# Assignment 4 - Information Retrieval

In this assignment, you will implement code for information retrieval.

## Part 1

For this part, you will be learning to use sklearn's in-buit functionality to find the most relevant (toy) document for a given query. You will use the [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) for computing tfidf scores. Then you will use the [cosine_similarity](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html) to compute the similarity between a given query and the documents.

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [0]:
#Define the toy documets and the query
Doc1 = 'Information Retrieval Systems'
Doc2 = 'Information Storage'
Doc3 = 'Digital Speech Synthesis Systems'
Doc4 = 'Speech Filtering, Speech Retrieval'
docs = [Doc1, Doc2, Doc3, Doc4]

query = ['Speech Systems']

In [10]:
# create the vectorizer
vectorizer = TfidfVectorizer() 
# fit and transform the documents
documents_v = vectorizer.fit_transform(docs)
# prepare the query using the same vectorizer
query_v = vectorizer.transform(query)
# calculate cosine similarity score for each document with respect to the query
print(cosine_similarity(query_v, documents_v))

[[0.40824829 0.         0.6191303  0.55011649]]


# Part 2

CACM is a collection of abstracts of articles published in the Communications of the ACM journal between 1958 and 1979. This collection has been used in numerous papers for the evaluation of information retrieval systems. The entire collection (3024 documents) is provided and is comprised of 3 files:
* cacm.tar.gz contains the 3024 html files.
* cacm.query contains the 64 queries.
* cacm.rel contains the relevance query results for each of the 64 queries.

Write code that eliminates all HTML tags and then tokenizes the text. Ignore the numeric columns at the end of each file. Use the tokenized text to answer the following questions.

Relevant functionality in *NLTK*: functions *nltk.clean_html*, *word_tokenize* and *sent_tokenize* from package *nltk.tokenize*.




1. How many tokens are in the entire collection?

2. What is the size of the vocabulary?

3. How many vocabulary entries are mentioned only once in the entire collection? How about two times?

4. What are the top 20 most common token types?

5. What is the percentage of tokens in the collection that is covered by these 20 most common token types?

Further, process the text by lower casing and eliminating stopwords. Do stemming using Porter’s algorithm. Use the  fully preprocessed (tokenization, stemming, stopwords removal, and case folding) version of the text to answer the same five questions from above.

Relevant functionality in *NLTK*: class *PorterStemmer* from package *nltk.stem.porter*; the stopwords corpus from package *nltk.corpus*).

1. How many tokens are in the entire collection?

2. What is the size of the vocabulary?

3. How many vocabulary entries are mentioned only once in the entire collection? How about two times?

4. What are the top 20 most common token types?

5. What is the percentage of tokens in the collection that is covered by these 20 most common token types?

Now Compute and report the Mean Average Precision (MAP) for all 64 queries in CACM. You may follow these steps.
1. Vectorize the entire collection using *TfidfVectorizer* as above.
2. Vectorize query *i* (see *cacm.query* file)
3. Compute cosine similarity between query *i* and the collection.
4. Rank the collection for query *i* using the above scores.
5. Compute Average Precision (AP) for query *i* as discussed in the class (*cacm.rel* file provides the relevances. Ignore the 2nd column).
6. Macro-average APs across all queries to compute the MAP.  

To maximize the performance of your model, explore all the parameters of *TfidfVectorizer*. List the combination of parameter values that provided best MAP for you. Describe anything else you may have tried (and has helped you to acheive your best performance). 

The three best reported MAP values will get **bonus points** (5% of the assignment grade for the 1st place, 3% for 2nd, and 2% for 3rd). The winner will be announced in the class after final evaluation.

(Note: merely 'reporting' the MAP is not enough. You should have the entire code to generate the best MAP so that the grader can compute the MAP while grading.)

