## Vector Space Models ###

Many tasks in information retrieval (IR) and natural language processing (NLP) involve performing document similarity comparisons. These tasks include document clustering, retrieving the most relevant documents for a given query, finding document translation pairs in a large multilingual collection, etc.  

Most practical applications of document similarity represent documents in a common feature space. Representing documents in a shared feature space abstracts away from the specific sequence of words used in each document and, with appropriate representations, can also facilitate the analysis of relationships between documents written using different vocabularies.   

In this part of the lab session we are going to cover one of the fundamental retrieval models - the vector space model 
In this model queries and documents are represented in a shared space whose dimensions are the number of index terms (e.g. words, n-grams, stems, phrases, etc.) More specifically, for a document $D$ and a vocabulary $V$ of index terms this representation is a vector whose dimensions are the weights for each of the index terms $|V|$.  

Since document features are the weights computed over single indexed terms in a document of a few thousend words only a few hundred unique words will have non-zero counts. This makes the document representation very sparse which in turn allows for the whole document collection to be represeted as a sparse matrix  where the  raws are the documents and the columns correspond to the  indexed terms. In practice a typical representation  has the indexed terms as rows and the columns as the weights assigned to the term for a given document.  

The most widely used weighing scheme for indexed terms is the tf-idf weighing. In this approach weights are computed as a product between the indexed terms frequency of occurence within the document, is reffered to as the **term frequency (tf)**, and the ** inverse document frequency (idf)** which is the log of the ratio between the total number of documents in the collection $N$ and the number of documents that contain that term $n_k$. For term  $k$ and document $i$ these  weighing components are computed using the following formulas:

$$ \Large tf_{ik} =\frac{f_{ik}}{\sum_{j=1}^t{tf_{ij}}} \ \ \ \ \ \ \ \ idf_k =\log \frac{N}{n_k}$$

In this lab section we are going to represent a set of books using the td-idf model. Using the tf-idf vector representation we'll run a query and rank books based on their distance from the query. 

In [10]:
import nltk
import os
import string
import sklearn.metrics.pairwise
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
books = {}
book_names = []
for filename in os.listdir("./books"):
    if filename.endswith(".txt"):
        book= open(os.path.join("./books",filename),'r').read().lower()
        book = book.translate(string.punctuation)
        books[filename]= book
    book_names.append(filename)
book_names = np.asarray(book_names)

#Convert the collection of books into tf-idf vectors:
tfidf = TfidfVectorizer(tokenizer=nltk.word_tokenize, stop_words='english')
tfs = tfidf.fit_transform(books.values()).todense()

#Now that we have the books represented in a shared vector space let's run a query:
query='''And what was he?
    Forsooth, a great arithmetician,
    One Michael Cassio, a Florentine
    (A fellow almost damn'd in a fair wife)
    That never set a squadron in the field,
    Nor the division of a battle knows
    More than a spinster; unless the bookish theoric,
    Wherein the toged consuls can propose
    As masterly as he.'''
query=query.lower()
query = query.translate(string.punctuation)

#We'll first represent the query in the same shared space as the books:
query_tfidf = tfidf.transform(query).todense()
#Next, we'll compute Euclidean distance between the query point and the books:
eu_distances = sklearn.metrics.pairwise.euclidean_distances(tfs,query_tfidf)
eu_distances = [row[0] for row in eu_distances]
#Sort the array of Euclidean values:
eu_distances_sorted = np.argsort(eu_distances)

**[Assignment 1]** Using the indexes of the sorted Euclidean distances print a ranked list of the books. Hint: book filenames are stored in the book_names array. 

**[Solution 1]**

In [11]:
#Enter your code here.