# Week 5: Document Similarity

In some applications, it may be difficult to define the classes that we want to use in classification ahead of time.  Or, classes might be made up various subclasses (which differ in terms of the vocabulary used).  In both of these cases (and others), it might be more appropriate to think about **document similarity**.  For a new document, can we find the most similar document in our collection?

### Preliminaries

In [None]:
###uncomment if working on colab

#from google.colab import drive
#drive.mount('/content/drive')


In [None]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')

Now lets get a document collection.  We are going to use the Gutenberg collection of books.  We will get the tokenised content of each book and store it in a dictionary (key = the fileid of the book) for easy access.

In [None]:
from nltk.corpus import gutenberg
book_ids=gutenberg.fileids()
books={b:gutenberg.words(b) for b in book_ids}

In [None]:
books[book_ids[0]]

We now need to normalise the tokens in the documents and construct a *bag-of-words* document representation.  Combining some of the functionality we have been working over the past few weeks (which we have imported from utils.py), we could use something like this

In [None]:
book_reps={key:FreqDist(normalise(book)) for key,book in books.items()}

Let's have a look at the representation of first book:

In [None]:
print(book_reps[book_ids[0]].items())

## Measuring Similarity
We are going to use the cosine measure to determine how similar two books are.  This can be defined in terms of the dot products of vectors:

\begin{eqnarray*}
\mbox{sim}_{\mbox{cosine}}(A,B) = \frac{A.B}{\sqrt{A.A \times B.B}}
\end{eqnarray*}

where the dot product of two vectors, A and B, is defined as:

\begin{eqnarray*}
A.B = \sum_{\mbox{f}} \mbox{weight}(A,f)\times \mbox{weight}(B,f) 
\end{eqnarray*}

and $\mbox{weight}(X,f)$ tells us the value associated with feature $f$ in the vector representation of $X$

### Exercise 1.1
* Write a function `dot` which takes two documents (represented as dictionaries or `FreqDist`s) and returns their dot product
* Test it out on the first two books in Gutenberg.  You should get the answer 3882298!
* Why is the number so large?

### Exercise 1.2
* Write a function `cos_sim` which takes two documents (represented as dictionaries or `FreqDist`s) and returns their cosine similarity.
* Your function should make 3 calls to the `dot` function you have already defined
* If you test it out on the first two documents in the finance collection you should get 0.72 (to 2S.F.)

### Exercise 1.3
* Write some code that will compute the similarity of every document in a collection with every document in another collection
* Write code to compute the average similarity of two collections
* Compute (and display) the average similarity of the book collection to itself
    

## Beyond Frequency
Frequency of a word in a document does not make a very good weight because some words occur very frequently in all documents.  If two rare words occur in both of our pair of documents, that should add more to their perceived similarity than if two common words occur in both of our pair of documents.

### TF-IDF
A commonly used weight is tf-idf which stands for **term frequency, inverse document frequency**

\begin{eqnarray*}
\mbox{tf-idf}(D_i,f) = tf(D_i,f) \times idf(D_i,f)
\end{eqnarray*}

where $tf(D_i,f)$ is simply the frequency of feature f in document $D_i$
and

\begin{eqnarray*}
idf(D_i,f) = log \frac{N}{df(f)}
\end{eqnarray*}

where $N$ is the total number of documents and $\mbox{df}(f)$ is the number of documents containing $f$:  

\begin{eqnarray*}
df(f)=|\{i|\mbox{freq}(D_i,f)>0\}|
\end{eqnarray*}

The code below will take a list of documents (represented as dictionaries) and compute the document frequency for each feature.  Test it out on one the collection of books.

In [None]:
def doc_freq(doclist):
    df={}
    for doc in doclist:
        for feat in doc.keys():
            df[feat]=df.get(feat,0)+1
            
    return df
    

In [None]:
doc_freq(book_reps.values())

### Exercise 2.1
* Write a function which will compute the idf values for features given a list of documents
* Use it to compute idf values for features given the entire list of books in the book collection
    

### Exercise 2.2
* Write a function `convert_to_tfidf` that takes two arguments:
    * a dictionary of documents mapping fileids to documents
        * where each document is represented as a dictionary or FreqDist {feat:freq})
    * a dictionary containing idf values
* and outputs a dictionary of documents where each document is represented as a dictionary or FreqDist with tfidf weights {feat:tfidf}

### Exercise 2.3
* Recompute the average similarity between the collection of books (as in Ex 1.3).
* What do you notice?

### Exercise 2.4
For each book in the collection, find it's most similar book (NOT INCLUDING ITSELF!).
Output your results in a table