# Document similarity comparison
This is a short introduction to document similarity comparison in Python using the gensim library. It is based on Jonathan Mugan's outstanding [presentation](https://www.oreilly.com/learning/how-do-i-compare-document-similarity-using-python).

## gensim
gensim is a library that was written in order to generate lists of the most similar mathematical articles to a given article.

In [1]:
import gensim

## Creating a collections of "documents"
For the purpose of simplicity the documents in this introduction are only sentences. But the principles are the same for more extensive documents.

In [46]:
raw_documents = ["I'm taking the show on the road.",
                 "My socks are a force multiplier.",
                 "I am the barber who cuts everyone's hair who doesn't cut their own.",
                 "Legend has it that the mind is a mad monkey.",
                 "I make my own fun."]
print("Number of documents:", len(raw_documents))

Number of documents: 5


## Tokenizing the documents
The first step in this technique is to tokenize the documents,
i.e. to break down the documents into the elements (tokens) that
make them up.

I use NLTK which is a widely used Python package for natural
language processing.

In [32]:
from nltk.tokenize import word_tokenize
tokenized_docs = [[w.lower() for w in word_tokenize(text)]
                     for text in raw_documents]
print(tokenized_docs)

[['i', "'m", 'taking', 'the', 'show', 'on', 'the', 'road', '.'], ['my', 'socks', 'are', 'a', 'force', 'multiplier', '.'], ['i', 'am', 'the', 'barber', 'who', 'cuts', 'everyone', "'s", 'hair', 'who', 'does', "n't", 'cut', 'their', 'own', '.'], ['legend', 'has', 'it', 'that', 'the', 'mind', 'is', 'a', 'mad', 'monkey', '.'], ['i', 'make', 'my', 'own', 'fun', '.']]


## Creating a dictionary
The dictionary maps the words in raw_documents to an integer.

In [33]:
dictionary = gensim.corpora.Dictionary(tokenized_docs)
for i in range(len(dictionary)):
    print(i, dictionary[i])

0 'm
1 .
2 i
3 on
4 road
5 show
6 taking
7 the
8 a
9 are
10 force
11 multiplier
12 my
13 socks
14 's
15 am
16 barber
17 cut
18 cuts
19 does
20 everyone
21 hair
22 n't
23 own
24 their
25 who
26 has
27 is
28 it
29 legend
30 mad
31 mind
32 monkey
33 that
34 fun
35 make


## Creating a corpus
We're interested in counting how many times a word occurs in a document. To do this we create a corpus, which is a list of bag of words. A bag of words is a representation of a document that lists the number of times a word occurs in the document.

In [34]:
corpus = [dictionary.doc2bow(tokenized_doc) for tokenized_doc in tokenized_docs]
print(corpus)

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 2)], [(1, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1)], [(1, 1), (2, 1), (7, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 2)], [(1, 1), (7, 1), (8, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1)], [(1, 1), (2, 1), (12, 1), (23, 1), (34, 1), (35, 1)]]


## Creating a tf-idf model
tf-idf is short for 'term frequency-inverse document frequency'. This is used for calculating how important a word is to a document in a collection/corpus. Simply put, if a word occurs many times in a document but only a few times across the corpus, the word get a high tf-idf score.

The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general.(https://bit.ly/2bm6Qgd)

In [35]:
tf_idf = gensim.models.TfidfModel(corpus)
print(tf_idf)
s = 0
for i in corpus:
    s += len(i)
print(s)

TfidfModel(num_docs=5, num_nnz=47)
47


## Creating a similarity measure
This is going to create a similarity index for the five documents.

In [43]:
sims = gensim.similarities.Similarity('',tf_idf[corpus],
                                      num_features=len(dictionary))
print(sims)
print(type(sims))

Similarity index with 5 documents in 0 shards (stored under )
<class 'gensim.similarities.docsim.Similarity'>


## Creating a query document
I'm gonna create the document that I want to compare to the five documents in raw_documents and the convert it to tf-idf. I'm also going to make a series of prints to show how the query document transforms from a text string to a list of tokens, a bag of words and tf-idf.

In [44]:
query_doc = [w.lower() for w in word_tokenize("Socks are a force for good.")]
print(query_doc)
query_doc_bow = dictionary.doc2bow(query_doc)
print(query_doc_bow)
query_doc_tf_idf = tf_idf[query_doc_bow]
print(query_doc_tf_idf)

['socks', 'are', 'a', 'force', 'for', 'good', '.']
[(1, 1), (8, 1), (9, 1), (10, 1), (13, 1)]
[(8, 0.31226270667960454), (9, 0.5484803253891997), (10, 0.5484803253891997), (13, 0.5484803253891997)]


## Comparing the query document to raw_documents
Finally I'm going to do a comparison that shows how similar the query document is to the five documents in raw_documents.

In [45]:
sims[query_doc_tf_idf]

array([0.        , 0.84565616, 0.        , 0.06124881, 0.        ],
      dtype=float32)

## Conclusion
It shows that the query document "Socks are a force for good." is most similar to the second document ("My socks are a force multiplier." of raw_documents which makes good sense.