### Calculating cosine similarity

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
#define the documents
documents = [
     'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?']

To compute the cosine similarity, you need the word count of the words in each document. The `CountVectorizer` or the `TfidfVectorizer` from scikit learn lets us compute this.

In [3]:
count_vectorizer = CountVectorizer()
sparse_matrix = count_vectorizer.fit_transform(documents) #this gives you a sparse matrix.
count_vectorizer.get_feature_names() #words used in comparison, can be changed by ngram_range

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

In [4]:
sparse_matrix.toarray()

array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 2, 0, 1, 0, 1, 1, 0, 1],
       [1, 0, 0, 1, 1, 0, 1, 1, 1],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]])

In [5]:
vectorizer2 = CountVectorizer(analyzer='word', ngram_range=(2, 2)) #2,2 means only consider bigrams
sparse2 = vectorizer2.fit_transform(documents)
vectorizer2.get_feature_names()

['and this',
 'document is',
 'first document',
 'is the',
 'is this',
 'second document',
 'the first',
 'the second',
 'the third',
 'third one',
 'this document',
 'this is',
 'this the']

In [6]:
similarity_scores =  cosine_similarity(sparse_matrix)
similarity_scores

array([[1.        , 0.79056942, 0.54772256, 1.        ],
       [0.79056942, 1.        , 0.4330127 , 0.79056942],
       [0.54772256, 0.4330127 , 1.        , 0.54772256],
       [1.        , 0.79056942, 0.54772256, 1.        ]])

* To interpret this, the first sentence is simalr to itself by 100%, so 1, the first is similar to second by  0.7 and third by 0.5 and completely similar the fourth.