# Cosine Text Similarity

This notebook is my attempt to solving Task 1.(ii) where we will compare two given texts and provide a similarity score using cosine similarity.

Modules used for task are:
+ NLTK (Natural Language Toolkit)
+ Numpy (For efficient processing)
+ Sklearn (For feature extraction)
+ Gensim (For word2vec model)

## Text Similarity using Bag of Words/Count Vector
Bag of words is one of the more popular forms of vectorization. 

In this method, we create a `count vector` for each corpus.

The `count vector` may either be
* Frequency based: Here the count is the frequency of the word in corpus 
* Presence based : Here the count is binary, present in corpus or not present in corpus 

Generally the former is preferred over the latter (and is hence implemented below), even though the latter is a bit easier to code.



### Loading necessary modules:

Let us first import all the necessary modules

In [14]:
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

### Preprocessing:

Whenever we are given a corpus, we first need to process it.

Processing for us includes three steps:
- Tokenizing the corpus
- Removal of stop words and punctuations
- Converting the remaining data to lowercase.

This is done using the `process` method below

In [15]:
def process(sentence, stop_words=set(stopwords.words('english'))):
    '''
        Takes some text (sentence) as input (optionally also stop_words)
        Gives a set of tokenized words that are not in stop_words
        stopwords.words('english') is cast to a set cause lookups in set are O(1)
    '''
    x = word_tokenize(sentence)
    words=[word.lower() for word in x if word.isalpha() and not word in stop_words]
    return words

### Generating count vectors
Now that we have some processed data, we can work on generating the `count` vectors. This is done in the `get_vectors` function implemented below

To generate the count vectors for two lists - l1 and l2, we need to do to the following:
* Get set `all` which is a set of all unique tokens in l1 and l2
* For each item in `all`:
    - Add it's frequency to l1's count vector
    - Add it's frequency to l2's count vector

Note: This is because we are using the frequency based approach for generating the `count` vector

In [16]:
def get_vectors(l1, l2):
    _all = np.union1d(l1, l2)
    v1 = np.zeros(_all.shape)
    v2 = np.zeros(_all.shape)
    for i in range(len(_all)):
        v1[i]=l1.count(_all[i])
        v2[i]=l2.count(_all[i])
    return v1,v2

### Taking the cosine

Now all that's left is to simply take the cosine. This is easily implemented using numpy.

We can easily get the cosine between two vectors using the dot product. 

This is done as: `cos(x) = A.B/|A|.|B|` 


In [17]:
def cosine(vec1, vec2):
    mod = np.linalg.norm(vec1)*np.linalg.norm(vec2)
    if mod == 0:
        return 0
    return x@y/mod

### Driver Code:

We can now test out the cosine similarity between two texts using the driver code below

In [18]:
sentence_1 = input("Enter some text")
sentence_2 = input("Enter more text")
tokens_1 = process(sentence_1)
tokens_2 = process(sentence_2)
x,y = get_vectors(tokens_1, tokens_2)
print("The cosine similarity is:", cosine(x,y))

The cosine similarity is: 0.7647058823529411


## Text Similarity using TF-IDF
TF-IDF is an abbreviation for Term Frequency Inverse Document Frequency. This is another very common algorithm for vectorization.

It is different from our above implementation because of the "IDF" part of it.

IDF or Inverse Document Frequency is a scoring of how rare the word is across documents.


### Loading necessary modules:
We will be using `sklearn`'s `TfidfVectorizer` to convert a given text in vector form.
We will then go on to apply our cosine function to it.

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer

### Preprocessing:
We will be heavily relying on the `process` function that we defined earlier to process our sentence. But since the `TfidfVectorizer` needs a string to fit, we will also convert the list we get after processing into a string of words.

This is done by:
* Apply `process` function on string
* Join the resulting list with spaces in between words


In [20]:
def pre_process(sentence):
    x = ' '.join(process(sentence))
    return x

### Generating Vectors
We will directly be using the `TfidfVectorizer` to vectorize our sentence.

But this has one small problem. The feature vectors that we extract from the `TfidfVectorizer` are in the form of a sparse matrix. We should probably also convert them into numpy arrays in the vector generation step itself.

Note: We can also use `sklearn.metrics.pairwise.cosine_similarity` to calculate the cosine similarity on the sparse matrix itself. But since we have already implemented our own cosine function, we should use that.

In [21]:
def tfidf_generate_vectors(corpus):
    tfidf_vectorizer = TfidfVectorizer(ngram_range=(1,2))
    tfidf_vectorizer.fit(corpus)
    feature_vectors = tfidf_vectorizer.transform(corpus)
    x,y = feature_vectors[0].toarray(), feature_vectors[1].toarray()
    return x.reshape(-1,),y.reshape(-1,)

### Taking the cosine

Now all that's left is to simply take the cosine. We have already implemented a function above to get the cosine between two vectors using the dot product.

As a reminder, it is implemented as: `cos(x) = A.B/|A|.|B|` 

And the function name is `cosine`

### Driver Code:

We can now test out the cosine similarity between two texts using the driver code below

In [22]:
sentence_1 = input("Enter some text")
sentence_2 = input("Enter more text")
corpus = [pre_process(sentence_1), pre_process(sentence_2)]
x,y = tfidf_generate_vectors(corpus)
cosine(x,y)

0.40724772010124605

## Text Similarity Using Word2Vec Embeddings (with GloVe Data)

Word2vec is a group of related models that are used to produce word embeddings. It takes as its input a text and produces a vector space of several hundred dimensions. 

For this case, that is all we need to know about it. We will use Word2Vec to vectorise our words and then get cosine similarity by comparing those vectors.

We will be using Stanford's `GloVe (Global Vectors for Word Repesentation)` data for the model (Just because file was comparatively smaller to download)


### Loading necessary modules:
Let us start off by loading in the necessary modules along with the model.

We will use the `genism` library to load in the required model.


In [23]:
from gensim.models import KeyedVectors
filename = '../../data/taskone/ii/glove.6B.300d.txt.word2vec'
model = KeyedVectors.load_word2vec_format(filename, binary=False)

### Preprocessing

Most of our preprocessing is already done in the `process` method that we defined earlier. We will mostly be using the same for this implementation as well.

One thing to note is that our model has a vocabulary, so if we try to get the similarity (or vectors) for tokens that are not in the vocabulary we will end up with an error.

It will be a good idea to get rid of such words in preprocessing.

In [24]:
def w2v_process(sentence_1, model=model):
    x= process(sentence_1)
    x = [i for i in x if i in model.vocab]
    return x

### Generating the vectors:

One thing to note is that our model has a method called `similarity` which gives us the similarity between two *words* if they are in it's vocabulary. That's not very helpful when we need to find similarity between two texts.

One way to counter this problem would be to run word2vec on the words in both sentences, sum up the vectors in the one sentence, sum up the vectors in the other sentence, and then find the cosin similarity between these vectors. By summing them up instead of doing a word-wise difference, you'll at least not be subject to word order.

This is the method that we will implement, and we will do this summing up as a part of our vector genertion.

In [25]:
def w2v_generate_vector(token):
    x = 0
    for i in token:
        x += model.get_vector(i)
    return x

### Taking the cosine

Now all that's left is to simply take the cosine. We have already implemented a function above to get the cosine between two vectors using the dot product.

As a reminder, it is implemented as: `cos(x) = A.B/|A|.|B|` 

And the function name is `cosine`

### Driver Code:

We can now test out the cosine similarity between two texts using the driver code below

In [26]:
sentence_1 = input("Enter some text")
sentence_2 = input("Enter more text")
tokens_1 = w2v_process(sentence_1)
tokens_2 = w2v_process(sentence_2)
x, y = w2v_generate_vector(tokens_1),  w2v_generate_vector(tokens_2)
cosine(x,y)

0.95129585