<a href="https://colab.research.google.com/github/Deependra011/Text-Similarity-Quantifier/blob/main/Precily_Text_Similarity_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**IMPORTING SOME IMPORTANT LIBRARIES**

In [19]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings, string
warnings.filterwarnings('ignore')
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.metrics.pairwise import cosine_similarity
import joblib
import gensim.downloader as api
from nltk import word_tokenize
from sklearn.pipeline import Pipeline


**LOADING THE DATASET**

In [20]:
text_data = pd.read_csv('Precily_Text_Similarity.csv')

In [21]:
text_data.head()

Unnamed: 0,text1,text2
0,broadband challenges tv viewing the number of ...,gardener wins double in glasgow britain s jaso...
1,rap boss arrested over drug find rap mogul mar...,amnesty chief laments war failure the lack of ...
2,player burn-out worries robinson england coach...,hanks greeted at wintry premiere hollywood sta...
3,hearts of oak 3-2 cotonsport hearts of oak set...,redford s vision of sundance despite sporting ...
4,sir paul rocks super bowl crowds sir paul mcca...,mauresmo opens with victory in la amelie maure...


**Text Cleaning and Preprocessing**

In [18]:
stemmer = PorterStemmer()
def stem_words(text):
    return ' '.join([stemmer.stem(word) for word in text.split()])
text_data['text1'] = text_data['text1'].apply(lambda x: stem_words(x))
text_data['text2'] = text_data['text2'].apply(lambda x: stem_words(x))

In [22]:
lemmatizer = WordNetLemmatizer()
def lemmatize_words(text):
    return ' '.join([lemmatizer.lemmatize(word) for word in text.split()])
text_data['text1'] = text_data['text1'].apply(lambda x: lemmatize_words(x))
text_data['text2'] = text_data['text2'].apply(lambda x: lemmatize_words(x))

In [23]:
def text_process(text):
    nopunc = [char for char in text if char not in string.punctuation]
    nopunc = ''.join(nopunc)
    return ' '.join([word for word in nopunc.split() if word.lower() not in stopwords.words('english') and not word.isdigit()])

In [24]:
text_data.shape

(3000, 2)

In [25]:
text_data['text1'] = text_data['text1'].apply(lambda x: text_process(x))
text_data['text2'] = text_data['text2'].apply(lambda x: text_process(x))

In [26]:
saved_text = text_data.copy()

## **Jaccard Similarity Technique**

The Jaccard similarity index measures the similarity between two sets of data. It can range from 0 to 1. The higher the number, the more similar the two sets of data.

The Jaccard similarity index is calculated as:

Jaccard Similarity = (number of observations in both sets) / (number in either set)

Or, written in notation form:

J(A, B) = |A∩B| / |A∪B|

In [27]:
def jaccard_similarity(a,b):
    intersection = set(a).intersection(set(b))
    union = set(a).union(set(b))
    return len(intersection)/len(union)

In [28]:
jaccard_similarity(text_data['text1'][0],text_data['text2'][0])

0.7222222222222222

### **Text Vectorization**

### **Count Vectorizer - TF(Term Frequency)**

We'll convert each message, represented as a list of tokens, into a vector that machine learning models can understand.

We'll do that in three steps using the bag-of-words model:

1.Count how many times does a word occur in each message (Known as term frequency)

2.Weigh the counts, so that frequent tokens get lower weight (inverse document frequency)

3.Normalize the vectors to unit length, to abstract from the original text length (L2 norm)

Each vector will have as many dimensions as there are unique words in the SMS corpus. We will first use SciKit Learn's CountVectorizer. This model will convert a collection of text documents to a matrix of token counts.

We can imagine this as a 2-Dimensional matrix. Where the 1-dimension is the entire vocabulary (1 row per word) and the other dimension are the actual documents, in this case a column per text message.

In [29]:
bow_transformer1 = CountVectorizer(analyzer=text_process).fit(text_data['text1'])
bow_transformer2 = CountVectorizer(analyzer=text_process).fit(text_data['text2'])
print(bow_transformer1)
print(bow_transformer2)

CountVectorizer(analyzer=<function text_process at 0x785fbeae4d30>)
CountVectorizer(analyzer=<function text_process at 0x785fbeae4d30>)


In [30]:
len(bow_transformer1.vocabulary_), len(bow_transformer2.vocabulary_)

(38, 38)

Let's consider the 4th message of the text1 column of the given dataset and generate bag of words along with their frequency.

In [31]:
text3 = text_data['text1'][3]
bow3 = bow_transformer1.transform([text3])
print(bow3)

  (0, 0)	164
  (0, 4)	1
  (0, 6)	1
  (0, 9)	1
  (0, 10)	1
  (0, 11)	98
  (0, 12)	16
  (0, 13)	32
  (0, 14)	36
  (0, 15)	108
  (0, 16)	22
  (0, 17)	24
  (0, 18)	23
  (0, 19)	57
  (0, 20)	3
  (0, 21)	19
  (0, 22)	39
  (0, 23)	31
  (0, 24)	58
  (0, 25)	70
  (0, 26)	12
  (0, 27)	4
  (0, 28)	55
  (0, 29)	38
  (0, 30)	74
  (0, 31)	30
  (0, 32)	3
  (0, 33)	13
  (0, 34)	1
  (0, 35)	8
  (0, 36)	2


In [32]:
bow_text1 = bow_transformer1.transform(text_data['text1'])
print(bow_text1.shape)

(3000, 38)


In [33]:
bow_text2 = bow_transformer2.transform(text_data['text2'])
print(bow_text2.shape)

(3000, 38)


## **TF-IDF Transformer(Term Frequency-Inverse Document Frequency)**

TF-IDF stands for term frequency-inverse document frequency, and the tf-idf weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

Typically, the tf-idf weight is composed by two terms: the first computes the normalized Term Frequency (TF), aka. the number of times a word appears in a document, divided by the total number of words in that document; the second term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears.

**TF: Term Frequency** which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization:

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).

**IDF: Inverse Document Frequency** which measures how important a term is. While computing TF, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following:

IDF(t) = log_e(Total number of documents / Number of documents with term t in it).

**Using Sklearn's TFIDF Transformer to transform the entire bag of words corpus into TFIDF corpus:**

In [34]:
tfidf_transformer1 = TfidfTransformer().fit(bow_text1)
tfidf_transformer2 = TfidfTransformer().fit(bow_text2)

In [35]:
tfidf4 = tfidf_transformer1.transform(bow3)
print(tfidf4)

  (0, 36)	0.012569908538128621
  (0, 35)	0.028643556123870647
  (0, 34)	0.003996199684477045
  (0, 33)	0.0465457787012898
  (0, 32)	0.010755660102155158
  (0, 31)	0.10741333546451493
  (0, 30)	0.2649528941458035
  (0, 29)	0.13605689158838558
  (0, 28)	0.1969244483516107
  (0, 27)	0.020703265274082477
  (0, 26)	0.04296533418580597
  (0, 25)	0.2506311160838682
  (0, 24)	0.2076657818980622
  (0, 23)	0.11099377997999876
  (0, 22)	0.1396373361038694
  (0, 21)	0.06843770906876578
  (0, 20)	0.012541250185322962
  (0, 19)	0.20408533738257836
  (0, 18)	0.08235022385612811
  (0, 17)	0.08593066837161194
  (0, 16)	0.07876977934064427
  (0, 15)	0.38668800767225375
  (0, 14)	0.12889600255741793
  (0, 13)	0.11457422449548259
  (0, 12)	0.057287112247741294
  (0, 11)	0.3508835625174154
  (0, 10)	0.009870303066602852
  (0, 9)	0.009721378425141635
  (0, 6)	0.008126656248887013
  (0, 4)	0.008681416835335026
  (0, 0)	0.5871929005393483


In [36]:
tfidf4.get_shape()

(1, 38)

**To transform the entire bag of words corpus into TFIDF corpus at once:**

In [37]:
tfidf_text1 = tfidf_transformer1.transform(bow_text1)
tfidf_text2 = tfidf_transformer2.transform(bow_text2)

**Checking out the various features of the TFIDF sparse matrices of the 2 text columns:**

In [38]:
print("Amount of non-zero values in TFIDF of text 1:",tfidf_text1.nnz)
print("Amount of non-zero values in TFIDF of text 2:",tfidf_text2.nnz)

Amount of non-zero values in TFIDF of text 1: 85913
Amount of non-zero values in TFIDF of text 2: 86091


In [39]:
print("Shape of TFIDF of text 1:",tfidf_text1.shape)
print("Shape of TFIDF of text 2:",tfidf_text2.shape)

Shape of TFIDF of text 1: (3000, 38)
Shape of TFIDF of text 2: (3000, 38)


In [40]:
print("Sparsity of text 1:",str(np.round((tfidf_text1.nnz/(tfidf_text1.shape[0]*tfidf_text1.shape[1]))*100,2)) + '%')
print("Sparsity of text 2:",str(np.round((tfidf_text2.nnz/(tfidf_text2.shape[0]*tfidf_text2.shape[1]))*100,2)) + '%')

Sparsity of text 1: 75.36%
Sparsity of text 2: 75.52%


## **Cosine Similarity**

Cosine similarity is a metric used to measure how similar the documents are irrespective of their size. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space.

The cosine similarity is advantageous because even if the two similar documents are far apart by the Euclidean distance (due to the size of the document), chances are they may still be oriented closer together. The smaller the angle, higher the cosine similarity.

**Evaluating the cosine similarity between any two TFIDF vectors in order to determine the degree of lexical and semantic resemblance, closeness or similarity of the two vectors..**.

In [41]:
cos_similarity = [[]]
for i in range(tfidf_text1.shape[0]):
    cos_similarity.append(cosine_similarity(tfidf_text1[i],tfidf_text2))
cos_similarity = pd.DataFrame(cos_similarity)
cos_similarity.drop(index=cos_similarity.index[0],
        axis=0,
        inplace=True)
cos_similarity.head()

Unnamed: 0,0
1,"[0.9794896326533228, 0.9911470256727979, 0.984..."
2,"[0.9744296384348115, 0.9814332254641669, 0.982..."
3,"[0.9580958798829047, 0.966449943458041, 0.9455..."
4,"[0.9787139732146466, 0.9826477704370604, 0.978..."
5,"[0.9827139662082587, 0.9863813704706574, 0.993..."


Error: Runtime no longer has a reference to this dataframe, please re-run this cell and try again.
Error: Runtime no longer has a reference to this dataframe, please re-run this cell and try again.
Error: Runtime no longer has a reference to this dataframe, please re-run this cell and try again.
Error: Runtime no longer has a reference to this dataframe, please re-run this cell and try again.
Error: Runtime no longer has a reference to this dataframe, please re-run this cell and try again.


In [42]:
cos_similarity.shape

(3000, 1)

## **Generating the cosine similarity matrix consisting of the cosine similarity scores of each and every text present in both the columns 'text1' and 'text2'**

**It is of dimensions 3000X3000 as each text paragraph in 'text1' column is mapped with each of the 3000 text paragraphs of the 'text2' column.**

In [43]:
cosine_similarity_matrix = pd.DataFrame()
for i in range(cos_similarity.shape[0]):
    cosine_similarity_matrix = pd.concat([cosine_similarity_matrix,(pd.DataFrame(cos_similarity.iloc[i].values.tolist()).T)],axis=1)
cosine_similarity_matrix.head()

Unnamed: 0,0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,...,0.10,0.11,0.12,0.13,0.14,0.15,0.16,0.17,0.18,0.19
0,0.97949,0.97443,0.958096,0.978714,0.982714,0.972998,0.978367,0.981301,0.975291,0.968173,...,0.976618,0.975601,0.983786,0.972849,0.969012,0.980798,0.982882,0.96098,0.981413,0.975546
1,0.991147,0.981433,0.96645,0.982648,0.986381,0.981089,0.983925,0.990825,0.983492,0.971646,...,0.985393,0.984376,0.98735,0.980751,0.975346,0.982566,0.987279,0.974484,0.982714,0.98907
2,0.984562,0.982525,0.945575,0.978955,0.993179,0.967054,0.975098,0.985682,0.973008,0.981187,...,0.981488,0.981392,0.972822,0.980918,0.970623,0.977375,0.983194,0.953811,0.977102,0.968173
3,0.983071,0.978106,0.95068,0.978898,0.985792,0.973304,0.983347,0.994692,0.985295,0.980059,...,0.988861,0.985179,0.991335,0.98446,0.979686,0.982485,0.984484,0.96851,0.986607,0.982816
4,0.99165,0.980982,0.957508,0.985628,0.987077,0.982298,0.974667,0.986198,0.973187,0.972489,...,0.979111,0.97552,0.983011,0.973471,0.967253,0.978969,0.987936,0.963964,0.980065,0.977771


In [44]:
cosine_similarity_matrix.shape

(3000, 3000)

In [45]:
cosine_similarity_matrix.idxmin()

0    707
0    707
0    208
0    285
0    285
    ... 
0    285
0    614
0    707
0    211
0    153
Length: 3000, dtype: int64

In [46]:
cosine_similarity_matrix.iloc[707].min()

0.8951597575942714

### **Leveraging Google News Word 2 Vector library present in gensim.downloader module to generate a huge inbuilt word vocabulary which can be used to check for semantic similarity between the two texts:**

In [47]:
word_vec = api.load('word2vec-google-news-300')



### **Function to calculate the semantic similarity between the two text documents based on the concept of word embeddings:**

In [48]:
similarity = []

for idx in text_data.index:
    t1 = text_data['text1'][idx]
    t2 = text_data['text2'][idx]

    if t1 == t2:
        similarity.append(1)
    else:
        t1_words = word_tokenize(t1)
        t2_words = word_tokenize(t2)
        vocab = word_vec

        if len(t1_words and t2_words) == 0:
            similarity.append(0)
        else:
            for word in t1_words.copy():
                if word not in vocab:
                    t1_words.remove(word)
            for word in t2_words.copy():
                if word not in vocab:
                    t2_words.remove(word)
            similarity.append(word_vec.n_similarity(t1_words,t2_words))

Generating a dataframe for the similarity scores for every pair of texts present in each row of the given dataset:

In [49]:
similarity = pd.DataFrame(similarity)
similarity.head()

Unnamed: 0,0
0,0.697249
1,0.621826
2,0.711702
3,0.617103
4,0.813758


In [50]:
text_data = pd.concat([text_data,similarity],axis=1)
text_data.head()

Unnamed: 0,text1,text2,0
0,broadband challenge tv viewing number european...,gardener win double glasgow britain jason gard...,0.697249
1,rap bos arrested drug find rap mogul marion su...,amnesty chief lament war failure lack public o...,0.621826
2,player burnout worry robinson england coach an...,hank greeted wintry premiere hollywood star to...,0.711702
3,heart oak cotonsport heart oak set ghanaian co...,redford vision sundance despite sporting cordu...,0.617103
4,sir paul rock super bowl crowd sir paul mccart...,mauresmo open victory la amelie mauresmo maria...,0.813758


In [51]:
text_data.columns = ['text1','text2','Similarity Score']

In [52]:
text_data.head()

Unnamed: 0,text1,text2,Similarity Score
0,broadband challenge tv viewing number european...,gardener win double glasgow britain jason gard...,0.697249
1,rap bos arrested drug find rap mogul marion su...,amnesty chief lament war failure lack public o...,0.621826
2,player burnout worry robinson england coach an...,hank greeted wintry premiere hollywood star to...,0.711702
3,heart oak cotonsport heart oak set ghanaian co...,redford vision sundance despite sporting cordu...,0.617103
4,sir paul rock super bowl crowd sir paul mccart...,mauresmo open victory la amelie mauresmo maria...,0.813758


In [53]:
text_data.to_csv('similarity_scores.csv')

### **Saving and loading the cosine similarity model within the joblib library in order to use it during the time of model deployment:**

In [54]:
joblib.dump(cosine_similarity,'model.pkl')

['model.pkl']

In [55]:
model = joblib.load('model.pkl')