# Paragraph Similarity

### Importing the dataset and libraries 

In [1]:
import pandas as pd
import numpy as np
data = pd.read_csv('Precily_Text_Similarity.csv')
data.head()

Unnamed: 0,text1,text2
0,broadband challenges tv viewing the number of ...,gardener wins double in glasgow britain s jaso...
1,rap boss arrested over drug find rap mogul mar...,amnesty chief laments war failure the lack of ...
2,player burn-out worries robinson england coach...,hanks greeted at wintry premiere hollywood sta...
3,hearts of oak 3-2 cotonsport hearts of oak set...,redford s vision of sundance despite sporting ...
4,sir paul rocks super bowl crowds sir paul mcca...,mauresmo opens with victory in la amelie maure...


In [2]:
len(data)

3000

### Working

Here we are using bag of words model
In order to determine the similarity between two text paragraphs,each of them is converted into a vector of words and their counts. The matrix formed is known as Document Term Matrix (DTM) where each row is a document and each column represents the term/token in the document.
While creating the DTM, some pre-processing is done, i.e, all words are converted to lower case and punctuation is removed.This matrix is generated using count vectorizer.
Then we calculate cosine similarity between the two paragraphs.
The less is the value of cosine distance more is the similarity.

### DTM for 1st row of paragraphs

In [19]:
dtm= count_vectorizer.fit_transform([data['text1'][0],data['text2'][0]])
print(pd.DataFrame(data=dtm.toarray(), columns=count_vectorizer.get_feature_names()))
val=1-cosine(dtm[0].toarray(),dtm[1].toarray())
print("Similarity between these two paragraphs is ",val)

   100  100m  12  120  15  1500m  16  19  1998  200  ...  win  winning  wins  \
0    2     0   2    1   0      0   0   1     0    0  ...    0        0     0   
1    0     1   0    0   1      2   1   0     1    1  ...    3        1     1   

   with  women  won  world  year  years  yet  
0     7      0    0      0     6      1    0  
1    10      4    6      3     1      0    1  

[2 rows x 414 columns]
Similarity between these two paragraphs is  0.6852556785505336


### Calculating similarity and storing in the dataset 

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer()
from scipy.spatial.distance import cosine

sim=[]
for ans in range(len(data)):
    dtm= count_vectorizer.fit_transform([data['text1'][ans],data['text2'][ans]])
    pd.DataFrame(data=dtm.toarray(), columns=count_vectorizer.get_feature_names())
    sim.append(1-cosine(dtm[0].toarray(),dtm[1].toarray()))
    
data['similarity count']=sim
data.head()

Unnamed: 0,text1,text2,similarity count
0,broadband challenges tv viewing the number of ...,gardener wins double in glasgow britain s jaso...,0.685256
1,rap boss arrested over drug find rap mogul mar...,amnesty chief laments war failure the lack of ...,0.431582
2,player burn-out worries robinson england coach...,hanks greeted at wintry premiere hollywood sta...,0.564167
3,hearts of oak 3-2 cotonsport hearts of oak set...,redford s vision of sundance despite sporting ...,0.665098
4,sir paul rocks super bowl crowds sir paul mcca...,mauresmo opens with victory in la amelie maure...,0.556794


In [5]:
data1 = pd.read_csv('Precily_Text_Similarity.csv')

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer
# import pandas as pd
# sent1 = "India is a republic country. We are proud Indians."
# sent2 = "The current Prime Minister of India is Shri. Narendra Modi."
tfidf_vectorizer = TfidfVectorizer()
sim1=[]
for ans in range(len(data)):
    tfidf_vectors= tfidf_vectorizer.fit_transform([data['text1'][ans],data['text2'][ans]])
    pd.DataFrame(data=tfidf_vectors.toarray(),columns=tfidf_vectorizer.get_feature_names())
    sim1.append(1-cosine(tfidf_vectors[0].toarray(),tfidf_vectors[1].toarray()))

data1['TIDF similarity count']=sim1
data1.head()

Unnamed: 0,text1,text2,TIDF similarity count
0,broadband challenges tv viewing the number of ...,gardener wins double in glasgow britain s jaso...,0.551968
1,rap boss arrested over drug find rap mogul mar...,amnesty chief laments war failure the lack of ...,0.312518
2,player burn-out worries robinson england coach...,hanks greeted at wintry premiere hollywood sta...,0.408212
3,hearts of oak 3-2 cotonsport hearts of oak set...,redford s vision of sundance despite sporting ...,0.53067
4,sir paul rocks super bowl crowds sir paul mcca...,mauresmo opens with victory in la amelie maure...,0.424369


In [30]:
s1= "Max_df stands for maximum document frequency. Similar to min_df, we can ignore words which occur frequently. These words could be like the word ‘the’ that occur in every document and does not provide and valuable information to our text classification or any other machine learning model and can be safely ignored."
#s2= "Max_df stands for maximum document frequency. Similar to min_df, we can ignore words which occur frequently. These words could be like the word ‘the’ that occur in every document and does not provide and valuable information to our text classification or any other machine learning model and can be safely ignored."
s2= "We have 8 unique words in the text and hence 8 different columns each representing a unique word in the matrix. The row represents the word count. Since the words ‘is’ and ‘my’ were repeated twice we have the count for those particular words as 2 and 1 for the rest.Countvectorizer makes it easy for text data to be used directly in machine learning and deep learning models such as text classification."
#s2="Divya Jayendra Chaudhari"
# dtm= count_vectorizer.fit_transform([s1,s2])
# print(pd.DataFrame(data=dtm.toarray(), columns=count_vectorizer.get_feature_names()))
# print(1-cosine(dtm[0].toarray(),dtm[1].toarray()))

tfidf_vectors= tfidf_vectorizer.fit_transform([s1,s2])
print(pd.DataFrame(data=tfidf_vectors.toarray(),columns=tfidf_vectorizer.get_feature_names()))
print(1-cosine(tfidf_vectors[0].toarray(),tfidf_vectors[1].toarray()))

        and       any        as        be       can  classification   columns  \
0  0.288591  0.135202  0.000000  0.192394  0.270404        0.096197  0.000000   
1  0.281481  0.000000  0.197806  0.070370  0.000000        0.070370  0.098903   

      could     count  countvectorizer  ...        to     twice    unique  \
0  0.135202  0.000000         0.000000  ...  0.192394  0.000000  0.000000   
1  0.000000  0.197806         0.098903  ...  0.070370  0.098903  0.197806   

       used  valuable        we      were     which      word     words  
0  0.000000  0.135202  0.096197  0.000000  0.135202  0.096197  0.192394  
1  0.098903  0.000000  0.140741  0.098903  0.000000  0.140741  0.211111  

[2 rows x 73 columns]
0.3587795765041317


In [1]:

model = tfidf_vectors

import joblib

joblib.dump(model, 'Precily.joblib')

NameError: name 'model' is not defined