Count Vectorizer method

# PROBLEM STATEMENT

Semantic Textual Similarity (STS) assesses the degree to which two sentences are semantically equivalent to each other. Given two paragraphs, quantify the degree of similarity between the two text-based on Semantic similarity. The task is to predict a value between 0-1 indicating the similarity between the pair of text paras.

# Data :

The data contains a pair of paragraphs. These text paragraphs are randomly sampled from a raw
dataset. Each pair of the sentence may or may not be semantically similar. The candidate is to
predict a value between 0-1 indicating a degree of similarity between the pair of text paras.

0 means highly similar

1 means highly dissimilar

In [1]:
# Importing packages 
import pandas as pd
from scipy.spatial.distance import cosine
from sklearn.feature_extraction.text import CountVectorizer

# Importing packages for pre-processing texts
import re                                                                        # for regular expressions
import nltk                                                                      # for text manipulation
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\dhyan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
# loading the dataset
df = pd.read_csv('Precily_Text_Similarity.csv')

In [3]:
df.insert(0, 'ID', range(0, len(df)))
df

Unnamed: 0,ID,text1,text2
0,0,broadband challenges tv viewing the number of ...,gardener wins double in glasgow britain s jaso...
1,1,rap boss arrested over drug find rap mogul mar...,amnesty chief laments war failure the lack of ...
2,2,player burn-out worries robinson england coach...,hanks greeted at wintry premiere hollywood sta...
3,3,hearts of oak 3-2 cotonsport hearts of oak set...,redford s vision of sundance despite sporting ...
4,4,sir paul rocks super bowl crowds sir paul mcca...,mauresmo opens with victory in la amelie maure...
...,...,...,...
2995,2995,uk directors guild nominees named martin scors...,steel firm to cut 45 000 jobs mittal steel ...
2996,2996,u2 to play at grammy awards show irish rock ba...,israel looks to us for bank chief israel has a...
2997,2997,pountney handed ban and fine northampton coach...,india and iran in gas export deal india has si...
2998,2998,belle named best scottish band belle & sebas...,mido makes third apology ahmed mido hossam h...


In [4]:
df["text1"][0]

'broadband challenges tv viewing the number of europeans with broadband has exploded over the past 12 months  with the web eating into tv viewing habits  research suggests.  just over 54 million people are hooked up to the net via broadband  up from 34 million a year ago  according to market analysts nielsen/netratings. the total number of people online in europe has broken the 100 million mark. the popularity of the net has meant that many are turning away from tv  say analysts jupiter research. it found that a quarter of web users said they spent less time watching tv in favour of the net  the report by nielsen/netratings found that the number of people with fast internet access had risen by 60% over the past year.  the biggest jump was in italy  where it rose by 120%. britain was close behind  with broadband users almost doubling in a year. the growth has been fuelled by lower prices and a wider choice of always-on  fast-net subscription plans.  twelve months ago high speed internet

In [5]:
df.text2[0]


'gardener wins double in glasgow britain s jason gardener enjoyed a double 60m success in glasgow in his first competitive outing since he won 100m relay gold at the athens olympics.  gardener cruised home ahead of scot nick smith to win the invitational race at the norwich union international. he then recovered from a poor start in the second race to beat swede daniel persson and italy s luca verdecchia. his times of 6.61 and 6.62 seconds were well short of american maurice greene s 60m world record of 6.39secs from 1998.  it s a very hard record to break  but i believe i ve trained very well   said the world indoor champion  who hopes to get closer to the mark this season.  it was important to come out and make sure i got maximum points. my last race was the olympic final and there was a lot of expectation.  this was just what i needed to sharpen up and get some race fitness. i m very excited about the next couple of months.   double olympic champion  marked her first appearance on h

In [6]:
stop_words = stopwords.words('english')
stemmer = SnowballStemmer('english')
text_cleaning_re = "@\S+|https?:\S+|http?:\S|[^A-Za-z0-9]+"

In [7]:
# function to preprocess texts
def preprocess(text, stem=True):
  text = re.sub(text_cleaning_re, ' ', str(text).lower()).strip()
  tokens = []
  for token in text.split():
    if token not in stop_words:
      if stem:
        tokens.append(stemmer.stem(token))
      else:
        tokens.append(token)
  return " ".join(tokens)

In [8]:
df.text1 = df.text1.apply(lambda x: preprocess(x))
df.text2 = df.text2.apply(lambda x: preprocess(x))

In [9]:
text1 = df.text1.tolist()
text2 = df.text2.tolist()

In [10]:
# Preprocessed dataframe
df.head()

Unnamed: 0,ID,text1,text2
0,0,broadband challeng tv view number european bro...,garden win doubl glasgow britain jason garden ...
1,1,rap boss arrest drug find rap mogul marion sug...,amnesti chief lament war failur lack public ou...
2,2,player burn worri robinson england coach andi ...,hank greet wintri premier hollywood star tom h...
3,3,heart oak 3 2 cotonsport heart oak set ghanaia...,redford vision sundanc despit sport corduroy c...
4,4,sir paul rock super bowl crowd sir paul mccart...,mauresmo open victori la ameli mauresmo maria ...


In [11]:
# Creating a function to find the cosine similarity between a pair of texts
def countvectorizer_cosine_distance_method(s1, s2):
    
    # sentences to list
    allsentences = [s1 , s2]
     
    # text to vector
    vectorizer = CountVectorizer()
    all_sentences_to_vector = vectorizer.fit_transform(allsentences)            # Vectorization through Bag of Words method
    text_to_vector_v1 = all_sentences_to_vector.toarray()[0].tolist()
    text_to_vector_v2 = all_sentences_to_vector.toarray()[1].tolist()
    
    # distance of similarity
    cos_dist = cosine(text_to_vector_v1, text_to_vector_v2)
    return (1-cos_dist)

In [12]:
similarity_score=[]
for index, row in df.iterrows():
  cosine_similarity = countvectorizer_cosine_distance_method(text1[index], text2[index])
  similarity_score.append(cosine_similarity)

we get the value ranging from -1 to 1. But, we need values ranging from 0 to 1 hence we will add 1 to the cosine similarity value and then normalize it.

In [13]:
Similarity_Score = [((x+1)/2) for x in similarity_score]

In [14]:
pd.Series(Similarity_Score).describe()

count    3000.000000
mean        0.543828
std         0.029502
min         0.500000
25%         0.524828
50%         0.537786
75%         0.555115
max         1.000000
dtype: float64

In [15]:
df = df.assign(Similarity_Score = Similarity_Score)

In [16]:
df = df[["ID","Similarity_Score"]]

In [17]:
df.head()

Unnamed: 0,ID,Similarity_Score
0,0,0.558715
1,1,0.519592
2,2,0.536538
3,3,0.520178
4,4,0.554313


In [None]:
df.to_csv("STS_score.csv")

In [18]:
#two example texts
sent1 = 'The prime minister modi greets the press in Chennai'
sent2 = 'Modi speaks to the media in Chennai'

In [19]:
countvectorizer_cosine_distance_method(sent1, sent2)

0.5698028822981898