# Finding Semantic Textual Similarity

# PROBLEM STATEMENT

Given two paragraphs, quantify the degree of similarity between the two text-based on Semantic
similarity. Semantic Textual Similarity (STS) assesses the degree to which two sentences are
semantically equivalent to each other. The STS task is motivated by the observation that accurately
modelling the meaning similarity of sentences is a foundational language understanding problem
relevant to numerous applications including machine translation (MT), summarization, generation,
question-answering (QA), short answer grading, semantic search.
STS is the assessment of pairs of sentences according to their degree of semantic similarity. The task
involves producing real-valued similarity scores for sentence pairs.

# Dataset Link:

https://drive.google.com/file/d/1OSJR7wLfNunt1WPD03Kj63WAH6Ch1cFf/view?ts=5db2bda5

The data contains a pair of paragraphs. These text paragraphs are randomly sampled from a raw
dataset. Each pair of the sentence may or may not be semantically similar. The candidate is to
predict a value between 0-1 indicating a degree of similarity between the pair of text paras.

0 means highly similar

1 means highly dissimilar

# Import some library 

In [1]:
import numpy as np
import pandas as pd

import re
from tqdm import tqdm

import collections

from sklearn.cluster import KMeans

from nltk.stem import WordNetLemmatizer  # For Lemmetization of words
from nltk.corpus import stopwords  # Load list of stopwords
from nltk import word_tokenize # Convert paragraph in tokens

import pickle
import sys

from gensim.models import word2vec # For represent words in vectors
import gensim

# Read Data-Set

In [2]:
# Read given data-set using pandas

text_data = pd.read_csv("Text_Similarity_Dataset.csv")
print("Shape of text_data : ", text_data.shape)
text_data.head(3)

Shape of text_data :  (4023, 3)


Unnamed: 0,Unique_ID,text1,text2
0,0,savvy searchers fail to spot ads internet sear...,newcastle 2-1 bolton kieron dyer smashed home ...
1,1,millions to miss out on the net by 2025 40% o...,nasdaq planning $100m share sale the owner of ...
2,2,young debut cut short by ginepri fifteen-year-...,ruddock backs yapp s credentials wales coach m...


In [3]:
text_data.isnull().sum() # Check if text data have any null values

Unique_ID    0
text1        0
text2        0
dtype: int64

# Preprocessing of text1 & text2 

* Convert phrases like won't to will not using function decontracted() below
* Remove Stopwords.
* Remove any special symbols and lower case all words
* lemmatizing words using WordNetLemmatizer define in function word_tokenizer below

In [4]:
def decontracted(phrase):
    # specific
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

In [5]:
# Combining all the above stundents 

preprocessed_text1 = []

# tqdm is for printing the status bar

for sentance in tqdm(text_data['text1'].values):
    sent = decontracted(sentance)
    sent = sent.replace('\\r', ' ')
    sent = sent.replace('\\"', ' ')
    sent = sent.replace('\\n', ' ')
    sent = re.sub('[^A-Za-z0-9]+', ' ', sent)

    sent = ' '.join(e for e in sent.split() if e not in stopwords.words('english'))
    preprocessed_text1.append(sent.lower().strip())

100%|██████████| 4023/4023 [01:50<00:00, 36.35it/s]


In [6]:
# Merging preprocessed_text1 in text_data

text_data['text1'] = preprocessed_text1
text_data.head(3)

Unnamed: 0,Unique_ID,text1,text2
0,0,savvy searchers fail spot ads internet search ...,newcastle 2-1 bolton kieron dyer smashed home ...
1,1,millions miss net 2025 40 uk population still ...,nasdaq planning $100m share sale the owner of ...
2,2,young debut cut short ginepri fifteen year old...,ruddock backs yapp s credentials wales coach m...


In [7]:
# Combining all the above stundents 
from tqdm import tqdm
preprocessed_text2 = []

# tqdm is for printing the status bar
for sentance in tqdm(text_data['text2'].values):
    sent = decontracted(sentance)
    sent = sent.replace('\\r', ' ')
    sent = sent.replace('\\"', ' ')
    sent = sent.replace('\\n', ' ')
    sent = re.sub('[^A-Za-z0-9]+', ' ', sent)
   
    sent = ' '.join(e for e in sent.split() if e not in stopwords.words('english'))
    preprocessed_text2.append(sent.lower().strip())

100%|██████████| 4023/4023 [01:47<00:00, 37.26it/s]


In [8]:


# Merging preprocessed_text2 in text_data

text_data['text2'] = preprocessed_text2

text_data.head(3)



Unnamed: 0,Unique_ID,text1,text2
0,0,savvy searchers fail spot ads internet search ...,newcastle 2 1 bolton kieron dyer smashed home ...
1,1,millions miss net 2025 40 uk population still ...,nasdaq planning 100m share sale owner technolo...
2,2,young debut cut short ginepri fifteen year old...,ruddock backs yapp credentials wales coach mik...


In [9]:
def word_tokenizer(text):
            #tokenizes and stems the text
            tokens = word_tokenize(text)
            lemmatizer = WordNetLemmatizer() 
            tokens = [lemmatizer.lemmatize(t) for t in tokens]
            return tokens

# Proposed Approach

# Word embeddings :

* Word embeddings are low dimensional vectors obtained by training a neural network on a large corpus to predict   a word given a context (Continuous Bag Of Words model) or to predict the context given a word (skip gram model). The context is a window of surrounding words. Pre-trained word embeddings are also available in the word2vec code.google page.


* In this i am using Google news pre trained vectors and compare similarity between text1 & text2 using n_similarity method in gensim library which is nothing compares cosine similarity between two 

* Consider it as a unsupervised problem.

In [10]:
# Load pre_trained Google News Vectors after download file

wordmodelfile="GoogleNews-vectors-negative300.bin.gz"
wordmodel= gensim.models.KeyedVectors.load_word2vec_format(wordmodelfile, binary=True)


In [11]:
# This code check if word in text1 & text2 present in our google news vectors vocabalry.
# if not it removes that word and if present it compares similarity score between text1 and text2 words


similarity = [] # List for store similarity score



for ind in text_data.index:
    
        s1 = text_data['text1'][ind]
        s2 = text_data['text2'][ind]
        
        if s1==s2:
                 similarity.append(0.0) # 0 means highly similar
                
        else:   

            s1words = word_tokenizer(s1)
            s2words = word_tokenizer(s2)
            
           
            
            vocab = wordmodel.vocab #the vocabulary considered in the word embeddings
            
            if len(s1words and s2words)==0:
                    similarity.append(1.0)

            else:
                
                for word in s1words.copy(): #remove sentence words not found in the vocab
                    if (word not in vocab):
                           
                            
                            s1words.remove(word)
                        
                    
                for word in s2words.copy(): #idem

                    if (word not in vocab):
                           
                            s2words.remove(word)
                            
                            
                similarity.append((1-wordmodel.n_similarity(s1words, s2words))) # as it is given 1 means highly dissimilar & 0 means highly similar
       

# Final Submission

* Make Final DataFrame and save a CSV file of similarity scores with Unique_ID. (Columns : Unique_ID, Similarity_Score) 

In [12]:
# Get Unique_ID and similarity

final_score = pd.DataFrame({'Unique_ID':text_data.Unique_ID,
                     'Similarity_score':similarity})
final_score.head(3) 

Unnamed: 0,Unique_ID,Similarity_score
0,0,0.389471
1,1,0.292066
2,2,0.27289


In [13]:
# SAVE DF as CSV file 

final_score.to_csv('final_score.csv',index=False)

# END OF NOTEBOOK