# Semantic Textual Similarity

# Problem statement
Given two paragraphs, quantify the degree of similarity between the two text-based on Semantic
similarity. Semantic Textual Similarity (STS) assesses the degree to which two sentences are
semantically equivalent to each other. The STS task is motivated by the observation that accurately
modelling the meaning similarity of sentences is a foundational language understanding problem
relevant to numerous applications including machine translation (MT), summarization, generation,
question-answering (QA), short answer grading, semantic search

# Importing necessary libraries 

In [52]:
import tensorflow as tf       # To work with Universal Sentence Encoder version 4
import pandas as pd           # To work with dataframes
import tensorflow_hub as hub  # contains USE4
module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" #Model is imported from this URL
model = hub.load(module_url)
def embed(input):
  return model(input)

we have used Universal Sentence Encoder(USE). It encodes text into higher dimensional vectors that can be used for our semantic similarity task. The pre-trained Universal Sentence Encoder(USE) is publicly available in tensorflow hub

In [53]:
data = pd.read_csv("C:/Users/Admin/Downloads/Text_Similarity_Dataset (1).csv")
data.head(10)

Unnamed: 0,Unique_ID,text1,text2
0,0,savvy searchers fail to spot ads internet sear...,newcastle 2-1 bolton kieron dyer smashed home ...
1,1,millions to miss out on the net by 2025 40% o...,nasdaq planning $100m share sale the owner of ...
2,2,young debut cut short by ginepri fifteen-year-...,ruddock backs yapp s credentials wales coach m...
3,3,diageo to buy us wine firm diageo the world s...,mci shares climb on takeover bid shares in us ...
4,4,be careful how you code a new european directi...,media gadgets get moving pocket-sized devices ...
5,5,india seeks to boost construction india has cl...,music mogul fuller sells company pop idol supr...
6,6,podcasters look to net money nasa is doing it...,ukip outspent labour on eu poll the uk indepen...
7,7,row over police power for csos the police fe...,ban on hunting comes into force fox hunting wi...
8,8,election could be terror target terrorists m...,nhs waiting time target is cut hospital waitin...
9,9,japan economy slides to recession the japanese...,optimism remains over uk housing the uk proper...


In [54]:
data['text1'][0]

'savvy searchers fail to spot ads internet search engine users are an odd mix of naive and sophisticated  suggests a report into search habits.  the report by the us pew research center reveals that 87% of searchers usually find what they were looking for when using a search engine. it also shows that few can spot the difference between paid-for results and organic ones. the report reveals that 84% of net users say they regularly use google  ask jeeves  msn and yahoo when online.  almost 50% of those questioned said they would trust search engines much less  if they knew information about who paid for results was being hidden. according to figures gathered by the pew researchers the average users spends about 43 minutes per month carrying out 34 separate searches and looks at 1.9 webpages for each hunt. a significant chunk of net users  36%  carry out a search at least weekly and 29% of those asked only look every few weeks. for 44% of those questioned  the information they are looking

In [55]:
type(data['text1'][0]) # we can see that all the data is in string type

str

# Text to vectors
USE version 4. It is trained on the whole wikipedia data. Our Sentence have a sequence of words. we give this sentence to our model (USE4), it gives us a "dense numeric vector". Here, we passed sentence pair and got a vector pair.



In [56]:
message = [data['text1'][0], data['text2'][0]]
message_embeddings = embed(message)
message_embeddings

<tf.Tensor: shape=(2, 512), dtype=float32, numpy=
array([[ 0.05397232, -0.04840362, -0.05309717, ...,  0.04776653,
        -0.06002417, -0.02362861],
       [-0.04064741, -0.05544911, -0.0575323 , ...,  0.05157086,
        -0.05860625, -0.05815785]], dtype=float32)>

In [57]:
type(message_embeddings)

tensorflow.python.framework.ops.EagerTensor

# Here we can see that the type of the vector retured is tensorflow.python.framework.ops.EagerTensor so, we cannot directly use it to compute the cosine similarity. We need to convert it into a numpy array first.

In [58]:
type(message_embeddings[0])

tensorflow.python.framework.ops.EagerTensor

In [59]:

type(tf.make_ndarray(tf.make_tensor_proto(message_embeddings)))

numpy.ndarray

In [60]:
a_np = tf.make_ndarray(tf.make_tensor_proto(message_embeddings))

# Proposed approach
 *The Universal Sentence Encoder version 4 makes getting sentence level embedding’s as easy as it has historically been to lookup the embedding’s for individual words. 
 *Word embedding is a term used for the representation of words for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the vector space are expected to be similar in meaning.  

# Finding Cosine similarity
Created a loop for all sentence pair present in our data and found the vector representation of our sentences. For each vector pair, we found the cosine between the by using usual cosine formula.


# Formula for Cosine Similarity
##cosin = dot(a,b)/norm(a)*norm(b)
The values will be from -1 to 1.  But, we need values ranging from 0 to 1 hence we will add 1 to the cosine similarity value and then normalize it.


In [61]:
from numpy import dot          # to calculate the dot product of two vectors
from numpy.linalg import norm  #for finding the norm of a vector

ans = []                        # This list will contain the cosin similarity value for each vector pair present.
for i in range(len(data)):
  messages = [data['text1'][i], data['text2'][i]]               #storing each sentence pair in messages
  message_embeddings = embed(messages)                          #converting the sentence pair to vector pair using the embed() function
  a = tf.make_ndarray(tf.make_tensor_proto(message_embeddings)) #storing the vector in the form of numpy array
  cos_sim = dot(a[0], a[1])/(norm(a[0])*norm(a[1]))             #Finding the cosine between the two vectors
  ans.append(cos_sim)                            

In [62]:
len(ans)


4023

In [63]:
Ans = pd.DataFrame(ans, columns = ['Similarity_Score'])         #converting the ans list into Dataframe so that we can add it to our "Data"

In [64]:

Ans.head()

Unnamed: 0,Similarity_Score
0,0.170659
1,0.188169
2,0.463088
3,0.421391
4,0.39246


In [65]:
Data = data.join(Ans)  #Joining the Similarity_Score Dataframe (Ans) to our main data

In [66]:
Data['Similarity_Score'] = Data['Similarity_Score'] + 1  #adding 1 to each of the values of Similarity_Score to make the values from 0 to 2. (Initially it was from [-1,1])             #adding 1 to each of the values of Similarity_Score to make the values from 0 to 2. (Initially it was from [-1,1])

In [67]:
Data['Similarity_Score'] = Data['Similarity_Score']/Data['Similarity_Score'].abs().max() #Normalizing the Similarity_Score to get the value between 0 and 1

In [72]:
Data.head()

Unnamed: 0,Unique_ID,text1,text2,Similarity_Score
0,0,savvy searchers fail to spot ads internet sear...,newcastle 2-1 bolton kieron dyer smashed home ...,0.585329
1,1,millions to miss out on the net by 2025 40% o...,nasdaq planning $100m share sale the owner of ...,0.594085
2,2,young debut cut short by ginepri fifteen-year-...,ruddock backs yapp s credentials wales coach m...,0.731544
3,3,diageo to buy us wine firm diageo the world s...,mci shares climb on takeover bid shares in us ...,0.710695
4,4,be careful how you code a new european directi...,media gadgets get moving pocket-sized devices ...,0.69623


In [73]:

Final_score = Data[['Unique_ID', 'Similarity_Score']]

In [74]:
Final_score.head()

Unnamed: 0,Unique_ID,Similarity_Score
0,0,0.585329
1,1,0.594085
2,2,0.731544
3,3,0.710695
4,4,0.69623


In [75]:
Final_score.to_csv('C:/Users/Admin/Downloads/Final_score.csv',index = False )