# Semantic Textual Similarity (STS)

## Problem Statement
### Semantic Textual Similarity (STS) assesses the degree to which two sentences are semantically equivalent to each other. Given two paragraphs, quantify the degree of similarity between the two text-based on Semantic similarity. The task is to predict a value between 0-1 indicating the similarity between the pair of text paras.

## Given Data

### The data contains a pair of paragraphs. These text paragraphs are randomly sampled from a raw dataset. Each pair of the sentence may or may not be semantically similar. The candidate is to predict a value between 0-1 indicating a degree of similarity between the pair of text paras.
### 1: Highly similar
### 0: Highly dissimilar

## Approach

#### This is a problem of Natural Language Processing (NLP).
#### First we have to convert the texts to numericl vectors i.e Text embedding
#### After converting the sentences into vectors we try find the simliarity b/w our sentences using cosine similarity.
#### We are going to use Universal Sentence Encoder(USE).
#### The pre-trained Universal Sentence Encoder(USE) is publicly available in tensorflow hub.
####  It encodes text into higher dimensional vectors that can be used for our semantic similarity task.
####  We are not only converting just based on keyword but the context and meaning.

## Importing Dependencies

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf       
import tensorflow_hub as hub # includes USE

In [2]:
model_url = "https://tfhub.dev/google/universal-sentence-encoder/4" #Model USE4(version 4) is imported from this URL

In [5]:
model = hub.load(model_url) # loading model

## Data Processing

In [6]:
df = pd.read_csv("Precily_Text_Similarity.csv")
df.head(5)

Unnamed: 0,text1,text2
0,broadband challenges tv viewing the number of ...,gardener wins double in glasgow britain s jaso...
1,rap boss arrested over drug find rap mogul mar...,amnesty chief laments war failure the lack of ...
2,player burn-out worries robinson england coach...,hanks greeted at wintry premiere hollywood sta...
3,hearts of oak 3-2 cotonsport hearts of oak set...,redford s vision of sundance despite sporting ...
4,sir paul rocks super bowl crowds sir paul mcca...,mauresmo opens with victory in la amelie maure...


In [7]:
df.shape

(3000, 2)

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text1   3000 non-null   object
 1   text2   3000 non-null   object
dtypes: object(2)
memory usage: 47.0+ KB


# Encoding text to vectors:

#### For example we give sentence to our model (USE4), it gives us a "dense numeric vector". Here, we passed sentence pair and got a vector pair.

In [12]:
def encod(input):
  return model(input)

In [13]:
test = [df['text1'][0], df['text2'][0]]
test_encod = encod(test)
test_encod

<tf.Tensor: shape=(2, 512), dtype=float32, numpy=
array([[-0.02720235,  0.00681645, -0.03939367, ..., -0.03903358,
        -0.05795866, -0.05810072],
       [-0.05569994, -0.0564485 , -0.056383  , ...,  0.04282599,
        -0.05645383, -0.05647698]], dtype=float32)>

In [14]:
type(test_encod)

tensorflow.python.framework.ops.EagerTensor

### Here we can see that the type of the vector retured is tensorflow.python.framework.ops.EagerTensor so, we cannot directly use it to compute the cosine similarity. We need to convert it into a numpy array first.

In [15]:
tf.make_ndarray(tf.make_tensor_proto(test_encod))

array([[-0.02720235,  0.00681645, -0.03939367, ..., -0.03903358,
        -0.05795866, -0.05810072],
       [-0.05569994, -0.0564485 , -0.056383  , ...,  0.04282599,
        -0.05645383, -0.05647698]], dtype=float32)

In [16]:
type(tf.make_ndarray(tf.make_tensor_proto(test_encod)))

numpy.ndarray

In [17]:
ar = tf.make_ndarray(tf.make_tensor_proto(test_encod))
ar

array([[-0.02720235,  0.00681645, -0.03939367, ..., -0.03903358,
        -0.05795866, -0.05810072],
       [-0.05569994, -0.0564485 , -0.056383  , ...,  0.04282599,
        -0.05645383, -0.05647698]], dtype=float32)

## Finding Cosine similarity

#### Iterating using for loop for all the sentence pair present in our data and found the vector representation of our sentences. For each vector pair, we found the cosine between the by using usual cosine formula.

##### cosine formula :https://www.delftstack.com/howto/python/cosine-similarity-between-lists-python/#:~:text=The%20cosine%20similarity%20measures%20the,and%20%2D1%20at%20180%20degrees.

#### cosin = dot(a,b)/norm(a)*norm(b)


In [25]:
from numpy import dot                                           
from numpy.linalg import norm 

result = []                                                       
for i in range(len(df)):
    texts = [df['text1'][i], df['text2'][i]]               
    texts_encod = encod(texts)                          
    ar_nd = tf.make_ndarray(tf.make_tensor_proto(texts_encod))
    cos_sim = dot(ar_nd[0], ar_nd[1])/(norm(ar_nd[0])*norm(ar_nd[1]))
    result.append(cos_sim) 

In [26]:
len(result)

3000

## we get the value ranging from -1 to 1. But, we need values ranging from 0 to 1 hence we will add 1 to the cosine similarity value and then normalize it.

In [27]:
result[:5]

[0.27266857, 0.277622, 0.1690105, 0.15746728, 0.24620086]

In [28]:
sm_sc = pd.DataFrame(result, columns = ['Similarity_Score'])  
sm_sc.head(5)

Unnamed: 0,Similarity_Score
0,0.272669
1,0.277622
2,0.169011
3,0.157467
4,0.246201


In [29]:
df = df.join(sm_sc)

In [30]:
df.head()

Unnamed: 0,text1,text2,Similarity_Score
0,broadband challenges tv viewing the number of ...,gardener wins double in glasgow britain s jaso...,0.272669
1,rap boss arrested over drug find rap mogul mar...,amnesty chief laments war failure the lack of ...,0.277622
2,player burn-out worries robinson england coach...,hanks greeted at wintry premiere hollywood sta...,0.169011
3,hearts of oak 3-2 cotonsport hearts of oak set...,redford s vision of sundance despite sporting ...,0.157467
4,sir paul rocks super bowl crowds sir paul mcca...,mauresmo opens with victory in la amelie maure...,0.246201


In [31]:
df['Similarity_Score'] = df['Similarity_Score'] + 1    ## addding one now our value ranges from 0 t0 2

In [32]:
df.head()

Unnamed: 0,text1,text2,Similarity_Score
0,broadband challenges tv viewing the number of ...,gardener wins double in glasgow britain s jaso...,1.272669
1,rap boss arrested over drug find rap mogul mar...,amnesty chief laments war failure the lack of ...,1.277622
2,player burn-out worries robinson england coach...,hanks greeted at wintry premiere hollywood sta...,1.169011
3,hearts of oak 3-2 cotonsport hearts of oak set...,redford s vision of sundance despite sporting ...,1.157467
4,sir paul rocks super bowl crowds sir paul mcca...,mauresmo opens with victory in la amelie maure...,1.246201


### Normalization refers to rescaling real-valued numeric attributes into a 0 to 1 range.
#### more on Normalization :  https://www.statology.org/normalize-data-between-0-and-1/

In [34]:
df['Similarity_Score'] = df['Similarity_Score']/df['Similarity_Score'].abs().max() ## normalize it now value ranges from 0 to 1
df.head()

Unnamed: 0,text1,text2,Similarity_Score
0,broadband challenges tv viewing the number of ...,gardener wins double in glasgow britain s jaso...,0.636334
1,rap boss arrested over drug find rap mogul mar...,amnesty chief laments war failure the lack of ...,0.638811
2,player burn-out worries robinson england coach...,hanks greeted at wintry premiere hollywood sta...,0.584505
3,hearts of oak 3-2 cotonsport hearts of oak set...,redford s vision of sundance despite sporting ...,0.578734
4,sir paul rocks super bowl crowds sir paul mcca...,mauresmo opens with victory in la amelie maure...,0.6231
