In [None]:
!pip install transformers 
!pip install -U sentence-transformers
!pip install wordcloud

In [None]:
# Data loading and manipulation
import pandas as pd
import numpy as np

import tensorflow as tf 
import torch 
from sentence_transformers import SentenceTransformer, util 
from sklearn.metrics.pairwise import cosine_similarity
from scipy.stats import spearmanr
from itertools import chain

import nltk
from nltk.tokenize import sent_tokenize 
from nltk.tokenize import word_tokenize


In [None]:
# Get the GPU device name. 
device_name = tf.test.gpu_device_name()
print(device_name)

/device:GPU:0


In [None]:
#In order for torch to use the GPU, we need to identify and specify the GPU as the device.
if torch.cuda.is_available():
  device = torch.device("cuda")

  print('There are %d GPU(s) available.' % torch.cuda.device_count())
  print('We will use the GPU:',torch.cuda.get_device_name(0))

else:
  print('No GPU available, using the CPU instead')
  device = torch.device("cpu")

There are 1 GPU(s) available.
We will use the GPU: Tesla K80


In [None]:
#Loading the data
df = pd.read_csv('/content/drive/MyDrive/Text_Similarity_Dataset.csv')
df.head()

Unnamed: 0,Unique_ID,text1,text2
0,0,savvy searchers fail to spot ads internet sear...,newcastle 2-1 bolton kieron dyer smashed home ...
1,1,millions to miss out on the net by 2025 40% o...,nasdaq planning $100m share sale the owner of ...
2,2,young debut cut short by ginepri fifteen-year-...,ruddock backs yapp s credentials wales coach m...
3,3,diageo to buy us wine firm diageo the world s...,mci shares climb on takeover bid shares in us ...
4,4,be careful how you code a new european directi...,media gadgets get moving pocket-sized devices ...


The two columns text1 and text2 needs to be analysed. Let's look at a single row for better understanding of the data(document).

In [None]:
#Looking at the shape of the data
df.shape

(4023, 3)

In [None]:
df.isnull().sum()

Unique_ID    0
text1        0
text2        0
dtype: int64

In [None]:
df['text1'][0] #document of text1(UniqueID = 0)

'savvy searchers fail to spot ads internet search engine users are an odd mix of naive and sophisticated  suggests a report into search habits.  the report by the us pew research center reveals that 87% of searchers usually find what they were looking for when using a search engine. it also shows that few can spot the difference between paid-for results and organic ones. the report reveals that 84% of net users say they regularly use google  ask jeeves  msn and yahoo when online.  almost 50% of those questioned said they would trust search engines much less  if they knew information about who paid for results was being hidden. according to figures gathered by the pew researchers the average users spends about 43 minutes per month carrying out 34 separate searches and looks at 1.9 webpages for each hunt. a significant chunk of net users  36%  carry out a search at least weekly and 29% of those asked only look every few weeks. for 44% of those questioned  the information they are looking

In [None]:
df['text2'][0] #document of text2(UniqueID = 0)

'newcastle 2-1 bolton kieron dyer smashed home the winner to end bolton s 10-game unbeaten run.  lee bowyer put newcastle ahead when he fed stephen carr on the right flank  then sprinted into the area to power home a header from the resultant cross. wanderers hit back through stelios giannakopoulos  who ended a fluid passing move with a well-struck volley. but dyer had the last word in a game of few chances  pouncing on a loose ball after alan shearer s shot was blocked and firing into the top corner. neither side lacked urgency in the early stages of the game  with plenty of tackles flying in  but opportunities in front of goal were harder to come by. bolton keeper jussi jaaskelainen had to make two saves in quick succession midway through the first-half - keeping out shearer s low shot and dyer s close-range header - but that was the only goalmouth action of note. and it was almost out of nothing that the magpies took the lead on 35 minutes. bowyer found space with a neat turn on the

- Looking at the document, 'text1' and 'text2' consists of corpora of around 500-800 words. 
- The corpus altered to lower cases. 
- We can use any of the BERT models for sentence text similarity(STS). 
- Since the corpus is quite large(4023,3), I will use Sentence-BERT(SBERT) which is a modification of the BERT network using siamese and triplet networks. 
- Sentence Text Similarity can be performed with extreme effeciency with SBERT model. We can get almost the same efficiency of BERT, RoBERTa models with very little computational power. 

## Architecture of SBERT: 
- We feed the sentence into the transformer network like BERT. 
- BERT produces contextual word embeddings for all the input tokens in our text i.e. 501 word tokens for example 1(text1[0])
- As we want a fixed sized output representation, we need a pooling layer. Different pooling layers are available and the most basic one is Mean-Pooling. 
- It simply averages all the contextualised word embeddings of BERT. This gives 768 dimensional output vector which is independent of how long our input vector(corpus).
- We can then use, Cosine Similarity for analysis where cosine similarity is cosine of the angle between two vectors. 
- Not removing stop words as bert models understand the semantic structure of the sentences.

In [None]:
model_name = 'bert-base-nli-mean-tokens' #SBERT model from huggingface

In [None]:
model = SentenceTransformer(model_name)

sentence_vecs = model.encode(df['text1']) #Feeding the 'text1' column into the transformer.
sentence_vecs

Some weights of the model checkpoint at /root/.cache/torch/sentence_transformers/sbert.net_models_bert-base-nli-mean-tokens/0_BERT were not used when initializing BertModel: ['classifier.bias', 'classifier.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


array([[-0.54137397,  1.2598132 , -0.77625245, ...,  0.29636627,
         0.43212074,  0.19745058],
       [-0.26046246,  0.61270577,  0.06184964, ..., -0.42472935,
         0.3006261 ,  0.34556156],
       [-0.6884721 ,  0.44877404,  0.34996334, ...,  0.7363529 ,
         0.5277435 , -0.51108295],
       ...,
       [-0.5758161 ,  1.1334713 , -0.889428  , ..., -0.09568016,
         0.44836384, -0.04462703],
       [-0.33474427,  0.56737816,  0.661795  , ..., -0.1679    ,
        -0.30790818, -0.05841556],
       [ 0.07700596,  0.6332016 , -0.03906994, ...,  0.18423545,
        -0.02406658,  0.5077925 ]], dtype=float32)

In [None]:
sentence_vecs2 = model.encode(df['text2']) #Feeding the 'text2' column into the transformer.
sentence_vecs2

array([[-0.4202749 ,  0.6990345 , -0.2226019 , ...,  0.17469361,
         0.46033466,  0.14973462],
       [-0.55514956,  0.32917708, -0.482527  , ..., -0.60323   ,
         0.36121878,  0.0349756 ],
       [-0.29840308,  0.2588199 , -0.00465268, ...,  0.53657174,
         0.5570576 , -0.42199388],
       ...,
       [-0.30856845,  0.85962766,  0.04093473, ...,  0.01395581,
         0.23926379, -0.17645541],
       [-0.38147604,  0.60046315, -0.83278316, ..., -0.33495563,
         0.23949596, -0.02967873],
       [-0.76588607,  0.01247211,  1.0528333 , ..., -0.36547437,
         0.43119842, -0.06551446]], dtype=float32)

In [None]:
#looking at the sentence vector
print(sentence_vecs[0])
print('')
print(sentence_vecs.shape)

[-5.41373968e-01  1.25981319e+00 -7.76252449e-01  1.10632345e-01
 -1.89218000e-02 -4.67766941e-01  1.39817238e+00  2.55670458e-01
  1.24648303e-01  2.24206179e-01 -2.87715614e-01  4.45956409e-01
  2.02211037e-01  1.97331652e-01 -1.07689178e+00  7.82637477e-01
  4.50377285e-01  4.25819129e-01 -1.87249959e-01 -3.51529479e-01
 -9.31338817e-02 -4.24977660e-01 -2.70688891e-01  1.58277620e-02
  9.57239151e-01  9.63204384e-01 -3.80303323e-01 -5.57951748e-01
 -1.05252397e+00  2.84790546e-01 -2.60121912e-01  6.82908356e-01
  3.12892854e-01 -1.26097783e-01  1.32913113e-01  1.78146392e-01
  4.81873721e-01 -3.44094522e-02  5.29673696e-01  3.12636316e-01
  5.82854986e-01  3.82691771e-02  1.97854489e-02  1.86933354e-01
 -3.71493697e-01 -6.54961020e-02  9.68555287e-02  8.31038415e-01
  6.59093916e-01 -2.02466950e-01 -5.36180735e-02  5.73645413e-01
  9.84294057e-01  3.12417805e-01 -5.14605284e-01  2.36952990e-01
  8.25038373e-01 -7.36944914e-01 -6.19647563e-01 -3.74562800e-01
 -8.80797803e-01  4.97478

- Each of the row is converted into an array of 768 dimension(768,)
- Now, we can compare the arrays using cosine similarity.  

## Cosine Similarity

In [None]:
#let's look at the cosine similarity. 
cosine_similarity([sentence_vecs[4]],[sentence_vecs2[4]])

array([[0.61139774]], dtype=float32)

In [None]:
# function to iterate over the entire dataset. 
a = 0 
similarity_score = []

for a in range(0,4023):
  a = cosine_similarity([sentence_vecs[a]],[sentence_vecs2[a]])
  similarity_score.append(a)

print(similarity_score)

[array([[0.5196991]], dtype=float32), array([[0.6724569]], dtype=float32), array([[0.70906794]], dtype=float32), array([[0.76264703]], dtype=float32), array([[0.61139774]], dtype=float32), array([[0.66226476]], dtype=float32), array([[0.51556075]], dtype=float32), array([[0.82694936]], dtype=float32), array([[0.66762996]], dtype=float32), array([[0.7012521]], dtype=float32), array([[0.6506639]], dtype=float32), array([[0.7778119]], dtype=float32), array([[0.69653076]], dtype=float32), array([[0.582692]], dtype=float32), array([[0.7049339]], dtype=float32), array([[0.6675392]], dtype=float32), array([[0.6215814]], dtype=float32), array([[0.42613533]], dtype=float32), array([[0.6364039]], dtype=float32), array([[0.68371034]], dtype=float32), array([[0.49066374]], dtype=float32), array([[0.709445]], dtype=float32), array([[0.68763137]], dtype=float32), array([[0.5441983]], dtype=float32), array([[0.8368628]], dtype=float32), array([[0.7649214]], dtype=float32), array([[0.60027254]], dtype

- The similarity score of first row(text1[0],text2[0]) is 0.5.
- The range of cosine similarity is between 0 - 1. 
- The closer the cosine value to 1, the smaller the angle and the greater the match between vectors.
- similarity(A,B) = A.B / |A|*|B| where 'A' and 'B' are two non-zero vectors. 

#Final Submission 

In [None]:
flatten_score = list(chain.from_iterable(similarity_score)) #Since it is a 2D array
Cosine_Score = pd.DataFrame(flatten_score)
df['Cosine_Score'] = Cosine_Score
df_alter = df.copy()
df_alter

Unnamed: 0,Unique_ID,text1,text2,Cosine_Score
0,0,savvy searchers fail to spot ads internet sear...,newcastle 2-1 bolton kieron dyer smashed home ...,0.519699
1,1,millions to miss out on the net by 2025 40% o...,nasdaq planning $100m share sale the owner of ...,0.672457
2,2,young debut cut short by ginepri fifteen-year-...,ruddock backs yapp s credentials wales coach m...,0.709068
3,3,diageo to buy us wine firm diageo the world s...,mci shares climb on takeover bid shares in us ...,0.762647
4,4,be careful how you code a new european directi...,media gadgets get moving pocket-sized devices ...,0.611398
...,...,...,...,...
4018,4018,labour plans maternity pay rise maternity pay ...,no seasonal lift for house market a swathe of ...,0.676032
4019,4019,high fuel costs hit us airlines two of the lar...,new media battle for bafta awards the bbc lead...,0.675134
4020,4020,britons growing digitally obese gadget lover...,film star fox behind theatre bid leading actor...,0.583080
4021,4021,holmes is hit by hamstring injury kelly holmes...,tsunami to hit sri lanka banks sri lanka s b...,0.689893


In [None]:
df_alter = df.drop(['text1','text2'],axis=1)
df_alter.reset_index(drop=True, inplace=True)
df_alter

Unnamed: 0,Unique_ID,Cosine_Score
0,0,0.519699
1,1,0.672457
2,2,0.709068
3,3,0.762647
4,4,0.611398
...,...,...
4018,4018,0.676032
4019,4019,0.675134
4020,4020,0.583080
4021,4021,0.689893


In [None]:
from google.colab import files

df_alter.to_csv('STS.csv', index=False)
files.download('STS.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

**References**: 
- https://arxiv.org/pdf/1908.10084.pdf
- https://www.sbert.net/docs/usage/semantic_textual_similarity.html