## Paraphrases Template

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd
from nltk.metrics import jaccard_distance

In [3]:
df = pd.read_csv('/content/drive/My Drive/test-gold/STS.input.SMTeuroparl.txt',sep='\t',header=None)

FileNotFoundError: [Errno 2] No such file or directory: '/content/drive/My Drive/test-gold/STS.input.SMTeuroparl.txt'

In [None]:
df.head()

Unnamed: 0,0,1
0,The leaders have now been given a new chance a...,The leaders benefit aujourd' hui of a new luck...
1,Amendment No 7 proposes certain changes in the...,Amendment No 7 is proposing certain changes in...
2,Let me remind you that our allies include ferv...,I would like to remind you that among our alli...
3,The vote will take place today at 5.30 p.m.,The vote will take place at 5.30pm
4,"The fishermen are inactive, tired and disappoi...","The fishermen are inactive, tired and disappoi..."


In [None]:
df['gs'] = pd.read_csv('/content/drive/My Drive/test-gold/STS.gs.SMTeuroparl.txt',sep='\t',header=None)

In [None]:
df.shape

(459, 3)

In [None]:
df.head()

Unnamed: 0,0,1,gs
0,The leaders have now been given a new chance a...,The leaders benefit aujourd' hui of a new luck...,4.5
1,Amendment No 7 proposes certain changes in the...,Amendment No 7 is proposing certain changes in...,5.0
2,Let me remind you that our allies include ferv...,I would like to remind you that among our alli...,4.25
3,The vote will take place today at 5.30 p.m.,The vote will take place at 5.30pm,4.5
4,"The fishermen are inactive, tired and disappoi...","The fishermen are inactive, tired and disappoi...",5.0


# Naive approach for the Jaccard distance

In [None]:
df['naive_jaccard'] = df.apply(lambda x: 1-jaccard_distance(set(x[0].split()), set(x[1].split())), axis = 1)

In [None]:
from scipy.stats import pearsonr
pearsonr(df['gs'], df['naive_jaccard'])[0]

0.4402114938513469

For the naive approach, we just compared the set of words (splitting by whitespaces) from the two sentences, and looked at the Jaccard distances between the two sets. Just to align it with the scale of the golden standard, what we really measure is "1-Jaccard distance". We do this to make more similar sentences have higher "similarity scores", aligned with the golden standard, which is higher if they are more similar.

# With tokenized version of the words


In [None]:
import spacy

In [None]:
nlp = spacy.load('en_core_web_sm')

In [None]:
df['doc_0'] = df.apply(lambda x: [token.text for token in nlp(x[0])], axis = 1)
df['doc_1'] = df.apply(lambda x: [token.text for token in nlp(x[1])], axis = 1)

In [None]:
df['jaccard'] = df.apply(lambda x: 1-jaccard_distance(set(x['doc_0']), set(x['doc_1'])), axis = 1)

In [None]:
from scipy.stats import pearsonr
pearsonr(df['gs'], df['jaccard'])[0]

0.46060347675882884

Improving the previous approach, we now don't separate words by whitespaces, instead we use the built-in tokenizer in Spacy. We see a slight improvement.

# Filtering stopwords

In [None]:
df['filtered_0'] = df.apply(lambda x: [token.text for token in nlp(x[0]) if not token.is_stop], axis = 1)
df['filtered_1'] = df.apply(lambda x: [token.text for token in nlp(x[1]) if not token.is_stop], axis = 1)

In [None]:
df['filtered_jaccard'] = df.apply(lambda x: 1-jaccard_distance(set(x['filtered_0']), set(x['filtered_1'])), axis = 1)

In [None]:
from scipy.stats import pearsonr
pearsonr(df['gs'], df['filtered_jaccard'])[0]

0.4681657218681324

Relying on Spacy's database of stopwords, we see a little improvement again, by removing them.

# Removing punctuation marks as well

After analysing a few examples, we saw that it might help if we removed the punctuation marks as well.

In [None]:
df['filtered_punct_0'] = df.apply(lambda x: [token.text for token in nlp(x[0]) if not token.is_stop or not token.is_punct], axis = 1)
df['filtered_punct_1'] = df.apply(lambda x: [token.text for token in nlp(x[1]) if not token.is_stop or not token.is_punct], axis = 1)

In [None]:
df['filtered_punct_jaccard'] = df.apply(lambda x: 1-jaccard_distance(set(x['filtered_punct_0']), set(x['filtered_punct_1'])), axis = 1)

In [None]:
from scipy.stats import pearsonr
pearsonr(df['gs'], df['filtered_punct_jaccard'])[0]

0.46060347675882884

We likely removed the same punctuations from both sentences, making the sentences seem less similar to each other according to the Jaccard distance. Apparently, this didn't help us get closer to the golden standard.