## Students:
1.   João Valério
2.   Eirik Grytøyr

In [1]:
# Getting the file STS.input.SMTeuroparl.txt from drive into a DataFrame
from google.colab import drive
drive.mount('/content/drive')
import pandas as pd
dt = pd.read_csv('/content/drive/My Drive/data/ihlt/test-gold/STS.input.SMTeuroparl.txt',sep='\t',header=None)

Mounted at /content/drive


In [2]:
# Updating the DataFrame with a new column with STS.gs.SMTeuroparl.txt
dt['gs'] = pd.read_csv('/content/drive/My Drive/data/ihlt/test-gold/STS.gs.SMTeuroparl.txt',sep='\t',header=None)

In [3]:
import nltk
from nltk.stem import PorterStemmer
import re

# Getting a list of stop words
nltk.download('stopwords')
stopWordSet = set(nltk.corpus.stopwords.words('english'))

def cleaner (sentenceList):

  # Considering that similar words have almost the same meaning in any form and get the list into lowercase
  sentenceList = list(map(lambda word: PorterStemmer().stem(word).lower(), sentenceList))
  
  # Filtering the ponctuation and the stop words
  sentenceList = list(filter(lambda word : re.search('''[!"#$%&'()*+, -./:;<=>?@[\]^_`{|}~]+''', word) == None and word not in stopWordSet, sentenceList))
  
  return sentenceList

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [4]:
from nltk.metrics import jaccard_distance
nltk.download('punkt')

# Lists to save the Tokenizations
text1 = []
text2 = []

# Adding an empty column to the DataFrame
dt['jaccard'] = ''

limit = len(dt[0][:])

for id in range(limit):

  # Tokenization of the 2 texts
  text1.append(cleaner(nltk.word_tokenize(dt.loc[id,0])))
  text2.append(cleaner(nltk.word_tokenize(dt.loc[id,1])))
  
  # Updating the DataFrame with the similarities according to the method jaccard 
  dt.loc[id,'jaccard'] = jaccard_distance(set(text1[id]), set(text2[id]))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [5]:
# Additional code to compare gs and jaccard in the same scale
'''
for id in range(limit):

  # Difference between jaccard and gs on the same scale
  diff = float(dt.loc[id,'jaccard']) - abs((float(dt.loc[id,'gs']) / 5 - 1))
  
  # The minDiff variable defines the minimum difference we are looking for
  minDiff = 0.8
  if (diff > minDiff):
    print('id:', id)
    print('jaccard:', float(dt.loc[id,'jaccard']))
    print('gs:', dt.loc[id,'gs'])
    print('Difference in the same scale:', diff)
    print('1. Initial phrase:', dt.loc[id,0])
    print('1. Tokenized phrase:', text1[id])
    print('2. Initial phrase:', dt.loc[id,1])
    print('2. Tokenized phrase:', text2[id], '\n\n')
'''

"\nfor id in range(limit):\n\n  # Difference between jaccard and gs on the same scale\n  diff = float(dt.loc[id,'jaccard']) - abs((float(dt.loc[id,'gs']) / 5 - 1))\n  \n  # The minDiff variable defines the minimum difference we are looking for\n  minDiff = 0.8\n  if (diff > minDiff):\n    print('id:', id)\n    print('jaccard:', float(dt.loc[id,'jaccard']))\n    print('gs:', dt.loc[id,'gs'])\n    print('Difference in the same scale:', diff)\n    print('1. Initial phrase:', dt.loc[id,0])\n    print('1. Tokenized phrase:', text1[id])\n    print('2. Initial phrase:', dt.loc[id,1])\n    print('2. Tokenized phrase:', text2[id], '\n\n')\n"

In [6]:
display(dt)

Unnamed: 0,0,1,gs,jaccard
0,The leaders have now been given a new chance a...,The leaders benefit aujourd' hui of a new luck...,4.500,0.692308
1,Amendment No 7 proposes certain changes in the...,Amendment No 7 is proposing certain changes in...,5.000,0.0
2,Let me remind you that our allies include ferv...,I would like to remind you that among our alli...,4.250,0.666667
3,The vote will take place today at 5.30 p.m.,The vote will take place at 5.30pm,4.500,0.25
4,"The fishermen are inactive, tired and disappoi...","The fishermen are inactive, tired and disappoi...",5.000,0.0
...,...,...,...,...
454,It is our job to continue to support Latvia wi...,It is of our duty of continue to support the c...,5.000,0.636364
455,The vote will take place today at 5.30 p.m.,Vote will take place at 17 h 30.,4.750,0.571429
456,Neither was there a qualified majority within ...,There was no qualified majority in this Parlia...,5.000,0.538462
457,Let me remind you that our allies include ferv...,"I hold you recall that our allies, there are e...",4.000,0.727273


In [7]:
from scipy.stats import pearsonr

# Get the correlation and the p-value between gs and jaccard
corr, p = pearsonr(dt['gs'], dt['jaccard'])
print("Correlation coefficient:", corr)
print("p-value:", p)

Correlation coefficient: -0.5021113739773013
p-value: 1.0947827632711695e-30


The code developed gives a negative non-linear correlation between the gold standard and jaccard methods of -0.50 with a p-value of 1.09e-30. Furthermore, it's important to note that even though the p-value is diminished, meaning that the null hypothesis is false and there is a correlation between the variables, the amount of data is insufficient to make such a conclusion.

The value obtained through the standard tokenization, without data cleaning, only allows us to get a correlation of -0.45. To improve the correlation, we implemented a cleaner function to:
1.   Stem the words into their basic form, because they have approximately the same meaning;
2.   Lower Case the words, because the jaccard algorithm is case sensitive;
3.   Remove stop words found in the stop word set, because they do not provide content to the meaning of the sentence;
4.   Remove punctuation, because, usually but not always, they do not provide content to the meaning of the sentence.

Nonetheless, the implementations are general solutions, a knowledge-based approach not adaptable to particular cases.

To conclude, through the data analysis it is understandable that the correlation is influenced mainly by the tokenization and the jaccard method. In tokenization, the text splitter punkt only considers abbreviations and final punctuation marks, but presents problems with particular words (such as strings with numbers and punctuation) and token frequency. Additionally, the jaccard distance method only measures the similarities between words, not considering the meaning of the words, as synonyms or different languages. Overall, the method is useful to measure similarities when the phrases use the exact same vocabulary and language, without particular words.