## Mandatory excercise

In [1]:
import nltk
import urllib.request
from bs4 import BeautifulSoup
from nltk.metrics import jaccard_distance
from scipy.stats import pearsonr

**Read all pairs of sentences of the trial set within the
evaluation framework of the project.**

In [2]:
f = open('trial/STS.input.txt','r')
pairs = {}
for l in f:
    sid = l.split('\t')[0]
    s1 = l.split('\t')[1]
    s2 = l.split('\t')[2][:-1]
    pairs[sid] = [s1,s2]
    print(sid,pairs[sid])
f.close()

id1 ['The bird is bathing in the sink.', 'Birdie is washing itself in the water basin.']
id2 ['In May 2010, the troops attempted to invade Kabul.', 'The US army invaded Kabul on May 7th last year, 2010.']
id3 ['John said he is considered a witness but not a suspect.', '"He is not a suspect anymore." John said.']
id4 ['They flew out of the nest in groups.', 'They flew into the nest together.']
id5 ['The woman is playing the violin.', 'The young lady enjoys listening to the guitar.']
id6 ['John went horse back riding at dawn with a whole group of friends.', 'Sunrise at dawn is a magnificent view to take in if you wake up early enough for it.']


**Compute their similarities by considering words and Jaccard distance.**


In [11]:
distances = []
for sid in pairs:
    s1 = pairs[sid][0]
    s2 = pairs[sid][1]
    words_1 = nltk.word_tokenize(s1)
    words_2 = nltk.word_tokenize(s2)
#     print(words_1,words_2)
    distances.append(jaccard_distance(set(words_1),set(words_2))) 
    print('id:', sid,'distance:', jaccard_distance(set(words_1),set(words_2)))

id: id1 distance: 0.6923076923076923
id: id2 distance: 0.7368421052631579
id: id3 distance: 0.5333333333333333
id: id4 distance: 0.5454545454545454
id: id5 distance: 0.7692307692307693
id: id6 distance: 0.8620689655172413


**Compare the previous results with gold standard by giving the pearson correlation between them.**

In [4]:
f = open('trial/STS.gs.txt','r')

gs = {}
for l in f:
    sid = l.split('\t')[0]
    value = abs( int(l.split('\t')[1])-5)    
    gs[sid] = value
f.close()

refs = list(gs.values())
print(pearsonr(refs,distances)[0])

-0.3962389776119233


We obtain an inverse correlation coeficient of -0.39. The reason for this inverse correlation is that the gold standard values are bigger when the sentences are more similar (as it measures the simmilarity, not the distance), contrary to the Jaccard distance, that is lower when the words are alike.
Another important point is that the pearson coeficient is less than 0.5 in absolute value, this means that there is little correlation between the two arrays, so probably the jaccard distance isn't the best way to measure the semantic simmilarity between sentences.

## Optative excercise: Language identifier

Implement a language identifier:

In [111]:
import pandas as pd 
import csv
from nltk.collocations import TrigramCollocationFinder
from nltk import word_tokenize
def preprocessing(data):
    # remove the digits and puntuation
    data['sentence'] = data['sentence'].str.replace('\d+', '')
    # convert to lowercase
    data['sentence'] = data['sentence'].str.replace('\W+', ' ')
    # replace continuous white spaces by a single one
    data['sentence'] = data['sentence'].str.replace('\s+', ' ')
    # concatenate all sentences with a double space in between
    concatenated = ''
    for a in data['sentence']:
        concatenated += a.strip() + '  '
    return data, concatenated

def extract_trigrams(array_of_strings):
    trigram_set = []
    for big_string in array_of_strings:
        finder = TrigramCollocationFinder.from_words(big_string)
#         finder.apply_freq_filter(5)
        print(list(finder.ngram_fd.items())[:5])

deu_train = pd.read_csv('datasets/deu_trn.txt', sep='\t', lineterminator='\n',names=['id','sentence'],header=None).set_index('id')
deu_test = pd.read_csv('datasets/deu_tst.txt', sep='\t', lineterminator='\n',names=['id','sentence'], header=None).set_index('id')

eng_train = pd.read_csv('datasets/eng_trn.txt', sep='\t', lineterminator='\n',names=['id','sentence'], header=None).set_index('id')
eng_test = pd.read_csv('datasets/eng_tst.txt', sep='\t', lineterminator='\n',names=['id','sentence'], header=None).set_index('id')

fra_train = pd.read_csv('datasets/fra_trn.txt', sep='\t', lineterminator='\n',names=['id','sentence'], header=None).set_index('id')
fra_test = pd.read_csv('datasets/fra_tst.txt', sep='\t', lineterminator='\n',names=['id','sentence'], header=None).set_index('id')

ita_train = pd.read_csv('datasets/ita_trn.txt', sep='\t', lineterminator='\n', names=['id','sentence'], header=None).set_index('id')
ita_test = pd.read_csv('datasets/ita_tst.txt', sep='\t', lineterminator='\n', names=['id','sentence'], header=None).set_index('id')

dutch_train = pd.read_csv('datasets/nld_trn.txt', sep='\t', lineterminator='\n', names=['id','sentence'], header=None).set_index('id')
dutch_test = pd.read_csv('datasets/nld_tst.txt', sep='\t', lineterminator='\n', names=['id','sentence'], header=None).set_index('id')

spa_train = pd.read_csv('datasets/spa_trn.txt', sep='\t', lineterminator='\n', names=['id','sentence'], quoting=csv.QUOTE_NONE,encoding = 'utf-8', header=None).set_index('id')
spa_test = pd.read_csv('datasets/spa_tst.txt', sep='\t', lineterminator='\n',names=['id','sentence'], encoding = 'utf-8', header=None).set_index('id')



In [112]:
pre, concat = preprocessing(ita_test)
extract_trigrams([concat])

[(('O', 'r', 'a'), 42), (('r', 'a', ' '), 3430), (('a', ' ', 'q'), 539), ((' ', 'q', 'u'), 2587), (('q', 'u', 'e'), 1635)]


In [68]:
ita_test

Unnamed: 0_level_0,sentence
id,Unnamed: 1_level_1
1,Ora questa squadra può fare il salto di qualità.
2,"Il kaiser di Kerpen, che dovrebbe tornare in p..."
3,Lo rivela ‘Chi’ nel numero in edicola domani.
4,"Ovvero, le applicazioni che determinano la pos..."
5,Maxi operazione antimafia della Squadra Mobile...
6,SPB 510: chiusura totale alla circolazione dei...
7,"Chiunque è in grado di leggere e verificare""."
8,Schierato in GP2 Series nel 2005 e nel 2006 ne...
9,"I rappresentanti dei lavoratori, che per il 20..."
10,"Vittoria del Deportivo La Coruna sullo Xerez, ..."
