# Mandatory 2 - Session 9

Statement:
* Read all pairs of sentences of the trial set within the
evaluation framework of the project.
* Compute the Jaccard similarity of each pair using the
dependency triples from CoreNLPDependencyParser.
* Show the results. Do you think it could be relevant to use
NEs to compute the similarity between two sentences?
Justify the answer

First read the data sets using pandas ...

In [14]:
import pandas as pd
import csv
from nltk.parse.corenlp import CoreNLPDependencyParser
from scipy.stats   import pearsonr
from copy          import deepcopy
from nltk.metrics  import jaccard_distance

In [7]:
trial_path    = 'data/trial/STS.input.txt'
trial_gs_path = 'data/trial/STS.gs.txt'
trial_df      = pd.read_csv(trial_path, sep='\t', lineterminator='\n', names=['sentence0','sentence1'], header=None, quoting=csv.QUOTE_NONE)
trial_gs      = pd.read_csv(trial_gs_path, sep='\t', lineterminator='\n', names=['labels'], header=None, quoting=csv.QUOTE_NONE).iloc[::-1]
trial_df

Unnamed: 0,sentence0,sentence1
id1,The bird is bathing in the sink.,Birdie is washing itself in the water basin.\r
id2,"In May 2010, the troops attempted to invade Ka...",The US army invaded Kabul on May 7th last year...
id3,John said he is considered a witness but not a...,"""He is not a suspect anymore."" John said.\r"
id4,They flew out of the nest in groups.,They flew into the nest together.\r
id5,The woman is playing the violin.,The young lady enjoys listening to the guitar.\r
id6,John went horse back riding at dawn with a who...,Sunrise at dawn is a magnificent view to take ...


Start the CoreNLPDependencyParser

_java -Xmx5g -cp C:\stanford-corenlp-full-2018-10-05\* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000_

In [8]:
parser = CoreNLPDependencyParser(url='http://localhost:9000')

The two next functions are used to calculate the similarities using the Jaccard distances (the same as in Session 2) with a list of words and the pearson correlation coefficient as well... 

In [31]:
def lexical_simmilarity(df):
    """ Calculate the similarities using the Jaccard distance """
    guess = pd.DataFrame()
    for i in df.index:
        set1 = set(df.loc[i,'sentence0'])
        set2 = set(df.loc[i,'sentence1'])
        guess.loc[i,'labels'] = 1. - jaccard_distance(set1, set2)
    return guess

def analyze_results(results):
    """ Print similarities and pearson correlation coefficient """
    guess_lex = lexical_simmilarity(results)
    pearson    = pearsonr(trial_gs['labels'], guess_lex['labels'])[0]
    print(results)
    print()
    print('Similarities:\n', guess_lex)
    print()
    print('Pearson correlation index:', pearson)

The preprocessing function just starts applies the DependencyParser to each sentence

In [32]:
def preprocessing(data):
    data = data.fillna('')
    for column in data.columns:
        # get triplets
        data[column] = data[column].apply(apply_dependency_triplets)
    return data

Results after applying the Jaccard distance to the __whole__ dependency triples...

_Conclusions at the end_

In [40]:
def apply_dependency_triplets(sentence):
    result = set()
    parse = parser.raw_parse(sentence.lower())
    tree = next(parse)
    for t in tree.triples():
        result.add(t)
    return list(result)

analyze_results(
    preprocessing(deepcopy(trial_df))
)

                                             sentence0  \
id1  [((sink, NN), det, (the, DT)), ((sink, NN), ca...   
id2  [((invade, VB), dobj, (kabul, NN)), ((invade, ...   
id3  [((considered, VBN), nsubjpass, (he, PRP)), ((...   
id4  [((flew, VBD), nmod, (groups, NNS)), ((groups,...   
id5  [((playing, VBG), nsubj, (woman, NN)), ((woman...   
id6  [((friends, NNS), case, (of, IN)), ((group, NN...   

                                             sentence1  
id1  [((washing, VBG), nsubj, (birdie, NN)), ((wash...  
id2  [((year, NN), amod, (last, JJ)), ((invaded, VB...  
id3  [((suspect, JJ), nsubj, (he, PRP)), ((suspect,...  
id4  [((flew, VBD), nmod, (nest, NN)), ((nest, NN),...  
id5  [((lady, NN), det, (the, DT)), ((enjoys, VBZ),...  
id6  [((view, NN), amod, (magnificent, JJ)), ((take...  

Similarities:
        labels
id1  0.000000
id2  0.000000
id3  0.000000
id4  0.400000
id5  0.000000
id6  0.033333

Pearson correlation index: -0.1879821089440828


Results after applying the Jaccard distance to the dependency triples. This time the triplets are split into three elements ...

In [41]:
def apply_dependency_triplets(sentence):
    result = set()
    parse = parser.raw_parse(sentence.lower())
    tree = next(parse)
    for t in tree.triples():
        result.add(t[0])
        result.add(t[1])
        result.add(t[2])
    return list(result)

analyze_results(
    preprocessing(deepcopy(trial_df))
)

                                             sentence0  \
id1  [nmod, (the, DT), case, det, (in, IN), (bird, ...   
id2  [nmod, case, (attempted, VBN), (in, IN), aux, ...   
id3  [(he, PRP), auxpass, conj, det, (a, DT), ccomp...   
id4  [nmod, case, (the, DT), det, (groups, NNS), (i...   
id5  [(the, DT), aux, det, (violin, NN), (playing, ...   
id6  [nmod, (a, DT), (at, IN), punct, amod, advmod,...   

                                             sentence1  
id1  [(washing, VBG), nmod, aux, dobj, (itself, PRP...  
id2  [nmod, (the, DT), aux, (invaded, VBD), nmod:tm...  
id3  [(he, PRP), det, (a, DT), ('', ''), (``, ``), ...  
id4  [nmod, case, (the, DT), det, (nest, NN), (., ....  
id5  [nmod, (the, DT), (listening, VBG), det, case,...  
id6  [nmod, (a, DT), (at, IN), advcl, cop, (to, TO)...  

Similarities:
        labels
id1  0.428571
id2  0.433333
id3  0.285714
id4  0.588235
id5  0.238095
id6  0.285714

Pearson correlation index: 0.40556896256359354


Lastly, each element in the triplet is added separatedly ...

In [42]:
def apply_dependency_triplets(sentence):
    result = set()
    parse = parser.raw_parse(sentence.lower())
    tree = next(parse)
    for t in tree.triples():
        result.add(t[0][0])
        result.add(t[0][1])
        result.add(t[1])
        result.add(t[2][0])
        result.add(t[2][0])
    return list(result)

analyze_results(
    preprocessing(deepcopy(trial_df))
)

                                             sentence0  \
id1  [nmod, case, in, det, ., sink, NN, cop, is, pu...   
id2  [nmod, aux, ,, VB, punct, the, 2010, attempted...   
id3  [conj, ccomp, not, RB, suspect, punct, but, cc...   
id4  [nmod, case, in, det, groups, ., they, NN, fle...   
id5  [VBG, violin, aux, det, playing, NN, is, punct...   
id6  [nmod, riding, with, punct, amod, VBD, advmod,...   

                                             sentence1  
id1  [VBG, nmod, aux, case, itself, in, det, birdie...  
id2  [nmod, aux, ,, VB, nmod:tmod, punct, the, 2010...  
id3  [anymore, JJ, det, not, ``, neg, root, suspect...  
id4  [nmod, case, into, det, they, NN, flew, togeth...  
id5  [VBG, nmod, guitar, det, case, listening, xcom...  
id6  [nmod, VBP, VB, advcl, cop, punct, amod, sunri...  

Similarities:
        labels
id1  0.434783
id2  0.388889
id3  0.281250
id4  0.600000
id5  0.291667
id6  0.260000

Pearson correlation index: 0.3506001934267482


## Conclusions
Final results, pearson correlation index: 
* Whole triplets: -0.1879821089440828
* Triplets splitted in 3: 0.40556896256359354
* All elements in triplets splitted: 0.3506001934267482

It can be seen that using whole triplets does not work to compare the sentences. We are just making a sintactic analysis tree that may or may not contain enough information to find the real correlation between two sentences. It can be seen though that in the other two cases, the index is still really small, but it's better than the first. This happens because some words are repeated between correlated sentences, so they are probably given the same category in the analysed tree, thus making the final Jaccard distance smaller. 

In conclusion, this tool is not enough, it will just serve as extra information for the other ones like semantic analysis that were previusly used in the subject.