# Mandatory - Session 6

In [1]:
import pandas as pd
import csv
from nltk.wsd import lesk
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize
from nltk.metrics import jaccard_distance
from scipy.stats import pearsonr
from nltk.corpus import wordnet as wn
from nltk import pos_tag

First load the datasets with panda

In [2]:
trial_path    = 'trial/STS.input.txt'
trial_gs_path = 'trial/STS.gs.txt'
trial_df      = pd.read_csv(trial_path, sep='\t', lineterminator='\n', names=['sentence0','sentence1'], header=None, quoting=csv.QUOTE_NONE)
trial_gs      = pd.read_csv(trial_gs_path, sep='\t', lineterminator='\n', names=['labels'], header=None, quoting=csv.QUOTE_NONE).iloc[::-1]
trial_df

Unnamed: 0,sentence0,sentence1
id1,The bird is bathing in the sink.,Birdie is washing itself in the water basin.\r
id2,"In May 2010, the troops attempted to invade Ka...",The US army invaded Kabul on May 7th last year...
id3,John said he is considered a witness but not a...,"""He is not a suspect anymore."" John said.\r"
id4,They flew out of the nest in groups.,They flew into the nest together.\r
id5,The woman is playing the violin.,The young lady enjoys listening to the guitar.\r
id6,John went horse back riding at dawn with a who...,Sunrise at dawn is a magnificent view to take ...


Now, the functions to calculate using lesk the meaning of words in a context. If no meaning is found for a word, the token is keep as it is

In [3]:
def apply_lesk_to_text(senten):
    ''' Convert a sentence to a list of tokens with meanings. '''
    tokens   = word_tokenize(senten)
    tagged   = pos_tag(tokens)
    semantic = []
    
    for idx, token in enumerate(tokens):             # For each word
        context = [i for i in tokens if i != token]  # Context for the word ({S} - {word})
        tag     = tagged[idx][1][0].lower()          # POS tag for the word
        synset  = lesk(context, token.lower(), tag)  # Lesk algorithm => Synset        
        if synset:
            semantic.append(str(synset).replace('Synset(\'', '').replace('\')', ''))  # If a meaning was found
        else:
            semantic.append(token.lower()) # If no meaning was found

    return semantic

def preprocessing(data):
    ''' Preprocess all sentences to infer their meanings. Generaly a more complete preprocessing function will be used. '''
    data = data.fillna('')
    for column in data.columns:
        # words to lower
        data[column] = data[column].str.lower()
        # desambiguate 
        data[column] = data[column].apply(apply_lesk_to_text)
    return data


Run the preprocessing...

In [4]:
trial_df = preprocessing(trial_df)
trial_df

Unnamed: 0,sentence0,sentence1
id1,"[the, bird.n.02, be.v.12, bathe.v.01, in, the,...","[shuttlecock.n.01, be.v.12, wash.v.09, itself,..."
id2,"[in, may, 2010, ,, the, troop.n.02, undertake....","[the, us, army, invaded, kabul.n.01, on, may, ..."
id3,"[whoremaster.n.01, suppose.v.01, he, embody.v....","[``, he, embody.v.02, not.r.01, a, defendant.n..."
id4,"[they, fly.v.12, out, of, the, nest, in, group...","[they, fly.v.10, into, the, nest, together.r.0..."
id5,"[the, woman.n.02, be.v.05, play.v.35, the, vio...","[the, young, lady.n.03, love.v.02, heed.v.01, ..."
id6,"[toilet.n.01, plump.v.04, knight.n.02, back.r....","[sunrise.n.03, at, dawn.n.03, be.v.12, a, magn..."


Now, calculate the similarities using the jaccard distance...

In [5]:
def lexical_simmilarity(df):
    guess = pd.DataFrame()
    for i in df.index:
        guess.loc[i,'labels'] = 1 - jaccard_distance(set(df.loc[i,'sentence0']), set(df.loc[i,'sentence1']))
    return guess

guess_lex = lexical_simmilarity(trial_df)
guess_lex

Unnamed: 0,labels
id1,0.333333
id2,0.333333
id3,0.571429
id4,0.333333
id5,0.166667
id6,0.1


In [6]:
print(pearsonr(trial_gs['labels'], guess_lex['labels'])[0])

0.6206712050540985


- On session 2, we compared the sentences practically as given and we obtained a correlation of 0.51. 
- On session 3, we performed a lemmatization of the sentences and obtained a correlation of 0.57. 
- On this session we performed a desambiguation and we have obtained a coefficient of 0.62. This is an improvement over the lemmatization.

As commented on previous sessions, 

> This value is bigger than 0.5, this means that there is some correlation between the two arrays, so this may make a good algorithm to measure semantic similarity between sentences (Lesk + Jaccard). But it is still not a very good value. Using hypernyms and hyponyms will probably improve the result.  

> The _poor_ performance may be a consecuence of the definition of Jaccard distance. This definition is fully based on set theory and does not take into account the semantic relationship between words.

These facts explain why the current results are better than the ones on the other sessions; The desambiguation values returned by lesk should be related to the sense of the words, this implies than sometimes this value will be the same on the two sentences, making greater the simmilarity between them.

