# Es2 - Word Sense Disambiguation

In questo esercizio andremo ad estrarre 50 frasi causuali dal corpus `SemCor` e proveremo a disambiguare un sostantivo per ogni frase, anche
quest'ultimo estratto casualmente dalla frase.

1. Estrazione casuale delle frasi dal corpus `SemCor`
   
2. Pulizia delle frasi:
   
   - Rimozione stopwords, punteggiatura e lemming
  
3. Estrazione di un sostantivo casuale dalla frase, che sia ambiguo, cioè che abbia associato 
   almeno 2 synset in wordnet

4. Estrazione dei synset del sostantivo da wordnet e del synset corretto associato in `SemCor`
   
5. Costruzione della `Bag of Words` per la frase:
   
   - Creiamo un insieme delle parole correlate ai termini presenti nella frasi, per far ciò andiamo ad
      estrarre le definizioni e gli esempi dei synset da wordnet, dati da `SemCor`, 
      associati ai termini della frase
      
6. Andiamo a confrontare la `Bag of Words` della frase con i termini presenti nelle definizioni e negli esempi
   dei vari synset associati al sostantivo estratto e andiamo a selezionare il synset con *overlap* maggiore,
   che dovrebbe corrispondere al synset corretto
   
7. Andiamo a calcolare le performance del nostro algoritmo andando a confrontare i sysnet estratti con
   quelli corretti dati da `SemCor`

## Preparazione dei dati

### Imports and dataset downlaod

In [49]:
from nltk.corpus import semcor
from nltk.corpus import wordnet as wn
from nltk.corpus import stopwords
import nltk
import random
from pprint import pprint
from nltk.tree import *
from iteration_utilities import deepflatten

Resurces to install

In [39]:
# nltk.download('semcor') # download the semcor corpus
# nltk.download('wordnet') # download wordnet
# nltk.download('omw-1.4') # download the open multilingual wordnet
# nltk.download('stopwords') # download the stopwords
# nltk.download('punkt') # download the punkt tokenizer

### Estrazione frasi da corpus

Estraggo le frasi con  `semcor.sents()` e le frasi annotate con PoS tag e i Lemma con `semcor.tagged_sents()`

In [22]:
semc_sents = semcor.sents() # Extract the sentences from the semcor corpus

semc_sents_full = semcor.tagged_sents(tag="both") # Extract the sentences from the semcor corpus with annotations

### Metodi utili per gestione Corpus `SemCor` e struttura `Tree` di `nltk`

In [8]:
def get_lemma(word):
    '''
    Args:
        word: term as as nltk.Tree
    Returns:
        lemma of the word as a string.
        If there isn't a lemma, return the PoS tag or None.
    '''
    return word.label()

def get_word(word):
    '''
    Args:
        word: term as as nltk.tree.tree.Tree
    Returns:
        the term as as a list of strings.
        Return a list because a term may consist of several words
    '''
    if(isinstance(word, nltk.Tree)):
        return word.leaves()
    return None

def get_pos(word):
    '''
    Args:
        word: term as as nltk.Tree
    Returns:
        The PoS tag of a word or None if there is no PoS for the term 
        (ex. '!' has tag None)
    '''
    return word.pos()[0][1]

def get_synset(lemma):
    '''
    Args:
        lemma: Lemma of a word
    Returns:
        The synset associated to the lemma
    '''
    if(isinstance(lemma, nltk.corpus.reader.wordnet.Lemma)):
        return lemma.synset()
    return None

def get_sents(semcor):
    '''
    Args:
        semcor: Semcor corpus
    Return:
        a list of list of words. Each list of words is a sentence.
    '''
    return semcor.sents()

def get_term(lemma):
    '''
    Args:
        lemma: Lemma of a word
    Returns:
        The term associated to the lemma
    '''
    if(isinstance(lemma, nltk.corpus.reader.wordnet.Lemma)):
        return lemma.name()
    return None

def get_synsets(term):
    '''
    Args:
        term: a word as string
    Returns:
        The synsets associated to the word
    '''
    if(len(wn.synsets(term)) > 0):
        return wn.synsets(term)
    return None

### Selezione delle frasi casuali

Estraiamo le frasi come stringhe e come oggetti `Tree` per poter ottenere anche il pos e il 
lemma associato ad un termine.

In [25]:
def check_sent(sent):
    '''
    Args:
        sent: a sent annotated extracted from the semcor corpus (semcor.tagged_sents(tag="both"))
    Returns:
        True if the sentence contain an ambiguous noun (NN), False otherwise
        By ambiguous word we mean a word that has more than one synset in wordnet
    '''
    for el in sent:
        if(get_pos(el) == 'NN'):
            lemma = get_lemma(el)
            if(isinstance(lemma, nltk.corpus.reader.wordnet.Lemma)):
                term = get_term(lemma)
                syns = wn.synsets(term)
                if(len(syns) > 1):
                    return True
    return False

def pick_sents(s, sfull, num):
    '''
    Args:
        s: list of strings -> List of sentences
        sfull: annotated sentences extracted from the semcor corpus (semcor.tagged_sents(tag="both"))
        num: int -> number of sentences to pick
    Returns:
        Return 'num' sentences and annotated sentences given in input wich contain an ambiguous noun
    '''
    rand_sents, rand_num, rand_full_sent = [], [], []
    l = len(s) - 1
    n = random.randint(0, l)
    
    while (len(rand_sents) < num):
        while(n in rand_num):
            n = random.randint(0, l)
            
        rand_num.append(n)
        
        if(check_sent(sfull[n])):
            rand_sents.append(s[n])
            rand_full_sent.append(sfull[n])
    
    return rand_sents, rand_full_sent

Estraggo 50 frasi casuali dal corpus `SemCor`

Ottengo una lista di oggetti *semcor* -> `nltk.corpus.reader.semcor.SemcorSentence`

In [24]:
sent_list, sent_full_list = pick_sents(semc_sents, semc_sents_full, 50)

# print(len(sent_list) == len(sent_full_list)) # 50

### Implementazione algoritmo di Lesk

In [35]:
from distutils.log import error

def find_noun(sent):
    '''
    Args:
        sent: annotated sentences extracted from the semcor corpus (semcor.tagged_sents(tag="both"))
    Returns:
        A random ambiguous noun in the sentence
    '''
    noun_list = []
    for el in sent:
        if(get_pos(el) == 'NN'):
            lemma = get_lemma(el)
            if(isinstance(lemma, nltk.corpus.reader.wordnet.Lemma)):
                term = get_term(lemma)
                syns = wn.synsets(term)
                if(len(syns) > 1):
                    noun_list.append(el)
    return noun_list[random.randint(0, len(noun_list) - 1)]


def bag_of_word(sent):
    '''
    Transforms the given sentence according to the bag of words approach, apply lemmatization, 
    stop words and punctuation removal.
    Args:
        sent: sentences -> a list of words
    Returns:
        A list of words after the preprocessing (stop words and punctuation removal, lemmatization)
    '''
    stop_words = set(stopwords.words('english'))
    punctuation = {',', ';', '(', ')', '{', '}', ':', '?', '!', '.', '``', '*', '-'}
    
    # Returns the input word unchanged if it cannot be found in WordNet.
    wnl = nltk.WordNetLemmatizer()
    
    # Return a tokenized copy of text, using NLTK’s recommended word tokenizer (Treebank + PunkSentence)
    tokens = nltk.word_tokenize(sent)
    tokens = list(filter(lambda x: x not in stop_words and x not in punctuation, tokens))
    
    return list(wnl.lemmatize(t.lower()) for t in tokens)


def get_context(sent):
    '''
    Join the words of the sentence in a single string and call bag_of_word method
    Args:
        sent: sentences -> a list of words
    Returns:
        The context of a sentences, a list of words after the preprocessing (stop words and punctuation 
        removal, lemmatization)
    '''
    context = []
    merged_sent = ' '.join(word for word in sent)
    context.append(bag_of_word(merged_sent))
    return context[0]


def get_signature(syn):
    '''
    Args:
        synset: a synset of a word
    Returns:
        A list of word (bag of words) formed by examples and gloss of the synset
    '''
    bof = bag_of_word(syn.definition())
    for el in syn.examples():
        bof.extend(bag_of_word(el))
    return bof


def get_overlap(s1, s2):
    '''
    Args:
        s1: list of words
        s2: list of words
    Returns:
        The number of words in s1 that are also in s2
    '''
    return len(set(s1).intersection(set(s2)))

#### Lesk

1. `Contesto` - Insieme delle parole presenti nella frase
2. `Signature` - Insieme della parole presenti nella definizione e negli esempi dei synset del termine da disambiguare

In [66]:
def lesk(word, sentence):
    max_overlap = -1
    best_synset = None
    context = get_context(sentence)
    word_synsets = wn.synsets(word)
    for syn in word_synsets:
        signature = get_signature(syn)
        overlap = get_overlap(signature, context)
        if(overlap > max_overlap):
            max_overlap = overlap
            best_synset = syn
    return best_synset

### Execution e performance evaluation

In [76]:
def es_2_iteration(n_sent):    
    result = []
    correct = 0
    path_sim_mean = 0
    num_sent = n_sent
    
    sent_list, sent_full_list = pick_sents(semc_sents, semc_sents_full, num_sent)

    for i in range(num_sent):
        target_word = find_noun(sent_full_list[i]) # find a random ambiguous noun in the sentence
        lemma = get_lemma(target_word) # and get his lemma
        target_word_str = get_term(lemma) # get the word as a string
        target_synset = lemma.synset() # and the right synset associated to the word
        sent = sent_list[i] # select the sentence
        
        lesk_syns = lesk(target_word_str, sent) # get the synset returned by the lesk algorithm
        
        path_sim_score = target_synset.path_similarity(lesk_syns) # compute the path similarity score
            
        partial_res = {}
        path_sim_mean += path_sim_score
        
        if((type(lesk_syns) == type(target_synset)) and (lesk_syns == target_synset)):
            correct += 1
            partial_res = {
                "target_syn": target_synset,
                "lesk_syn": lesk_syns,
                "distance": path_sim_score
            }
        else:
            partial_res = {
                "target_syn": target_synset,
                "lesk_syn": lesk_syns,
                "distance": path_sim_score
            }   
            
        result.append(partial_res)
    
    path_sim_mean = path_sim_mean / num_sent  
    
    return result, (correct / num_sent), path_sim_mean


# print("Correct: ", correct, "out of", num_sent, "sentences")
# print("Accuracy: ", correct / num_sent)
# print("Mean path similarity: ", path_sim_mean)

In [81]:
iterations = 100
average_accuracy = 0
average_path_sim = 0
num_sent = 50

for k in range(iterations):
    result, accuracy, path_sim_mean = es_2_iteration(num_sent)
    average_accuracy += accuracy
    average_path_sim += path_sim_mean
    
average_path_sim = average_path_sim / iterations
average_accuracy = average_accuracy / iterations

print("Average accuracy: ", average_accuracy)
print("Average path similarity: ", average_path_sim)

Average accuracy:  0.48080000000000017
Average path similarity:  0.5384015556181453


### Iterazione di prova su una sola frase

In [573]:
sent_list, sent_full_list = pick_sents(semc_sents, semc_sents_full, 50)

In [574]:
target_word = find_noun(sent_full_list[0])
lemma = get_lemma(target_word)
target_word_str = get_term(lemma)
target_synset = lemma.synset()

sent_0 = sent_list[0]

In [575]:
print("Target Word: " + target_word_str)
print("\nSentence: " + ' '.join(sent_0))

lesk_syns = lesk(target_word_str, sent_0)

print("\nResult: " + lesk_syns.name(), "-- Correct: " + target_synset.name())

Target Word: future

Sentence: It is on them alone that the future of their race depends , for all their relatives ( mothers , husbands , brothers , and unmated sisters ) have perished with the arrival of the cold weather .

Result: future.n.02 -- Correct: future.n.01
