# Es2 - Word Sense Disambiguation

In questo esercizio andremo ad estrarre 50 frasi causuali dal corpus `SemCor` e proveremo a disambiguare un sostantivo per ogni frase, anche
quest'ultimo estratto casualmente dalla frase.

1. Estrazione casuale delle frasi dal corpus `SemCor`
2. Pulizia delle frasi:
   1. Rimozione stopwords, punteggiatura e lemming
3. Estrazione di un sostantivo casuale dalla frase
4. Estrazione dei synset del sostantivo **?? (domanda, estraggo solo i sysnet che sono etichettati come *NN*)**
5. Costruzione della `Bag of Words` per la frase e del sostantivo 

## Preparazione dei dati

### Imports and dataset downlaod

In [537]:
from nltk.corpus import semcor
from nltk.corpus import wordnet as wn
from nltk.corpus import stopwords
import nltk
import random
from pprint import pprint
from nltk.tree import *

# nltk.download('semcor') # download the semcor corpus

### Estrazione frasi da corpus

In [538]:
sents = semcor.sents()

sents_full = semcor.tagged_sents(tag="both")

print(len(sents) == len(sents_full)) # True

True


### Metodi utili per gestione Corpus `SemCor` e struttura `Tree` di `nltk`

In [539]:
def get_lemma(word):
    '''
    Args:
        word: term as as nltk.Tree
    Returns:
        lemma of the word as a string.
        If there isn't a lemma, return the PoS tag or None.
    '''
    return word.label()

def get_word(word):
    '''
    Args:
        word: term as as nltk.tree.tree.Tree
    Returns:
        the term as as a list of strings.
        Return a list because a term may consist of several words
    '''
    if(isinstance(word, nltk.Tree)):
        return word.leaves()
    return None

def get_pos(word):
    '''
    Args:
        word: term as as nltk.Tree
    Returns:
        The PoS tag of a word or None if there is no PoS for the term 
        (es. '!' has tag None)
    '''
    return word.pos()[0][1]

def get_synset(lemma):
    '''
    Args:
        lemma: Lemma of a word
    Returns:
        The synset associated to the lemma
    '''
    if(isinstance(lemma, nltk.corpus.reader.wordnet.Lemma)):
        return lemma.synset()
    return None

def get_sents(semcor):
    '''
    Args:
        semcor: Semcor corpus
    Return:
        a list of list of words. Each list of words is a sentence.
    '''
    return semcor.sents()

def get_term(lemma):
    '''
    Args:
        lemma: Lemma of a word
    Returns:
        The term associated to the lemma
    '''
    if(isinstance(lemma, nltk.corpus.reader.wordnet.Lemma)):
        return lemma.name()
    return None

def get_synsets(term):
    '''
    Retrurn the synsets of a term.
    '''
    if(len(wn.synsets(term)) > 0):
        return wn.synsets(term)
    return None

In [33]:
lemma = "primary_election.n.01.primary"
wn.lemma(lemma).synset()

Synset('primary.n.01')

### Selezione delle frasi casuali

Estraiamo le frasi come stringhe e come oggetti `Tree` per poter ottenere anche il pos e il 
lemma associato ad un termine.

In [540]:
def check_sent(sent):
    '''
        Check if there is a NN with his lemma in the sentence and that as more than
        1 synset, so that the term is ambiguous
    '''
    for el in sent:
        if(get_pos(el) == 'NN'):
            lemma = get_lemma(el)
            if(isinstance(lemma, nltk.corpus.reader.wordnet.Lemma)):
                term = get_term(lemma)
                syns = wn.synsets(term)
                if(len(syns) > 1):
                    return True
    return False

def pick_sents(s, sfull, num):
    rand_sents, rand_num, rand_full_sent = [], [], []
    l = len(s) - 1
    n = random.randint(0, l)
    
    while (len(rand_sents) < num):
        while(n in rand_num):
            n = random.randint(0, l)
            
        rand_num.append(n)
        
        if(check_sent(sfull[n])):
            rand_sents.append(s[n])
            rand_full_sent.append(sfull[n])
    
    return rand_sents, rand_full_sent

Estraggo 50 frasi casuali dal corpus `SemCor`

Ottengo una lista di oggetti *semcor* -> `nltk.corpus.reader.semcor.SemcorSentence`



In [541]:
sent_list, sent_full_list = pick_sents(sents, sents_full, 50)

# print(len(sent_list) == len(sent_full_list)) # 50

### Implementazione algoritmo di Lesk

In [542]:
from distutils.log import error

def find_noun(sent):
    '''
    Take a sentence and return a random Noun in the phrase with his right synset associated
    '''
    noun_list = []
    for el in sent:
        if(get_pos(el) == 'NN'):
            lemma = get_lemma(el)
            if(isinstance(lemma, nltk.corpus.reader.wordnet.Lemma)):
                term = get_term(lemma)
                syns = wn.synsets(term)
                if(len(syns) > 1):
                    noun_list.append(el)
    return noun_list[random.randint(0, len(noun_list) - 1)]


def bag_of_word(sent):
    '''
    Auxiliary function for the Lesk algorithm. Transforms the given sentence
    according to the bag of words approach, apply lemmatization, stop words
    and punctuation removal.
    Args:
        sent: sentence
    Returns:
        bag of words
    '''
    stop_words = set(stopwords.words('english'))
    punctuation = {',', ';', '(', ')', '{', '}', ':', '?', '!', '.', '``', '*', '-'}
    # Returns the input word unchanged if it cannot be found in WordNet.
    wnl = nltk.WordNetLemmatizer()
    # Return a tokenized copy of text, using NLTK’s recommended word tokenizer (Treebank + PunkSentence)
    tokens = nltk.word_tokenize(sent)
    tokens = list(filter(lambda x: x not in stop_words and x not in punctuation, tokens))
    return list(wnl.lemmatize(t.lower()) for t in tokens)


def get_context(sent):
    '''
    Auxiliary function for the Lesk algorithm. Returns the context of the given word in the given sentence.
    Args:
        sent: sentence
    Returns:
        set of words in the sentence after stop words and punctuation removal and lemming
    '''
    context = []
    merged_sent = ' '.join(word for word in sent)
    context.append(bag_of_word(merged_sent))
    return context[0]


def get_signature(syn):
    '''
    Args:
        synset: a synset of a word
    Returns:
        A list of word formed by examples and gloss of the synset
    '''
    bof = bag_of_word(syn.definition())
    for el in syn.examples():
        bof.extend(bag_of_word(el))
    return bof


def get_overlap(s1, s2):
    '''
    Args:
        s1: list of words
        s2: list of words
    Returns:
        The number of words in s1 that are also in s2
    '''
    return len(set(s1).intersection(set(s2)))

#### Lesk

1. `Contesto` - Insieme delle parole presenti nella frase
2. `Signature` - Insieme della parole presenti nella definizione e negli esempi dei synset del termine da disambiguare

In [543]:
def lesk(word, sentence):
    max_overlap = 0
    best_synset = None
    context = get_context(sentence)
    word_synsets = wn.synsets(word)
    for syn in word_synsets:
        signature = get_signature(syn)
        overlap = get_overlap(signature, context)
        if(overlap > max_overlap):
            max_overlap = overlap
            best_synset = syn
    return best_synset

### Iterazione di prova su una sola frase

In [573]:
sent_list, sent_full_list = pick_sents(sents, sents_full, 50)

In [574]:
target_word = find_noun(sent_full_list[0])
lemma = get_lemma(target_word)
target_word_str = get_term(lemma)
target_synset = lemma.synset()

sent_0 = sent_list[0]

In [575]:
print("Target Word: " + target_word_str)
print("\nSentence: " + ' '.join(sent_0))

lesk_syns = lesk(target_word_str, sent_0)

print("\nResult: " + lesk_syns.name(), "-- Correct: " + target_synset.name())

Target Word: future

Sentence: It is on them alone that the future of their race depends , for all their relatives ( mothers , husbands , brothers , and unmated sisters ) have perished with the arrival of the cold weather .

Result: future.n.02 -- Correct: future.n.01


### Prova su 50 frasi