# Week 5 (Part 2): Dictionary Methods for WSD

We have seen that many words have many different senses.  In order to make the correct decision about the meaning of a sentence or a document, an application often needs to be able to **disambiguate** individual words, that is, choose the correct sense given the context.

In this lab we will be looking st methods for word sense disambiguation (WSD) that make use of dictionaries or other lexical resources (also referred to as **knowledge-based methods** for WSD).  In particular, we will look at
* simplified Lesk
* adapted Lesk
* minimising distance in a semantic hierarchy

As in the previous lab, we will be using WordNet as our lexical resource.  So, first, lets import it.

In [3]:
from nltk import word_tokenize
from nltk.corpus import wordnet as wn
from nltk.corpus import wordnet_ic as wn_ic
from nltk.stem.wordnet import WordNetLemmatizer

import operator, sys
from Week4Labs.utils import filter_stopwords, normalise

sys.path.append(r'\\ad.susx.ac.uk\ITS\TeachingResources\Departments\Informatics\LanguageEngineering\resources')
sys.path.append(r'/Users/juliewe/Documents/teaching/NLE2018/resources')
from sussex_nltk.corpus_readers import AmazonReviewCorpusReader


Sussex NLTK root directory is \\ad.susx.ac.uk\ITS\TeachingResources\Departments\Informatics\LanguageEngineering\resources


In [4]:
#make sure that the path to your utils.py file is correct for your computer
sys.path.append('/Users/juliewe/Documents/teaching/NLE/NLE2019/w4/Week4Labs')

from utils import *

## Simplified Lesk

The Lesk algorithm is based on the intuition that the correct combination of senses in a sentence will share more common words in their definitions.

It is computationally very expensive to compare all possible sense combinations of words in a sentence.  If each word has just 2 senses, then there are $2^n$ possible sense combinations.

In the simplifed Lesk algorithm, below, we consider each word in turn and choose the sense whose definition has more **overlap** with the contextual words in the sentence.


In [5]:
def simplifiedLesk(word,sentence):
    '''
    Use the simplified Lesk algorithm to disambiguate word in the context of sentence
    word: a String which is the word to be disambiguated
    sentence: a String which is the sentence containing the word
    :return: a pair (chosen sense definition, overlap score)
    '''
    
    #construct the set of context word tokens for the sentence: all words in sentence - word itself
    contexttokens=set(word_tokenize(sentence))-{word}
    
    #get all the possible synsets for the word
    synsets=wn.synsets(word)
    scores=[]
    
    #iterate over synsets
    for synset in synsets:
        #get the set of tokens in the definition of the synset
        sensetokens=set(word_tokenize(synset.definition()))
        #find the size of the intersection of the sensetokens set with the contexttokens set
        scores.append((synset.definition(),len(sensetokens.intersection(contexttokens))))
    
    #sort the score list in descending order by the score (which is item with index 1 in the pair)
    sortedscores=sorted(scores,key=operator.itemgetter(1),reverse=True) 
    #print(sortedscores)
    return sortedscores[0]
    

In [27]:
word="best"
contexttokens=set(word_tokenize("what is the best thing in this world"))-{word}
synsets=wn.synsets(word)
scores=[]
for synset in synsets:
    #get the set of tokens in the definition of the synset
    sensetokens=set(word_tokenize(synset.definition()))
    #find the size of the intersection of the sensetokens set with the contexttokens set
    scores.append((synset.definition(),sensetokens.intersection(contexttokens)))

for i in scores:
    print(i)
    

('the supreme effort one can make', {'the'})
('the person who is most outstanding or excellent; someone who tops all others', {'is', 'the'})
('Canadian physiologist (born in the United States) who assisted F. G. Banting in research leading to the discovery of insulin (1899-1978)', {'in', 'the'})
('get the better of', {'the'})
("(superlative of `good') having the most positive qualities", {'the'})
("(comparative and superlative of `well') wiser or more advantageous and hence advisable", set())
('having desirable or positive qualities especially those suitable for a thing specified', {'thing'})
('having the normally expected amount', {'the'})
('morally admirable', set())
('deserving of esteem and respect', set())
('promoting or enhancing well-being', set())
('agreeable or pleasing', set())
('of moral excellence', set())
('having or showing knowledge and skill and aptitude', set())
('thorough', set())
('with or in a close or intimate relationship', {'in'})
('financially sound', set())
('m

Now lets test it on a couple of sentences containing the word *bank*

In [6]:
banksentences=["he borrowed money from the bank","he sat on the bank of the river and watched the currents"]
for sentence in banksentences:
    print(sentence,":",simplifiedLesk("bank",sentence))

he borrowed money from the bank : ('a financial institution that accepts deposits and channels the money into lending activities', 2)
he sat on the bank of the river and watched the currents : ('sloping land (especially the slope beside a body of water)', 2)


It actually appears not to do too bad.  However, this is more by luck than anything else.   If you inspect the sentences and the definitions, you will notice that most of the overlap is currently generated by stopwords.

### Exercise 1.1
Improve the SimplifiedLesk algorithm by carrying out:
* case and number normalisation 
* stopword filtering
* lemmatisation

You should find some useful functions for doing this in `utils.py` based on earlier labs.

Make sure you test it.  Unfortunately, you should now find 0 overlap between any of the senses and the two bank sentences given.

In [29]:
def simplifiedLesk(word,sentence):
    '''
    Use the simplified Lesk algorithm to disambiguate word in the context of sentence
    word: a String which is the word to be disambiguated
    sentence: a String which is the sentence containing the word
    :return: a pair (chosen sense definition, overlap score)
    '''
    
    #construct the set of context word tokens for the sentence: all words in sentence - word itself
    lemma =WordNetLemmatizer()  
    contexttokens=set((filter_stopwords(normalise(word_tokenize(sentence)))))-{word}
    contextlemmas={lemma.lemmatize(contexttoken) for contexttoken in contexttokens}
    
    #get all the possible synsets for the word
    synsets=wn.synsets(word)
    scores=[]
    
    #iterate over synsets
    for synset in synsets:
        #get the set of tokens in the definition of the synset
        sensetokens=set(filter_stopwords(normalise(word_tokenize(synset.definition()))))
        senselemmas={lemma.lemmatize(token) for token in sensetokens}
        #find the size of the intersection of the sensetokens set with the contexttokens set
        scores.append((synset.definition(),len(senselemmas.intersection(contextlemmas))))
    
    #sort the score list in descending order by the score (which is item with index 1 in the pair)
    sortedscores=sorted(scores,key=operator.itemgetter(1),reverse=True) 
    #print(sortedscores)
    return sortedscores[0]
    

In [30]:
banksentences=["he borrowed money from the bank","he sat on the bank of the river and watched the currents"]
for sentence in banksentences:
    print(sentence,":",simplifiedLesk("bank",sentence))

he borrowed money from the bank : ('a financial institution that accepts deposits and channels the money into lending activities', 1)
he sat on the bank of the river and watched the currents : ('sloping land (especially the slope beside a body of water)', 0)


In [28]:
wn.synsets("bank")

[Synset('bank.n.01'),
 Synset('depository_financial_institution.n.01'),
 Synset('bank.n.03'),
 Synset('bank.n.04'),
 Synset('bank.n.05'),
 Synset('bank.n.06'),
 Synset('bank.n.07'),
 Synset('savings_bank.n.02'),
 Synset('bank.n.09'),
 Synset('bank.n.10'),
 Synset('bank.v.01'),
 Synset('bank.v.02'),
 Synset('bank.v.03'),
 Synset('bank.v.04'),
 Synset('bank.v.05'),
 Synset('deposit.v.02'),
 Synset('bank.v.07'),
 Synset('trust.v.01')]

## Adapted Lesk
WordNet definitions are very short.  However, it is possible to create a bigger set of sense words by including information about the hypernyms and hyponyms of each sense.

### Exercise 2.1
Adapt the Lesk algorithm to include in `sensetokens`:
* all of the lemma_names for the sense itself
* all of the lemma_names for the hypernyms of the sense
* all of the lemma_names for the hypoynyms of the sense
* all of the words from the definitions of the hypernyms of the sense
* all of the words from the definitions of the hyponyms of the sense

Make sure you carry out normalisation and lemmatisation of these words as before

Test each adaptation you make on the bank sentences, recording the overlap observed with the chosen sense.

In [9]:
def adaptedLesk(word,sentence):
    '''
    Use the simplified Lesk algorithm to disambiguate word in the context of sentence, using standard WordNet adaptations
    word: a String which is the word to be disambiguated
    sentence: a String which is the sentence containing the word
    :return: a pair (chosen sense definition, overlap score)
    '''
    
    #construct the set of context word tokens for the sentence: all words in sentence - word itself
    
    lemma =WordNetLemmatizer()
    contexttokens=set((filter_stopwords(normalise(word_tokenize(sentence)))))-{word}
    contextlemmas={lemma.lemmatize(contexttoken) for contexttoken in contexttokens}
    #get all the possible synsets for the word
    synsets=wn.synsets(word)
    scores=[]
    
    #iterate over synsets
    for synset in synsets:
        #get the set of tokens in the definition of the synset
        sensetokens=word_tokenize(synset.definition())
        sensetokens+=synset.lemma_names()
        for hypernym in synset.hypernyms():
            sensetokens+=hypernym.lemma_names()
            sensetokens+=word_tokenize(hypernym.definition())
        for hyponym in synset.hyponyms():
            sensetokens+=hyponym.lemma_names()
            sensetokens+=word_tokenize(hyponym.definition())
        
        sensetokens=set(filter_stopwords(normalise(sensetokens)))
        senselemmas={lemma.lemmatize(token) for token in sensetokens}
        #find the size of the intersection of the sensetokens set with the contexttokens set
        scores.append((synset.definition(),len(senselemmas.intersection(contextlemmas))))
    
    #sort the score list in descending order by the score (which is item with index 1 in the pair)
    sortedscores=sorted(scores,key=operator.itemgetter(1),reverse=True) 
    #print(sortedscores)
    return sortedscores[0]
    

In [10]:
banksentences=["he borrowed money from the bank","he sat on the bank of the river and watched the currents"]
for sentence in banksentences:
    print(sentence,":",adaptedLesk("bank",sentence))


he borrowed money from the bank : ('a financial institution that accepts deposits and channels the money into lending activities', 1)
he sat on the bank of the river and watched the currents : ('sloping land (especially the slope beside a body of water)', 1)


### Exercise 2.2
* From a sample of 1000 sentences from the dvd category of the Amazon review corpus (using the `sample_raw_sents()` method), find sentences which contain the lemma *film*. It will depend on the exact sample, but I would expect there to be somewhere between 50 and 100. 
* Use your AdaptedLesk algoritm to disambiguate them.  You may want to adapt it slightly so that it takes as input a list or a set of context lemmas rather than the sentence itself.  
* Record the number of instances of each sense of *film* predicted by this algorithm.

In [11]:
dvd_reader = AmazonReviewCorpusReader().category("dvd")
sentences=dvd_reader.sample_raw_sents(1000)

In [12]:
i=0
for sentence in sentences:
    if(i>=50 and i<=100):
        if("film" in sentence):
            print(sentence,":",adaptedLesk("film",sentence))
    i+=1

The film is highly , highly recommended . : ('a form of entertainment that enacts a story by sound and a sequence of images giving the illusion of continuous movement', 0)
This movie is definetly worth a rental and is a surprising novelty watching a slasher film not featuring a bunch of half-clad bimbos running around but a bunch of half-clad himbos running around : ('a form of entertainment that enacts a story by sound and a sequence of images giving the illusion of continuous movement', 2)
I 'd have to say this is one of the best animated films I 've ever seen . : ('a form of entertainment that enacts a story by sound and a sequence of images giving the illusion of continuous movement', 1)
Despite some claims on here this film did fine at the box office it made $ 32,377,000 domestically . : ('a form of entertainment that enacts a story by sound and a sequence of images giving the illusion of continuous movement', 1)
Given that the film was shot in Panavision 's 2.35 : 1 anamorphic pr

### Exercise 2.3
Inspect some of the individual predictions for your film sentences (at least one for each sense predicted).  Do you agree with the sense prediction?

In [13]:
film_sentences=["The music was specially composed for the film.","She develops her own film.","The crew has gone to Africa to film a wildlife documentary.","Her last movie was filmed in Spain","I didn’t get my film developed yet.","a roll of film","A film of oil glistened on the surface of the water."]
for film_sentence in film_sentences:
    print(film_sentence,":",adaptedLesk("film",film_sentence))
    

The music was specially composed for the film. : ('a form of entertainment that enacts a story by sound and a sequence of images giving the illusion of continuous movement', 0)
She develops her own film. : ('a form of entertainment that enacts a story by sound and a sequence of images giving the illusion of continuous movement', 0)
The crew has gone to Africa to film a wildlife documentary. : ('a form of entertainment that enacts a story by sound and a sequence of images giving the illusion of continuous movement', 1)
Her last movie was filmed in Spain : ('a form of entertainment that enacts a story by sound and a sequence of images giving the illusion of continuous movement', 1)
I didn’t get my film developed yet. : ('a form of entertainment that enacts a story by sound and a sequence of images giving the illusion of continuous movement', 0)
a roll of film : ('photographic material consisting of a base of celluloid covered with a photographic emulsion; used to make negatives or transp

## Minimising the Distance in the Semantic Hierarchy
This WSD method is based on the intuition that the concepts mentioned in a sentence will be close together in the hyponym hierarchy.

### Exercise 3.1
Write a function `max_sim(word, contextlemmas,pos)`which will choose the sense of a *word* given its context *sentence* using a WordNet based semantic similarity measure (see Lab_5_1).  You can assume that the part of speech of the word is known and is supplied to the function as another argument.

Within the function, 
1. For each **sense** of the word under consideration:
* compute its semantic similarity with each context **lemma** of the same part of speech.  For each context lemma you will need to consider each of its **senses** (and take the maximum similarity).  Therefore, you will need a triple nested loop! 
* sum the semantic similarities over the sentence
2. Choose the **sense** with the maximum sum.

Test your function on the bank sentences.  You should find, disappointingly for the method,  that the first sentence has a maximum score of 2.71 with "an arrangement of similar objects in a row or in tiers" and the second sentence has a maximum socre of 4.68 with "an arrangement of similar objects in a row or in tiers".

In [37]:
def max_sim(word,contextlemmas,pos=wn.NOUN):
    #brown_ic=wn_ic.ic("ic-brown.dat")
    synsets=wn.synsets(word,pos)
    scores=[]
    for synset in synsets:
        total=0
        for lemma in contextlemmas:
            sofar=0
            for synsetB in wn.synsets(lemma,pos):
                sim=wn.path_similarity(synset,synsetB)
                if sim>sofar:
                    sofar=sim
            total+=sofar
        scores.append((synset.definition(),total))
    sortedscores=sorted(scores,key=operator.itemgetter(1),reverse=True)
    #print(sortedscores)
    return sortedscores[0]

In [42]:
banksentences=["he borrowed money from the bank","he sat on the bank of the river and watched the currents"]

sentence=filter_stopwords(normalise(word_tokenize(banksentences[0])))
print(banksentences[0],":",max_sim("bank",sentence))


he borrowed money from the bank : ('a supply or stock held in reserve for future use (especially in emergencies)', 1.482901085763742)


### Exercise 3.2
* Run your max_sim function on all of your film sentences and record the number of predictions for each sense.
* Inspect some of the individual predictions.
* Compare the results with those from the AdaptedLesk algorithm and draw some conclusions.

In [1]:
dvd_reader = AmazonReviewCorpusReader().category("kitchen")
sentences=dvd_reader.sample_raw_sents(1000)
i=0
for sentence in sentences:
    if(i>=50 and i<=100):
        if("film" in sentence):
            print("adaptLesk:",adaptedLesk("",sentence))
            print("lin_sim:",max_sim("film",filter_stopwords(normalise(word_tokenize(sentence)))))
            print(sentence)
    i+=1


NameError: name 'AmazonReviewCorpusReader' is not defined

In [48]:
film_sentences=["The music was specially composed for the film.","She develops her own film.","The crew has gone to Africa to film a wildlife documentary.","Her last movie was filmed in Spain","I didn’t get my film developed yet.","a roll of film","A film of oil glistened on the surface of the water."]
for film_sentence in film_sentences:
    print("adaptLesk:",adaptedLesk("film",film_sentence))
    print("lin_sim:",max_sim("film",filter_stopwords(normalise(word_tokenize(film_sentence)))))
    print(film_sentence)
    
    
    

adaptLesk: ('a form of entertainment that enacts a story by sound and a sequence of images giving the illusion of continuous movement', 0)
lin_sim: ('a form of entertainment that enacts a story by sound and a sequence of images giving the illusion of continuous movement', 1.2523973849914365)
The music was specially composed for the film.
adaptLesk: ('a form of entertainment that enacts a story by sound and a sequence of images giving the illusion of continuous movement', 0)
lin_sim: ('a form of entertainment that enacts a story by sound and a sequence of images giving the illusion of continuous movement', 1.0)
She develops her own film.
adaptLesk: ('a form of entertainment that enacts a story by sound and a sequence of images giving the illusion of continuous movement', 1)
lin_sim: ('a form of entertainment that enacts a story by sound and a sequence of images giving the illusion of continuous movement', 2.135407478380542)
The crew has gone to Africa to film a wildlife documentary.
ada

In [5]:
wn.synsets("film")
    
    
    
    
    
    
    
    
    
    
    
    


[Synset('movie.n.01'),
 Synset('film.n.02'),
 Synset('film.n.03'),
 Synset('film.n.04'),
 Synset('film.n.05'),
 Synset('film.v.01'),
 Synset('film.v.02')]