<div class="pull-right"><img src=KEY-logo.png></div/>

## Knowledge representation and similarity 
### Grounding (Word-Sense Disambiguation) to WordNet

CSI4106 Artificial Intelligence  
Fall 2018  
Caroline Barrière

***

In this notebook, first, you will explore Wordnet, a lexical semantic network, in which knowledge is organized by interrelated synsets (groups of synonyms).  Second, you will attempt Word-Sense Disambiguation (WSD), using simple Lesk-like algorithm which compares BOWs (bag-of-words).  

This notebook uses the same package NLTK as we used in the last notebook. We will also reuse some knowledge from the previous notebook (tokenization, lemmatization, POS tagging), so make sure to do the NLP Pipeline notebook before this one.

*As you now have more experience, this notebook requires that you write more code by yourself than the previous ones.*

***HOMEWORK***:  
Go through the notebook by running each cell, one at a time. Look for (**TO DO**) for the tasks that you need to perform.  
Make sure you *sign* (type your name) the notebook at the end. Once you're done, submit your notebook.

***

In [1]:
# let's import nltk, and wordnet

import nltk
from nltk.corpus import wordnet

**1. Exploring Wordnet**  

Let's first explore a bit the wordnet interface within nltk.  
You can also look a the [WordNet interface description](http://www.nltk.org/howto/wordnet.html)

In [2]:
# a synset is a concept associated with a set of synonyms

paperSenses = wordnet.synsets('paper')
print(paperSenses)

[Synset('paper.n.01'), Synset('composition.n.08'), Synset('newspaper.n.01'), Synset('paper.n.04'), Synset('paper.n.05'), Synset('newspaper.n.02'), Synset('newspaper.n.03'), Synset('paper.v.01'), Synset('wallpaper.v.01')]


This shows that there are 9 senses of paper, 7 nouns and 2 verbs.  The word displayed is the most representative word for each sense.  

You can try other words.  I recommend that you also perform the same search [online](http://wordnetweb.princeton.edu/perl/webwn) to better understand the results.

Let's look at the basic information in each synset.        

In [3]:
# We define a function to print the basic information

def printBasicSynsetInfo(d):
    print("SynLemmas")
    print(d.lemmas())
    print("Synonyms")
    synonyms = [l.name() for l in d.lemmas()]
    print(synonyms)
    print("Definition")
    print(d.definition())

In [4]:
# We can print the information for each sense of "paper"

for i in range(len(paperSenses)):
    print("[Sense " + str(i) + "]")
    printBasicSynsetInfo(paperSenses[i])
    print()

[Sense 0]
SynLemmas
[Lemma('paper.n.01.paper')]
Synonyms
['paper']
Definition
a material made of cellulose pulp derived mainly from wood or rags or certain grasses

[Sense 1]
SynLemmas
[Lemma('composition.n.08.composition'), Lemma('composition.n.08.paper'), Lemma('composition.n.08.report'), Lemma('composition.n.08.theme')]
Synonyms
['composition', 'paper', 'report', 'theme']
Definition
an essay (especially one written as an assignment)

[Sense 2]
SynLemmas
[Lemma('newspaper.n.01.newspaper'), Lemma('newspaper.n.01.paper')]
Synonyms
['newspaper', 'paper']
Definition
a daily or weekly publication on folded sheets; contains news and articles and advertisements

[Sense 3]
SynLemmas
[Lemma('paper.n.04.paper')]
Synonyms
['paper']
Definition
a medium for written communication

[Sense 4]
SynLemmas
[Lemma('paper.n.05.paper')]
Synonyms
['paper']
Definition
a scholarly article describing the results of observations or stating hypotheses

[Sense 5]
SynLemmas
[Lemma('newspaper.n.02.newspaper'), Lemm

A rich taxonomy has been manually developed in Wordnet, making it a rich resource.  

**(TO-DO : Q1)** Choose two words, and write code to print the taxonomic information for all senses of those words.

In [5]:
# We define a function to print the basic information, receives a synset

def printTaxonomyInfo(d):
    synonyms = [l.name() for l in d.lemmas()]
    print(synonyms)
    print("Hypernyms:")
    print(d.hypernyms())
    print("Hyponyms:")
    print(d.hyponyms())

In [6]:
# Q1 - ANSWER
# We can print the taxonomy information for each sense of a word X

for i in range(len(paperSenses)):
    print("[Taxonomy " + str(i) + "]")
    printTaxonomyInfo(paperSenses[i])
    print()


[Taxonomy 0]
['paper']
Hypernyms:
[Synset('material.n.01')]
Hyponyms:
[Synset('art_paper.n.01'), Synset('blotting_paper.n.01'), Synset('blueprint_paper.n.01'), Synset('carbon_paper.n.01'), Synset('card.n.01'), Synset('cardboard.n.01'), Synset('cartridge_paper.n.02'), Synset('chad.n.01'), Synset('computer_paper.n.01'), Synset('confetti.n.01'), Synset('construction_paper.n.01'), Synset('crepe.n.01'), Synset('drawing_paper.n.01'), Synset('filter_paper.n.01'), Synset('flypaper.n.01'), Synset('graph_paper.n.01'), Synset('greaseproof_paper.n.01'), Synset('india_paper.n.01'), Synset('linen.n.02'), Synset('litmus_paper.n.01'), Synset('manifold_paper.n.01'), Synset('manila.n.01'), Synset('music_paper.n.01'), Synset('newspaper.n.04'), Synset('oilpaper.n.01'), Synset('pad.n.01'), Synset('paper_tape.n.01'), Synset('paper_toweling.n.01'), Synset('papier-mache.n.01'), Synset('papyrus.n.01'), Synset('parchment.n.01'), Synset('rice_paper.n.01'), Synset('roofing_paper.n.01'), Synset('sheet.n.02'), Syns

**2. Word-Sense Disambiguation.**  

Let's now implement a simple modified Lesk algorithm for WSD.  
The idea is to compare the sentence containing the ambiguous word W to all the definitions of W and choose the most similar.

(Step 1) Create a BOW (bag of words) for each definition.

In [7]:
# we will need the tokenizer

from nltk import word_tokenize

In [8]:
# define a small method to return the set of words found in a text
# we can exclude some words

def bow(text, excluded = None):
    text = text.replace("_", " ") # the compound nouns in wordnet text have _
    tokens = word_tokenize(text)
    setTokens = set(tokens)
    if excluded != None:
        if (excluded in setTokens):
            setTokens.remove(excluded)
    return setTokens

In [9]:
# testing 
print(bow("There is a lot of food on the table", excluded='table'))
print(bow("He wrote an excellent conference paper referred by many researchers", excluded='paper'))

{'the', 'food', 'lot', 'There', 'on', 'of', 'a', 'is'}
{'conference', 'He', 'referred', 'an', 'wrote', 'by', 'many', 'excellent', 'researchers'}


In [10]:
# make BOWs for all the senses in a received word
# exclude from the BOW, the word being defined

def makeDefBOWs(testWord):
    synsets = wordnet.synsets(testWord)
    defs = [s.definition() for s in synsets]
    bows = [bow(d, excluded=testWord) for d in defs]
    return bows

In [11]:
# try with different words, look at the resulting info

testWord = "cell" # bank, course, paper, ...
defBOWs = makeDefBOWs(testWord)
    
print(*defBOWs, sep="\n")  # to print a list on separate lines

{'compartment', 'any', 'small'}
{';', 'the', 'may', 'units', 'tissues', 'they', 'plants', 'as', 'or', 'animals', 'life', 'in', 'basic', 'functional', '(', 'exist', 'organisms', 'biology', 'and', 'independent', 'monads', 'form', 'higher', 'of', 'all', 'structural', ')', 'unit', 'colonies'}
{'result', 'that', 'as', 'reaction', 'an', 'electric', 'the', 'device', 'delivers', 'chemical', 'of', 'a', 'current'}
{'as', 'or', 'larger', 'the', 'political', 'small', 'serving', 'part', 'nucleus', 'unit', 'of', 'movement', 'a'}
{'an', 'short-range', 'sections', 'hand-held', 'a', 'mobile', 'in', 'transmitter/receiver', 'with', ',', 'its', 'each', 'small', 'divided', 'radiotelephone', 'use', 'for', 'own', 'into', 'area'}
{'room', 'or', 'nun', 'which', 'in', 'small', 'lives', 'monk', 'a'}
{'where', 'room', 'prisoner', 'kept', 'a', 'is'}


(Step 2) Create a method to compare BOWs

In [12]:
# We're interested in the size of the intersection between the BOWs
# If you wish to see the words in common to understand the results, uncomment the prints

def bowOverlap(bow1, bow2):
    #print(bow1)
    #print(bow2)
    #print(bow1.intersection(bow2))
    #print(len(bow1.intersection(bow2)))
    return len(bow1.intersection(bow2))

**(TO-DO: Q2)** Implement the (Step 3) of the algorithm.  The (Step 3) consist in comparing the BOW of a test sentence (let's call it our context C) containing an ambiguous word (X) to the BOWs of all the senses of the X.  To do Step 3, you need to complete the method below which receives a word X, as well as the text C in which X occurs.  The method should return the synsets with largest common BOWs with X.  Notice that there could be more than one maximum, so your method should return all synsets with maximum intersection.

In [13]:
# Q2 - ANSWER

# method receives a word and its context
# returns all the synsets with maximum overlap

def findMostProbableSense(word, context):
    bows = makeDefBOWs(word)
    textBOW = bow(context)
    # find senses with max overlap
    maxBagsSyns=[]
    maxOver = 0
    for bag in bows:
        overlap = bowOverlap(textBOW, bag)
        if(overlap>maxOver):
            maxBagSyns=[]
            for word in textBOW.intersection(bag):
                maxBagSyns.append(wordnet.synsets(word))
            maxOver = overlap
        elif(overlap==maxOver):
            for word in textBOW.intersection(bag):
                maxBagSyns.append(wordnet.synsets(word))
    #print(maxBagSyns)
    return maxBagSyns
    
    

##### Your method should return the chosen senses for the example below.  We will test your method using the following code.

In [14]:
# Show the BOWs of the senses with the overlap, and the chosen sense(s)
# You can try with various words and sentences

testWord = "cell"
testSentence = "He lived in this prison cell for many years."


####  CALL TO YOUR METHOD RECEIVING THE WORD AND ITS CONTEXT
chosenSynsets = findMostProbableSense(testWord, testSentence)  

# print all the definitions of the most probable senses
for s in chosenSynsets:
    for t in s:
        printBasicSynsetInfo(t)

SynLemmas
[Lemma('inch.n.01.inch'), Lemma('inch.n.01.in')]
Synonyms
['inch', 'in']
Definition
a unit of length equal to one twelfth of a foot
SynLemmas
[Lemma('indium.n.01.indium'), Lemma('indium.n.01.In'), Lemma('indium.n.01.atomic_number_49')]
Synonyms
['indium', 'In', 'atomic_number_49']
Definition
a rare soft silvery metallic element; occurs in small quantities in sphalerite
SynLemmas
[Lemma('indiana.n.01.Indiana'), Lemma('indiana.n.01.Hoosier_State'), Lemma('indiana.n.01.IN')]
Synonyms
['Indiana', 'Hoosier_State', 'IN']
Definition
a state in midwestern United States
SynLemmas
[Lemma('in.s.01.in')]
Synonyms
['in']
Definition
holding office
SynLemmas
[Lemma('in.s.02.in')]
Synonyms
['in']
Definition
directed or bound inward
SynLemmas
[Lemma('in.s.03.in')]
Synonyms
['in']
Definition
currently fashionable
SynLemmas
[Lemma('in.r.01.in'), Lemma('in.r.01.inwards'), Lemma('in.r.01.inward')]
Synonyms
['in', 'inwards', 'inward']
Definition
to or toward the inside of


**(TO-DO: Q3)** What do you notice? With the example above for "cell", what are the words making the BOWs look similar?  Are these significant words?

*Q3-ANSWER*  

The words that are in common 'in' and 'for'
These are fairly significant words because it determines that 'cell' is not an object but a place.  If it were interpreted as an object these words would likely nbot be used


**(TO-DO: Q4)  Refining our BOWs**

**Exploring variations:**
1. What if you lowercase everything?
2. What if you apply lemmatisation on all words in the BOWs?
3. What if you focus on only the NOUNS in the BOWs?

(hint) Go back to your notebook NLP pipeline for questions (2) use the lemmatizer and (3) perform POS tagging on the sentences. 

For your answer (code to write):  

a) First complete the BOW method below in which I've added parameters to possibly activate the lowercase, the lemmatization and the POS tagging.   
b) Add a few tests to see if your BOW works.  


In [15]:
# Q4 - ANSWER - part a)

# The parameters possibly ACTIVATE lowercase, lemmatization, and keeping only Nouns in BOWs.

# nltk contains a method to obtain the part-of-speech of each token
# Download the wordnet resource
from nltk.stem.wordnet import WordNetLemmatizer
nltk.download('averaged_perceptron_tagger')
wnl = nltk.WordNetLemmatizer()

def get_wordnet_pos(treebank_tag):

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.ADV  # just use as default, for ADV the lemmatizer doesn't change anything 

# refine the method with parameters
def bow(text, excluded = None, lowercase = False, lemmatize=False, nounsOnly=False):
    text = text.replace("_", " ") # the compound nouns in wordnet text have _
    tokens = word_tokenize(text)
    setTokens = set(tokens)
    if excluded != None:
        if (excluded in setTokens):
            setTokens.remove(excluded)
    
    if lowercase:
        for token in setTokens:
            token=token.lower()
            
    if lemmatize:
        setTokens = [wnl.lemmatize(t) for t in tokens]
    
    if nounsOnly:
        
        posTokens = nltk.pos_tag(setTokens)
        wordnet_tags = [get_wordnet_pos(p[1]) for p in posTokens]
        #print(wordnet_tags)
        #print(posTokens)
        for p in posTokens:
        
            wordnetPos = get_wordnet_pos(p[1])
            #print(wordnetPos)
            if (wordnetPos!='n'):
                #print(setTokens)
                setTokens.remove(p[0])
      
            
        
    return setTokens




[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/reynadoerwald/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [16]:
# Q4 - ANSWER - part b)

# TEST YOUR METHOD 
print(bow("There is a lot of food on the table", excluded='table', lowercase=True, lemmatize=True, nounsOnly=True))
# Your example 1
print(bow("This is the final notebook for CSI4106", excluded='notebook', lowercase=False, lemmatize=True, nounsOnly=False))
# Your example 2
print(bow("I really want Swiss Chalet right now", excluded='right', lowercase=True, lemmatize=False, nounsOnly=True))

['lot', 'food', 'table']
['This', 'is', 'the', 'final', 'notebook', 'for', 'CSI4106']
{'want', 'Chalet'}


**(TO-DO: Q5)** TESTING BOW VARIATIONS IN LESK-LIKE DISAMBIGUATION

a) Redo the method makeDefBOW and findMostProbableSense to use the new parameters.  

b) Generate three example cases and test your disambiguation strategy programmed above.  An example case contains an ambiguous word (e.g. bank) and a sentence in which that word must be disambiguated (e.g. He sat on the bank throwing rocks in the water.).  

c) For your examples, which filtering seems to work better (with/without lemmatization, with/without focus only on nouns)?


In [17]:
# Q5 - ANSWER - part a)

# add the parameters to makeBOW as well, same default
def makeDefBOWs(testWord, lowercase=False, lemmatize=False, nounsOnly=False):
    synsets = wordnet.synsets(testWord)
    defs = [s.definition() for s in synsets]
    bows = [bow(d, excluded=testWord, lowercase=lowercase, lemmatize=lemmatize, nounsOnly=nounsOnly) for d in defs]
    return bows


# also add the parameter here, copy your method from above and add a parameter for stemming
# def findMostProbableSense(senses, text, stemming=False):
def findMostProbableSense(word, text, lowercase=False, lemmatize=False, nounsOnly=False):
    bows = makeDefBOWs(word, lowercase=lowercase, lemmatize=lemmatize, nounsOnly=nounsOnly)
    textBOW = bow(text)
    # find senses with max overlap
    maxBagSyns=[]
    maxOver = 0
    for bag in bows:
        overlap = bowOverlap(textBOW, bag)
        if(overlap>maxOver):
            maxBagSyns=[]
            for word in textBOW.intersection(bag):
                maxBagSyns.append(wordnet.synsets(word))
                maxOver = overlap
        elif(overlap==maxOver):
            for word in textBOW.intersection(bag):
                maxBagSyns.append(wordnet.synsets(word))
    #print(maxBagSyns)
    return maxBagSyns

In [18]:
# Q5 - ANSWER - part b)

testWord = "table"
testSentence = "There is a lot of food on the table."
chosenSynsets = findMostProbableSense(testWord, testSentence, lowercase=True, lemmatize=True, nounsOnly=True)  

# print all the definitions of the most probable senses
for s in chosenSynsets:
    for t in s:
        printBasicSynsetInfo(t)

    
# Your example 1
#
print("#################")
print("#   Example 1.  #")
print("#################")
testWord1 = "break"
testSentence1 = "I really need a break from school."
chosenSynsets1 = findMostProbableSense(testWord1, testSentence1, lowercase=True, lemmatize=False, nounsOnly=False)  
for s in chosenSynsets1:
    for t in s:
        printBasicSynsetInfo(t)


# Your example 2
#
print("#################")
print("#   Example 2.  #")
print("#################")
testWord2 = "clear"
testSentence2 = "Everything is so clear now."
chosenSynsets2 = findMostProbableSense(testWord2, testSentence2, lowercase=False, lemmatize=True, nounsOnly=False)  
for s in chosenSynsets2:
    for t in s:
        printBasicSynsetInfo(t)



# Your example 3
#
print("#################")
print("#   Example 3.  #")
print("#################")

testWord3 = "light"
testSentence3 = "Justyn can see light in the Chalet."
chosenSynsets3 = findMostProbableSense(testWord3, testSentence3, lowercase=True, lemmatize=True, nounsOnly=False)  

for s in chosenSynsets3:
    for t in s:
        printBasicSynsetInfo(t)


SynLemmas
[Lemma('table.n.01.table'), Lemma('table.n.01.tabular_array')]
Synonyms
['table', 'tabular_array']
Definition
a set of data arranged in rows and columns
SynLemmas
[Lemma('table.n.02.table')]
Synonyms
['table']
Definition
a piece of furniture having a smooth flat top that is usually supported by one or more vertical legs
SynLemmas
[Lemma('table.n.03.table')]
Synonyms
['table']
Definition
a piece of furniture with tableware for a meal laid out on it
SynLemmas
[Lemma('mesa.n.01.mesa'), Lemma('mesa.n.01.table')]
Synonyms
['mesa', 'table']
Definition
flat tableland with steep edges
SynLemmas
[Lemma('table.n.05.table')]
Synonyms
['table']
Definition
a company of people assembled at a table for a meal or game
SynLemmas
[Lemma('board.n.04.board'), Lemma('board.n.04.table')]
Synonyms
['board', 'table']
Definition
food or meals in general
SynLemmas
[Lemma('postpone.v.01.postpone'), Lemma('postpone.v.01.prorogue'), Lemma('postpone.v.01.hold_over'), Lemma('postpone.v.01.put_over'), Lemma

Synonyms
['open', 'clear']
Definition
a clear or unobstructed space or expanse of land or water
SynLemmas
[Lemma('unclutter.v.01.unclutter'), Lemma('unclutter.v.01.clear')]
Synonyms
['unclutter', 'clear']
Definition
rid of obstructions
SynLemmas
[Lemma('clear.v.02.clear')]
Synonyms
['clear']
Definition
make a way or path by removing objects
SynLemmas
[Lemma('clear_up.v.04.clear_up'), Lemma('clear_up.v.04.clear'), Lemma('clear_up.v.04.light_up'), Lemma('clear_up.v.04.brighten')]
Synonyms
['clear_up', 'clear', 'light_up', 'brighten']
Definition
become clear
SynLemmas
[Lemma('authorize.v.01.authorize'), Lemma('authorize.v.01.authorise'), Lemma('authorize.v.01.pass'), Lemma('authorize.v.01.clear')]
Synonyms
['authorize', 'authorise', 'pass', 'clear']
Definition
grant authorization or clearance for
SynLemmas
[Lemma('clear.v.05.clear')]
Synonyms
['clear']
Definition
remove
SynLemmas
[Lemma('pass.v.09.pass'), Lemma('pass.v.09.clear')]
Synonyms
['pass', 'clear']
Definition
go unchallenged; be 

*Q5 - ANSWER - part c)*
With lematization and without nouns only seems to work best in my scenarios.

#### Signature

I, Reyna Doerwald, declare that the answers provided in this notebook are my own.