# Week 6 (Part 1): Supervised WSD

In the first part of this week we will be looking at corpus-based methods for carrying out word sense disambiguation.  In particular, we will:
* introduce SemCor, a sense-tagged subsection of the Brown Corpus.
* build Naive Bayes classifiers to carry out sense disambiguation for words with two senses

First some preliminary imports

In [2]:
from nltk.corpus import semcor
from nltk.corpus import wordnet as wn
import nltk
import operator
import random


In [3]:
#On first run, you will probably need to uncomment the following line and run this cell
#nltk.download('semcor')

## 1. SemCor
SemCor is a collection of 352 documents which have been annotated in various ways (annotations include POS tags and WordNet synsets for individual words

`semcor.fileids()` returns a list of all of the individual document ids in SemCor

In [4]:
allfiles=semcor.fileids() #list of fileids
len(allfiles)

352

`semcor.raw(fileid)` returns the raw text of the given file.  Note that this is marked-up using XML and is probably best avoided unless there is no other way to access the information you require from the file!

In [1]:
"""
semcor.raw(allfiles)
"""

'\nsemcor.raw(allfiles)\n'

Other potentially useful SemCor functions include:

* `semcor.words(fileid)`: returns a list of tokens for each file
* `semcor.chunks(fileid)`: returns a list of *chunks* for each file, where a chunk identifies multiword (generally non-compositional) phrases
* `semcor.tagged_chunks(fileid,tagtype)`: returns the tagged chunks of the file where the tagtype can be *pos* or *sem*.  We are interested in the *sem* tags which are the WOrdNet synsets
* `semcor.tagged_sentences(fileid,tagtype)`: maintains the sentence boundaries within the file and therefore returns a list of lists (one for each sentence)

In [6]:
semcor.words(allfiles[0])

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

In [7]:
semcor.chunks(allfiles[0])

[['The'], ['Fulton', 'County', 'Grand', 'Jury'], ...]

In [8]:
semcor.tagged_chunks(allfiles[0],tag='sem')

[['The'], Tree(Lemma('group.n.01.group'), [Tree('NE', ['Fulton', 'County', 'Grand', 'Jury'])]), ...]

In [9]:
tagged_sentences=semcor.tagged_sents(allfiles[0],tag='sem')
tagged_sentences[0]

[['The'],
 Tree(Lemma('group.n.01.group'), [Tree('NE', ['Fulton', 'County', 'Grand', 'Jury'])]),
 Tree(Lemma('state.v.01.say'), ['said']),
 Tree(Lemma('friday.n.01.Friday'), ['Friday']),
 ['an'],
 Tree(Lemma('probe.n.01.investigation'), ['investigation']),
 ['of'],
 Tree(Lemma('atlanta.n.01.Atlanta'), ['Atlanta']),
 ["'s"],
 Tree(Lemma('late.s.03.recent'), ['recent']),
 Tree(Lemma('primary.n.01.primary_election'), ['primary', 'election']),
 Tree(Lemma('produce.v.04.produce'), ['produced']),
 ['``'],
 ['no'],
 Tree(Lemma('evidence.n.01.evidence'), ['evidence']),
 ["''"],
 ['that'],
 ['any'],
 Tree(Lemma('abnormality.n.04.irregularity'), ['irregularities']),
 Tree(Lemma('happen.v.01.take_place'), ['took', 'place']),
 ['.']]

For the purposes of this exercise, we are interested in single words which have been tagged with a WordNet Lemma or synset.  We now define a couple of functions to help us extract this information.

In [10]:
def extract_tags(taggedsentence):
    '''
    For a tagged sentence in SemCor, identify single words which have been tagged with a WN synset
    taggedsentence: a list of items, some of which are of type wordnet.tree.Tree
    :return: a list of pairs, (word,synset)
    
    '''
    alist=[]
    for item in taggedsentence:
        if isinstance(item,nltk.tree.Tree):   #check with this is a Tree
            if isinstance(item.label(),nltk.corpus.reader.wordnet.Lemma) and len(item.leaves())==1:
                #check whether the tree's label is Lemma and whether the tree has a single leaf
                #if so add the pair (lowercased leaf,synsetlabel) to output list
                alist.append((item.leaves()[0].lower(),item.label().synset()))
    return alist
            

def extract_senses(fileid_list):
    '''
    apply extract_tags to all sentences in all documents in a list of file ids
    fileid_list: list of ids
    :return: list of list of (token,tag) pairs, one for each sentence in corpus
    '''
    sentences=[]
    for fileid in fileid_list:
        print("Processing {}".format(fileid))
        sentences+=[extract_tags(taggedsentence) for taggedsentence in semcor.tagged_sents(fileid,tag='sem')]
    return sentences

Lets test this on the first document in the fileid list.  Notice that it takes a while to process a single file in this way.

In [1]:
#some_sentences=extract_senses([allfiles[0]])


### Exercise 1.1
Write a function `find_sense_distributions()` which finds the distribution of senses for every word in a list of sentences (in the format returned by `extract_sentences()`).  Your output should be a dictionary of dictionaries.  The key to the outermost dictionary should be the word_form and the key to the inner dictionaries should be the sense tag.

Test your function on `some_sentences`

In [12]:
def find_sense_distributions(some_sentences):
    allwords={}
    for sentence in some_sentences:
        for(word, sense) in sentence:
            thisword=allwords.get(word,{})
            thisword[sense]=thisword.get(sense,0)+1
            allwords[word]=thisword
    return allwords

In [27]:
#find_sense_distributions(some_sentences)

### Exercise 1.2
Write a function which returns a list of words which only occur with one sense in the corpus, ordered by frequency (most frequent first).

Test your function on `some_sentences`.  You should find that the fourth most frequently occurring seemingly monosemous word is *georgia* which occurs 6 times in this sample.

In [14]:
def find_monosemous(sense_dists):
    mono=[]
    for key,worddict in sense_dists.items():
        if len(worddict.keys())==1:
            mono.append((key,sum(worddict.values())))
    return sorted(mono, key=operator.itemgetter(1),reverse=True)

In [15]:
#find_monosemous(find_sense_distributions(some_sentences))

### Exercise 1.3
Write a function `find_candidates()` which will find words which 
* have 2 senses in the sample, 
* occurrences of which are roughly balanced between the two classes (between 30% and 70%)
* are as frequent as possible

Test it on `some_sentences`

In [16]:
def find_candidates(sense_dists):
    cands=[]
    for key, worddict in sense_dists.items():
        if len(worddict.keys())==2:
            freq=sum(worddict.values())
            p=list(worddict.values())[0]/freq
            if p>0.3 and p <0.7:
                cands.append((key,freq,p))    
    return sorted(cands,key=operator.itemgetter(1),reverse=True)
    

In [17]:
#find_candidates(find_sense_distributions(some_sentences))

We now need to apply our functions to larger samples.  Here we will define two sets of sentences `training_sentences` and `testing_sentences`.  We are going to choose a random sample of the documents for testing.  We can achieve this by randomly shuffling the fileids and then assigning documents in the first part of the list to training and documents in the second part of the list to testing.  By setting the random seed, we ensure reproducibility of our results (since the random shuffle will be the same each time we run the cell)



In [18]:
random.seed(37)
shuffled=list(allfiles)
random.shuffle(shuffled)
print(shuffled)

['brownv/tagfiles/br-a29.xml', 'brownv/tagfiles/br-l06.xml', 'brown2/tagfiles/br-e31.xml', 'brownv/tagfiles/br-c06.xml', 'brown2/tagfiles/br-j34.xml', 'brownv/tagfiles/br-e11.xml', 'brownv/tagfiles/br-a21.xml', 'brown1/tagfiles/br-j01.xml', 'brownv/tagfiles/br-a17.xml', 'brown1/tagfiles/br-l12.xml', 'brownv/tagfiles/br-e09.xml', 'brown2/tagfiles/br-g17.xml', 'brown2/tagfiles/br-g18.xml', 'brownv/tagfiles/br-g09.xml', 'brownv/tagfiles/br-l04.xml', 'brownv/tagfiles/br-l05.xml', 'brown2/tagfiles/br-l18.xml', 'brownv/tagfiles/br-d08.xml', 'brown1/tagfiles/br-k16.xml', 'brown2/tagfiles/br-f21.xml', 'brown2/tagfiles/br-n11.xml', 'brown1/tagfiles/br-j04.xml', 'brownv/tagfiles/br-e16.xml', 'brownv/tagfiles/br-a25.xml', 'brown2/tagfiles/br-n17.xml', 'brownv/tagfiles/br-g06.xml', 'brownv/tagfiles/br-e06.xml', 'brownv/tagfiles/br-a42.xml', 'brown1/tagfiles/br-g01.xml', 'brown1/tagfiles/br-j19.xml', 'brown2/tagfiles/br-n15.xml', 'brown1/tagfiles/br-j05.xml', 'brownv/tagfiles/br-b16.xml', 'brown1/t

In [2]:
#this cell will take 1-5 minutes to run - avoid rerunning it unnecessarily
#training_sentences=extract_senses(shuffled[:300])
#testing_sentences=extract_senses(shuffled[300:])

### Exercise 1.4
Use the functionality you have already developed to identify:
* the ten most frequent monosemous words in the training data
* the ten best candidates in the training data for evaluating binary classification algorithms for WSD

In [3]:
#find_monosemous(find_sense_distributions(training_sentences))

#training_sentences

## 2. Building Naive Bayes Classifiers for WSD
We are going to train and use a NB classifier to identify the correct sense of a word.

The functions below will get all of the sentences containing a given word and generate a bag-of-words representation suitable for a Naive Bayes classifier.

Try it out on one of the words you identified above.

In [21]:
def contains(sentence,astring):
    '''
    check whether sentence contains astring
    '''
    if len(sentence)>0:
        tokens,tags=zip(*sentence)
        #print(tokens,tags)
        return astring in tokens
    else:
        return False
    
def get_label(sentence,word):
    '''
    get the synset label for the word in this sentence
    '''
    count=0
    label="none"
    for token,tag in sentence:
        if token==word:
            count+=1
            label=str(tag)
    if count !=1:
        #print("Warning: {} occurs {} times in {}".format(word,count,sentence))
        pass
    return label
    
def get_word_data(sentences,word):
    '''
    select sentences containing words and construct labelled data set where each sentence is represented using Bernouilli event model
    '''
    selected_sentences=[sentence for sentence in sentences if contains(sentence,word)]
    word_data=[({token:True for (token,tag) in sentence},get_label(sentence,word)) for sentence in selected_sentences] 
    return word_data

We can now train and test a NaiveBayesClassifier.  Here we are going to use the nltk one, but feel free to try out your own developed in earlier labs.

In [22]:
from nltk.classify.naivebayes import NaiveBayesClassifier

#set myword to one of the words you identified as a good candidate for testing WSD algorithms in the earlier exercises
myword="atom"

training=get_word_data(training_sentences,myword)
testing=get_word_data(testing_sentences,myword)
aclassifier=NaiveBayesClassifier.train(training)

In [23]:
testing

[({'appears': True,
   'possible': True,
   'set': True,
   'abstraction': True,
   'chlorine': True,
   'atom': True,
   'molecule': True,
   'form': True,
   'radical': True},
  "Synset('atom.n.01')"),
 ({'furthermore': True,
   'exchange': True,
   'not': True,
   'expected': True,
   'be': True,
   'sensitive': True,
   'trace': True,
   'amounts': True,
   'impurities': True,
   'abstraction': True,
   'chlorine': True,
   'atom': True,
   'too': True,
   'high': True,
   'also': True,
   'compete': True,
   'very': True,
   'effectively': True,
   'scavenger': True,
   'radicals': True},
  "Synset('atom.n.01')")]

### Exercise 2.1
Write a function to evaluate the accuracy of your classifier on some test data.

Test it using `testing`

In [24]:
training

[({'o': True,
   'atoms': True,
   'sheet': True,
   'are': True,
   'cr': True,
   'atom': True,
   'surrounded': True,
   'octahedron': True},
  "Synset('atom.n.01')"),
 ({'found': True,
   'be': True,
   'paramagnetic': True,
   'three': True,
   'unpaired': True,
   'electron': True,
   'chromium': True,
   'atom': True,
   'molecular': True,
   'susceptibility': True},
  "Synset('atom.n.01')")]

In [25]:
def evaluate(cls,test_data):
    correct=0
    wrong=0
    predictions={}
    actual={}
    for doc,label in test_data:
        prediction=cls.classify(doc)
        predictions[prediction]=predictions.get(prediction,0)+1
        actual[label]=actual.get(label,0)+1
        if prediction==label:
            correct+=1
        else:
            wrong+=1
    acc=correct/(correct+wrong)
    print("Accuracy of NB classification on testing data is {} , {} out of {}".format(acc,correct,correct+wrong))
    
evaluate(aclassifier,testing)

Accuracy of NB classification on testing data is 1.0 , 2 out of 2


### Exercise 2.2
Write a function which will return the precision of each class

### Exercise 2.3
Write a function `train_and_test()` which gets the appropriate training and testing data for a given word, builds a classifier and outputs the precision with which each class is predicted

In [26]:
"""
from classifiercode import *
def train_and_test(word):
    training=get_word_data(training_sentences,word)
    testing=get_word_data(testing_sentences,word)
    classifier=NaiveBayesClassifier.train(training)
    #evaluate(classifier,testing)
    evaluate_precision(classifier,testing)

train_and_test("best")
"""

Sussex NLTK root directory is \\ad.susx.ac.uk\ITS\TeachingResources\Departments\Informatics\LanguageEngineering\resources


NameError: name 'evaluate_precision' is not defined

### Exercise 2.4
* Run `train_and_test()` on each of your candidate words identified earlier in the exercise.  
* Display results in a pandas dataframe
* Calculate average precision for each word
* Calculate the average average precision score for the set of candidate words