# FNLP 2019: Lab Session 5: Word Sense Disambiguation

##  Word Sense Disambiguation: Recap

In this tutorial we will be exploring the lexical sample task. This is a task where you use a corpus to learn how to disambiguate a small set of target words using supervised learning. The aim is to build a classifier that maps each occurrence of a target word in a corpus to its sense.

We will use a Naive Bayes classifier. In other words, where the context of an occurrence of a target word in the corpus is represented as a feature vector, the classifier estimates the word sense s on the basis of its context as shown below. 


![Slide from lecture 14](nb_maths.jpg)

## The corpus

We will use the [senseval-2](http://www.hipposmond.com/senseval2) corpus for our training and test data. This corpus consists of text from a mixture of places, including the British National Corpus and the Penn Treebank portion of the Wall Street Journal. Each word in the corpus is tagged with its part of speech, and the senses of the following target words are also manually annotated: the nouns *interest*, *line*; the verb *serve* and the adjective *hard*. You can find out more about the task from [here](http://www.hipposmond.com/senseval2/descriptions/english-lexsample.htm).

The sets of senses that are used to annotate each target word come from WordNet (more on that later).

## Getting started: Run the code

Look at the code below, and try to understand how it works (don't worry if you don't understand some of it, it's not necessary for doing this task).
    Remember, `help(...)` is your friend:
  * `help([class name])` for classes and all their methods and instance variables
  * `help([any object])` likewise
  * `help([function])` or `help([class].[method])` for functions / methods

This code allows you to do several things. You can now run, train and evaluate a range of Naive Bayes classifiers over the corpus to acquire a model of WSD for a given target word: the adjective *hard*, the nouns *interest* or *line*, and the verb *serve*. We'll learn later how you do this. First, we're going to explore the nature of the corpus itself. 

In [None]:
from __future__ import division
import nltk
import random
from nltk.corpus import senseval
from nltk.classify import accuracy, NaiveBayesClassifier, MaxentClassifier
from collections import defaultdict

# The following shows how the senseval corpus consists of instances, where each instance
# consists of a target word (and its tag), it position in the sentence it appeared in
# within the corpus (that position being word position, minus punctuation), and the context,
# which is the words in the sentence plus their tags.
#
# senseval.instances()[:1]
# [SensevalInstance(word='hard-a', position=20, context=[('``', '``'), ('he', 'PRP'),
# ('may', 'MD'), ('lose', 'VB'), ('all', 'DT'), ('popular', 'JJ'), ('support', 'NN'),
# (',', ','), ('but', 'CC'), ('someone', 'NN'), ('has', 'VBZ'), ('to', 'TO'),
# ('kill', 'VB'), ('him', 'PRP'), ('to', 'TO'), ('defeat', 'VB'), ('him', 'PRP'),
# ('and', 'CC'), ('that', 'DT'), ("'s", 'VBZ'), ('hard', 'JJ'), ('to', 'TO'), ('do', 'VB'),
# ('.', '.'), ("''", "''")], senses=('HARD1',))]

def senses(word):
    """Return the list of possible senses for a word per senseval-2
    
    :param word: The word to look up
    :type word: str
    :return: list of senses
    :rtype: list(str)
    """
    return list(set(i.senses[0] for i in senseval.instances(word)))

# Both above and below, we depend on the (non-obvious?) fact that although the field is
#  called 'senses', there is always only 1, i.e. there is no residual ambiguity in the
#  data as we have it, because this is the gold standard and disambiguation per
#  the context has already been done

def sense_instances(instances, sense):
    """Return a list of instances that have the given sense
    
    :param instances: corpus of sense-labelled instances
    :type instances: list(senseval.SensevalInstance)
    :param sense: The target sense
    :type sense: str
    :return: matching instances
    :rtype: list(senseval.SensevalInstance)
    """
    return [instance for instance in instances if instance.senses[0]==sense]

# >>> sense3 = sense_instances(senseval.instances('hard.pos'), 'HARD3')
# >>> sense3[:2]
# [SensevalInstance(word='hard-a', position=15,
#  context=[('my', 'PRP$'), ('companion', 'NN'), ('enjoyed', 'VBD'), ('a', 'DT'), ('healthy', 'JJ'), ('slice', 'NN'), ('of', 'IN'), ('the', 'DT'), ('chocolate', 'NN'), ('mousse', 'NN'), ('cake', 'NN'), (',', ','), ('made', 'VBN'), ('with', 'IN'), ('a', 'DT'), ('hard', 'JJ'), ('chocolate', 'NN'), ('crust', 'NN'), (',', ','), ('topping', 'VBG'), ('a', 'DT'), ('sponge', 'NN'), ('cake', 'NN'), ('with', 'IN'), ('either', 'DT'), ('strawberry', 'NN'), ('or', 'CC'), ('raspberry', 'JJ'), ('on', 'IN'), ('the', 'DT'), ('bottom', 'NN'), ('.', '.')],
#  senses=('HARD3',)),
#  SensevalInstance(word='hard-a', position=5,
#  context=[('``', '``'), ('i', 'PRP'), ('feel', 'VBP'), ('that', 'IN'), ('the', 'DT'), ('hard', 'JJ'), ('court', 'NN'), ('is', 'VBZ'), ('my', 'PRP$'), ('best', 'JJS'), ('surface', 'NN'), ('overall', 'JJ'), (',', ','), ('"', '"'), ('courier', 'NNP'), ('said', 'VBD'), ('.', '.')],
# senses=('HARD3',))]

_inst_cache = {}

STOPWORDS = ['.', ',', '?', '"', '``', "''", "'", '--', '-', ':', ';', '(',
             ')', '$', '000', '1', '2', '10,' 'I', 'i', 'a', 'about', 'after', 'all', 'also', 'an', 'any',
             'are', 'as', 'at', 'and', 'be', 'being', 'because', 'been', 'but', 'by',
             'can', "'d", 'did', 'do', "don'", 'don', 'for', 'from', 'had','has', 'have', 'he',
             'her','him', 'his', 'how', 'if', 'is', 'in', 'it', 'its', "'ll", "'m", 'me',
             'more', 'my', 'n', 'no', 'not', 'of', 'on', 'one', 'or', "'re", "'s", "s",
             'said', 'say', 'says', 'she', 'so', 'some', 'such', "'t", 'than', 'that', 'the',
             'them', 'they', 'their', 'there', 'this', 'to', 'up', 'us', "'ve", 'was', 'we', 'were',
             'what', 'when', 'where', 'which', 'who', 'will', 'with', 'years', 'you',
             'your']

STOPWORDS_SET=set(STOPWORDS)

NO_STOPWORDS = []

def wsd_context_features(instance, vocab, dist=3):
    """Return a featureset dictionary of left/right context word features within a distance window
    of the sense-classified word of a senseval-2 instance, also a feature for the word and for
    its part of speech,
    for use by an NLTK classifier such as NaiveBayesClassifier or MaxentClassifier
    
    :param instance: sense-labelled instance to extract features from
    :type instance: senseval.SensevalInstance
    :param vocab: ignored in this case
    :type vocab: str
    :param dist: window size
    :type dist: int
    :return: feature dictionary
    :rtype: dict"""
    features = {}
    ind = instance.position
    con = instance.context
    for i in range(max(0, ind-dist), ind):
        j = ind-i
        features['left-context-word-%s(%s)' % (j, con[i][0])] = True

    for i in range(ind+1, min(ind+dist+1, len(con))):
        j = i-ind
        features['right-context-word-%s(%s)' % (j, con[i][0])] = True

    features['word'] = instance.word
    features['pos'] = con[1][1]
    return features

def wsd_word_features(instance, vocab, dist=3):
    """Return a featureset for an NLTK classifier such as NaiveBayesClassifier or MaxentClassifier
    where every key returns False unless it occurs in the instance's context
    and in a specified vocabulary
    
    :param instance: sense-labelled instance to extract features from
    :type instance: senseval.SensevalInstance
    :param vocab: filter for context words that yield features
    :type vocab: list(str)
    :param dist: ignored in this case
    :type dist: int
    :return: feature dictionary
    :rtype: dict"""
    features = defaultdict(lambda:False)
    features['alwayson'] = True
    # Not all context items are (word,pos) pairs, for some reason some are just strings...
    for w in (e[0] for e in instance.context if isinstance(e,tuple)):
            if w in vocab:
                features[w] = True
    return features

def extract_vocab_frequency(instances, stopwords=STOPWORDS_SET, n=300):
    """Construct a frequency distribution of the non-stopword context words
    in a collection of senseval-2 instances and return the top n entries, sorted
    
    :param instances: sense-labelled instances to extract from
    :type instance: list(senseval.SensevalInstance)
    :param stopwords: words to exclude from the result
    :type stopwords: iterable(string)
    :param n: number of items to return
    :type n: int
    :return: sorted list of at most n items from the frequency distribution
    :rtype: list(tuple(str,int))
    """
    fd = nltk.FreqDist()
    for i in instances:
        (target, suffix) = i.word.split('-')
        words = (c[0] for c in i.context if not c[0] == target)
        for word in set(words) - set(stopwords):
            fd[word] += 1
    return fd.most_common()[:n+1]
        
def extract_vocab(instances, stopwords=STOPWORDS_SET, n=300):
    """Return the n most common non-stopword words appearing as context
    in a collection of semeval-2 instances
    
    A wrapper for extract_vocab_frequency, q.v.
    
    :param instances: sense-labelled instances to extract from
    :type instance: list(senseval.SensevalInstance)
    :param stopwords: words to exclude from the result
    :type stopwords: iterable(string)
    :param n: number of words to return
    :type n: int
    :return: sorted list of at most n words
    :rtype: list(str)"""

    return [w for w,f in extract_vocab_frequency(instances,stopwords,n)]
    
def wst_classifier(trainer, word, features, stopwords_list = STOPWORDS_SET, number=300, log=False, distance=3, confusion_matrix=False):
    """Build a classifier instance for the senseval2 senses of a word and applies it
    
    :param trainer: the trainer class method for an NLTK classifier such as NaiveBayesClassifier or MaxentClassifier
    :type trainer: method(list(tuple(featureset,label)))
    :param word: from senseval2 (we have 'hard.pos', 'interest.pos', 'line.pos' and 'serve.pos')
    :type string:
    :param features: a feature set constructor (we have wsd_context_features or wsd_word_features)
    :type features: function(senseval.SensevalInstance,list(str),int)
    :param number: passed to extract_vocab when constructing the second argument to the feature set constructor
    :type int:
    :param log: if set to True outputs any errors into a file errors.txt
    :type bool:
    :param distance: passed to the feature set constructor as 3rd argument
    :type int:
    :param confusion_matrix: if set to True prints a confusion matrix
    :type bool:

    Calling this function splits the senseval data for the word into a training set and a test set (the way it does
    this is the same for each call of this function, because the argument to random.seed is specified,
    but removing this argument would make the training and testing sets different each time you build a classifier).

    It then trains the trainer on the training set to create a classifier that performs WSD on the word,
    using features (with number or distance where relevant).

    It then tests the classifier on the test set, and prints its accuracy on that set.

    If log==True, then the errors of the classifier over the test set are written to errors.txt.
    For each error four things are recorded: (i) the example number within the test data (this is simply the index of the
    example within the list test_data); (ii) the sentence that the target word appeared in, (iii) the
    (incorrect) derived label, and (iv) the gold label.

    If confusion_matrix==True, then calling this function prints out a confusion matrix, where each cell [i,j]
    indicates how often label j was predicted when the correct label was i (so the diagonal entries indicate labels
    that were correctly predicted).
    """
    print("Reading data...")
    global _inst_cache
    if word not in _inst_cache:
        _inst_cache[word] = [(i, i.senses[0]) for i in senseval.instances(word)]
    events = _inst_cache[word][:]
    senses = list(set(l for (i, l) in events))
    instances = [i for (i, l) in events]
    vocab = extract_vocab(instances, stopwords=stopwords_list, n=number)
    print(' Senses: ' + ' '.join(senses))

    # Split the instances into a training and test set,
    #if n > len(events): n = len(events)
    n = len(events)
    random.seed(5043562) # 5444522
    random.shuffle(events)
    training_data = events[:int(0.8 * n)]
    test_data = events[int(0.8 * n):n]
    # Train classifier
    print('Training classifier...')
    classifier = trainer([(features(i, vocab, distance), label) for (i, label) in training_data])
    # Test classifier
    print('Testing classifier...')
    acc = accuracy(classifier, [(features(i, vocab, distance), label) for (i, label) in test_data] )
    print('Accuracy: %6.4f' % acc)
    if log:
        #write error file
        print('Writing errors to errors.txt')
        output_error_file = open('errors.txt', 'w')
        errors = []
        for (i, label) in test_data:
            guess = classifier.classify(features(i, vocab, distance))
            if guess != label:
                con =  i.context
                position = i.position
                item_number = str(test_data.index((i, label)))
                word_list=[cv[0] if isinstance(cv,tuple) else cv for cv in con]
                hard_highlighted = word_list[position].upper()
                word_list_highlighted = word_list[0:position] + [hard_highlighted] + word_list[position+1:]
                sentence = ' '.join(word_list_highlighted)
                errors.append([item_number, sentence, guess,label])
        error_number = len(errors)
        output_error_file.write('There are ' + str(error_number) + ' errors!' + '\n' + '----------------------------' +
                                '\n' + '\n')
        for error in errors:
            output_error_file.write(str(errors.index(error)+1) +') ' + 'example number: ' + error[0] + '\n' +
                                    '    sentence: ' + error[1] + '\n' +
                                    '    guess: ' + error[2] + ';  label: ' + error[3] + '\n' + '\n')
        output_error_file.close()
    if confusion_matrix:
        gold = [label for (i, label) in test_data]
        derived = [classifier.classify(features(i,vocab)) for (i,label) in test_data]
        cm = nltk.ConfusionMatrix(gold,derived)
        print(cm)
        return cm
        
def demo():
    print("NB, with features based on 300 most frequent context words")
    wst_classifier(NaiveBayesClassifier.train, 'hard.pos', wsd_word_features)
    print("\nNB, with features based word + pos in 6 word window")
    wst_classifier(NaiveBayesClassifier.train, 'hard.pos', wsd_context_features)
##    print "MaxEnt, with features based word + pos in 6 word window"
##    wst_classifier(MaxentClassifier.train, 'hard.pos', wsd_context_features)
    
#demo()

# The Senseval corpus
## Target words

You can find out the set of target words for the senseval-2 corpus by running:

In [None]:
senseval.fileids()


The result doesn't tell you the syntactic category of the words, but see the description of the corpus in Section 1 or Section 4.2. 

## Word senses

Let's now find out the set of word senses for each target word in senseval. There is a function in above that returns this information. For example:


In [None]:
print(senses('hard.pos'))

As you can see this gives you `['HARD1', 'HARD2', 'HARD3']`

So there are 3 senses for the adjective hard in the corpus. You'll shortly be looking at the data to guess what these 3 senses are.

Now it's your turn:

* What are the senses for the other target words? Find out by calling senses with appropriate arguments.
* How many senses does each target have?
* Let's now guess the sense definitions for HARD1, HARD2 and HARD3 by looking at the 100 most frequent open class words that occur in the context of each sense. 


You can find out what these 100 words for HARD1 by running the following:

In [None]:
from pprint import pprint
instances1 = sense_instances(senseval.instances('hard.pos'), 'HARD1')
features1 = extract_vocab_frequency(instances1, n=100)

# Now lets try printing features1:
pprint(features1)

Now it's your turn:

* Call the above functions for HARD2 and HARD3.
* Look at the resulting lists of 100 most frequent words for each sense, and try to define what HARD1, HARD2 and HARD3 mean.
* These senses are actually the first three senses for the adjective _hard_ in [WordNet](http://wordnet.princeton.edu/). You can enter a word and get its list of WordNet senses from [here](http://wordnetweb.princeton.edu/perl/webwn). Do this for hard, and check whether your estimated definitions for the 3 word senses are correct. 

In [None]:
instances2 = sense_instances(senseval.instances('hard.pos'), 'HARD2')
features2 = extract_vocab_frequency(instances2, n=20)

instances3 = sense_instances(senseval.instances('hard.pos'), 'HARD3')
features3 = extract_vocab_frequency(instances3, n=20)

## The data structures: Senseval instances
Having extracted all instances of a given sense, you can look at what the data structures in the corpus look like: 

In [None]:
print("For HARD2:\nSample instance: %s\nAll features:"%instances2[0])
pprint(features2)
print("\nFor HARD3:\nSample instance: %s\nAll features:"%instances3[0])
pprint(features3)

 So the senseval corpus is a collection of information about a set of tagged sentences, where each entry or instance consists of 4 attributes:

* word specifies the target word together with its syntactic category (e.g., hard-a means that the word is hard and its category is 'adjective');
* position gives its position within the sentence (ignoring punctuation);
* context represents the sentence as a list of pairs, each pair being a word or punctuation mark and its tag; and finally
* senses is a tuple, each item in the tuple being a sense for that target word. In the subset of the corpus we are working with, this tuple consists of only one argument. But there are a few examples elsewhere in the corpus where there is more than one, representing the fact that the annotator couldn't decide which sense to assign to the word. For simplicity, our classifiers are going to ignore any non-first arguments to the attribute senses. 

# Exploring different WSD classifiers
You're now going to compare the performance of different classifiers that perform word sense disambiguation. You do this by calling the function `wst_classifer` This function must have at least the following arguments specified by you:

 1. A trainer; e.g., `NaiveBayesClassifier.train` (if you want you could also try `MaxentClassifier.train`, but this takes longer to train).
 2. The target word that the classifier is going to learn to disambiguate: i.e., 'hard.pos', 'line.pos', 'interest.pos' or 'serve.pos'.
 3. A feature set. The code allows you to use two kinds of feature sets:
#### wsd_word_features
This feature set is based on the set **S&nbsp;** of the **n&nbsp;** most frequent words that occur in the same sentence as the target word **w&nbsp;** across the entire training corpus (as you'll see later, you can specify the value of **n&nbsp;**, but if you don't specify it then it defaults to 300). For each occurrence of **w,** `wsd_word_features` represents its context as the subset of those words from **S&nbsp;** that occur in the **w&nbsp;**'s sentence. By default, the closed-class words that are specified in `STOPWORDS` are excluded from the set **S&nbsp;** of most frequent words. But as we'll see later, you can also include closed-class words in **S&nbsp;**, or re-define closed-class words in any way you like! If you want to know what closed-class words are excluded by default, just look at the code above. 
#### wsd_context_features
This feature set represents the context of a word **w&nbsp;** as the sequence of **m&nbsp;** pairs `(word,tag)` that occur before **w&nbsp;** and the sequence of **m&nbsp;** pairs `(word, tag)` that occur after **w&nbsp;**. As we'll see shortly, you can specify the value of **m&nbsp;** (e.g., `m=1` means the context consists of just the immediately prior and immediately subsequent word-tag pairs); otherwise, **m&nbsp;** defaults to 3. 
    
    
## Now let's train our first classifier
Try the following:

In [None]:
wst_classifier(NaiveBayesClassifier.train, 'hard.pos', wsd_word_features) 

In other words, the adjective hard is tagged with 3 senses in the corpus (HARD1, HARD2 and HARD3), and the Naive Bayes Classifier using the feature set based on the 300 most frequent context words yields an accuracy of 0.8362. 

#### Now it's your turn:

Use `wst_classifier` to train a classifier that disambiguates hard using `wsd_context_features`. Build classifiers for *line* and *serve* as well, using the word features and then the context features.

* What's more accurate for disambiguating 'hard.pos', `wsd_context_features` or `wst_word_features`?
* Does the same hold true for 'line.pos' and 'serve.pos'. Why do you think that might be?
* Why is it not fair to compare the accuracy of the classifiers across different target words? 

    
# Baseline models
Just how good is the accuracy of these WSD classifiers? To find out, we need a baseline. There are two we consider here:

1. A model which assigns a sense at random.
2. A model which always assigns the most frequent sense. 

### Now it's your turn:

* What is the accuracy of the random baseline model for 'hard.pos'?
* To compute the accuracy of the frequency baseline model for 'hard.pos', we need to find out the Frequency Distribution of the three senses in the corpus: 

In [None]:
hard_sense_fd = nltk.FreqDist([i.senses[0] for i in senseval.instances('hard.pos')])
print(hard_sense_fd.most_common())

frequency_hard_sense_baseline = hard_sense_fd.freq('HARD1')
frequency_hard_sense_baseline

 In other words, the frequency baseline has an accuracy of approx. 0.797. What is the most frequent sense for 'hard.pos'? And is the frequency baseline a better model than the random model?
* Now compute the accuracy of the frequency baseline for other target words; e.g. 'line.pos'. 

In [None]:
line_sense_fd = nltk.FreqDist([i.senses[0] for i in senseval.instances('line.pos')])
print(line_sense_fd.most_common())

frequency_line_sense_baseline = line_sense_fd.freq('product')
frequency_line_sense_baseline

# Rich features vs. sparse data
In this part of the tutorial we are going to vary the feature sets and compare the results. As well as being able to choose between `wsd_context_features` vs. `wsd_word_features`, you can also vary the following:

#### wsd_context_features

You can vary the number of word-tag pairs before and after the target word that you include in the feature vector. You do this by specifying the argument `distance` to the function `wst_classifier`. For instance, the following creates a classifier that uses 2 words to the left and right of the target word: 
    
    wst_classifier(NaiveBayesClassifier.train, 'hard.pos', 
		wsd_context_features, distance=2)

What about distance 1?
#### wsd_word_features
You can vary the closed-class words that are excluded from the set of most frequent words, and you can vary the size of the set of most frequent words. For instance, the following results in a model which uses the 100 most frequent words including closed-class words:

    wst_classifier(NaiveBayesClassifier.train, 'hard.pos', 
    		wsd_word_features, stopwords_list=[], number=100)
           
#### Now it's your turn:
Build several WSD models for 'hard.pos', including at least the following: for the `wsd_word_features` version, vary `number` between 100, 200 and 300, and vary the `stopwords_list` between `[]` (i.e., the empty list) and `STOPWORDS`; for the `wsd_context_features` version, vary the `distance` between 1, 2 and 3, and vary the `stopwords_list` between `[]` and `STOPWORDS`.

In [None]:
for n in [100, 200, 300, 400]:
    for stopwords in [[], STOPWORDS]:
        stop = 'stopwords' if stopwords else 'no stopwords'
        print('Word features with number: %s and %s'%(n, stop))
        wst_classifier(NaiveBayesClassifier.train, 'hard.pos', wsd_word_features, number=n, stopwords_list=stopwords) 

for n in [1, 2, 3]:
    for stopwords in [[], STOPWORDS]:
        stop = 'stopwords' if stopwords else 'no stopwords'
        print('Context features with distance: %s and %s'%(n, stop))
        wst_classifier(NaiveBayesClassifier.train, 'hard.pos', wsd_context_features,stopwords_list=stopwords, distance=n) 

Why does changing `number` have an inconsistent impact on the word model?
  * This suggests that the data is too sparse for changes in vocabulary size to have a consistent impact.

Why does making the context window before and after the target word to a number smaller than 3 improve the model?
  * Sparse data, again

Why does including closed-class words in word model improve overall performance?
  * Including closed class words improves performance.  One can see from
the distinct list of closed class words that are constructed for each
sense of "hard" that the distributions of closed class wrt word sense
are quite distinct and therefore informative.  Furthermore, by
including closed class words within the context window one *excludes*
open class words that may be, say, 5 or 6 words away from the target
word and are hence less informative clues for the target word sense.

To see if the data really is too sparse for consistent results, try a different seed for the random number generator, by
editting line 211 in the definition of `wst_classifier` to use the seed value from the comment instead of the one it's been using.  Then try again and see how, if at all, the trend as number increases is different.

In [None]:
for n in [100, 200, 300, 400]:
    for stopwords in [[], STOPWORDS]:
        stop = 'stopwords' if stopwords else 'no stopwords'
        print('Word features with number: %s and %s'%(n, stop))
        wst_classifier(NaiveBayesClassifier.train, 'hard.pos', wsd_word_features, number=n, stopwords_list=stopwords) 

It seems slightly odd that the word features for 'hard.pos' include _harder_ and _hardest_. Try using a stopwords list which adds them to STOPWORDS: is the effect what you expected? Can you explain it?

In [None]:
wst_classifier(NaiveBayesClassifier.train, 'hard.pos', wsd_word_features, number=300, stopwords_list=STOPWORDS)
wst_classifier(NaiveBayesClassifier.train, 'hard.pos', wsd_word_features, number=300, stopwords_list=STOPWORDS+['harder', 'hardest'])

The accuracy goes down. This might be expected if a particular word sense would be more likely to appear together with harder and hardest. This means that removing the two words would remove relevant information which would be replaced by some very infrequent words. 

# Error analysis
The function `wst_classifier` allows you to explore the errors of the model it creates:

#### Confusion Matrix

You can output a confusion matrix as follows: 

In [None]:
wst_classifier(NaiveBayesClassifier.train, 'hard.pos',
               wsd_context_features, distance=3, confusion_matrix=True)


Note that the rows in the matrix are the gold labels, and the columns are the estimated labels. Recall that the diagonal line represents the number of items that the model gets right. 
#### Errors

You can also output each error from the test data into a file `errors.txt`. For example:



In [None]:
wst_classifier(NaiveBayesClassifier.train, 'hard.pos',
               wsd_context_features, distance=2, confusion_matrix=True, log=True)

Use your favourite editor to look at `errors.txt`.
You will find it in the same directory as this notebook.

In `errors.txt`, the example number on the first line of each entry is the (list) index of the error in the test_data. 

#### Now it's your turn:

1. Choose your best performing model from your earlier trials, and train it again, but add the arguments `confusion_matrix=True` and `log=True`.
2. Using the confusion matrix, identify which sense is the hardest one for the model to estimate.
3. Look in `errors.txt` for examples where that hardest word sense is the correct label. Do you see any patterns or systematic errors? If so, can you think of a way to adapt the feature vector so as to improve the model? 

In [None]:
wst_classifier(NaiveBayesClassifier.train, 'hard.pos', wsd_context_features,  distance=1, confusion_matrix=True, log=True) 

In [None]:
print((14+8)/(680+14+8))
print((20+3)/(20+65+3))
print((17+7)/(17+7+53))

HARD3 is the most difficult sense for the classifier. There isn't one right answer for this question. It is more of a question to invite speculation and let you think about your classifier. The most obvious pattern is that HARD1 is extremely dominant in the number of examples. This can be seen in the classification results as the majority of error comes from HARD2 and HARD3 being missclassified as HARD1. It should be noted that the classifier used only looks at words at a distance of 1. This really isn't very much context. Due to data sparsity I imagine that a lot of error simply comes from the fact that a particular context may not have been seen before. For example, _HARD shoulders_ and _HARD soled shoes_ seem like obvious examples of HARD3 but they have been classed as HARD1. Most likely these things were simply not found in the dataset. One thing that seems to happen quite a few times is that the HARD ends up next to an adverb or a different adjective, such as _slightly HARDER_ which could definitely appear next to any sense of the word. What might be useful is to always include information about the POS context in which the word appears or parsing information. HARD3 will generally attach to nouns, as in _hard seats_ or _hard hats_. While HARD1 has more of a spread. 

Again, there isn't one correct answer, see what you can spot and try and come up with some reasonable suggestions.