# Week 8: Part-of-Speech Tagging 

This week we are learn about part-of-speech (POS) tagging.  This involves deciding the correct part-of speech tag (e.g., noun, verb, adjective etc) for each word in a sentence.  Since the correct tag for each word depends not only on the current word but on the tags of those words around it, it is generally viewed as a **sequence labelling** problem.  In other words, for a given sequence of words, we are asking what is the most likely sequence of tags?


In [None]:
import sys
sys.path.append(r'\\ad.susx.ac.uk\ITS\TeachingResources\Departments\Informatics\LanguageEngineering\resources')
sys.path.append(r'/Users/juliewe/resources')

import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from itertools import zip_longest
%matplotlib inline
import random
import math
import operator

## Average PoS tag ambiguity 
The Part-of-Speech (PoS) tag ambiguity of a word type is a measure of how varied the PoS tags are for that type.   Note that here, we talk about the ambiguity of a word type rather than a word token because any given token has a single tag but different occurrences of the same type may have different tags.  For example, some occurrences of the word *bank* have the tag *noun* whereas others have the tag *verb*

Some types are always (or almost always) labelled with the same PoS tag, so exhibit no (or very little) ambiguity. It is easy to predict the correct PoS tag for such words. 

On the other hand, a type that is commonly labelled by a variety of different PoS tags exhibits a high level of ambiguity, and is more challenging to deal with.

In this session, we are going to be considering two measures of a type's ambiguity. We will be using the Wall Street Journal corpus as it has been hand-annotated with part of speech tags. 
We will consider 
* a simple measure that just **counts** the number of different tags that label the type. 
* a more complex information-theoretic measure based on **entropy**.

First, we create an instance of a `WSJCorpusReader`.  Then we can use the method `tagged_words()` to get a list of all tokens in the corpus tagged with their POS.

In [None]:
from sussex_nltk.corpus_readers import WSJCorpusReader

wsjreader=WSJCorpusReader()
taggedWSJ=wsjreader.tagged_words()
for i,(token,tag) in enumerate(taggedWSJ):
    print(i,token,tag)
    if i>10:
        break

### Exercise 1.1
Write a function `find_tag_distributions(tokentaglist)` which finds the (frequency) distributions of tags for every word in its input.
* input: a list of pairs (token,tag)
* returns: a dictionary of dictionaries.  The key to the outermost dictionary should be the word and the key to each internal dictionary should be the tag.  The value associated with the tag in the internal dictionary should be its frequency of occurrence.

Note that this exercise is very similar to Ex1.1 in Lab_7_1

Test your function on `taggedWSJ` and look at the tag distribution for the word `bank`.  You should find that you get:

`{NN: 521,
VB: 1,
VBP: 1}`



### Exercise 1.2
Write a function `simple_pos_ambiguity` which can take the tagged WSJ text and returns a dictionary containing the number of part of speech tags which each word type has.  Note that this is simply the length of the dictionary associated with that word in the output from `find_tag_distributions`.

Check that you get the following results:
bank: 3
blue: 2
walk: 3

### Exercise 1.3
Find the mean average value of the `simple_pos_ambiguity` score for word types in the WSJ.

## Entropy as a Measure of Tag Ambiguity

**Entropy** is a measure of uncertainty. A word will have high entropy when it occurs the same number of times with each part of speech. There is maximum uncertainty as to which part of speech it has.

The larger the part of speech tagset, the greater the potential for uncertainty, and the higher the entropy can be.

In the cell below we see a function `entropy`. It's argument is a list of counts (which in our case are counts of how many times a word appeared with a given part of speech).

Check that you understand how the code implements this definition of entropy:
$$H([x_1,\ldots,x_n])= - \sum_{i=1}^nP(x_i)\log_2 P(x_i)$$
where $n$ is the number of PoS tags, and $x_i$ is a count of how many times the word was labelled with the $i$th PoS tag.

In [None]:
def entropy(counts):            # counts = list of counts of occurrences of tags
    total = sum(counts)         # get total number of occurrences
    if not total: return 0      # if zero occurrences in total, then 0 entropy
    entropy = 0
    for i in counts:            # for each tag count
        p = i/total      # probability that the token occurs with this tag
        try:
            entropy += p * math.log(p,2) # add to entropy
        except ValueError: pass     # if p==0, then ignore this p
    return -entropy if entropy else entropy   # only negate if nonzero, otherwise 
                                              # floats can return -0.0, which is weird.


### Exercise 2.1
Experiment with the `entropy` function.
- It takes a list of counts as its argument.
- Compare the entropy of a list where all counts are the same with the entropy of a list of different counts.
- See what happens when you vary the length of the list of counts.

### Exercise 2.2
Write a function `entropy_ambiguity` which takes the tagged WSJ text and returns a dictionary containing the entropy of each word.

Test it out your function; you should find:

`bank: 0.04004053596567404
blue: 0.4394969869215134
walk: 1.3127443531093745
show: 1.5322594002899546`

How does this correspond to our intuitions about which word types are more difficult to correctly POS tag?

## A Simple Unigram Tagger
Now, we will be looking at part of speech tagging itself i.e., the problem of determining the correct tag for a given word token. We will

* implement a unigram tagger
* experiment with an off-the-shelf POS tagger which utilises information about the previous words or tags in the sequence.

First, lets get some tagged text from the WSJ and split it into a training and a testing set.

In [None]:
def get_train_test_pos(split=0.7):

    from sussex_nltk.corpus_readers import WSJCorpusReader
    wsjreader=WSJCorpusReader()
    taggedWSJ=wsjreader.tagged_words()
    taggedlist=list(taggedWSJ)
    
    #we don't want to randomly select data because we need to preserve sequence information
    #so we are just going to take the first part as training and the second as test
    n=int(len(taggedlist)*split)
    return taggedlist[:n],taggedlist[n:]

train, test = get_train_test_pos(split=0.8)



Now, we build a unigram model of the tag distribution for each word type.  We use the `find_tag_distributions` function defined earlier and store the result in the variable `unigram_model`

In [None]:
unigram_model=find_tag_distributions(train)

### Exercise 3.1
Write a `uni_pos_tag` function which takes:
* a sequence of tokens \[wordtoken1,wordtoken2, ....\]
* a unigram model (stored as a dictionary of dictionaries
and returns:
* a tagged sequence of tokens \[(wordtoken1,tag1),(wordtoken2,tag2),....\]



### Exercise 3.2
Test that your function works on both the training data `train` and the testing data `test`.  Remember, you can separate the tokens and the tags into two separate lists using:
* `train_toks,train_tags=zip(*train)`
* `test_toks,test_tags=zip(*test)`

Don't worry about evaluating the accuracy at this point (that's the next exercise) - just check that you can generate sequences of (token,tag) pairs in both cases.  What happens if there is a word in the test data that didn't occur in the training data?  You might need to update your `uni_pos_tag` function to take this into account.

### Exercise 3.3
Write a function `evaluate_uni_pos_tag` which will calculate the accuracy of the `uni_pos_tag` function. This should have as arguments:
* the unigram_model
* the gold standard sequence of (token,tag) pairs for comparison

You should find that it is 94.6% accurate on the training data.  How accurate is it on the test data? 

As an extension, you could implement a uni_pos_tagger class, which combines the all of the functionality above, and then provide an `evaluate` function which evaluates a tagger. 


## Beyond Unigram Tagging
State-of-the-art POS-taggers use information about likely sequences of tags to get higher performance.

The `pos_tag` function provided by nltk uses a pre-trained maximum entropy markov model (MEMM).  We can run it on our sequences of tokens in the same way as our `uni_pos_tag` function

In [None]:
from nltk import pos_tag

pos_tag(train_toks)

### Exercise 4.1
Write or adapt your code so that you can evaluate `nltk pos_tag` function on the training and testing data, as divided above.  What improvement over unigram tagging does the nltk pos tagger provide?


### Extension
Find examples where the unigram tagger makes mistakes but the nltk pos tagger is correct.  What different types of errors are being made?  Can you explain intuitively why the correct sequence predicted by the nltk pos tagger is more likely than the one predicted by the unigram tagger?