# Topic 4: Part-of-speech (PoS) Tagging

## Preliminaries 
Run this cell.

In [6]:
import sys
sys.path.append(r'\\ad.susx.ac.uk\ITS\TeachingResources\Departments\Informatics\LanguageEngineering\resources')
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import collections
from collections import defaultdict,Counter
from itertools import zip_longest
from IPython.display import display
from random import seed
get_ipython().magic('matplotlib inline')
import random
import math
import matplotlib.pylab as pylab
%matplotlib inline
params = {'legend.fontsize': 'large',
          'figure.figsize': (15, 5),
         'axes.labelsize': 'large',
         'axes.titlesize':'large',
         'xtick.labelsize':'large',
         'ytick.labelsize':'large'}
pylab.rcParams.update(params)
from pylab import rcParams
from operator import itemgetter, attrgetter, methodcaller
import matplotlib.pyplot as plt
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
import seaborn as sns
import csv

### Tokens vs Types
This session concerns the task of part-of-speech tagging. It is loosely divided into 2 parts: the first part deals with the notion of PoS ambiguity of a vocabulary type; and the second part compares the performance of two taggers on various corpora.

We will be making an important distinction between tokens and types. A sentence in a document is make up of a sequence of tokens. For example the list
`["the", "cat", "sat", "on", "the", "mat", "."]`  
contains 7 tokens, but only 6 distinct strings: there are two occurrences of `"the"`. 

The way we say this is that there are 6 **types** in the sentence, but 7 **tokens**. Tokens are occurrences of types.

In this session we will be looking at the ambiguity of types not tokens.

## Average PoS tag ambiguity 
The Part-of-Speech (PoS) tag ambiguity of a type is a measure of how varied the PoS tags are for that type. 

Some types are always (or almost always) labelled with the same PoS tag, so exhibit no (or very little) ambiguity. It is easy to predict the correct PoS tag for such words. 

On the other hand, a type that is commonly labelled by a variety of different PoS tags exhibits a high level of ambiguity, and is more challenging to deal with.

In this session, we are going to be considering two measures of a type's ambiguity. 

In this section, we consider a simple measure that just counts the number of different tags that label the type. 

In the next section we will look at a more complex information-theoretic measure based on entropy.

### Exercise
In the blank cell below, create a function `simple_pos_ambiguity`. 

Here is the docstring for `simple_pos_ambiguity`:
```
    """
    for each type in the Walls Street Journal corpus, this 
    function determines the number of different PoS tags that
    the type as been assigned.

    :param none
    :return: A dictionary (hashmap) mapping each type to its 
            degree of ambiguity (the number of distinct PoS tags 
            that the type is labelled with in the Wall Street 
            Journal Corpus).
    """
```

Create `simple_pos_ambiguity` as follows:

1. Create a Wall Street Journal corpus reader
2. Use the corpus reader's method `tagged_words`, to get a list of all tokens in the corpus tagged with their PoS (e.g. if your corpus reader is called `wsj_reader`, then you'd call `wsj_reader.tagged_words()`). This method is available because the Wall Street Journal corpus has been hand-annotated with PoS tags.
3. For each type, build a set containing all of the different PoS tags that are assigned to that type. So if in the Wall Street Journal corpus "red" occurred only as a noun and adjective, then this number would be a two element set containing just these two part-of-speech tags. The size (cardinality) of the set is the ambiguity of that type. See below for details.
4. Return a Python dictionary (hashmap) mapping each type to its ambiguity.  

Some useful hints:
- It will be useful to have this line: `from collections import defaultdict`.
- See https://docs.python.org/3/library/collections.html#collections.defaultdict for how to use `defaultdict`.
- Think carefully about what is an appropirate type to give `defaultdict` as a parameter.


In [23]:
# %load solutions/simple_ambiguity
from collections import defaultdict
from sussex_nltk.corpus_readers import WSJCorpusReader

def simple_pos_ambiguity():
    """
    for each type in the Walls Street Journal corpus, this 
    function determines the number of different PoS tags that
    the type as been assigned.

    :param none
    :return: A dictionary (hashmap) mapping each type to its 
            degree of ambiguity (the number of distinct PoS tags 
            that the type is labelled with in the Wall Street 
            Journal Corpus).
    """
    wsj_reader = WSJCorpusReader()    #Create a new reader
    tags_dict = defaultdict(set)
    for tok,tag in wsj_reader.tagged_words():
        tags_dict[tok].add(tag)
    count_dict = defaultdict(int)
    for ty in tags_dict.keys():
        count_dict[ty] = len(tags_dict[ty])
    return count_dict

### Exercise
In the blank cell below, check that the ambiguity of "*blue*" is 2 in the Wall Street Journal corpus. It occurs as a noun and adjective only.

In [42]:
dict = simple_pos_ambiguity()
dict["will"]

3

In [14]:
# %load solutions/blue


### Exercise
In the blank cell below, write code to find the average ambituity of words in the Wall Street Journal corpus.

This might be useful:  
`from scipy import mean`


In [24]:
# %load solutions/average_ambiguity
from numpy import average
from scipy import mean
ambiguities = simple_pos_ambiguity()
#mean(ambiguities.values())
#values = ambiguities.values()
#values
mean([value for value in ambiguities.values()])


1.1663008576189895

## Entropy as a measure of ambiguity 

In this activity, you are given a function that calculates PoS ambiguity in a different way, using the notion of [entropy](http://en.wikipedia.org/wiki/Entropy_(information_theory). 

Below we will find a function `get_entropy_ambiguity` that is used to get a measure of the PoS ambiguity of a word in the Wall Street Journal corpus based on entropy.

First let's get a sense of how entropy works.

Entropy is a measure of uncertainty. A word will have high entropy when it occurs the same number of times with each part of speech. There is maximum uncertainty as to which part of speech it has.

The larger the part of speech tagset, the greater the potential for uncertainty, and the higher the entropy can be.

### Exercise
In the cell below we see a function `entropy`. It's argument is a list of counts (which in our case are counts of how many times a word appeared with a given part of speech).

Check that you understand how the code implements this definition of entropy:
$$H([x_1,\ldots,x_n])=\sum_{i=1}^nP(x_i)\log_2 P(x_i)$$
where $n$ is the number of PoS tags, and $x_i$ is a count of how many times the word was labelled with the $i$th PoS tag.

In [25]:
from math import log

def entropy(counts):            # counts = list of counts of occurrences of tags
    total = sum(counts)         # get total number of occurrences
    if not total: return 0      # if zero occurrences in total, then 0 entropy
    entropy = 0
    for i in counts:            # for each tag count
        p = i/total      # probability that the token occurs with this tag
        try:
            entropy += p * log(p,2) # add to entropy
        except ValueError: pass     # if p==0, then ignore this p
    return -entropy if entropy else entropy   # only negate if nonzero, otherwise 
                                              # floats can return -0.0, which is weird.


### Exercise
In the empty cell below, experiment with the `entropy` function.
- It takes a list of counts as its argument.
- Compare the entropy of a list where all counts are the same with the entropy of a list of different counts.
- Investigate the effect of varying the length of the list of counts.

In [37]:
l = [2, 2, 2]
l2 = [1, 2, 0, 3, 6, 4, 5, 7, 3, 8, 2, 4]
print(entropy(l), entropy(l2))

1.584962500721156 3.263393653274255


We are now ready to look at the `get_entropy_ambiguity` function.

Although it isn't efficient, in order to keep the code simple, `get_entropy_ambiguity` only computes the ambiguity of one word for any given call. This means that to find the average entropy of all of the types in the corpus, you would have to call the function once per type.

### Exercise
Have a careful look at the code for `get_entropy_ambiguity` in the cell below.

Note that the code below uses `try-except` statements. The code under the try statement is executed, and if an exception is raised, then the code under the except statement is executed. 

In [43]:
from math import log
from sussex_nltk.corpus_readers import WSJCorpusReader
from collections import defaultdict

def get_entropy_ambiguity(word):
# Get the PoS ambiguity of *word* according to its occurrence in WSJ
    pos_counts = defaultdict(int)       # keep track of the number of times *word* 
                                        # appears with each PoS tag
    for token, tag in WSJCorpusReader().tagged_words():   
        if token == word:               
            pos_counts[tag] += 1
    return entropy(pos_counts.values())

def entropy(counts):            # counts = list of counts of occurrences of tags
    total = sum(counts)         # get total number of occurrences
    if not total: return 0      # if zero occurrences in total, then 0 entropy
    entropy = 0
    for i in counts:            # for each tag count
        p = i/total      # probability that the token occurs with this tag
        try:
            entropy += p * log(p,2) # add to entropy
        except ValueError: pass     # if p==0, then ignore this p
    return -entropy if entropy else entropy   # only negate if nonzero, otherwise 
                                              # floats can return -0.0, which is weird.
    
# Usage:
print('Ambiguity of "either": {0:.4f}'.format(get_entropy_ambiguity("will")))
print('Ambiguity of "value": {0:.4f}'.format(get_entropy_ambiguity("value")))

Ambiguity of "either": 0.0781
Ambiguity of "value": 0.0756


### Exercise
- Use your simple measure of PoS ambiguity (from the previous section) to calculate the PoS ambiguity of the words "*either*" and "*value*". 
- Now do the same with the entropy-based ambiguity measure. 
- How do the measures differ? 
- Which measure produces a more representative figure for how ambiguous the PoS of a type is?

## Experiment with PoS taggers
In this section you will have a chance to use two different Part-of-Speech taggers: the NLTK Maximum Entropy PoS tagger; and the Twitter-specific PoS tagger from Gimpel et al.

The following code shows you how to use these taggers.

In [None]:
from sussex_nltk.corpus_readers import ReutersCorpusReader
from sussex_nltk.tag import twitter_tag_batch
from nltk import pos_tag
from nltk.tokenize import word_tokenize

number_of_sentences = 10     #Number of sentences to sample and display
rcr = ReutersCorpusReader()  #Create a corpus reader
sentences = rcr.sample_raw_sents(number_of_sentences)  #Sample some sentences

#Tag with twitter specific tagger
# - it also tokenises for you in a twitter specific way
twitter_tagged = twitter_tag_batch(sentences)   

#Tag with NLTK's maximum entropy tagger         
nltk_tagged = [pos_tag(word_tokenize(sentence)) for sentence in sentences] 

#Print results for each sentence
for raw, twitter_sentence, nltk_sentence in zip(sentences,twitter_tagged,nltk_tagged):
    print("\n",raw,"\n")
    df = pd.DataFrame(list(zip_longest([(token,tag) for token,tag in nltk_sentence],
                                       [(token,tag) for token,tag in twitter_sentence])),
                      columns=["nltk tagger","twitter tagger"])
    print(df)

### Exercise
Make a copy of the cell above that ran the two taggerson a sample of Reuters data, and move the copy to be positioned below this cell.

Adapt the code so that it runs both taggers on a sample of sentences from the Reuters, Medline and Twitter corpora.

Then run the code and try to observe limitations and strengths of the taggers on the various corpora.


In [None]:
from sussex_nltk.corpus_readers import ReutersCorpusReader
from sussex_nltk.tag import twitter_tag_batch
from nltk import pos_tag
from nltk.tokenize import word_tokenize

number_of_sentences = 10     #Number of sentences to sample and display
rcr = ReutersCorpusReader()  #Create a corpus reader
sentences = rcr.sample_raw_sents(number_of_sentences)  #Sample some sentences

#Tag with twitter specific tagger
# - it also tokenises for you in a twitter specific way
twitter_tagged = twitter_tag_batch(sentences)   

#Tag with NLTK's maximum entropy tagger         
nltk_tagged = [pos_tag(word_tokenize(sentence)) for sentence in sentences] 

#Print results for each sentence
for raw, twitter_sentence, nltk_sentence in zip(sentences,twitter_tagged,nltk_tagged):
    print("\n",raw,"\n")
    df = pd.DataFrame(list(zip_longest([(token,tag) for token,tag in nltk_sentence],
                                       [(token,tag) for token,tag in twitter_sentence])),
                      columns=["nltk tagger","twitter tagger"])
    print(df)

In [49]:
# %load solutions/tag_all_corpora
from sussex_nltk.corpus_readers import ReutersCorpusReader, MedlineCorpusReader, TwitterCorpusReader
from sussex_nltk.tag import twitter_tag_batch
from nltk import pos_tag
from nltk.tokenize import word_tokenize

number_of_sentences = 10     #Number of sentences to sample and display
rcr = ReutersCorpusReader()  #Create a corpus reader
mcr = MedlineCorpusReader()
tcr = TwitterCorpusReader()

reuters_sents = rcr.sample_raw_sents(number_of_sentences) 
medline_sents = mcr.sample_raw_sents(number_of_sentences) 
twitter_sents = tcr.sample_raw_sents(number_of_sentences) 

#Tag with twitter specific tagger
# - it also tokenises for you in a twitter specific way
twitter_tagged_reuters = twitter_tag_batch(reuters_sents)   
twitter_tagged_medline = twitter_tag_batch(medline_sents)   
twitter_tagged_twitter = twitter_tag_batch(twitter_sents)   

#Tag with NLTK's maximum entropy tagger         
nltk_tagged_reuters = [pos_tag(word_tokenize(sentence)) for sentence in reuters_sents]  
nltk_tagged_medline = [pos_tag(word_tokenize(sentence)) for sentence in medline_sents]  
nltk_tagged_twitter = [pos_tag(word_tokenize(sentence)) for sentence in twitter_sents]  

#Print each sentence
print("-----------------------------------------")
print("Reuters Sample")
print("-----------------------------------------")
for raw, twitter_sentence, nltk_sentence in zip(reuters_sents,twitter_tagged_reuters,nltk_tagged_reuters):
    print("\n",raw,"\n")
    df = pd.DataFrame(list(zip_longest([(token,tag) for token,tag in nltk_sentence],
                                       [(token,tag) for token,tag in twitter_sentence])),
                      columns=["nltk tagger","twitter tagger"])
    print(df)
print("-----------------------------------------")
print("Medline Sample")
print("-----------------------------------------")
for raw, twitter_sentence, nltk_sentence in zip(medline_sents,twitter_tagged_medline,nltk_tagged_medline):
    print("\n",raw,"\n")
    df = pd.DataFrame(list(zip_longest([(token,tag) for token,tag in nltk_sentence],
                                       [(token,tag) for token,tag in twitter_sentence])),
                      columns=["nltk tagger","twitter tagger"])
    print(df)
print("-----------------------------------------")
print("Twitter Sample")
print("-----------------------------------------")
for raw, twitter_sentence, nltk_sentence in zip(twitter_sents,twitter_tagged_twitter,nltk_tagged_twitter):
    print("\n",raw,"\n")
    df = pd.DataFrame(list(zip_longest([(token,tag) for token,tag in nltk_sentence],
                                       [(token,tag) for token,tag in twitter_sentence])),
                      columns=["nltk tagger","twitter tagger"])
    print(df)

RuntimeError: Can't find an installed Java Runtime Environment (JRE).If you have installed java in a non standard location please call nltk.internals.config_java with the correct JRE path and options='-Xmx1g -XX:ParallelGCThreads=2' before calling sussex_nltk.cmu.tag

### Using PoS feature for classification
In the blank cell below, investigate the performance of the Naïve Bayes classifier with two different feature extraction functions involving PoS information:
- A feature extraction function that returns just the PoS tags, i.e. no token.
- A feature extraction function that returns a new token that results from concatenating the token and its PoS.

How do these compare to the standard setup where no feature extractor is used?


In [2]:
# %load solutions/classification_with_PoS
