# [Assignment #3: NPFL067 Statistical NLP II](http://ufal.mff.cuni.cz/~hajic/courses/npfl067/assign3.html)

## Tagging

### Author: Dan Kondratyuk

### March 28, 2018

---

This Python notebook compares Brill's Tagger with a trigram HMM tagger.

Code and explanation of results is fully viewable within this webpage.

## Files

- [index.html](./index.html) - Contains all veiwable code and a summary of results
- [README.md](./README.md) - Instructions on how to run the code with Python
- [nlp-assignment-3.ipynb](./nlp-assignment-3.ipynb) - Jupyter notebook where code can be run
- [tag.py](./tag.py) - Contains HMM code for part 2
- [requirements.txt](./requirements.txt) - Required python packages for running

## 1. Brill's Tagger & Tagger Evaluation

> For this whole homework, use data found in `texten2.ptg`, `textcz2.ptg`
>
> In the following, "the data" refers to both English and Czech, as usual.
>
> Split the data in the following way: use last 40,000 words for testing (data S), and from the remaining data, use the last 20,000 for smoothing (data H, if any). Call the rest "data T" (training). 
>
> Download Eric Brill's supervised tagger from [UFAL's course assignment space](http://ufal.mff.cuni.cz/~hajic/courses/npfl067/RULE_BASED_TAGGER_V.1.14.tar.gz). Install it (i.e., uncompress (gunzip), untar, and make).
>
> You might need to make some changes in his makefile of course (it's and OLD program, in this fast changing world...).
>
> After installation, get the data, train it on as much data from T as time allows (in the package, there is an extensive documentation on how to train it on new data), and evaluate on data S. Tabulate the results.
>
> Do cross-validation of the results: split the data into S', [H',] T' such that S' is the first 40,000 words, and T' is the last but the first 20,000 words from the rest. Train Eric Brill's tagger on T' (again, use as much data as time allows) and evaluate on S'. Again, tabulate the results.
>
> Do three more splits of your data (using the same formula: 40k/20k/the rest) in some way or another (as different as possible), and get another three sets of results. Compute the mean (average) accuracy and the standard deviation of the accuracy. Tabulate all results. 

In [1]:
import numpy as np
import pandas as pd
import nltk
from sklearn.metrics import accuracy_score
import itertools
import dill as pickle
from collections import Counter, defaultdict
from tqdm import tqdm_notebook as tqdm, tnrange as trange

from subprocess import call

In [2]:
def open_text(filename):
    """Reads a text line by line, applies light preprocessing, and returns an array of words and tags"""
    with open(filename, encoding='iso-8859-2') as f:
        content = f.readlines()
    
    preprocess = lambda word: tuple(word.strip().rsplit('/', 1))
    
    return [preprocess(word) for word in content]

In [3]:
def isplit(iterable, splitters):
    # https://stackoverflow.com/a/4322780
    return [list(g) for k,g in itertools.groupby(iterable, lambda x:x in splitters) if not k]

In [4]:
def sentence_split(data, token=('###', '###')):
    return isplit(data, (None, token))

In [5]:
def split_data(words, start=0):
    train, heldout, test = words[:start] + words[start+60_000:],  words[start+40_000:start+60_000], words[start:start+40_000]
    return train, heldout, test

In [6]:
def split_data_end(words):
    train, heldout, test = words[:-60_000],  words[-60_000:-40_000], words[-40_000:]
    return train, heldout, test

In [7]:
def split_all(words):
    return [
        split_data_end(words),
        split_data(words, start=40_000 * 0),
        split_data(words, start=40_000 * 1),
        split_data(words, start=40_000 * 2),
        split_data(words, start=40_000 * 3)
    ]

In [8]:
# Taken from https://github.com/nltk/nltk/blob/a84b28ca26ea3ee53da4eaafc2bbf037847779bd/nltk/tbl/demo.py
REGEXP_TAGGER = nltk.tag.RegexpTagger(
    [(r'^-?[0-9]+(.[0-9]+)?$', 'CD'),   # cardinal numbers
     (r'(The|the|A|a|An|an)$', 'AT'),   # articles
     (r'.*able$', 'JJ'),                # adjectives
     (r'.*ness$', 'NN'),                # nouns formed from adjectives
     (r'.*ly$', 'RB'),                  # adverbs
     (r'.*s$', 'NNS'),                  # plural nouns
     (r'.*ing$', 'VBG'),                # gerunds
     (r'.*ed$', 'VBD'),                 # past tense verbs
     (r'.*', 'NN')                      # nouns (default)
])
templates = nltk.tag.brill.brill24()

In [9]:
def brill_tagger(train, heldout, baseline_backoff_tagger=REGEXP_TAGGER, templates=templates, trace=0, 
                 ruleformat='str', max_rules=300, min_score=3, min_acc=None):
    baseline_tagger = nltk.tag.UnigramTagger(heldout, backoff=baseline_backoff_tagger)
    trainer = nltk.tag.BrillTaggerTrainer(baseline_tagger, templates, trace=trace, ruleformat=ruleformat)
    tagger = trainer.train(train, max_rules, min_score, min_acc)
    return tagger

In [10]:
def evaluate_brill(split, i=0, lang='', load=False):
    train, heldout, test = split
    
    filename = 'data/brill_tagger_{}_{}.pkl'.format(lang, i)
    
    print('Evaluating Brill Tagger {} [{}]'.format(lang, i))
    if load:
        with open(filename, 'rb') as f:
            tagger = pickle.load(f)
    else:
        tagger = brill_tagger([train], [heldout])
        with open(filename, 'wb') as f:
            pickle.dump(tagger, f)
    
    return tagger.evaluate([test])

In [11]:
def evaluate(tagger_type, eval_func, langs=('en', 'cz')):
    lang_d = {'en': ('English', splits_en), 'cz': ('Czech', splits_cz)}
    
    rows = []
    for lang in langs:
        language, splits = lang_d[lang]
        accuracies = [eval_func(split, i, lang) for i,split in enumerate(splits)]
        acc_str = ' '.join(['{0:0.1f}'.format(i * 100) for i in accuracies])
        row = [tagger_type, language, acc_str, np.mean(accuracies) * 100, np.std(accuracies) * 100]
        rows.append(row)

    columns = ['type', 'language', 'accuracies', 'mean', 'standard_deviation']
    results = pd.DataFrame(rows, columns=columns)
    return results

In [12]:
# Read the texts into memory
english = './data/texten2.ptg'
czech = './data/textcz2.ptg'

words_en = open_text(english)
words_cz = open_text(czech)

In [13]:
splits_en = split_all(words_en)
splits_cz = split_all(words_cz)

### Brill Results

In [14]:
brill_results = evaluate('Brill', evaluate_brill)
brill_results

Evaluating Brill Tagger en [0]
Evaluating Brill Tagger en [1]
Evaluating Brill Tagger en [2]
Evaluating Brill Tagger en [3]
Evaluating Brill Tagger en [4]
Evaluating Brill Tagger cz [0]
Evaluating Brill Tagger cz [1]
Evaluating Brill Tagger cz [2]
Evaluating Brill Tagger cz [3]
Evaluating Brill Tagger cz [4]


Unnamed: 0,type,language,accuracies,mean,standard_deviation
0,Brill,English,90.4 90.7 90.6 90.2 87.6,89.924328,0.011614
1,Brill,Czech,61.9 70.2 64.7 64.0 65.7,65.28882,0.027449


## 2. Unsupervised Learning: HMM Tagging

> Use the datasets T, H, and S. Estimate the parameters of an HMM tagger using supervised learning off the T data (trigram and lower models for tags). Smooth (both the trigram tag model as well as the lexical model) in the same way as in Homework No. 1 (use data H). Evaluate your tagger on S, using the Viterbi algorithm.
>
> Now use only the first 10,000 words of T to estimate the initial (raw) parameters of the HMM tagging model. Strip off the tags from the remaining data T. Use the Baum-Welch algorithm to improve on the initial parameters. Smooth as usual. Evaluate your unsupervised HMM tagger and compare the results to the supervised HMM tagger.
>
> Tabulate and compare the results of the HMM tagger vs. the Brill's tagger. 

In [14]:
from tag import HMMTagger # See tag.py for implementation details

In [15]:
def evaluate_hmm(split, i=0, lang='', unsupervised=False, load=False):
    train, heldout, test = split
    
    name = 'unsupervised' if unsupervised else 'supervised'
    filename = 'data/hmm_{}_tagger_{}_{}.pkl'.format(name, lang, i)
    
    if unsupervised:
        labeled = sentence_split(train[:10_000])
        unlabeled = [list(zip(*sentence))[0] for sentence in sentence_split(train[10_000:])]
    else:
        labeled = sentence_split(train)

    words, tags = list(zip(*(train + heldout + test)))
#     tag_set, word_set = list(set(tags)), list(set(words))
    tag_set, word_set = set(nltk.bigrams(tags, pad_left=True)), set(words)

    test = sentence_split(test)

    print('Evaluating HMM {} {} [{}]'.format(name, lang, i))
    if load:
        with open(filename, 'rb') as f:
            tagger = pickle.load(f)
    else:
        tagger = HMMTagger(labeled, tag_set, word_set)
        tagger.smooth(heldout)
        
        if unsupervised:
            tagger.train_unsupervised(unlabeled, max_iterations=5)
        with open(filename, 'wb') as f:
            pickle.dump(tagger, f)
    
    return tagger.evaluate(test)

In [16]:
def isplit(iterable, splitters):
    # https://stackoverflow.com/a/4322780
    return [list(g) for k,g in itertools.groupby(iterable, lambda x:x in splitters) if not k]

def sentence_split(data, token=('###', '###')):
    return [[(token[0], token[0])] + g for g in isplit(data, (None, token))]

In [None]:
langs=['en', 'cz']
hmm_supervised_results = evaluate('HMM (supervised)', lambda split, i, lang: evaluate_hmm(split, i, lang, unsupervised=False), langs)

In [None]:
langs=['en', 'cz']
hmm_unsupervised_results = evaluate('HMM (unsupervised)', lambda split, i, lang: evaluate_hmm(split, i, lang, unsupervised=True), langs)

### HMM Results

In [392]:
hmm_supervised_results

Unnamed: 0,type,language,accuracies,mean,standard_deviation
0,HMM (supervised),English,83.8 83.6 84.0 82.7 84.2,83.669745,0.502114
1,HMM (supervised),Czech,55.3 60.5 57.9 56.6 56.0,57.269441,1.839133


In [18]:
hmm_unsupervised_results

Unnamed: 0,type,language,accuracies,mean,standard_deviation
0,HMM (unsupervised),English,81.1 79.9 80.6 80.1 80.9,80.515292,0.460171
1,HMM (unsupervised),Czech,47.7 62.9 54.0 50.6 48.6,52.788909,5.517085


### Performance Comparison

Below are the final results of all taggers evaluated in this notebook.

In [20]:
pd.concat(brill_results, hmm_supervised_results, hmm_unsupervised_results)

Unnamed: 0,type,language,accuracies,mean,standard_deviation
0,Brill,English,90.4 90.7 90.6 90.2 87.6,89.924328,0.011614
1,Brill,Czech,61.9 70.2 64.7 64.0 65.7,65.28882,0.027449
2,HMM (supervised),English,83.8 83.6 84.0 82.7 84.2,83.669745,0.502114
3,HMM (supervised),Czech,55.3 60.5 57.9 56.6 56.0,57.269441,1.839133
4,HMM (unsupervised),English,81.1 79.9 80.6 80.1 80.9,80.515292,0.460171
5,HMM (unsupervised),Czech,47.7 62.9 54.0 50.6 48.6,52.788909,5.517085


### Conclusion

In all cases, Czech POS tagging performs much worse than English POS tagging. The standard deviation among tag accuracies is higher as well. This can be attributed to two primary causes:

1. Czech has many more types of tags (>1500) versus English (<50). This means the potential to get an incorrect answer is much higher.
2. OOV issues prevent many words from being observed in the training data. Czech has rich morphology encoded as inflections in each word, which in turn expand the size of the vocabulary exponentially. This means that the Czech tagger is much more likely to encounter words it has never seen before, thereby making it difficult to choose the correct tag for the word.

We see that, overall, Brill's tagger performs the best on both English and Czech, with a sizable lead on the HMM tagger.

The supervised HMM comes in 2nd, while the unsupervised HMM comes last. This is as expected, since the supervised HMM was trained with labeled words on the entire training set, while the unsupervised was trained on just the first 10,000 labeled examples and used Baum-Welch to train on the remaining unlabeled set. While Baum-Welch can clue in on the distribution of observed words and update its internal model accordingly, this provides less information to the model than if both observed words and labels are avalable. 

More surprisingly however, the difference between supervised and unsupervised HMM approaches not very large. Despite not observing most of the hidden states in the training set, the unsupervised HMM can still model the distribution quite well, suffering only a couple percentage points in English. This drop is more significant in Czech, as there are likely more unobserved tags in Czech due to its rich morphlogical tags.