# [Assignment #3: NPFL067 Statistical NLP II](http://ufal.mff.cuni.cz/~hajic/courses/npfl067/assign3.html)

## Tagging

### Author: Dan Kondratyuk

### March 28, 2018

---

This Python notebook compares Brill's Tagger with a trigram HMM tagger.

Code and explanation of results is fully viewable within this webpage.

## Files

- [index.html](./index.html) - Contains all veiwable code and a summary of results
- [README.md](./README.md) - Instructions on how to run the code with Python
- [nlp-assignment-3.ipynb](./nlp-assignment-3.ipynb) - Jupyter notebook where code can be run
- [requirements.txt](./requirements.txt) - Required python packages for running

## 1. Brill's Tagger & Tagger Evaluation

> For this whole homework, use data found in `texten2.ptg`, `textcz2.ptg`
>
> In the following, "the data" refers to both English and Czech, as usual.
>
> Split the data in the following way: use last 40,000 words for testing (data S), and from the remaining data, use the last 20,000 for smoothing (data H, if any). Call the rest "data T" (training). 
>
> Download Eric Brill's supervised tagger from [UFAL's course assignment space](http://ufal.mff.cuni.cz/~hajic/courses/npfl067/RULE_BASED_TAGGER_V.1.14.tar.gz). Install it (i.e., uncompress (gunzip), untar, and make).
>
> You might need to make some changes in his makefile of course (it's and OLD program, in this fast changing world...).
>
> After installation, get the data, train it on as much data from T as time allows (in the package, there is an extensive documentation on how to train it on new data), and evaluate on data S. Tabulate the results.
>
> Do cross-validation of the results: split the data into S', [H',] T' such that S' is the first 40,000 words, and T' is the last but the first 20,000 words from the rest. Train Eric Brill's tagger on T' (again, use as much data as time allows) and evaluate on S'. Again, tabulate the results.
>
> Do three more splits of your data (using the same formula: 40k/20k/the rest) in some way or another (as different as possible), and get another three sets of results. Compute the mean (average) accuracy and the standard deviation of the accuracy. Tabulate all results. 

In [1]:
import numpy as np
import pandas as pd
import collections as c
import nltk
from sklearn.metrics import accuracy_score
import itertools
import dill as pickle

from subprocess import call

In [2]:
def open_text(filename):
    """Reads a text line by line, applies light preprocessing, and returns an array of words and tags"""
    with open(filename, encoding='iso-8859-2') as f:
        content = f.readlines()
    
    preprocess = lambda word: tuple(word.strip().rsplit('/', 1))
    
    return [preprocess(word) for word in content]

In [3]:
def isplit(iterable, splitters):
    # https://stackoverflow.com/a/4322780
    return [list(g) for k,g in itertools.groupby(iterable, lambda x:x in splitters) if not k]

In [4]:
def sentence_split(data, token=('###', '###')):
    return isplit(data, (None, token))

In [5]:
def split_data(words, start=0):
    train, heldout, test = words[:start] + words[start+60_000:],  words[start+40_000:start+60_000], words[start:start+40_000]
    return train, heldout, test

In [6]:
def split_data_end(words):
    train, heldout, test = words[:-60_000],  words[-60_000:-40_000], words[-40_000:]
    return train, heldout, test

In [7]:
def split_all(words):
    return [
        split_data_end(words),
        split_data(words, start=40_000 * 0),
        split_data(words, start=40_000 * 1),
        split_data(words, start=40_000 * 2),
        split_data(words, start=40_000 * 3)
    ]

In [8]:
# Taken from https://github.com/nltk/nltk/blob/a84b28ca26ea3ee53da4eaafc2bbf037847779bd/nltk/tbl/demo.py
REGEXP_TAGGER = nltk.tag.RegexpTagger(
    [(r'^-?[0-9]+(.[0-9]+)?$', 'CD'),   # cardinal numbers
     (r'(The|the|A|a|An|an)$', 'AT'),   # articles
     (r'.*able$', 'JJ'),                # adjectives
     (r'.*ness$', 'NN'),                # nouns formed from adjectives
     (r'.*ly$', 'RB'),                  # adverbs
     (r'.*s$', 'NNS'),                  # plural nouns
     (r'.*ing$', 'VBG'),                # gerunds
     (r'.*ed$', 'VBD'),                 # past tense verbs
     (r'.*', 'NN')                      # nouns (default)
])
templates = nltk.tag.brill.brill24()

In [9]:
def brill_tagger(train, heldout, baseline_backoff_tagger=REGEXP_TAGGER, templates=templates, trace=0, 
                 ruleformat='str', max_rules=300, min_score=3, min_acc=None):
    baseline_tagger = nltk.tag.UnigramTagger(heldout, backoff=baseline_backoff_tagger)
    trainer = nltk.tag.BrillTaggerTrainer(baseline_tagger, templates, trace=trace, ruleformat=ruleformat)
    tagger = trainer.train(train, max_rules, min_score, min_acc)
    return tagger

In [10]:
def evaluate_brill(split, i=0, lang='', load=False):
    train, heldout, test = split
    
    filename = 'data/brill_tagger_{}_{}.pkl'.format(lang, i)
    
    print('Evaluating Brill Tagger {} [{}]'.format(lang, i))
    if load:
        with open(filename, 'rb') as f:
            tagger = pickle.load(f)
    else:
        tagger = brill_tagger([train], [heldout])
        with open(filename, 'wb') as f:
            pickle.dump(tagger, f)
    
    return tagger.evaluate([test])

In [11]:
def evaluate(tagger_type, eval_func, langs=('en', 'cz')):
    lang_d = {'en': ('English', splits_en), 'cz': ('Czech', splits_cz)}
    
    rows = []
    for lang in langs:
        language, splits = lang_d[lang]
        accuracies = [eval_func(split, i, lang) for i,split in enumerate(splits)]
        acc_str = ' '.join(['{0:0.1f}'.format(i * 100) for i in accuracies])
        row = [tagger_type, language, acc_str, np.mean(accuracies) * 100, np.std(accuracies) * 100]
        rows.append(row)

    columns = ['type', 'language', 'accuracies', 'mean', 'standard_deviation']
    results = pd.DataFrame(rows, columns=columns)
    return results

In [12]:
# Read the texts into memory
english = './data/texten2.ptg'
czech = './data/textcz2.ptg'

words_en = open_text(english)
words_cz = open_text(czech)

In [13]:
splits_en = split_all(words_en)
splits_cz = split_all(words_cz)

In [14]:
brill_results = evaluate('Brill', evaluate_brill)
brill_results

Evaluating Brill Tagger en [0]
Evaluating Brill Tagger en [1]
Evaluating Brill Tagger en [2]
Evaluating Brill Tagger en [3]
Evaluating Brill Tagger en [4]
Evaluating Brill Tagger cz [0]
Evaluating Brill Tagger cz [1]
Evaluating Brill Tagger cz [2]
Evaluating Brill Tagger cz [3]
Evaluating Brill Tagger cz [4]


Unnamed: 0,type,language,accuracies,mean,standard_deviation
0,Brill,English,90.4 90.7 90.6 90.2 87.6,89.924328,0.011614
1,Brill,Czech,61.9 70.2 64.7 64.0 65.7,65.28882,0.027449


## 2. Unsupervised Learning: HMM Tagging

> Use the datasets T, H, and S. Estimate the parameters of an HMM tagger using supervised learning off the T data (trigram and lower models for tags). Smooth (both the trigram tag model as well as the lexical model) in the same way as in Homework No. 1 (use data H). Evaluate your tagger on S, using the Viterbi algorithm.
>
> Now use only the first 10,000 words of T to estimate the initial (raw) parameters of the HMM tagging model. Strip off the tags from the remaining data T. Use the Baum-Welch algorithm to improve on the initial parameters. Smooth as usual. Evaluate your unsupervised HMM tagger and compare the results to the supervised HMM tagger.
>
> Tabulate and compare the results of the HMM tagger vs. the Brill's tagger. 

In [14]:
def evaluate_hmm(split, i=0, lang='', unsupervised=False, load=False):
    train, heldout, test = split
    
    name = 'unsupervised' if unsupervised else 'supervised'
    filename = 'data/hmm_{}_tagger_{}_{}.pkl'.format(name, lang, i)
    
    if unsupervised:
        labeled = sentence_split(train[:10_000])
        unlabeled = sentence_split(train[10_000:])
    else:
        labeled = sentence_split(train)

    words, tags = list(zip(*(train + heldout + test)))
    states, symbols = list(set(tags)), list(set(words))

    test = sentence_split(test)

    trainer = nltk.hmm.HiddenMarkovModelTrainer(states, symbols)
    
    print('Evaluating HMM {} {} [{}]'.format(name, lang, i))
    if load:
        with open(filename, 'rb') as f:
            tagger = pickle.load(f)
    else:
        tagger = trainer.train_supervised(labeled, estimator=lambda fd, bins: nltk.probability.LidstoneProbDist(fd, 0.1, bins))
        if unsupervised:
            tagger = trainer.train_unsupervised(unlabeled, model=tagger, max_iterations=5)
        with open(filename, 'wb') as f:
            pickle.dump(tagger, f)
    
    return tagger.evaluate(test)

In [None]:
langs = ['en']

In [18]:
hmm_supervised_results = evaluate('HMM (supervised)', lambda split, i, lang: evaluate_hmm(split, i, lang, unsupervised=False), langs)
hmm_supervised_results

Evaluating HMM supervised en [0]
Evaluating HMM supervised en [1]
Evaluating HMM supervised en [2]
Evaluating HMM supervised en [3]
Evaluating HMM supervised en [4]


Unnamed: 0,type,language,accuracies,mean,standard_deviation
0,HMM (supervised),English,91.1 90.7 91.7 89.8 91.2,90.908268,0.641836


In [18]:
call(['spd-say', 'hmm supervised done'])

0

In [33]:
train, heldout, test = splits_en[0]
len(train), len(heldout), len(test)

(187479, 20000, 40000)

In [51]:
train, heldout, test = splits_en[0]

words, tags = list(zip(*(train + heldout + test)))
states, symbols = list(set(tags)), list(set(words))

test = sentence_split(test)

trainer = nltk.hmm.HiddenMarkovModelTrainer(states, symbols)

labeled = sentence_split(train[:10_000])
unlabeled = sentence_split(train[10_000:])

tagger = trainer.train_supervised(labeled, estimator=lambda fd, bins: nltk.probability.LidstoneProbDist(fd, 0.1, bins))

print(tagger.evaluate(test))

tagger = trainer.train_unsupervised(unlabeled, model=tagger, max_iterations=3, update_outputs=False)

tagger.evaluate(test)

0.7380410022779044
iteration 0 logprob -1921749.2676943757


0.7095671981776766

In [46]:
tagger.tag([w for w,t in test[0]])

[('two', 'CD'),
 ('administrations', 'NNPS'),
 ('in', 'NNPS'),
 ('a', 'NNPS'),
 ('row', 'NNPS'),
 ('have', 'NNPS'),
 ('been', 'NNPS'),
 ('unwilling', 'NNPS'),
 ('and', 'NNPS'),
 ('unable', 'NNPS'),
 ('to', 'NNPS'),
 ('develop', 'NNPS'),
 ('any', 'NNPS'),
 ('plan', 'NNPS'),
 (',', 'NNPS'),
 ('military', 'NNPS'),
 ('or', 'NNPS'),
 ('economic', 'NNPS'),
 (',', 'NNPS'),
 ('for', 'NNPS'),
 ('supporting', 'NNPS'),
 ('the', 'NNPS'),
 ('Panamanian', 'NNPS'),
 ('people', 'NNPS'),
 ('in', 'NNPS'),
 ('their', 'NNPS'),
 ('attempts', 'NNPS'),
 ('to', 'NNPS'),
 ('restore', 'NNPS'),
 ('democracy', 'NNPS'),
 ('.', 'NNPS')]

In [None]:
hmm_supervised_results = evaluate('HMM (unsupervised)', lambda split, i, lang: evaluate_hmm(split, i, lang, unsupervised=True), langs)
hmm_supervised_results

In [None]:
call(['spd-say', 'hmm unsupervised done'])

In [18]:
class SmoothProbDist(nltk.probability.ProbDistI):
    def __init__(self, probdist):
        self._probdist = probdist

    def prob(self, sample):
        return (self._probdist.prob(sample) + 0.1) / 1.1

    def max(self):
        return self._probdist.max()

    def samples(self):
        return self._probdist.samples()

    def __repr__(self):
        return '<SmoothProbDist>'

In [19]:
def smooth(tagger):
    transitions = tagger._transitions
    outputs = tagger._outputs

    for k in transitions:
        transitions[k] = SmoothProbDist(transitions[k])
    for k in outputs:
        outputs[k] = SmoothProbDist(outputs[k])

    return nltk.tag.hmm.HiddenMarkovModelTagger(tagger._symbols, tagger._states, transitions, outputs, tagger._priors)

In [25]:
class LanguageModel:
    """Counts words and calculates probabilities (up to trigrams)"""
    
    def __init__(self, words):
        # Prepend two tokens to avoid beginning-of-data problems
        words = np.array(['<ss>', '<s>'] + list(words))
        
        # Unigrams
        self.unigrams = words
        self.unigram_set = list(set(self.unigrams))
        self.unigram_count = len(self.unigram_set)
        self.total_unigram_count = len(self.unigrams)
        self.unigram_dist = c.Counter(self.unigrams)
        
        # Bigrams
        self.bigrams = list(nltk.bigrams(words))
        self.bigram_set = list(set(self.bigrams))
        self.bigram_count = len(self.bigram_set)
        self.total_bigram_count = len(self.bigrams)
        self.bigram_dist = c.Counter(self.bigrams)
        
        # Trigrams
        self.trigrams = list(nltk.trigrams(words))
        self.trigram_set = list(set(self.trigrams))
        self.trigram_count = len(self.trigram_set)
        self.total_trigram_count = len(self.trigrams)
        self.trigram_dist = c.Counter(self.trigrams)
    
    def count(ngrams):
        ngram_set = list(set(ngrams))
        ngram_count = len(ngram_set)
        total_ngram_count = len(ngrams)
        ngram_dist = c.Counter(ngrams)
        return ngram_set, ngram_count, total_ngram_count, ngram_dist
        
    def p_uniform(self):
        """Calculates the probability of choosing a word uniformly at random"""
        return self.div(1, self.unigram_count)
    
    def p_unigram(self, w):
        """Calculates the probability a unigram appears in the distribution"""
        return self.div(self.unigram_dist[w], self.total_unigram_count)
    
    def p_bigram_cond(self, wprev, w):
        """Calculates the probability a word appears in the distribution given the previous word"""
        # If neither ngram has been seen, use the uniform distribution for smoothing purposes
        if ((self.bigram_dist[wprev, w], self.unigram_dist[wprev]) == (0,0)):
            return self.p_uniform()
        
        return self.div(self.bigram_dist[wprev, w], self.unigram_dist[wprev])
    
    def p_trigram_cond(self, wprev2, wprev, w):
        """Calculates the probability a word appears in the distribution given the previous word"""
        # If neither ngram has been seen, use the uniform distribution for smoothing purposes
        if ((self.trigram_dist[wprev2, wprev, w], self.bigram_dist[wprev2, wprev]) == (0,0)):
            return self.p_uniform()
        
        return self.div(self.trigram_dist[wprev2, wprev, w], self.bigram_dist[wprev2, wprev])
    
    def div(self, a, b):
        """Divides a and b safely"""
        return a / b if b != 0 else 0

In [26]:
def init_lambdas(n=3):
    """Initializes a list of lambdas for an ngram language model with uniform probabilities"""
    return np.array([1 / (n + 1)] * (n + 1))

In [27]:
def p_smoothed(lm, lambdas, wprev2, wprev, w):
    """Calculate the smoothed trigram probability using the weighted product of lambdas"""
    return np.multiply(lambdas, [
        lm.p_uniform(),
        lm.p_unigram(w),
        lm.p_bigram_cond(wprev, w),
        lm.p_trigram_cond(wprev2, wprev, w)
    ])

In [28]:
def expected_counts(lm, lambdas, heldout):
    """Computes the expected counts by smoothing across all trigrams and summing them all together"""
    smoothed_probs = (p_smoothed(lm, lambdas, *trigram) for trigram in heldout) # Multiply lambdas by probabilities
    return np.sum(smoothed / np.sum(smoothed) for smoothed in smoothed_probs) # Element-wise sum

In [29]:
def next_lambda(lm, lambdas, heldout):
    """Computes the next lambda from the current lambdas by normalizing the expected counts"""
    expected = expected_counts(lm, lambdas, heldout)
    return expected / np.sum(expected) # Normalize

In [30]:
def em_algorithm(train, heldout, stop_tolerance=1e-4):
    """Computes the EM algorithm for linear interpolation smoothing"""
    lambdas = init_lambdas(3)
    
    lm = LanguageModel(train)
    heldout_trigrams = LanguageModel(heldout).trigrams
    
    print('Lambdas:')
    
    next_l = next_lambda(lm, lambdas, heldout_trigrams)
    while not np.all([diff < stop_tolerance for diff in np.abs(lambdas - next_l)]):
        print(next_l)
        lambdas = next_l
        next_l = next_lambda(lm, lambdas, heldout_trigrams)

    lambdas = next_l
    return lambdas

In [31]:
train, heldout, test = splits_en[0]

In [38]:
lm_en = LanguageModel(train)

In [39]:
lambdas_en = em_algorithm(train, heldout)

Lambdas:
[0.00171272 0.01427302 0.23094838 0.75306588]
[6.09494990e-06 4.42538007e-04 1.01261852e-01 8.98289515e-01]
[3.49390171e-08 6.14055420e-05 4.23362189e-02 9.57602341e-01]
[7.19392903e-10 5.15530093e-05 1.74922896e-02 9.82456157e-01]
[1.71849267e-11 5.12945905e-05 7.20653209e-03 9.92742173e-01]
[4.12379908e-13 5.12845010e-05 2.97099414e-03 9.96977721e-01]
[9.90029950e-15 5.12829547e-05 1.22672529e-03 9.98721992e-01]
[2.37739409e-16 5.12824581e-05 5.07102930e-04 9.99441615e-01]
[5.70960107e-18 5.12822773e-05 2.09756728e-04 9.99738961e-01]
[1.37130442e-19 5.12822076e-05 8.67883404e-05 9.99861929e-01]
