# [Assignment #3: NPFL067 Statistical NLP II](http://ufal.mff.cuni.cz/~hajic/courses/npfl067/assign3.html)

## Tagging

### Author: Dan Kondratyuk

### March 28, 2018

---

This Python notebook compares Brill's Tagger with a trigram HMM tagger.

Code and explanation of results is fully viewable within this webpage.

## Files

- [index.html](./index.html) - Contains all veiwable code and a summary of results
- [README.md](./README.md) - Instructions on how to run the code with Python
- [nlp-assignment-3.ipynb](./nlp-assignment-3.ipynb) - Jupyter notebook where code can be run
- [requirements.txt](./requirements.txt) - Required python packages for running

## 1. Brill's Tagger & Tagger Evaluation

> For this whole homework, use data found in `texten2.ptg`, `textcz2.ptg`
>
> In the following, "the data" refers to both English and Czech, as usual.
>
> Split the data in the following way: use last 40,000 words for testing (data S), and from the remaining data, use the last 20,000 for smoothing (data H, if any). Call the rest "data T" (training). 
>
> Download Eric Brill's supervised tagger from [UFAL's course assignment space](http://ufal.mff.cuni.cz/~hajic/courses/npfl067/RULE_BASED_TAGGER_V.1.14.tar.gz). Install it (i.e., uncompress (gunzip), untar, and make).
>
> You might need to make some changes in his makefile of course (it's and OLD program, in this fast changing world...).
>
> After installation, get the data, train it on as much data from T as time allows (in the package, there is an extensive documentation on how to train it on new data), and evaluate on data S. Tabulate the results.
>
> Do cross-validation of the results: split the data into S', [H',] T' such that S' is the first 40,000 words, and T' is the last but the first 20,000 words from the rest. Train Eric Brill's tagger on T' (again, use as much data as time allows) and evaluate on S'. Again, tabulate the results.
>
> Do three more splits of your data (using the same formula: 40k/20k/the rest) in some way or another (as different as possible), and get another three sets of results. Compute the mean (average) accuracy and the standard deviation of the accuracy. Tabulate all results. 

In [1]:
import numpy as np
import pandas as pd
import nltk
from sklearn.metrics import accuracy_score
import itertools
import dill as pickle
from collections import Counter, defaultdict
from tqdm import tqdm_notebook as tqdm

from subprocess import call

In [2]:
def open_text(filename):
    """Reads a text line by line, applies light preprocessing, and returns an array of words and tags"""
    with open(filename, encoding='iso-8859-2') as f:
        content = f.readlines()
    
    preprocess = lambda word: tuple(word.strip().rsplit('/', 1))
    
    return [preprocess(word) for word in content]

In [3]:
def isplit(iterable, splitters):
    # https://stackoverflow.com/a/4322780
    return [list(g) for k,g in itertools.groupby(iterable, lambda x:x in splitters) if not k]

In [4]:
def sentence_split(data, token=('###', '###')):
    return isplit(data, (None, token))

In [5]:
def split_data(words, start=0):
    train, heldout, test = words[:start] + words[start+60_000:],  words[start+40_000:start+60_000], words[start:start+40_000]
    return train, heldout, test

In [6]:
def split_data_end(words):
    train, heldout, test = words[:-60_000],  words[-60_000:-40_000], words[-40_000:]
    return train, heldout, test

In [7]:
def split_all(words):
    return [
        split_data_end(words),
        split_data(words, start=40_000 * 0),
        split_data(words, start=40_000 * 1),
        split_data(words, start=40_000 * 2),
        split_data(words, start=40_000 * 3)
    ]

In [8]:
# Taken from https://github.com/nltk/nltk/blob/a84b28ca26ea3ee53da4eaafc2bbf037847779bd/nltk/tbl/demo.py
REGEXP_TAGGER = nltk.tag.RegexpTagger(
    [(r'^-?[0-9]+(.[0-9]+)?$', 'CD'),   # cardinal numbers
     (r'(The|the|A|a|An|an)$', 'AT'),   # articles
     (r'.*able$', 'JJ'),                # adjectives
     (r'.*ness$', 'NN'),                # nouns formed from adjectives
     (r'.*ly$', 'RB'),                  # adverbs
     (r'.*s$', 'NNS'),                  # plural nouns
     (r'.*ing$', 'VBG'),                # gerunds
     (r'.*ed$', 'VBD'),                 # past tense verbs
     (r'.*', 'NN')                      # nouns (default)
])
templates = nltk.tag.brill.brill24()

In [9]:
def brill_tagger(train, heldout, baseline_backoff_tagger=REGEXP_TAGGER, templates=templates, trace=0, 
                 ruleformat='str', max_rules=300, min_score=3, min_acc=None):
    baseline_tagger = nltk.tag.UnigramTagger(heldout, backoff=baseline_backoff_tagger)
    trainer = nltk.tag.BrillTaggerTrainer(baseline_tagger, templates, trace=trace, ruleformat=ruleformat)
    tagger = trainer.train(train, max_rules, min_score, min_acc)
    return tagger

In [10]:
def evaluate_brill(split, i=0, lang='', load=False):
    train, heldout, test = split
    
    filename = 'data/brill_tagger_{}_{}.pkl'.format(lang, i)
    
    print('Evaluating Brill Tagger {} [{}]'.format(lang, i))
    if load:
        with open(filename, 'rb') as f:
            tagger = pickle.load(f)
    else:
        tagger = brill_tagger([train], [heldout])
        with open(filename, 'wb') as f:
            pickle.dump(tagger, f)
    
    return tagger.evaluate([test])

In [11]:
def evaluate(tagger_type, eval_func, langs=('en', 'cz')):
    lang_d = {'en': ('English', splits_en), 'cz': ('Czech', splits_cz)}
    
    rows = []
    for lang in langs:
        language, splits = lang_d[lang]
        accuracies = [eval_func(split, i, lang) for i,split in enumerate(splits)]
        acc_str = ' '.join(['{0:0.1f}'.format(i * 100) for i in accuracies])
        row = [tagger_type, language, acc_str, np.mean(accuracies) * 100, np.std(accuracies) * 100]
        rows.append(row)

    columns = ['type', 'language', 'accuracies', 'mean', 'standard_deviation']
    results = pd.DataFrame(rows, columns=columns)
    return results

In [12]:
# Read the texts into memory
english = './data/texten2.ptg'
czech = './data/textcz2.ptg'

words_en = open_text(english)
words_cz = open_text(czech)

In [13]:
splits_en = split_all(words_en)
splits_cz = split_all(words_cz)

In [14]:
brill_results = evaluate('Brill', evaluate_brill)
brill_results

Evaluating Brill Tagger en [0]
Evaluating Brill Tagger en [1]
Evaluating Brill Tagger en [2]
Evaluating Brill Tagger en [3]
Evaluating Brill Tagger en [4]
Evaluating Brill Tagger cz [0]
Evaluating Brill Tagger cz [1]
Evaluating Brill Tagger cz [2]
Evaluating Brill Tagger cz [3]
Evaluating Brill Tagger cz [4]


Unnamed: 0,type,language,accuracies,mean,standard_deviation
0,Brill,English,90.4 90.7 90.6 90.2 87.6,89.924328,0.011614
1,Brill,Czech,61.9 70.2 64.7 64.0 65.7,65.28882,0.027449


## 2. Unsupervised Learning: HMM Tagging

> Use the datasets T, H, and S. Estimate the parameters of an HMM tagger using supervised learning off the T data (trigram and lower models for tags). Smooth (both the trigram tag model as well as the lexical model) in the same way as in Homework No. 1 (use data H). Evaluate your tagger on S, using the Viterbi algorithm.
>
> Now use only the first 10,000 words of T to estimate the initial (raw) parameters of the HMM tagging model. Strip off the tags from the remaining data T. Use the Baum-Welch algorithm to improve on the initial parameters. Smooth as usual. Evaluate your unsupervised HMM tagger and compare the results to the supervised HMM tagger.
>
> Tabulate and compare the results of the HMM tagger vs. the Brill's tagger. 

In [14]:
def evaluate_hmm(split, i=0, lang='', unsupervised=False, load=False):
    train, heldout, test = split
    
    name = 'unsupervised' if unsupervised else 'supervised'
    filename = 'data/hmm_{}_tagger_{}_{}.pkl'.format(name, lang, i)
    
    if unsupervised:
        labeled = sentence_split(train[:10_000])
        unlabeled = sentence_split(train[10_000:])
    else:
        labeled = sentence_split(train)

    words, tags = list(zip(*(train + heldout + test)))
    states, symbols = list(set(tags)), list(set(words))

    test = sentence_split(test)

    trainer = nltk.hmm.HiddenMarkovModelTrainer(states, symbols)
    
    print('Evaluating HMM {} {} [{}]'.format(name, lang, i))
    if load:
        with open(filename, 'rb') as f:
            tagger = pickle.load(f)
    else:
        tagger = trainer.train_supervised(labeled, estimator=lambda fd, bins: nltk.probability.LidstoneProbDist(fd, 0.1, bins))
        if unsupervised:
            tagger = trainer.train_unsupervised(unlabeled, model=tagger, max_iterations=5)
        with open(filename, 'wb') as f:
            pickle.dump(tagger, f)
    
    return tagger.evaluate(test)

In [15]:
langs = ['en']

In [18]:
hmm_supervised_results = evaluate('HMM (supervised)', lambda split, i, lang: evaluate_hmm(split, i, lang, unsupervised=False), langs)
hmm_supervised_results

Evaluating HMM supervised en [0]
Evaluating HMM supervised en [1]
Evaluating HMM supervised en [2]
Evaluating HMM supervised en [3]
Evaluating HMM supervised en [4]


Unnamed: 0,type,language,accuracies,mean,standard_deviation
0,HMM (supervised),English,91.1 90.7 91.7 89.8 91.2,90.908268,0.641836


In [18]:
call(['spd-say', 'hmm supervised done'])

0

In [51]:
train, heldout, test = splits_en[0]

words, tags = list(zip(*(train + heldout + test)))
states, symbols = list(set(tags)), list(set(words))

test = sentence_split(test)

trainer = nltk.hmm.HiddenMarkovModelTrainer(states, symbols)

labeled = sentence_split(train[:10_000])
unlabeled = sentence_split(train[10_000:])

tagger = trainer.train_supervised(labeled, estimator=lambda fd, bins: nltk.probability.LidstoneProbDist(fd, 0.1, bins))

print(tagger.evaluate(test))

tagger = trainer.train_unsupervised(unlabeled, model=tagger, max_iterations=3, update_outputs=False)

tagger.evaluate(test)

0.7380410022779044
iteration 0 logprob -1921749.2676943757


0.7095671981776766

In [None]:
hmm_supervised_results = evaluate('HMM (unsupervised)', lambda split, i, lang: evaluate_hmm(split, i, lang, unsupervised=True), langs)
hmm_supervised_results

In [None]:
call(['spd-say', 'hmm unsupervised done'])

In [262]:
class LISmoother:
    """Linear interpolation smoother"""
    
    def __init__(self, p_uniform, p_unigram, p_bigram, p_trigram):
        self.p_uniform = p_uniform
        self.p_unigram = p_unigram
        self.p_bigram = p_bigram
        self.p_trigram = p_trigram
        
        self.lambdas = self.init_lambdas(2)
#         self.lambdas = self.init_lambdas(3)
    
    def init_lambdas(self, n=3):
        """Initializes a list of lambdas for an ngram language model with uniform probabilities"""
        return np.array([1 / (n + 1)] * (n + 1))
    
    def smooth(self, heldout_data, stop_tolerance=1e-4):
        """Computes the EM algorithm for linear interpolation smoothing"""
        
        print('Lambdas:')
        print(self.lambdas)

        next_l = self.next_lambda(self.lambdas, heldout_data)
        while not all(diff < stop_tolerance for diff in np.abs(self.lambdas - next_l)):
            print(next_l)
            self.lambdas = next_l
            next_l = self.next_lambda(self.lambdas, heldout_data)

        print(next_l)
        self.lambdas = next_l

    def next_lambda(self, lambdas, heldout):
        """Computes the next lambda from the current lambdas by normalizing the expected counts"""
        expected = self.expected_counts(lambdas, heldout)
        return expected / np.sum(expected)  # Normalize

    def expected_counts(self, lambdas, heldout):
        """Computes the expected counts by smoothing across all trigrams and summing them all together"""
        smoothed_probs = (self.p_smoothed(lambdas, *h) for h in heldout)  # Multiply lambdas by probabilities
        return np.sum(smoothed / np.sum(smoothed) for smoothed in smoothed_probs)  # Element-wise sum

    def p_smoothed(self, lambdas, tprev, t, w):
        """Calculate the smoothed trigram probability using the weighted product of lambdas"""
        return np.multiply(lambdas, [
            self.p_uniform,
            self.p_unigram[w],
            self.p_bigram[t, w],
#             self.p_trigram[tprev, t, w]
        ])

In [278]:
class HMMTagger:
    def __init__(self, tagged_data, tag_set, word_set):
        # Prepend two tokens to avoid beginning-of-data problems

        self.states = tag_set
        self.symbols = word_set
        
        self.text_size = len(tagged_data)
        
        # Transition tables - p(t | tprev2, tprev)
        self.transition = defaultdict(float)
        self.transition_bigram = defaultdict(float)
        self.transition_unigram = defaultdict(float)
        self.state_uniform = self.div(1, len(self.states))

        # Emission tables - p(w | tprev, t)
        self.emission = defaultdict(float)
        self.emission_bigram = defaultdict(float)
        self.emission_unigram = defaultdict(float)
        self.symbol_uniform = self.div(1, len(self.symbols))

        unigram_tag_dist = defaultdict(int)
        bigram_tag_dist = defaultdict(int)
        trigram_tag_dist = defaultdict(int)

        unigram_output_dist = defaultdict(int)
        bigram_output_dist = defaultdict(int)
        trigram_output_dist = defaultdict(int)

        tprev, tprev2 = None, None
        for w, t in tagged_data:
            unigram_tag_dist[t] += 1
            bigram_tag_dist[tprev, t] += 1
            trigram_tag_dist[tprev2, tprev, t] += 1

            unigram_output_dist[w] += 1
            bigram_output_dist[t, w] += 1
            trigram_output_dist[tprev, t, w] += 1

            tprev2 = tprev
            tprev = t

        # Build transition tables
        for tprev2, tprev, t in trigram_tag_dist:
            # Use uniform distribution if tags not seen
            if (trigram_tag_dist[tprev2, tprev, t], bigram_tag_dist[tprev, t]) == (0, 0):
                self.transition[tprev2, tprev, t] = self.state_uniform
            self.transition[tprev2, tprev, t] = self.div(trigram_tag_dist[tprev2, tprev, t], bigram_tag_dist[tprev, t])
        
        for tprev, t in bigram_tag_dist:
            # Use uniform distribution if tags not seen
            if (bigram_tag_dist[tprev, t], unigram_tag_dist[t]) == (0, 0):
                self.transition_bigram[tprev, t] = self.state_uniform
            self.transition_bigram[tprev, t] = self.div(bigram_tag_dist[tprev, t], unigram_tag_dist[t])

        for t in unigram_tag_dist:
            self.transition_unigram[t] = self.div(unigram_tag_dist[t], self.text_size)
            
        # Build emission tables
        for tprev, t, w in trigram_output_dist:
            # Use uniform distribution if tags not seen
            if (trigram_output_dist[tprev, t, w], bigram_tag_dist[tprev, t]) == (0, 0):
                self.emission[tprev, t, w] = self.symbol_uniform
            self.emission[tprev, t, w] = self.div(trigram_output_dist[tprev, t, w], bigram_tag_dist[tprev, t])    
        
        for t, w in bigram_output_dist:
            # Use uniform distribution if tags not seen
            if (bigram_output_dist[t, w], unigram_tag_dist[t]) == (0, 0):
                self.emission_bigram[t, w] = self.symbol_uniform
            self.emission_bigram[t, w] = self.div(bigram_output_dist[t, w], unigram_tag_dist[t])

        for w in unigram_output_dist:
            self.emission_unigram[w] = self.div(unigram_output_dist[w], self.text_size)
        
        self.transition_smoother = LISmoother(self.state_uniform, self.transition_unigram, 
                                              self.transition_bigram, self.transition)
        self.emission_smoother = LISmoother(self.symbol_uniform, self.emission_unigram, 
                                            self.emission_bigram, self.emission)
    
    def smooth(self, heldout_data):
        """Smooth the transition and emission tables with linear interpolation smoothing"""
        heldout_trigrams = [(tprev, t, w) for (tprev, _), (t, w) in  nltk.bigrams(heldout_data)]
        self.transition_smoother.smooth(heldout_trigrams)
        self.emission_smoother.smooth(heldout_trigrams)
        
    def train_unsupervised(self, unlabeled_data):
        pass
        
    def tag(self, words):
        T = len(words)
        V = defaultdict(float)
        B = {}

        # Find the starting probabilities for each state
        symbol = words[0]
        for state in self.states:
            V[0, state] = self.p_emission(state, symbol)
            B[0, state] = None

        # Find the maximum probabilities for reaching each state at time t
        n_best = 100
        for t in range(1, T):
            symbol = words[t]
#             prev_states = [x for x in self.states if x[0] == None] if t == 1 else self.states
            prev_states = self.states
            
            for j in prev_states:
                sj = j
                best = None
                
#                 best_states = list(sorted(((s, V[t - 1, s]) for s in tagger.states), key=lambda x: x[1], reverse=True))[:n_best]
#                 best_states = [x[0] for x in best_states]

#                 next_states = [x for x in self.states if x[0] == sj[1]]
                next_states = self.states
                
                for i in next_states:
                    si = i
                    va = V[t - 1, i] * self.p_transition(sj, si)
                    if not best or va > best[0]:
                        best = (va, si)
                V[t, j] = best[0] * self.p_emission(sj, symbol)
                B[t, sj] = best[1]

        # Find the highest probability for the final state
        best = None
        for i in self.states:
            val = V[T - 1, i]
            if not best or val > best[0]:
                best = (val, i)

        # traverse the back-pointers B to find the state sequence
        current = best[1]
        sequence = [current]
        for t in range(T - 1, 0, -1):
            last = B[t, current]
            sequence.append(last)
            current = last

        sequence.reverse()
#         sequence = [s[1] for s in sequence]
        return sequence
    
    def evaluate(self, data):
        total, correct = 0, 0
        for sentence in tqdm(data):
            words, tags = zip(*sentence)
            predicted_tags = self.tag(words)
            for tag, pred in zip(tags, predicted_tags):
                if tag == pred:
                    correct +=1
                total += 1
            
        return correct / total
    
    def p_transition(self, tprev, t):
        return self.transition_smoother.lambdas.dot([
            self.state_uniform,
            self.transition_unigram[t],
            self.transition_bigram[tprev, t]
        ])

    def p_emission(self, t, w):
        return self.emission_smoother.lambdas.dot([
            self.symbol_uniform,
            self.emission_unigram[w],
            self.emission_bigram[t, w]
        ])
    
#     def p_transition(self, tprev, t):
#         tprev2, tprev = tprev
#         return self.transition_smoother.lambdas.dot([
#             self.state_uniform,
#             self.transition_unigram[t],
#             self.transition_bigram[tprev, t],
#             self.transition[tprev2, tprev]
#         ])

#     def p_emission(self, t, w):
#         tprev, t = t
#         return self.emission_smoother.lambdas[:3].dot([
#             self.symbol_uniform,
#             self.emission_unigram[w],
#             self.emission_bigram[t, w]
#         ])
    
    def div(self, a, b):
        """Divides a and b safely"""
        return a / b if b != 0 else 0

In [287]:
def isplit(iterable, splitters):
    # https://stackoverflow.com/a/4322780
    return [list(g) for k,g in itertools.groupby(iterable, lambda x:x in splitters) if not k]

def sentence_split(data, token=('###', '###')):
    return [[(token[0], token[0])] + g for g in isplit(data, (None, token))]

In [279]:
train, heldout, test = splits_en[0]

words, tags = list(zip(*(train + heldout + test)))
tag_set, word_set = list(set(tags)), list(set(words))
# tag_set, word_set = list(set(nltk.bigrams(tags, pad_left=True))), list(set(words))

labeled = train[:10_000]
# labeled = train
unlabeled = train[10_000:]

In [280]:
len(word_set), len(tag_set)

(20712, 62)

In [281]:
# Emissions, transitions
len(tag_set) ** 2 * len(word_set), len(tag_set) ** 3

(79616928, 238328)

In [282]:
tagger = HMMTagger(labeled, tag_set, word_set)

In [283]:
tagger.smooth(heldout)

Lambdas:
[0.33333333 0.33333333 0.33333333]
[0.28514722 0.71171577 0.00313701]
[1.59821155e-01 8.40136060e-01 4.27856639e-05]
[9.29713642e-02 9.07027953e-01 6.83290462e-07]
[5.75298161e-02 9.42470172e-01 1.19772173e-08]
[3.80347143e-02 9.61965285e-01 2.21394061e-10]
[2.68713134e-02 9.73128687e-01 4.21892898e-12]
[2.02602120e-02 9.79739788e-01 8.18460231e-14]
[1.62424952e-02 9.83757505e-01 1.60492310e-15]
[1.37533821e-02 9.86246618e-01 3.16787676e-17]
[1.21894433e-02 9.87810557e-01 6.27858922e-19]
[1.11968399e-02 9.88803160e-01 1.24760710e-20]
[1.05623750e-02 9.89437625e-01 2.48317432e-22]
[1.01548486e-02 9.89845151e-01 4.94758570e-24]
[9.89222379e-03 9.90107776e-01 9.86445444e-26]
[9.72260590e-03 9.90277394e-01 1.96762416e-27]
[9.61289746e-03 9.90387103e-01 3.92584859e-29]
[9.54187038e-03 9.90458130e-01 7.83437008e-31]
Lambdas:
[0.33333333 0.33333333 0.33333333]
[0.84297042 0.00987577 0.1471538 ]
[0.846141   0.00167462 0.15218439]
[8.46870576e-01 6.26380865e-04 1.52503043e-01]
[8.47119

In [None]:
# tagger.train_unsupervised(unlabeled)

In [285]:
sentences = sentence_split(test)

In [286]:
tagger.evaluate(sentences[:100]) # TODO: prune states

0.7478048780487805