# [Assignment #3: NPFL067 Statistical NLP II](http://ufal.mff.cuni.cz/~hajic/courses/npfl067/assign3.html)

## Tagging

### Author: Dan Kondratyuk

### March 2, 2018

---

Code and explanation of results is fully viewable within this webpage.

## Files

- [index.html](./index.html) - Contains all veiwable code and a summary of results
- [README.md](./README.md) - Instructions on how to run the code with Python
- [nlp-assignment-3.ipynb](./nlp-assignment-1.ipynb) - Jupyter notebook where code can be run
- [requirements.txt](./requirements.txt) - Required python packages for running

## 1. Brill's Tagger & Tagger Evaluation

>  For this whole homework, use data found in `texten2.ptg`, `textcz2.ptg`

> In the following, "the data" refers to both English and Czech, as usual.

> Split the data in the following way: use last 40,000 words for testing (data S), and from the remaining data, use the last 20,000 for smoothing (data H, if any). Call the rest "data T" (training). 

> Download Eric Brill's supervised tagger from [UFAL's course assignment space](http://ufal.mff.cuni.cz/~hajic/courses/npfl067/RULE_BASED_TAGGER_V.1.14.tar.gz). Install it (i.e., uncompress (gunzip), untar, and make).

> You might need to make some changes in his makefile of course (it's and OLD program, in this fast changing world...).

> After installation, get the data, train it on as much data from T as time allows (in the package, there is an extensive documentation on how to train it on new data), and evaluate on data S. Tabulate the results.

> Do cross-validation of the results: split the data into S', [H',] T' such that S' is the first 40,000 words, and T' is the last but the first 20,000 words from the rest. Train Eric Brill's tagger on T' (again, use as much data as time allows) and evaluate on S'. Again, tabulate the results.

> Do three more splits of your data (using the same formula: 40k/20k/the rest) in some way or another (as different as possible), and get another three sets of results. Compute the mean (average) accuracy and the standard deviation of the accuracy. Tabulate all results. 

In [74]:
import numpy as np
import pandas as pd
import collections as c
import nltk
from nltk.tag import BrillTaggerTrainer, RegexpTagger, UnigramTagger
from nltk.tag.brill import brill24
from nltk.tag import hmm
from sklearn.metrics import accuracy_score
import itertools

In [75]:
def isplit(iterable,splitters):
    return [list(g) for k,g in itertools.groupby(iterable,lambda x:x in splitters) if not k]

In [79]:
def open_text(filename):
    """Reads a text line by line, applies light preprocessing, and returns an array of words and tags"""
    with open(filename, encoding='iso-8859-2') as f:
        content = f.readlines()
        
    preprocess = (word.strip().rsplit('/', 1) for word in content)
    
    return isplit(preprocess, (None, '###/###'))

In [3]:
def split_tags(words):
    return [tuple(word.rsplit('/', 1)) for word in words]

In [4]:
def strip_tags(words):
    return [word.rsplit('/', 1)[0] for word in words]

In [5]:
# Read the texts into memory
english = './data/texten2.ptg'
czech = './data/textcz2.ptg'

words_en = open_text(english)
words_cz = open_text(czech)

In [6]:
def split_data(words, start=0):
    test, heldout, train = words[:start+40_000],  words[start+40_000:start+60_000], words[start+60_000:]
    return train, heldout, test

In [7]:
def split_data_end(words):
    test, heldout, train = words[40_000:],  words[40_000:60_000], words[:60_000]
    return train, heldout, test

In [8]:
def split_all(words):
    return [
        split_data_end(words),
        split_data(words, start=60_000 * 0),
        split_data(words, start=60_000 * 1),
        split_data(words, start=60_000 * 2),
        split_data(words, start=60_000 * 3)
    ]

In [9]:
# Taken from https://github.com/nltk/nltk/blob/a84b28ca26ea3ee53da4eaafc2bbf037847779bd/nltk/tbl/demo.py
REGEXP_TAGGER = RegexpTagger(
    [(r'^-?[0-9]+(.[0-9]+)?$', 'CD'),   # cardinal numbers
     (r'(The|the|A|a|An|an)$', 'AT'),   # articles
     (r'.*able$', 'JJ'),                # adjectives
     (r'.*ness$', 'NN'),                # nouns formed from adjectives
     (r'.*ly$', 'RB'),                  # adverbs
     (r'.*s$', 'NNS'),                  # plural nouns
     (r'.*ing$', 'VBG'),                # gerunds
     (r'.*ed$', 'VBD'),                 # past tense verbs
     (r'.*', 'NN')                      # nouns (default)
])
templates = brill24()

In [10]:
def brill_tagger(train, heldout, baseline_backoff_tagger=REGEXP_TAGGER, templates=templates, trace=0, 
                 ruleformat='str', max_rules=300, min_score=3, min_acc=None):
    baseline_tagger = UnigramTagger(heldout, backoff=baseline_backoff_tagger)
    trainer = BrillTaggerTrainer(baseline_tagger, templates, trace=trace, ruleformat=ruleformat)
    tagger = trainer.train(train, max_rules, min_score, min_acc)
    return tagger

In [11]:
def evaluate(split, i=0):
    train, heldout, test = [split_tags(s) for s in split]
    
    print('Evaluating Brill Tagger [{}]'.format(i))
    tagger = brill_tagger([train], [heldout])
    
    test_words = [w for w,_ in test]
    predicted_tags = [t for _,t in tagger.tag(test_words)]
    true_tags = [t for _,t in test]
    return accuracy_score(true_tags, predicted_tags)

In [12]:
splits_en = split_all(words_en)
splits_cz = split_all(words_cz)

In [13]:
print('English')
accuracies_en = [evaluate(split, i) for i,split in enumerate(splits_en)]
print('Czech')
accuracies_cz = [evaluate(split, i) for i,split in enumerate(splits_cz)]

English
Evaluating Brill Tagger [0]
Evaluating Brill Tagger [1]
Evaluating Brill Tagger [2]
Evaluating Brill Tagger [3]
Evaluating Brill Tagger [4]
Czech
Evaluating Brill Tagger [0]
Evaluating Brill Tagger [1]
Evaluating Brill Tagger [2]
Evaluating Brill Tagger [3]
Evaluating Brill Tagger [4]


In [16]:
acc_str_en = ' '.join(['{0:0.1f}'.format(i * 100) for i in accuracies_en])
acc_str_cz = ' '.join(['{0:0.1f}'.format(i * 100) for i in accuracies_cz])

row_en = ['English', acc_str_en, np.mean(accuracies_en) * 100, np.std(accuracies_en)]
row_cz = ['Czech',   acc_str_cz, np.mean(accuracies_cz) * 100, np.std(accuracies_cz)]

columns = ['Language', 'Accuracies', 'Mean', 'Standard Deviation']
brill_results = pd.DataFrame([row_en, row_cz], columns=columns)
brill_results

Unnamed: 0,Language,Accuracies,Mean,Standard Deviation
0,English,90.7 92.8 92.0 90.6 87.6,90.769112,0.017659
1,Czech,64.7 80.4 77.5 76.3 75.0,74.796807,0.053395


In [None]:
# def save_brill(split):
#     train, heldout, test = split
    
#     tagged_all = ' '.join(train + heldout)
#     tagged_1 = ' '.join(train)
#     tagged_2 = ' '.join(heldout)
    
#     untagged_1 = ' '.join(strip_tags(train))
#     untagged_2 = ' '.join(strip_tags(heldout))
    
#     with open('./data/TAGGED-CORPUS-ENTIRE', 'w', encoding='iso-8859-2') as f: f.write(tagged_all)
#     with open('./data/TAGGED-CORPUS', 'w', encoding='iso-8859-2') as f: f.write(tagged_1)
#     with open('./data/TAGGED-CORPUS-2', 'w', encoding='iso-8859-2') as f: f.write(tagged_2)
#     with open('./data/UNTAGGED-CORPUS', 'w', encoding='iso-8859-2') as f: f.write(untagged_1)
#     with open('./data/UNTAGGED-CORPUS-2', 'w', encoding='iso-8859-2') as f: f.write(untagged_2)

# save_brill(splits_en[0])

## 2. Unsupervised Learning: HMM Tagging

> Use the datasets T, H, and S. Estimate the parameters of an HMM tagger using supervised learning off the T data (trigram and lower models for tags). Smooth (both the trigram tag model as well as the lexical model) in the same way as in Homework No. 1 (use data H). Evaluate your tagger on S, using the Viterbi algorithm.

> Now use only the first 10,000 words of T to estimate the initial (raw) parameters of the HMM tagging model. Strip off the tags from the remaining data T. Use the Baum-Welch algorithm to improve on the initial parameters. Smooth as usual. Evaluate your unsupervised HMM tagger and compare the results to the supervised HMM tagger.

> Tabulate and compare the results of the HMM tagger vs. the Brill's tagger. 

In [38]:
def evaluate(split, i=0):
    train, heldout, test = [split_tags(s) for s in split]
    
    print('Evaluating HMM [{}]'.format(i))
    trainer = hmm.HiddenMarkovModelTrainer()
    tagger = trainer.train_supervised([train])
    
    return tagger.evaluate([test])

In [39]:
splits_en = split_all(words_en)
splits_cz = split_all(words_cz)

In [45]:
train, heldout, test = [split_tags(s) for s in splits_en[0]]

trainer = hmm.HiddenMarkovModelTrainer()
tagger = trainer.train_supervised([train])

In [68]:
tagger._outputs

AttributeError: 'ConditionalProbDist' object has no attribute 'freqdist'

In [52]:
data = test[:1000]

test_words = [w for w,_ in data]
predicted_tags = [t for _,t in tagger.tag(test_words)]
true_tags = [t for _,t in data]

In [54]:
test_words

['on',
 'behalf',
 'of',
 'the',
 'syndicates',
 '.',
 '###',
 'It',
 'could',
 'take',
 'six',
 'months',
 'for',
 'a',
 'claim',
 'to',
 'be',
 'paid',
 '.',
 '###',
 '``',
 'The',
 'system',
 ',',
 "''",
 'says',
 'Nicholas',
 'Samengo-Turner',
 ',',
 'a',
 'Lloyd',
 "'s",
 'broker',
 'who',
 'left',
 'the',
 'exchange',
 'in',
 '1985',
 ',',
 '``',
 'is',
 'so',
 'ludicrously',
 'unprofessional',
 'it',
 'drives',
 'you',
 'mad',
 '.',
 "''",
 '###',
 'Some',
 'maintain',
 'underwriters',
 'also',
 'have',
 'been',
 'inept',
 '.',
 '###',
 'John',
 'Wetherell',
 ',',
 'a',
 'Lloyd',
 "'s",
 'underwriter',
 ',',
 'says',
 'he',
 'and',
 'his',
 'fellow',
 'underwriters',
 'underestimated',
 'by',
 'as',
 'much',
 'as',
 '50',
 '%',
 'the',
 'premiums',
 'they',
 'should',
 'have',
 'charged',
 'for',
 'property',
 'risks',
 'from',
 '1980',
 'to',
 '1985',
 '.',
 '###',
 '``',
 'How',
 'unprofessional',
 'we',
 'must',
 'have',
 'appeared',
 'to',
 'the',
 'outside',
 'world',
 '--'

In [36]:
print('English')
accuracies_en = [evaluate(split, i) for i,split in enumerate(splits_en)]
print('Czech')
accuracies_cz = [evaluate(split, i) for i,split in enumerate(splits_cz)]

English
Evaluating HMM [0]


In [37]:
acc_str_en = ' '.join(['{0:0.1f}'.format(i * 100) for i in accuracies_en])
acc_str_cz = ' '.join(['{0:0.1f}'.format(i * 100) for i in accuracies_cz])

row_en = ['English', acc_str_en, np.mean(accuracies_en) * 100, np.std(accuracies_en)]
row_cz = ['Czech',   acc_str_cz, np.mean(accuracies_cz) * 100, np.std(accuracies_cz)]

columns = ['Language', 'Accuracies', 'Mean', 'Standard Deviation']
hmm_results = pd.DataFrame([row_en, row_cz], columns=columns)
hmm_results

Unnamed: 0,Language,Accuracies,Mean,Standard Deviation
0,English,4.0,4.032697,0.0
1,English,4.0,4.032697,0.0


In [25]:
class LanguageModel:
    """Counts words and calculates probabilities (up to trigrams)"""
    
    def __init__(self, words):
        # Prepend two tokens to avoid beginning-of-data problems
        words = np.array(['<ss>', '<s>'] + list(words))
        
        # Unigrams
        self.unigrams = words
        self.unigram_set = list(set(self.unigrams))
        self.unigram_count = len(self.unigram_set)
        self.total_unigram_count = len(self.unigrams)
        self.unigram_dist = c.Counter(self.unigrams)
        
        # Bigrams
        self.bigrams = list(nltk.bigrams(words))
        self.bigram_set = list(set(self.bigrams))
        self.bigram_count = len(self.bigram_set)
        self.total_bigram_count = len(self.bigrams)
        self.bigram_dist = c.Counter(self.bigrams)
        
        # Trigrams
        self.trigrams = list(nltk.trigrams(words))
        self.trigram_set = list(set(self.trigrams))
        self.trigram_count = len(self.trigram_set)
        self.total_trigram_count = len(self.trigrams)
        self.trigram_dist = c.Counter(self.trigrams)
    
    def count(ngrams):
        ngram_set = list(set(ngrams))
        ngram_count = len(ngram_set)
        total_ngram_count = len(ngrams)
        ngram_dist = c.Counter(ngrams)
        return ngram_set, ngram_count, total_ngram_count, ngram_dist
        
    def p_uniform(self):
        """Calculates the probability of choosing a word uniformly at random"""
        return self.div(1, self.unigram_count)
    
    def p_unigram(self, w):
        """Calculates the probability a unigram appears in the distribution"""
        return self.div(self.unigram_dist[w], self.total_unigram_count)
    
    def p_bigram_cond(self, wprev, w):
        """Calculates the probability a word appears in the distribution given the previous word"""
        # If neither ngram has been seen, use the uniform distribution for smoothing purposes
        if ((self.bigram_dist[wprev, w], self.unigram_dist[wprev]) == (0,0)):
            return self.p_uniform()
        
        return self.div(self.bigram_dist[wprev, w], self.unigram_dist[wprev])
    
    def p_trigram_cond(self, wprev2, wprev, w):
        """Calculates the probability a word appears in the distribution given the previous word"""
        # If neither ngram has been seen, use the uniform distribution for smoothing purposes
        if ((self.trigram_dist[wprev2, wprev, w], self.bigram_dist[wprev2, wprev]) == (0,0)):
            return self.p_uniform()
        
        return self.div(self.trigram_dist[wprev2, wprev, w], self.bigram_dist[wprev2, wprev])
    
    def div(self, a, b):
        """Divides a and b safely"""
        return a / b if b != 0 else 0

In [26]:
def init_lambdas(n=3):
    """Initializes a list of lambdas for an ngram language model with uniform probabilities"""
    return np.array([1 / (n + 1)] * (n + 1))

In [27]:
def p_smoothed(lm, lambdas, wprev2, wprev, w):
    """Calculate the smoothed trigram probability using the weighted product of lambdas"""
    return np.multiply(lambdas, [
        lm.p_uniform(),
        lm.p_unigram(w),
        lm.p_bigram_cond(wprev, w),
        lm.p_trigram_cond(wprev2, wprev, w)
    ])

In [28]:
def expected_counts(lm, lambdas, heldout):
    """Computes the expected counts by smoothing across all trigrams and summing them all together"""
    smoothed_probs = (p_smoothed(lm, lambdas, *trigram) for trigram in heldout) # Multiply lambdas by probabilities
    return np.sum(smoothed / np.sum(smoothed) for smoothed in smoothed_probs) # Element-wise sum

In [29]:
def next_lambda(lm, lambdas, heldout):
    """Computes the next lambda from the current lambdas by normalizing the expected counts"""
    expected = expected_counts(lm, lambdas, heldout)
    return expected / np.sum(expected) # Normalize

In [30]:
def em_algorithm(train, heldout, stop_tolerance=1e-4):
    """Computes the EM algorithm for linear interpolation smoothing"""
    lambdas = init_lambdas(3)
    
    lm = LanguageModel(train)
    heldout_trigrams = LanguageModel(heldout).trigrams
    
    print('Lambdas:')
    
    next_l = next_lambda(lm, lambdas, heldout_trigrams)
    while not np.all([diff < stop_tolerance for diff in np.abs(lambdas - next_l)]):
        print(next_l)
        lambdas = next_l
        next_l = next_lambda(lm, lambdas, heldout_trigrams)

    lambdas = next_l
    return lambdas

In [31]:
train, heldout, test = splits_en[0]

In [38]:
lm_en = LanguageModel(train)

In [39]:
lambdas_en = em_algorithm(train, heldout)

Lambdas:
[0.00171272 0.01427302 0.23094838 0.75306588]
[6.09494990e-06 4.42538007e-04 1.01261852e-01 8.98289515e-01]
[3.49390171e-08 6.14055420e-05 4.23362189e-02 9.57602341e-01]
[7.19392903e-10 5.15530093e-05 1.74922896e-02 9.82456157e-01]
[1.71849267e-11 5.12945905e-05 7.20653209e-03 9.92742173e-01]
[4.12379908e-13 5.12845010e-05 2.97099414e-03 9.96977721e-01]
[9.90029950e-15 5.12829547e-05 1.22672529e-03 9.98721992e-01]
[2.37739409e-16 5.12824581e-05 5.07102930e-04 9.99441615e-01]
[5.70960107e-18 5.12822773e-05 2.09756728e-04 9.99738961e-01]
[1.37130442e-19 5.12822076e-05 8.67883404e-05 9.99861929e-01]
