Training a Dutch tagger
=======

_Practical Python for Linguistics and the Humanities -- Alexis Dimitriadis_

The NLTK doesn't come with a tagger for Dutch. Today we're going to
build our own tagger, using the trainable taggers provided by the NLTK. We'll follow closely the procedure described in the NLTK book, and you should keep it open during this practicum for reference.

The file `nescio_full.txt` will be used near the end of this activity. Download it and save it where you can find it.

## Reading:

From the NLTK Book, chapter 5:

* Review [sections 1-2](http://www.nltk.org/book/ch05.html), on using taggers.
* [Section 4: Automatic tagging][ch5.4]
* [Section 5: Ngram tagging][ch5.5]


[ch5.4]: http://www.nltk.org/book/ch05.html#automatic-tagging
[ch5.5]: http://www.nltk.org/book/ch05.html#n-gram-tagging

## Contents


**[1. Step 1: Find training data](#1.-Step-1:-Find-training-data)**  
&nbsp;&nbsp;&nbsp;&nbsp;
  [1.1 A tagged corpus of Dutch](#1.1-A-tagged-corpus-of-Dutch)  
&nbsp;&nbsp;&nbsp;&nbsp;
  [1.2 Separating training and testing data](#1.2-Separating-training-and-testing-data)  
&nbsp;&nbsp;&nbsp;&nbsp;
  [1.3 Why do we need to train a tagger?](#1.3-Why-do-we-need-to-train-a-tagger?)  

**[2. Step 2: Build a "default tagger"](#2.-Step-2:-Build-a-"default-tagger")**  
&nbsp;&nbsp;&nbsp;&nbsp;
  [2.1 Count the tags](#2.1-Count-the-tags)  
&nbsp;&nbsp;&nbsp;&nbsp;
  [2.2 Making a default tagger](#2.2-Making-a-default-tagger)  
&nbsp;&nbsp;&nbsp;&nbsp;
  [2.3 Preparing tagger input](#2.3-Preparing-tagger-input)  
&nbsp;&nbsp;&nbsp;&nbsp;
  [2.4 Evaluating the tagger](#2.4-Evaluating-the-tagger)  

**[3. Step 3: Train a unigram tagger](#3.-Step-3:-Train-a-unigram-tagger)**  
&nbsp;&nbsp;&nbsp;&nbsp;
  [3.1 Backoff taggers](#3.1-Backoff-taggers)  

**[4. Step 4: Train a bigram tagger](#4.-Step-4:-Train-a-bigram-tagger)**  

**[5. Step 5: Save your tagger for future use](#5.-Step-5:-Save-your-tagger-for-future-use)**  

**[6. Using your tagger for tagging](#6.-Using-your-tagger-for-tagging)**  

**[7. Optional: Make your tagger into a stand-alone program](#7.-Optional:-Make-your-tagger-into-a-stand-alone-program)**  


<h2>Overview: Building a tagger</h2>

While it is possible to write a tagger that uses hand-crafted rules for tagging decisions, statistical methods like  ngram taggers are simple and perform exceedigly well. But a trigram tagger, for example, can only draw inferences from trigrams it has seen before. What should it do when faced with a novel trigram? Rather than devise ad hoc solutions, nltk taggers can be connected to a "backoff tagger": If the main tagger cannot figure out the tag for a word, it delegates the problem to its backoff cousin. 

The process can be repeated: In this activity we will use the NLTK's toolbox to train a bigram tagger. When it cannot tag a word, it will defer to a unigram tagger, which will in turn defer to another tagger when necessary.

Our training data is quite small, but in general training a tagger takes time. We'll learn how to save our trained target to disk and reload it when we want to tag new text. 

## 1. Step 1: Find training data

### 1.1 A tagged corpus of Dutch

To train a tagger, we need an example of tagged text.
We'll use the [CONLL 2002 corpus](http://www.cnts.ua.ac.be/conll2002/ner/), which comes as part of the NLTK. Note that this corpus consists of newspaper text (four editions of the Belgian newspaper "De Morgen"). Our tagger will have higher error rates if used on other kinds of text.

CONLL stands for _Computational Natural Language Learning_, and each year's version is targeted to a different NLP task. CONLL2002 is about "language independent" Named Entity regognition, by which they meant languages other than English. The corpus includes texts in Dutch and in Spanish, so we must _always_ take care to only fetch the appropriate file.

The following code will import the CONLL 2002 corpus and assign the tagged sentences from the Dutch training set to the variable `cosent`. (Recall that `tagged_sents()`, like other nltk corpus methods, returns a "view"; so this is not a real list, but we can index it or iterate over it as if it was.)

In [1]:
import nltk
from nltk.corpus import conll2002 as conll
cosents = conll.tagged_sents("ned.train")

**Your turn:** 
Add commands to print the first five sentences, in readable form and **without** the tags. Add a blank line between sentences for better readability. 

In [2]:
index = 0
for index, sen in enumerate(cosents):
    if index < 5:
        for index2, word in enumerate(sen):
            print(word[0], end=" ")
        index += 1
    print("\n")

De tekst van het arrest is nog niet schriftelijk beschikbaar maar het bericht werd alvast bekendgemaakt door een communicatiebureau dat Floralux inhuurde . 

In '81 regulariseert de toenmalige Vlaamse regering de toestand met een BPA dat het bedrijf op eigen kosten heeft laten opstellen . 

publicatie 

Vandaag is Floralux dus met alle vergunningen in orde , maar het BPA waarmee die konden verkregen worden , was omstreden omdat zaakvoerster Christiane Vandenbussche haar schepenambt van ... 

In eerste aanleg werd Vandenbussche begin de jaren '90 veroordeeld wegens belangenvermenging maar later vrijgesproken door het hof van beroep in Gent . 






























































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































Now check that you have 15806 sentences, with a total of 202644 tokens.

In [6]:
# YOUR CODE:
print(cosents)
print(len(cosents))
counter = 0
for sen in cosents:
    for word in sen:
        counter += 1

print(counter)
    



[[('De', 'Art'), ('tekst', 'N'), ('van', 'Prep'), ('het', 'Art'), ('arrest', 'N'), ('is', 'V'), ('nog', 'Adv'), ('niet', 'Adv'), ('schriftelijk', 'Adj'), ('beschikbaar', 'Adj'), ('maar', 'Conj'), ('het', 'Art'), ('bericht', 'N'), ('werd', 'V'), ('alvast', 'Adv'), ('bekendgemaakt', 'V'), ('door', 'Prep'), ('een', 'Art'), ('communicatiebureau', 'N'), ('dat', 'Conj'), ('Floralux', 'N'), ('inhuurde', 'V'), ('.', 'Punc')], [('In', 'Prep'), ("'81", 'Num'), ('regulariseert', 'V'), ('de', 'Art'), ('toenmalige', 'Adj'), ('Vlaamse', 'Adj'), ('regering', 'N'), ('de', 'Art'), ('toestand', 'N'), ('met', 'Prep'), ('een', 'Art'), ('BPA', 'N'), ('dat', 'Pron'), ('het', 'Art'), ('bedrijf', 'N'), ('op', 'Prep'), ('eigen', 'Pron'), ('kosten', 'N'), ('heeft', 'V'), ('laten', 'V'), ('opstellen', 'V'), ('.', 'Punc')], ...]
15806
202644


### 1.2 Separating training and testing data

As the NLTK book explains, you should always evaluate a tool on data it has not seen before. Split your corpus into two sublists, `test_sents` and `train_sents`: Reserve the first 1500 sentences (approximately 10%) for testing, and use the rest for training.

In [8]:
test_sents = cosents[0:1500]
train_sents = cosents[1500:]
print(len(test_sents))
print(len(train_sents))

1500
14306


### 1.3 Why do we need to train a tagger?

While POS tags are useful for searching, couldn't we just build a lookup table (dictionary) with the correct tag for each word? E.g., `het` is a determiner (`"Art"`), etc. We can't, because many words are ambiguous: They have multiple functions (or look like unrelated words with a different function). In our training data, the word _in_ appears with the most tags:

    in       ['Adj', 'Adv', 'Conj', 'Misc', 'N', 'Prep']

From a linguistic perspective, we might suspect that some of these tags are mistakes; but the majority are certainly correct. So while we _can_ build a tagger that assigns the "best" tag for each word, such a "unigram" tagger requires some counting to construct, and will not work as well as more complex solutions. We'll see this below. <br/>

**Optional:** Build a dictionary of all the words in the training corpus, listing the POS 
tags assigned to each one. How many words have more than one tag? What proportion of the 
different words in the corpus are ambiguous? Which ten words have the greatest number of different tags?

## 2. Step 2: Build a "default tagger"

**Important: From now on you should be working with the training data only**

A "default tagger" is hardly a real tagger: It assigns the same tag to everything. We'll use it as a fall-back when all else fails. But which tag should it assign? We'll get the best results if we assign the most common one. To find out which tag that is, we must count all the different tags.

### 2.1 Count the tags

**Your turn:** Use a python dictionary or `nltk.FreqDist` to count the uses of the tags in the training corpus, and print out the most common one.

In [46]:
# FILL IN:

tagcounts = {}
for sen in train_sents:
    for word_tuple in sen:
        tag = word_tuple[1]
        if tag not in tagcounts.keys():
            tagcounts[tag] = 1
        else:
            tagcounts[tag] += 1
            
tagcounts = sorted(tagcounts.items(), key=lambda item: item[1], reverse=True)

print(tagcounts)

[('N', 45649), ('V', 23868), ('Punc', 23759), ('Prep', 18687), ('Art', 16727), ('Pron', 13811), ('Adv', 12910), ('Adj', 12285), ('Conj', 8548), ('Num', 6292), ('Misc', 429), ('Int', 163)]


Counting is a common operation, and both python and the NLTK provide tools to make it easier. We have already seen the NLTK's `FreqDist()` object in an earlier notebook (maybe you even used it for the previous problem). It is an extension of python's `Counter` type, which is itself a kind of "`defaultdict`": a special dictionary that will automatically create keys if you try to use a key that does not exist yet. 

We have been using our dictionaries by passing the data to be counted directly to the constructor, but this is not the only way to use them; they have all the methods of ordinary dictionaries, as well. The words that we want to count now are embedded in sentences, so it is more convenient to create an empty counter and count the tags sentence by sentence:

In [26]:
from collections import Counter

newcounts = Counter()

for s in train_sents:
    newcounts.update(t for w, t in s)

print(newcounts.most_common(5))


[('N', 45649), ('V', 23868), ('Punc', 23759), ('Prep', 18687), ('Art', 16727)]


The `Counter` method `most_common()` will return a list of `(key, value)` pairs, like the `dict.items()` method but sorted by frequency. The argument to `most_common()` specifies how many pairs to return.

**Your turn:** Although we can see the most common tag name in the above output, it is embedded in a larger structure. Write one or more statements to extract the most common tag name and save it in the variable `maxtag`. Make sure `maxtag` ends up containing just the tag name (a string).

In [60]:
# FILL IN:
common = newcounts.most_common(5)
print(common)
maxtag = common[0][0]
print(maxtag)

[('N', 45649), ('V', 23868), ('Punc', 23759), ('Prep', 18687), ('Art', 16727)]
N


**Your turn:**
While we're at it, let's take a look at all the tags used in the CONLL corpus. Print them all out, in alphabetical order. (Hint: The tags are the keys of our dictionary).

In [61]:
# YOUR CODE:
list = []
for value in tagcounts:
    tag = value[0]
    list.append(tag)

list = sorted(list)
print(list)
    


['Adj', 'Adv', 'Art', 'Conj', 'Int', 'Misc', 'N', 'Num', 'Prep', 'Pron', 'Punc', 'V']


### 2.2 Making a default tagger

We can now define a "default tagger" that always assigns the most common tag. The NLTK provides us with a suitable tagger class, so all we have to do is tell the constructor which tag to assign.

In [62]:
default_tagger = nltk.DefaultTagger(maxtag)

### 2.3 Preparing tagger input

We have a tagger! Sort of. But how do we use it? Like the rest of the NLTK's tools, it expects its input to be a list of tokens. So let's use the NLTK's tokenizer to create one:

In [63]:
testzin = nltk.word_tokenize("Bekijk het professionele profiel van Jan Doodle op LinkedIn.")
print(testzin)
print(default_tagger.tag(testzin))


['Bekijk', 'het', 'professionele', 'profiel', 'van', 'Jan', 'Doodle', 'op', 'LinkedIn', '.']
[('Bekijk', 'N'), ('het', 'N'), ('professionele', 'N'), ('profiel', 'N'), ('van', 'N'), ('Jan', 'N'), ('Doodle', 'N'), ('op', 'N'), ('LinkedIn', 'N'), ('.', 'N')]


Be aware that the nltk taggers expect to be applied to one sentence at a time. If you have a longer text, break it up into sentences first with `nltk.sent_tokenize()`, then break up each sentence into tokens with `nltk.word_tokenize()` as above (see [Step 6][s6] below).

[s6]: #Step-6:-Use-your-tagger-for-tagging

### 2.4 Evaluating the tagger

How good is our trivial tagger? To find out, we must run it on a bunch of tagged data (that's what our `test` data is for), and compare the answers with the correct tag. But of course the tagger expects untagged text as input, so we must extract the untagged version of each test sentence, tag it, and count how many tags are correct:

In [64]:
mytest =  test_sents[1] # Test with sentence 1 from the test corpus
print(mytest)
untagged = [ w for w,t in mytest ]
print(untagged)
retagged = default_tagger.tag(untagged)
print(retagged)
correct = 0
for i in range(len(retagged)):
    if retagged[i] == mytest[i]:
        correct += 1
print("Correct:", correct, "out of", len(untagged), "words. Proportion:", correct/len(untagged))

[('In', 'Prep'), ("'81", 'Num'), ('regulariseert', 'V'), ('de', 'Art'), ('toenmalige', 'Adj'), ('Vlaamse', 'Adj'), ('regering', 'N'), ('de', 'Art'), ('toestand', 'N'), ('met', 'Prep'), ('een', 'Art'), ('BPA', 'N'), ('dat', 'Pron'), ('het', 'Art'), ('bedrijf', 'N'), ('op', 'Prep'), ('eigen', 'Pron'), ('kosten', 'N'), ('heeft', 'V'), ('laten', 'V'), ('opstellen', 'V'), ('.', 'Punc')]
['In', "'81", 'regulariseert', 'de', 'toenmalige', 'Vlaamse', 'regering', 'de', 'toestand', 'met', 'een', 'BPA', 'dat', 'het', 'bedrijf', 'op', 'eigen', 'kosten', 'heeft', 'laten', 'opstellen', '.']
[('In', 'N'), ("'81", 'N'), ('regulariseert', 'N'), ('de', 'N'), ('toenmalige', 'N'), ('Vlaamse', 'N'), ('regering', 'N'), ('de', 'N'), ('toestand', 'N'), ('met', 'N'), ('een', 'N'), ('BPA', 'N'), ('dat', 'N'), ('het', 'N'), ('bedrijf', 'N'), ('op', 'N'), ('eigen', 'N'), ('kosten', 'N'), ('heeft', 'N'), ('laten', 'N'), ('opstellen', 'N'), ('.', 'N')]
Correct: 5 out of 22 words. Proportion: 0.22727272727272727


The NLTK can save us the trouble of repeating this for every sentence: The tagger comes with a built-in method `evaluate()`. We call it with an entire list of **tagged** sentences, and it will do exactly what we did and give us the over-all accuracy.

In [65]:
print("Default tagger accuracy:", default_tagger.evaluate(test_sents))

Default tagger accuracy: 0.2530744004919041


Our accuracy is about 25%. Clearly, about one in four tokens in Dutch text is a noun.

The `evaluate()` method is a standard feature of the NLTK's tools. Note that it must be called with **tagged** data: After all, it needs to know the right answer in order to evaluate the tagger's performance! 
Internally, `evaluate()` will separate words from tags, retag the words using the trained tagger, and compare the result to the correct tags to measure performance.

## 3. Step 3: Train a unigram tagger

A unigram tagger decides the tag for each word without considering context. Some words only ever occur in our corpus with one tag, while others are ambiguous. Clearly, the best strategy for the unigram tagger is to assign to each word the most common tag that it occurs with. Again, the nltk makes it easy to train a unigram tagger. Find out how by looking in the NLTK book, [section 5.5.1](http://www.nltk.org/book/ch05.html#unigram-tagging), and train and evaluate one now.

In [68]:
# FILL IN:

from nltk.corpus import brown

brown_tagged_sents = brown.tagged_sents(categories='news')
print(brown_tagged_sents)
brown_sents = brown.sents(categories='news')
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.tag(brown_sents[2007])


print(unigram_tagger.evaluate(brown_tagged_sents))

[[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'NP$'), ('recent', 'JJ'), ('primary', 'NN'), ('election', 'NN'), ('produced', 'VBD'), ('``', '``'), ('no', 'AT'), ('evidence', 'NN'), ("''", "''"), ('that', 'CS'), ('any', 'DTI'), ('irregularities', 'NNS'), ('took', 'VBD'), ('place', 'NN'), ('.', '.')], [('The', 'AT'), ('jury', 'NN'), ('further', 'RBR'), ('said', 'VBD'), ('in', 'IN'), ('term-end', 'NN'), ('presentments', 'NNS'), ('that', 'CS'), ('the', 'AT'), ('City', 'NN-TL'), ('Executive', 'JJ-TL'), ('Committee', 'NN-TL'), (',', ','), ('which', 'WDT'), ('had', 'HVD'), ('over-all', 'JJ'), ('charge', 'NN'), ('of', 'IN'), ('the', 'AT'), ('election', 'NN'), (',', ','), ('``', '``'), ('deserves', 'VBZ'), ('the', 'AT'), ('praise', 'NN'), ('and', 'CC'), ('thanks', 'NNS'), ('of', 'IN'), ('the', 'AT'), ('City', 'NN-TL'), ('of', 'IN-TL'), ('Atlant

You should get an accuracy of about 0.86. This is slightly higher than the 0.81 that the NLTK book reports (in section 5.2) for
a unigram tagger trained the Brown corpus, when evaluated on unseen text: Perhaps Dutch
words are slightly less ambiguous than English words? But our tagset is very different from the Brown corpus tagset (which is much more fine-grained), so it's not really fair to compare.

### 3.1 Backoff taggers

What should a unigram tagger do with unknown words? Instead of making provision for ad hoc solutions, nltk taggers can be provided with a "backoff tagger": If the main tagger cannot figure out the tag for a word, it delegates the problem to its backoff cousin. The backoff tagger can have its own backoff tagger, which makes it possible to chain several taggers together in a fallback chain from the most advanced to the most general.

Create a new unigram tagger, adding our default tagger `default_tagger` as its backoff tagger. (The NLTK book shows how to do this.) Evaluate it again; its accuracy should go up to about 0.941. Not bad for such simple methods! But that's still one wrong tag in twenty; most sentences will contain an error. Moreover, improving performance gets a lot harder from here.

In [None]:
# YOUR CODE:



## 4. Step 4: Train a bigram tagger

Thanks to the nltk's consistent interface, training a bigram tagger is no different from training the unigram version. As a backoff tagger, give it our unigram tagger (with its own backoff tagger). Your accuracy should go up to 0.947.

In [None]:
# YOUR CODE:



Compare the precision of the bigram tagger with that of the unigram tagger; the improvement is quite small, and it gets even smaller if we go from bigrams to trigrams. While these methods are indeed more powerful, we are reaching the limits of what we can do with our relatively small corpus. Too many bi- or trigrams in new text were never seen during training.

## 5. Step 5: Save your tagger for future use

**Reference:** Section [Storing taggers](http://www.nltk.org/book/ch05.html#storing-taggers) in the NLTK book.

Training our taggers was not unbearably slow, but only because the CONLL corpus is so tiny. Training a tagger on a larger corpus would take so long that we would not want to do it more than once. We must store the trained "model" of our tagger for future use.

The NLTK relies on python's `pickle` module, which allows us to write out and later restore large, complex objects very easily. We can use it to dump our bigram tagger; it will automatically also dump the backoff taggers that it incorporates.

Note that the output file, `nl-tagger.pickle`, is opened with mode `"wb"`. The `b` stands for "binary", and tells python not to apply certain transformations that are appropriate when writing text to a file. It must always be used when dumping or loading data with `pickle`.

**Your turn:** Run the code below to save your bigram tagger.

In [None]:
import pickle
output = open("nl-tagger.pickle", "wb")
pickle.dump(bigram_tagger, output)
output.close()

## 6. Using your tagger for tagging

The purpose of taggers is to tag new text. We'll now see how you can load your pickled tagger and use it to tag regular text.

"Unpickling" a dumped object is as simple as dumping it: We simply need to open the pickled file in binary mode and call the pickle module's `load()` function with the open file. If all goes well, it will return an object that is identical to the `bigram_tagger` object we pickled.

In [None]:
input = open("nl-tagger.pickle", "rb")
unpickled_tagger = pickle.load(input)
input.close()

###  **Your turn:** 

Write an _independent_ script that loads your pickled tagger and uses it to tag the simple text below. Print out the tagged sentences in a readable manner.

Your script should be a complete, stand-alone script. You can develop it in this notebook, but ensure it works independently of earlier code cells by restarting the python "kernel" (`Kernel menu > Restart`, or the <button><strong><span style="font-size: 144%">⟳</span></strong></button> button on the toolbar), and running only the code cell below.

Do not manually change the format of the string `hugo`: Your script must first tokenize this text into separate sentences and words, then tag each sentence. Do so by first splitting the string into a list of one-sentence strings, using `nltk.sent_tokenize()`. Then split each string into a list of words, collecting the results into a list of sentences in the usual nltk format.

In [None]:
import nltk
import pickle

hugo = """De mensen moeten eens dringend leren met één voet op de grond 
te staan. Want wie steeds met beide voeten op de grond staat komt niet 
vooruit in het leven, tenzij men hem met kluit en al uitgraaft en per 
kruiwagen vervoert."""

# YOUR CODE:
...

### **Your turn:**
Now write a script to read and tag the file `nescio_full.txt` (remember to download it). Write out
the tagged text to a file `nescio_tagged.txt`, in a form that can be read back by the NLTK's `TaggedCorpusReader`. (That is, words and tags must be separated by `/` ). Open the generated file with a text editor and inspect it to ensure it looks right.

In [None]:
# YOUR CODE:



## 7. Optional: Make your tagger into a stand-alone program

Make a version of your script that can be used from the command line, without editing, to tag arbitrary files. Your script should find the name of the file to be tagged, and the name of the file where the tagged version will be saved, from the list `sys.argv` (i.e., from the command line or via drag-and-drop on Windows). 

In [None]:
# YOUR CODE:

