# Second Group Assignment: A Named Entity Recognizer for Dutch

## Contents

[Introduction](#Introduction)  
[0. Preparation: Training data](#0.-Preparation:-Training-data)  

[1. Step 1: A minimal NER tagger for Dutch](#1.-Step-1:-A-minimal-NER-tagger-for-Dutch)  
[2. Step 2: Turn it into a script](#2.-Step-2:-Turn-it-into-a-script)  
[3. Pickling and unpickling successfully](#3.-Pickling-and-unpickling-successfully)  
[4. Self-testing](#4.-Self-testing)  
[5. Step 3: Improve the feature selection](#5.-Step-3:-Improve-the-feature-selection)  
[6. Step 4: Compare machine learning engines](#6.-Step-4:-Compare-machine-learning-engines)  
[7. Step 5: Performance evaluation](#7.-Step-5:-Performance-evaluation)  
[8. Step 6: Submission](#8.-Step-6:-Submission)  

[9. Practicalities](#9.-Practicalities)  


## References

* NLTK book, Chapter 6: Classification and classifiers
* NLTK book, Chapter 7: Chunking and named entity recognition
* Jurafsky and Martin, ch. 22.1, give a high-level introduction.

## Introduction

The goal of this activity is to construct a Named Entity Recognizer
(NER): A device that can scan natural text, identify named entities
such as persons, places and organizations that are referred to by
name, and classify them according to type (PERSON, LOCATION, etc.)

You will train a classifier on the Dutch component of the CONLL2002
corpus. The corpus also includes a Spanish
component, so always specify which files you want to read.

The necessary background concepts and software techniques are
presented in chapter 7 of the NLTK book:

[Section 7.2]: http://www.nltk.org/book/ch07.html#sec-chunking
[Section 7.3]: http://www.nltk.org/book/ch07.html#developing-and-evaluating-chunkers
[Section 7.5]: http://www.nltk.org/book/ch07.html#named-entity-recognition

* [Section 7.2][] presents the concept of _chunking_, and how the NLTK
manages its chunked corpora.

* [Section 7.3][] shows how to build and evaluate chunkers with the
help of the NLTK's chunked corpora. The discussion is based on the
CONLL2000 corpus (note the year), a corpus of _English_ text in which
all noun phrases are indicated.

* Finally, [Section 7.5][] briefly covers the task of Named Entity
Recognition. (Tip for the impatient: Sections 7.2 and 7.3 are
essential reading--do not skip them).

In the CONLL2002 corpus, which contains Spanish and Dutch components,
only named entities have been chunked. Although the content of the
chunks is different, the structure and interface of the corpora is the
same: The text is annotated with POS tags, chunks, and chunk types.
Thus the procedures for chunking noun phrases can be adapted to the NER task with
minimal changes: Just train on the Dutch CONLL2002 corpus, and
recognize _its_ chunks.

* [Practicum 12 (Week 7)](https://uu.blackboard.com/webapps/blackboard/content/listContentEditable.jsp?content_id=_3158377_1&course_id=_120751_1) will explain a lot of what you need to know to do this assignment.
-----

**Note:** The nltk's classifiers and taggers need the external `numpy`
library, but fail in a very confusing way if it is not found. The
Anaconda distribution includes `numpy`, so that's not a problem unless
you are using python without Anaconda.

## 0. Preparation: Training data

We will once again use the `CONLL2002` corpus, which was specifically created for the task of named entity recognition.

In earlier practica, we split the file `ned.train` into training
and testing components. In fact, the corpus includes
separate datasets for testing. Use all of the file `"ned.train"` (and nothing else) to train your models. Use the file `"ned.testa"` for testing. 

## 1. Step 1: A minimal NER tagger for Dutch

**In brief:** Use the wrapper `custom_chunker.py`, provided below, to train a
(trivial) named entity recognizer for Dutch. Pickle it and measure its performance.

[Section 7.3.3](http://www.nltk.org/book/ch07.html#training-
classifier-based-chunkers) of the NLTK book provides sample code for a
chunker, showing how to wrap a sequential MaxEnt classifier in a
converter that acccepts chunked sentences in `Tree` format.
The code is intentionally simple, so we will use the following
extended version. Save it as a module `custom_chunker.py`,
and import it into your code.

In [2]:
# FILE: custom_chunker.py

# Natural Language Toolkit: code_classifier_chunker
# Based on code from
#   http://www.nltk.org/book/ch07.html#code-classifier-chunker
#
# Revisions:
# - Not using "megam" as the machine learning engine.
# - The feature builder is a constructor parameter.
# - Added the method `explain()`, which prints the docstring of the feature builder.
# - Added access to the `show_most_informative_features()` method of the underlying classifier.
#
# Alexis Dimitriadis

import nltk
from nltk.chunk.util import conlltags2tree, tree2conlltags

nltk.download('conll2002')

# If numpy is absent, the nltk fails with a very confusing error.
# We avoid problems by checking directly
try:
    import numpy
except ImportError:
    print("You need to download and install numpy!!!")
    raise


class ConsecutiveNPChunker(nltk.ChunkParserI):
    """
    Train a classifier on chunked data in Tree format.
    Arguments for the constructor:

    featuremap   The function that will compute features for each word
        in a sentence. See the NLTK book (and the assignment)
        for the arguments it must accept.

    train_sents  A list of sentences in chunked (Tree) format.
    
    algorithm  (optional). The name of the machine-learning model to use.
    """
    def __init__(self, featuremap, train_sents, algorithm="IIS"):
        self._algorithm = algorithm
        tagged_sents = [[((w,t),c) for (w,t,c) in tree2conlltags(sent)]
                            for sent in train_sents]
        self.tagger = _ConsecutiveNPChunkTagger(featuremap, tagged_sents, algorithm)

    def parse(self, sentence):
        tagged_sents = self.tagger.tag(sentence)
        conlltags = [(w,t,c) for ((w,t),c) in tagged_sents]
        return conlltags2tree(conlltags)

    chunk = parse  # A synonym for the absent-minded
    
    def explain(self):
        """Print the docstring of our feature extraction function"""
        print("Algorithm:", self._algorithm)
        # Print the feature map's help string:
        print(self.tagger._featuremap.__doc__)
 
    def show_most_informative_features(self, n=10):
        """Call our classifier's `show_most_informative_features()` function."""
        self.tagger.classifier.show_most_informative_features(n)


class _ConsecutiveNPChunkTagger(nltk.TaggerI):
    """This class is not meant to be
    used directly: Use ConsecutiveNPChunker instead."""

    def __init__(self, featuremap, train_sents, algorithm):
        
        self._featuremap = featuremap
        train_set = []
        for tagged_sent in train_sents:
            untagged_sent = nltk.tag.untag(tagged_sent)
            history = []
            for i, (word, tag) in enumerate(tagged_sent):
                featureset = self._featuremap(untagged_sent, i, history) 
                train_set.append( (featureset, tag) )
                history.append(tag)
        self.classifier = nltk.MaxentClassifier.train( 
            train_set, algorithm=algorithm, trace=0)

    def tag(self, sentence):
        history = []
        for i, word in enumerate(sentence):
            featureset = self._featuremap(sentence, i, history)
            tag = self.classifier.classify(featureset)
            history.append(tag)
        return zip(sentence, history)

[nltk_data] Downloading package conll2002 to
[nltk_data]     C:\Users\Jacco\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\conll2002.zip.


To use this module, you need a training dataset of chunked sentences
(in Tree format) and a feature extractor function that will be used
internally during training and regular use.

A feature extractor must accept a tagged sentence `sentence`, the
index `i` of a word in the sentence, and a list `history` containing
the IOB tags that have already been assigned (presumably to earlier
positions in the sentence). It must return a dictionary of the
extracted features. Here is a very simple example, using just 
one feature out of all the available information:

In [3]:
def simple_features_1(sentence, i, history):
    """Simplest chunker features: Just the POS tag of the word""" 
    word, pos = sentence[i]
    return { "pos": pos }

Once we have our feature function, we can train a recognizer for the
Dutch CONLL corpus as shown below.

It takes a long time to train a recognizer (about half a second per
sentence on my computer), so we demonstrate here with a tiny
training set. Unsurprisingly,
it's too small for the chunker to do anything useful with novel test
data. Use larger datasets judiciously: Very short training sets are
fine for checking if your code runs or crashes, but to find out if a new
feature improves accuracy, you need to train on the entire dataset--
or at least a substantial portion (several thousand sentences).

If your code is too slow for anything but trivial datasets, figure out
what is slowing it down. Pickle the trained model for later use.

In [4]:
from nltk.corpus import conll2002 as conll

In [5]:
#from custom_chunker import ConsecutiveNPChunker

tiny_sample = 150
# training = conll.chunked_sents("ned.train")  # Train with full dataset
training = conll.chunked_sents("ned.train")[:tiny_sample] # SHORT DATASET: FOR DEMO/DEBUGGING ONLY! 
testing = conll.chunked_sents("ned.testa")

simple_nl_NER = ConsecutiveNPChunker(simple_features_1, training)

We evaluate our recognizer by calling its `evaluate()` method.
Evaluation is a lot faster than training, so we use the entire test
set, `ned.testa`.

In [6]:
print(conll.chunked_sents("ned.train")[0])
print(conll.chunked_sents("ned.testa")[0])

(S
  De/Art
  tekst/N
  van/Prep
  het/Art
  arrest/N
  is/V
  nog/Adv
  niet/Adv
  schriftelijk/Adj
  beschikbaar/Adj
  maar/Conj
  het/Art
  bericht/N
  werd/V
  alvast/Adv
  bekendgemaakt/V
  door/Prep
  een/Art
  communicatiebureau/N
  dat/Conj
  (ORG Floralux/N)
  inhuurde/V
  ./Punc)
(S
  Dat/Pron
  is/V
  verder/Adj
  opgelaaid/N
  door/Prep
  windsnelheden/N
  die/Pron
  oplopen/V
  tot/Prep
  35/Num
  kilometer/N
  per/Prep
  uur/N
  ./Punc)


In [8]:
print(simple_nl_NER.evaluate(testing))


ChunkParse score:
    IOB Accuracy:  90.1%%
    Precision:      7.7%%
    Recall:         0.0%%
    F-Measure:      0.1%%


Unsurprisingly, relying only on the part of speech is a very poor way
to identify named entities. Our trivial recognizer finds a negligible
proportion of all named entities (0% recall). Of the chunks it marks
as named entities, a small proportion (7.7%) are indeed named
entities; the rest were marked incorrectly.

## 2. Step 2: Turn it into a script

Training a non-trivial classifier is too time-consuming to keep
entirely in a notebook. Prepare to work with python scripts (in IDLE
or in your favorite editor), as follows:

1. Save the script `custom_chunker.py` (see above), without any
modifications. It is an importable module that you can use in your
script.

2. Create a script `features.py` for your feature extractors. 
Put the definition of `features_simple_1()` there as a starter. 
(You should later add, and use, additional functions.)

3. You can now import both modules, or parts of them, for use in a
Notebook or in other scripts. For example:

In [None]:
from custom_chunker import ConsecutiveNPChunker
from features import features_simple_1

myRecognizer = ConsecutiveNPChunker(features_simple_1, training)
# etc.

## 3. Pickling and unpickling successfully

We have already seen how to pickle and reload a trained tagger.
Working with a classifier is slightly more complicated, since its
operation relies on code that we write and revise.  This requires some
care to work correctly.

It is important to understand that **pickling in python only stores
data.** Pickled objects do not store python code for function or class
definitions. To reload a pickled object, python must be able to find
the definition of its class and of any functions the object refers to.

1. During pickling, a record is made of the modules where the needed
types and functions were defined.

2. During unpickling, the types and functions are imported from the
recorded modules and used with the reloaded objects.

This is a bit of behind-the-screens magic, so you must take some care
to avoid problems:

2.  After you store a pickled object, you should not modify the
functions and classes it depends on (i.e., `ConsecutiveNPChunker` and
the feature extraction function it uses). **If you modify your feature 
function after pickling a model, the model will become invalid.**
Your model may or may not cause runtime errors, but the statistics 
will be incorrect and you'll have to train and pickle a new
version. Use a different name for each version of your feature extraction 
function (`chunkfeatures_2`, `big_features`, or whatever), so 
that unpickled models can retrieve the right function later.

* Code can only be found in *named* modules, but the main script does
not count as a regular named module (its name is always just
`__main__`). If you define your feature function (e.g.,
`chunkfeatures_1`) in your main script, pickle a model, and unpickle
it from a _different_ script, python will not be able to find your
function. The solution is simple:

    * **All necessary classes and functions should be defined in
modules (one or several), and *imported* into your main script.**

 The script that unpickles your model will then know where to find
everything.



**In short:** Use a different name for each new feature extraction
function you define, and keep their definitions in modules, not in
your main script.


## 4. Self-testing

Here is a simple script that should be able to use your pickled
tagger. Ensure that your code is compatible with it. If your code does
not work with this script, **do not modify the script.** Fix your code
so that it is compatible with the script.

In [None]:
# FILE: model_test.py

import pickle
ner = pickle.load(open("best.pickle", "rb"))

from nltk.corpus import conll2002 as conll

# Usage 1: parse a list of sentences (with POS tags)
tagzinnen = conll.tagged_sents("ned.train")[1000:1050]
result = ner.parse_sents(tagzinnen)

# Usage 2: self-evaluate (on chunked sentences)
chunkzinnen = conll.chunked_sents("ned.testa")[1000:1500]
print(ner.evaluate(chunkzinnen))

## 5. Step 3: Improve the feature selection

**In brief:** Choose suitable features to improve your classifier's
performance.

Now that you have a minimal working classifier, define better 
feature functions to improve your classifier's performance. Train with
`ned.train`, and evaluate performance on the data in `ned.testa` only.

You can start with including capitalization information about the
current word; you can also experiment with features about the number
of letters in the word, the tag that precedes or follows it, whether
it follows a word that is already marked as part of a named entity
(use the `history` vector), etc.

You are also allowed to use (some) external resources. E.g., our list
of Dutch proper names (from an earlier activity) might be helpful in
recognizing whether some word is a person's name: Add a feature that
tells you whether a word is the list of names (or perhaps in an abbreviated
list containing just the most common names). You may also use any dataset
that is provided by the nltk (available through `nltk.download`). Ask
us first before using any other, external resources.

These are only suggestions: Utilize your reading and your imagination
to come up with more.

It is not necessary to **report** on every version of the feature
extractors you try out, but you may include a demonstration of more
than one if, e.g., you find that the "best" function depends on the algorithm (see
next step).

## 6. Step 4: Compare machine learning engines

There are many different learning algorithms for classifiers, and the
NLTK offers several of them. The code in `custom_chunker.py` uses the
MaxEnt classifier, which supports four kinds of maximum entropy
optimization. You can select among them by using the `algorithm`
argument of the `ConsecutiveNPChunker` constructor (which will be passed
to `MaxEnt`). To see the available algorithms, print the attribute `ALGORITHMS`:

In [None]:
import nltk
print(nltk.classify.MaxentClassifier.ALGORITHMS)

The names stand for Generalized Iterative Scaling ("GIS"), Improved Iterative Scaling
("IIS", the default), and two that require external libraries:
the Megam library (which uses the LM-BFGS algorithm), and the Toolkit
for Advanced Discriminative Modeling (TADM).  The default is `'IIS'`.

The NLTK also offers the NaiveBayes classifier, which can be used
instead of MaxEnt. You'll need to **modify 
`custom_chunker.py`** so that it accepts `"NaiveBayes"` as a special 
value for the `algorithm` argument, and uses it instead of the MaxEnt classifier. 

Experiment with the different engines and algorithms, and find out
which combination of algorithms and features gives the best
performance on the test data. Note that the Megam and TADM algorithms
require software that must be downloaded and installed separately. Put
those aside and work with the rest: **`IIS, GIS,` and the NaiveBayes
classifier. Document the performance of all three in your report.** (You should at
least report on their performance with your "best" feature selection; 
but different features may perform best with different
algorithms, so consider exploring that as well.)

## 7. Step 5: Performance evaluation

Train your best, finished classifier on the complete set of sentences
in `ned.train`, pickle it, and evaluate it on the data in `ned.testa`.
Include the results in your short report (see Submission).

Precision and recall are discussed in [section 6.3.3][3.3] of the NLTK
book. The section immediately below that presents the NLTK's
["ConfusionMatrix( )"][3.4] method, which is useful for identifying
where your classifier makes mistakes. (Its use is optional).

You may also find the nltk's [nltk.chunk.util.ChunkScore(
)](http://www.nltk.org/api/nltk.chunk.html#nltk.chunk.util.ChunkScore)
function useful (again, optional).

[3.3]: http://www.nltk.org/book/ch06.html#precision-and-recall
[3.4]:  http://www.nltk.org/book/ch06.html#confusion-matrices

**What performance should you aim for?** As a baseline, a very simple
six-feature classifier achieved about 60% precision and recall,
trained on the entire `ned.train` corpus and tested on `ned.testa`.
Your solution should do at least this well.

<code><pre>
MODEL: Minimal classifier
MaxEnt, IIS algorithm
6 features, 15806 train sentences, 2895 testing

ChunkParse score:
    IOB Accuracy:  95.8%
    Precision:     59.0%
    Recall:        60.0%
    F-Measure:     59.5%

Should have found 2616 chunks, guessed 2661, correct 1564
</pre></code>

A more complex classifier achieved almost 15% higher precision and
a modest gain in recall. You can easily do even better with a bit of
exploration. Note that you'll probably need to let the training run
for hours, and possibly overnight or longer: Allow enough time to
train and improve your models. Training on very short datasets is 
not useful for comparing between feature sets: there must be enough 
variety in the training data to weigh the features properly.

<pre>
MODEL: NLTK feature set
MaxEnt, default algorithm
15 features, 15806 train sentences, 2895 testing
Training...  done, 24443.5 s (1.55s per sentence)

ChunkParse score:
    IOB Accuracy:  96.2%
    Precision:     73.8%
    Recall:        62.0%
    F-Measure:     67.4%

Should have found 2616 chunks, guessed 2199, correct 1616
</pre>

Your classifier will also be evaluated on novel newstext in Dutch (to
be distributed later).

## 8. Step 6: Submission

Prepare to submit your code, organized in the following units (which
may, and should, share some additional modules that you **also**
submit).

Double-check that the code and models you upload can run without
modifications, and according to the instructions. Your code should
contain no absolute paths.

1. **Your python source files.** It should consist of the following
standard files, plus any additional files you find it useful to
create:

    1. A script `BuildModels.py` that trains and pickles the different
classifiers you have
      developed. 
    2. A script `EvaluateModels.py` that loads and evaluates your
models. Make sure it prints at least: something that identifies the model/feature set, and the precision, recall, and F-measure for each model. You may also find you want to print more things; use your discretion. These two scripts together are an abbreviated record of the
approaches you have explored.
    4. The module `features.py` that defines, or imports from other
modules,
       one or more feature extractor functions (whatever you find
worth reporting on).
    5. Your modified module `custom_chunker.py`.
    3. The file `Evaluation-output.txt`, with the saved output of the
evaluation.
    5. Any additional files you require, including external data
(e.g., a list of names).

 Ensure that the output of your evaluation script is a comprehensible report,
not just raw numbers. E.g., print out a line identifying the
classifier that you are about to `evaluate()`, and add some blank
lines or other visual structure. Add a line of output identifying your
best model.
<p/>

2. **The pickled model of your best classifier.** Please name it
`best.pickle` and **test** that it can be reloaded and works correctly
with the simple script `model_test.py` (see above). If you needed to make
**any** changes to `model_test.py` to get it to run, be sure to include
your version in the upload. **If you need to do this, it will cost you a few points**<p/>

4. **A short report** (1-2 pages, **txt file or PDF only**), summarizing your
results. (Cf. the results from step 4.)  Report the features and
classifier engine you used; precision, recall and F-value; how much
time was needed for training and for evaluation of your best model;
and any additional explanations or observations about the task.

<p/>

## 9. Practicalities

* There will be a small grade bonus (and bragging rights!) for the three
groups that achieve the highest F-scores on new data.


*    Your work should be your own. Attribute external sources etc. Do not ask for help in any forum.


*    You can freely use any part of the conll module itself and any
other
    parts of the NLTK, including utility routines for evaluation, etc.
    You may use standard python libraries, and all libraries included with
Anaconda; but not libraries that must be
    separately downloaded, except with prior approval. If in doubt,
ask us.