# Second Group Assignment: A Named Entity Recognizer for Dutch

## Contents

[Introduction](#Introduction)  
[0. Preparation: Training data](#0.-Preparation:-Training-data)  
[1. Step 1: A minimal NER tagger for Dutch](#1.-Step-1:-A-minimal-NER-tagger-for-Dutch)   
[2. Feature Extractors](#2.-Feature-Extractors)  
[3. Step 2: Turn it into a program](#3.-Step-2:-Turn-it-into-a-program)  
[4. Pickling and unpickling successfully](#3.-Pickling-and-unpickling-successfully)  
[5. Self-testing](#4.-Self-testing)  
[6. Step 3: Write scripts to build and evaluate your models](#6.-Step-3:-Write-scripts-to-build-and-evaluate-your-models)  
[7. Step 4: Improve the model by improving the feature selection](#7.-Step-4:-Improve-the-model-by-improving-the-feature-selection)  
[8. Step 5: Performance evaluation](#8.-Step-5:-Performance-evaluation)  
[9. Step 6: Submission](#9.-Step-6:-Submission)  
[10. Practicalities](#10.-Practicalities)  


## References

* NLTK book, Chapter 6: Classification and classifiers
* NLTK book, Chapter 7: Chunking and named entity recognition
* Jurafsky and Martin, ch. 8, cover some NER and other sequence-tagging basics
* Jurafsky and Martin, ch. 17, give a high-level introduction to information extraction

## Introduction

The goal of this activity is to construct a Named Entity Recognizer
(NER): A device that can scan natural text, identify named entities
such as persons, places and organizations that are referred to by
name, and classify them according to type (PERSON, LOCATION, etc.)

You will train a classifier on the Dutch component of the CONLL2002
corpus. The corpus also includes a Spanish
component, so always specify which files you want to read.

The necessary background concepts and software techniques are
presented in chapter 7 of the NLTK book:

[Section 7.2]: http://www.nltk.org/book/ch07.html#sec-chunking
[Section 7.3]: http://www.nltk.org/book/ch07.html#developing-and-evaluating-chunkers
[Section 7.5]: http://www.nltk.org/book/ch07.html#named-entity-recognition

* [Section 7.2][] presents the concept of *chunking*, and how the NLTK
manages its chunked corpora.

* [Section 7.3][] shows how to build and evaluate chunkers with the
help of the NLTK's chunked corpora. The discussion is based on the
CONLL2000 corpus (note the year), a corpus of _English_ text in which
all noun phrases are indicated.

* Finally, [Section 7.5][] briefly covers the task of Named Entity
Recognition. (Tip for the impatient: Sections 7.2 and 7.3 are
essential reading--do not skip them).

In the CONLL2002 corpus, which contains Spanish and Dutch components,
only named entities have been chunked. Although the content of the
chunks is different, the structure and interface of the corpora is the
same: The text is annotated with POS tags, chunks, and chunk types.
Thus the procedures for chunking noun phrases can be adapted to the NER task with
minimal changes: Just train on the Dutch CONLL2002 corpus, and
recognize _its_ chunks.

* Practicum 10 (Week 6) will explain a lot of what you need to know to do this assignment. It is made available now in case you want to get started on it. You can get started right away by reading the text book.
-----

**Note:** The nltk's classifiers and taggers need the external `numpy`
library, but fail in a very confusing way if it is not found. The
Anaconda distribution includes `numpy`, so that's not a problem unless
you are using python without Anaconda.

## 0. Preparation: Training data

We will once again use the `CONLL2002` corpus, which was specifically created for the task of named entity recognition.

In some earlier practica, we split the file `ned.train` into training
and testing components. In fact, the corpus includes
separate datasets for testing. Use all of the file `"ned.train"` (and nothing else) to train your models. Use the file `"ned.testa"` for testing. (If you're familiar with machine learning vocabulary, this means you should use `testa` as your development set.)

## 1. Step 1: A minimal NER tagger for Dutch

**In brief:** Complete the wrapper `custom_chunker.py`, provided below. Train a named entity recognizer (NER) for Dutch. Pickle it and measure its performance.

[Section 7.3.3](http://www.nltk.org/book/ch07.html#training-classifier-based-chunkers) of the NLTK book provides sample code for a
chunker, showing how to wrap a sequential MaxEnt classifier in a
converter that acccepts chunked sentences in `Tree` format.
You will write methods to complete module based on it. You'll save it as a module `custom_chunker.py`,
and import it into your code.


In the following steps you will write a function to prepare the training data, which you will then incorporate into the `_ConsecutiveNPChunkTagger` class below. 

To begin with, we need to import some things from `nltk` and check that we have `numpy`. If this raises an error, install `numpy` before you go on.

In [3]:
import nltk
from nltk.chunk.util import conlltags2tree, tree2conlltags

# If numpy is absent, the nltk fails with a very confusing error.
# We avoid problems by checking directly
try:
    import numpy
except ImportError:
    print("You need to download and install numpy!!!")
    raise

The data set is imported below. For debugging, use a tiny sample so you don't have to wait for anything to train.

In [4]:
from nltk.corpus import conll2002 as conll

# for debugging, use a tiny corpus
tiny_sample = 10
training = conll.chunked_sents("ned.train")[:tiny_sample] # SHORT DATASET: FOR DEMO/DEBUGGING ONLY! 
# training = conll.chunked_sents("ned.train")
testing = conll.chunked_sents("ned.testa")

To use this module, you need a training dataset of chunked sentences
(in `nltk.Tree` format) and a feature extractor function that will be used
internally during training and regular use.

## 2. Feature Extractors

A feature extractor must accept a POS-tagged sentence `sentence`, the
index `i` of a word in the sentence, and a tuple `history` containing
the IOB tags that have already been assigned (presumably to earlier
positions in the sentence). It must return a dictionary of the
extracted features, where the keys are feature names and the values are the feature values. Because of the way the trainer works, the feature values have to be "hashable", which mostly means they can't be lists. If you want a list, turn it into a tuple with `tuple(my_list)`.

Here is a very simple example, using just two features out of all the available information. The entire history is included so we can make sure the feature extractor is working right. (Are these useful features? Maybe, maybe not; the second half of the assignment is for figuring that out!)

In [5]:
def test_features(sentence, i, history):
    """Chunker features designed to test the Chunker class for correctness 
        - the POS tag of the word
        - the entire history of IOB tags so far
            formatted as a tuple because it needs to be hashable
    """ 
    word, pos = sentence[i]
    return { 
        "pos": pos,
        "whole history": tuple(history)
            }

Here is a utility function for getting the `nltk.Tree` training data into the right format for the tagger, which is lists of ((word, POS), IOB).

In [6]:
def reformat_corpus_for_tagger(train_sents):
    """
    Given a corpus in nltk.Tree format, returns the corpus as a list of lists of tuples,
    where each tuple ((word, POS), IOB) includes the word, its POS tag, and the IOB tag to be predicted.
    @param train_sents nltk.Tree list
    """
    return [[((word, pos), iob) for (word, pos, iob) in tree2conlltags(sent)] for sent in train_sents]
    

### Step 1.A 

Write a function that takes a feature map and a list of training sentences as given in the corpus and returns a list of appropriate training data for the tagger. Call `reformat_corpus_for_tagger` on the sentences to get them into the right format.

The tagger is trained not on the sentences themselves, but on the features extracted from the sentences, paired with the gold tags. A sentence should therefore become a list of (feature dictionary, IOB tag) pairs. 

Interestingly, even though the corpus is a list of lists, the tagger takes as input a flat list of the (feature dictionary, IOB tag) pairs.

When you extract the features from the words in the sentence, be mindful of how feature functions work: for each word in a sentence, you need to give it the full current sentence in the form of (word, POS) pairs, the index of the word, and the history of IOB tags for the words before this word **in the sentence** (not in the whole corpus).

Finally, when you pass the history to the feature map, make sure you're not passing it a pointer to a history that will keep changing. If you do this, all the words will end up with the history of the last word in the corpus. One way to handle this is to use `from copy import copy`, and then you can use, e.g., `copy(history)` to get a separate copy that won't change.


In [7]:
def create_training_data(feature_map, train_sents):
    """
    Creates training data from the corpus of train_sents and the feature_map
    :param feature_map: function that maps (untagged sentence, word index,
     history) to a dict of features (from features.py)
    :param train_sents: training sentences as lists of nltk.Tree objects
    :return: list of (dict, IOB tag) pairs
    """
    # TODO reformat sentences to ((word, pos_tag), iob_tag) pairs
    formatted_train_sents = reformat_corpus_for_tagger(train_sents)
    
    # TODO turn the sentences into appropriate training data by finding their features
    train_set = []
    
    for tagged_sent in formatted_train_sents:
        untagged_sent = nltk.tag.untag(tagged_sent)
        history = []
        for i, (word, tag) in enumerate(tagged_sent):
            feature_set = feature_map(untagged_sent, i, history)
            train_set.append((feature_set, tag))
            history.append(tag)
            
        
    return train_set


In [8]:
# TEST

training_data = create_training_data(test_features, training[0:2])

if training_data == [({'pos': 'Art', 'whole history': ()}, 'O'), ({'pos': 'N', 'whole history': ('O',)}, 'O'), ({'pos': 'Prep', 'whole history': ('O', 'O')}, 'O'), ({'pos': 'Art', 'whole history': ('O', 'O', 'O')}, 'O'), ({'pos': 'N', 'whole history': ('O', 'O', 'O', 'O')}, 'O'), ({'pos': 'V', 'whole history': ('O', 'O', 'O', 'O', 'O')}, 'O'), ({'pos': 'Adv', 'whole history': ('O', 'O', 'O', 'O', 'O', 'O')}, 'O'), ({'pos': 'Adv', 'whole history': ('O', 'O', 'O', 'O', 'O', 'O', 'O')}, 'O'), ({'pos': 'Adj', 'whole history': ('O', 'O', 'O', 'O', 'O', 'O', 'O', 'O')}, 'O'), ({'pos': 'Adj', 'whole history': ('O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O')}, 'O'), ({'pos': 'Conj', 'whole history': ('O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O')}, 'O'), ({'pos': 'Art', 'whole history': ('O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O')}, 'O'), ({'pos': 'N', 'whole history': ('O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O')}, 'O'), ({'pos': 'V', 'whole history': ('O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O')}, 'O'), ({'pos': 'Adv', 'whole history': ('O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O')}, 'O'), ({'pos': 'V', 'whole history': ('O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O')}, 'O'), ({'pos': 'Prep', 'whole history': ('O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O')}, 'O'), ({'pos': 'Art', 'whole history': ('O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O')}, 'O'), ({'pos': 'N', 'whole history': ('O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O')}, 'O'), ({'pos': 'Conj', 'whole history': ('O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O')}, 'O'), ({'pos': 'N', 'whole history': ('O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O')}, 'B-ORG'), ({'pos': 'V', 'whole history': ('O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORG')}, 'O'), ({'pos': 'Punc', 'whole history': ('O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'O')}, 'O'), ({'pos': 'Prep', 'whole history': ()}, 'O'), ({'pos': 'Num', 'whole history': ('O',)}, 'O'), ({'pos': 'V', 'whole history': ('O', 'O')}, 'O'), ({'pos': 'Art', 'whole history': ('O', 'O', 'O')}, 'O'), ({'pos': 'Adj', 'whole history': ('O', 'O', 'O', 'O')}, 'O'), ({'pos': 'Adj', 'whole history': ('O', 'O', 'O', 'O', 'O')}, 'B-MISC'), ({'pos': 'N', 'whole history': ('O', 'O', 'O', 'O', 'O', 'B-MISC')}, 'O'), ({'pos': 'Art', 'whole history': ('O', 'O', 'O', 'O', 'O', 'B-MISC', 'O')}, 'O'), ({'pos': 'N', 'whole history': ('O', 'O', 'O', 'O', 'O', 'B-MISC', 'O', 'O')}, 'O'), ({'pos': 'Prep', 'whole history': ('O', 'O', 'O', 'O', 'O', 'B-MISC', 'O', 'O', 'O')}, 'O'), ({'pos': 'Art', 'whole history': ('O', 'O', 'O', 'O', 'O', 'B-MISC', 'O', 'O', 'O', 'O')}, 'O'), ({'pos': 'N', 'whole history': ('O', 'O', 'O', 'O', 'O', 'B-MISC', 'O', 'O', 'O', 'O', 'O')}, 'B-MISC'), ({'pos': 'Pron', 'whole history': ('O', 'O', 'O', 'O', 'O', 'B-MISC', 'O', 'O', 'O', 'O', 'O', 'B-MISC')}, 'O'), ({'pos': 'Art', 'whole history': ('O', 'O', 'O', 'O', 'O', 'B-MISC', 'O', 'O', 'O', 'O', 'O', 'B-MISC', 'O')}, 'O'), ({'pos': 'N', 'whole history': ('O', 'O', 'O', 'O', 'O', 'B-MISC', 'O', 'O', 'O', 'O', 'O', 'B-MISC', 'O', 'O')}, 'O'), ({'pos': 'Prep', 'whole history': ('O', 'O', 'O', 'O', 'O', 'B-MISC', 'O', 'O', 'O', 'O', 'O', 'B-MISC', 'O', 'O', 'O')}, 'O'), ({'pos': 'Pron', 'whole history': ('O', 'O', 'O', 'O', 'O', 'B-MISC', 'O', 'O', 'O', 'O', 'O', 'B-MISC', 'O', 'O', 'O', 'O')}, 'O'), ({'pos': 'N', 'whole history': ('O', 'O', 'O', 'O', 'O', 'B-MISC', 'O', 'O', 'O', 'O', 'O', 'B-MISC', 'O', 'O', 'O', 'O', 'O')}, 'O'), ({'pos': 'V', 'whole history': ('O', 'O', 'O', 'O', 'O', 'B-MISC', 'O', 'O', 'O', 'O', 'O', 'B-MISC', 'O', 'O', 'O', 'O', 'O', 'O')}, 'O'), ({'pos': 'V', 'whole history': ('O', 'O', 'O', 'O', 'O', 'B-MISC', 'O', 'O', 'O', 'O', 'O', 'B-MISC', 'O', 'O', 'O', 'O', 'O', 'O', 'O')}, 'O'), ({'pos': 'V', 'whole history': ('O', 'O', 'O', 'O', 'O', 'B-MISC', 'O', 'O', 'O', 'O', 'O', 'B-MISC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O')}, 'O'), ({'pos': 'Punc', 'whole history': ('O', 'O', 'O', 'O', 'O', 'B-MISC', 'O', 'O', 'O', 'O', 'O', 'B-MISC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O')}, 'O')]:
    print("\nTraining data is formated correctly")

else:
    print("\n** Training data isn't formatted correctly!** Possible hints below")
    print(f"Type is {type(training_data)} and should be <class 'list'>")
    if type(training_data) == list and len(training_data) > 0:
        print(f"Type of list members is {type(training_data[0])} and should be <class 'tuple'>")
        if len(training_data) > 1:
            print(f"item 2 is:\n{training_data[2]}\nand should be:\n({{'pos': 'Prep', 'whole history': ['O', 'O']}}, 'O')") 

            
# If you can't tell what's wrong, try printing the whole thing
# print(training_data)


Training data is formated correctly


### Step 1.B.  Complete the Chunker module

Finish the following module by completing the `_ConsecutiveNPChunkTagger` class. Follow the `#TODO` instructions to complete the `__init__` method and transform your `create_training_data` function into a class method. 

In [23]:
"""
FILE: custom_chunker.py AS GIVEN IN ASSIGNMENT

Based on code from http://www.nltk.org/book/ch07.html#code-classifier-chunker

Authors: Alexis Dimitriadis, Meaghan Fowlie, and #TODO you!

Use ConsecutiveNPChunker to train and use a classifier

Treat _ConsecutiveNPChunkTagger as private: do not use it directly; it is called by ConsecutiveNPChunker

"""
from abc import ABC

import nltk
from nltk.chunk.util import conlltags2tree, tree2conlltags

# If numpy is absent, nltk fails with a very confusing error.
# We avoid problems by checking directly
try:
    import numpy
except ImportError:
    print("You need to download and install numpy!!!")
    raise


class ConsecutiveNPChunker(nltk.ChunkParserI, ABC):
    """
    MaxEnt trained classifier for NER
    Classifier Input: a POS-tagged sentence -- (word, POS) list
    Classifier Output: an IOB-tagged sentence -- ((word, POS), IOB) list
    Attributes:
        tagger: a _ConsecutiveNPChunkTagger object, trained on the feature map
                and training set given to __init__
    """

    def __init__(self, feature_map, train_sents, algorithm="NaiveBayes", verbose=0):
        """
        Train a classifier on chunked data in Tree format.
        :param feature_map: The function that will compute features for each
         word in a sentence. See the NLTK book (and the assignment)
         for the arguments it must accept.
        :param train_sents: A list of sentences in chunked (Tree) format.
        :param algorithm: str: which classifier to use 
            (default NaiveBayes; other possibilities IIS, GIS, and DecisionTree)
        :param verbose: int: how much to print during training (default 0, meaning nothing)
        """        
        
        # train the tagger
        self.tagger = _ConsecutiveNPChunkTagger(feature_map,
                                                train_sents,
                                                algorithm=algorithm,
                                                verbose=verbose)

    def parse(self, sentence):
        """
        tag a sentence with IOB tags and return a tree
        :param sentence: list of (word, POS) pairs
        :return: Conll tree
        """
        tagged_sent = self.tagger.tag(sentence)
        # return to conll format
        conll_tags = [(word, pos, iob) for ((word, pos), iob) in tagged_sent]
        return conlltags2tree(conll_tags)

    chunk = parse  # A synonym for the absent-minded

    def explain(self):
        """Print the docstring of our feature extraction function"""
        print("Algorithm:", self.tagger.algorithm)
        # Print the feature map's doc string:
        print(self.tagger.feature_map.__doc__)

    def show_most_informative_features(self, n=10):
        """
        Call our classifier's `show_most_informative_features()` function.
        :param n : int: the number of features to print (default 10)
        """
        self.tagger.classifier.show_most_informative_features(n)
        
    def tag_corpus_sentence(self, sentence):
        """
        tags a sentence in nltk.Tree form
        :param sentence: nltk.Tree formated sentence, as in the corpora
        :return tagged sentence as ((word, POS), IOB) pairs,
                where IOB are the tags predicted by the model
        """
        # turn the sentence into a unary list,
        # use reformat_corpus_for_tagger,
        # and untag the sentence
        s = nltk.tag.untag(self.tagger.reformat_corpus_for_tagger([sentence])[0])
        
        # use the trained tagger to re-tag the sentence
        return list(self.tagger.tag(s))
    
    def compare_output_to_gold(self, sentence):
        """
        tags a sentence from the corpus and prints out a word-by-word comparison with the gold data
        :param sentence: a sentence in nltk.Tree form, as in the corpora
        """
        gold = self.tagger.reformat_corpus_for_tagger([sentence])[0]
        tagged = self.tag_corpus_sentence(sentence)
        print("gold")
        print("tagged\n")
        for i in range(len(gold)):
            print(gold[i])
            print(tagged[i], "\n")


class _ConsecutiveNPChunkTagger(nltk.TaggerI):
    """This class is not meant to be
    used directly: Use ConsecutiveNPChunker instead.
    Attributes:
        feature_map: map from
                    (sentence, word index, history of features assigned so far)
                    to dict of feature name: feature value.
                    Imported from features.py.
        train_set: list of (feature dict, IOB tag) pairs
        classifier: nltk.NaiveBayesClassifier trained on train_sents (default) 
        algorithm: str: name of the algorithm for reporting
    """

    def __init__(self, feature_map, train_sents, algorithm="NaiveBayes", verbose=0, ):
        """
        Initialises and trains a tagger using the given features
         and training sentences
        :param feature_map: function that maps (untagged sentence, word index,
         history) to a dict of features (from features.py)
        :param train_sents: training sentences as list of
                            ((word, pos_tag), iob_tag) pairs
        :param algorithm: str:  which training algorithm to use. Default NaiveBayes.
                                Other options are IIS, GIS, and DecisionTree.
        :param verbose: int: IIS and GIS only: how much to print during training (0 = nothing)
        """
        
        ALLOWED_ALGOS = {
            "NaiveBayes",
            "DecisionTree",
            "IIS",
            "GIS"  
        }


        # TODO: store the feature_map parameter as self.feature_map
        # TODO: call self.create_training_data on train_sents
        # TODO: check that algorithm is one of "NaiveBayes", "DecisionTree", "IIS", and "GIS"
        # and raise an error if it's not
        
        self.feature_map = feature_map
        self.train_set = self.create_training_data(train_sents)
        if algorithm not in ALLOWED_ALGOS:
            raise ValueError(f"Algorithm must be one of {ALLOWED_ALGOS}, not {algorithm}")
          
        
        # set and train the classifier
        if algorithm == "NaiveBayes":
            self.classifier = nltk.NaiveBayesClassifier.train(self.train_set)
            self.algorithm = "Naive Bayes"
        elif algorithm == "DecisionTree":
            self.classifier = nltk.DecisionTreeClassifier.train(self.train_set)
            self.algorithm = "Decision Tree"
        else:
            self.classifier = nltk.MaxentClassifier.train( 
                self.train_set, algorithm=algorithm, trace=verbose)
            self.algorithm = f"Maximum Entropy with {algorithm}"

    @staticmethod
    def reformat_corpus_for_tagger(train_sents):
        """
        Given a corpus in nltk.Tree list format, returns the corpus as a list of lists of tuples,
        where each tuple ((word, POS), IOB) includes the word, its POS tag, and the IOB tag to be predicted.
        :param train_sents nltk.Tree list of IOB-tagged sentences
        """
        return [[((word, pos), iob) for (word, pos, iob) in tree2conlltags(sent)] for sent in train_sents]
    

    def create_training_data(self, training_sentences):
        """
        Creates training data from the corpus of training_sentences and self.feature_map
        stores a list of (dict, IOB tag) pairs as self.train_set
        
        :param training_sentences: list of nltk.Trees with IOB tags
        
        TODO make your function into a method that 
            uses the stored self.feature_map,
            calls self.reformat_corpus_for_tagger on training_sentences,
            and stores the training data as self.train_set
        """
        # TODO reformat sentences to ((word, pos_tag), iob_tag) pairs
    
        # TODO turn the sentences into appropriate training data by finding their features
        # store them in self._train_set
        # TODO reformat sentences to ((word, pos_tag), iob_tag) pairs
        formatted_train_sents = reformat_corpus_for_tagger(training_sentences)

        # TODO turn the sentences into appropriate training data by finding their features
        train_set = []

        for tagged_sent in formatted_train_sents:
            untagged_sent = nltk.tag.untag(tagged_sent)
            history = []
            for i, (word, tag) in enumerate(tagged_sent):
                feature_set = self.feature_map(untagged_sent, i, history)
                train_set.append((feature_set, tag))
                history.append(tag)


        return train_set

    def tag(self, sentence):
        """
        uses the trained classifier to tag a sentence
        :param sentence: list of (word, pos_tag) pairs
        :return: list of ((word, pos_tag), IOB_tag) pairs
        """
        history = []
        for i in range(len(sentence)):
            # extract the features
            feature_dict = self.feature_map(sentence, i, history)
            # tag the sentence
            tag = self.classifier.classify(feature_dict)
            history.append(tag)
        return zip(sentence, history)


Once we have our feature function, we can train a recognizer for the
Dutch CONLL corpus as shown below. This classifier uses a Naive Bayes training algorithm.

It sometimes takes a long time to train a recognizer so we demonstrate here with a tiny
training set. Unsurprisingly,
it's too small for the chunker to do anything useful with novel test
data. Use larger datasets judiciously: Very short training sets are
fine for checking if your code runs or crashes, but to find out if a new
feature improves accuracy, you need to train on the entire dataset--
or at least a substantial portion (several thousand sentences).

If your code is too slow for anything but trivial datasets, figure out
what is slowing it down. 

In [24]:
# from custom_chunker import ConsecutiveNPChunker

training = conll.chunked_sents("ned.train")[:100]

# train a model on 100 training items
test_nl_NER = ConsecutiveNPChunker(test_features, training)

We evaluate our recognizer by calling its `evaluate()` method.
Evaluation is a lot faster than training, so we use the entire test
set `ned.testa`.

(Are you wondering why there's an `evaluate` method even though it's not in the code for `ConsecutiveNPChunker`? It's _inherited_ from its parent class, `nltk.ChunkParserI`.)

In [25]:
print(test_nl_NER.evaluate(testing))

  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  """Entry point for launching an IPython kernel.


ChunkParse score:
    IOB Accuracy:  90.0%%
    Precision:      6.3%%
    Recall:         0.2%%
    F-Measure:      0.4%%


The `ConsecutiveNPChunker` class has some useful methods for development, including one that tags a sentence and displays the gold and predicted words:

In [26]:
test_nl_NER.compare_output_to_gold(testing[1])

gold
tagged

(('Bomaanslag', 'Adj'), 'O')
(('Bomaanslag', 'Adj'), 'O') 

(('op', 'Prep'), 'O')
(('op', 'Prep'), 'O') 

(('Indiase', 'Adj'), 'B-MISC')
(('Indiase', 'Adj'), 'O') 

(('trein', 'N'), 'O')
(('trein', 'N'), 'O') 

((':', 'Punc'), 'O')
((':', 'Punc'), 'O') 

(('twaalf', 'Num'), 'O')
(('twaalf', 'Num'), 'O') 

(('doden', 'Adj'), 'O')
(('doden', 'Adj'), 'O') 



Unsurprisingly, relying only on the part of speech and history is a very poor way
to identify named entities. Our trivial recognizer trained on 100 sentences finds a negligible
proportion of all named entities (0.2% recall). Of the chunks it marks
as named entities, a small proportion (6.3%) are indeed named
entities; the rest were marked incorrectly. It does, however, get most of the I/O/B tags right. (Why do you think this is?)

When the whole training corpus is used, it finds no named entities at all. (Can you see why, with this feature set?)

## 3. Step 2: Turn it into a program

Training a non-trivial classifier is too time-consuming to keep
entirely in a notebook. Prepare to work with python scripts (in Spyder
or in your favorite editor), as follows:

1. Save the `custom_chunker` module code above in a file named `custom_chunker.py`. Note that this code is also provided as `incomplete_custom_chunker.py`, so if you've been editing that file, just rename it to `custom_chunker.py`.

2. Create a module `features.py` for your feature extractors. 
Put the definition of `test_features()` there as a starter. 
(You should later add, and use, additional functions.)

3. You can now import both modules, or parts of them, for use in a
Notebook or in other scripts. (Be careful if you're importing into a Notebook. If you edit the file, you may need to restart the kernel in order to properly re-import the edited module.) 

For example:

In [1]:
from custom_chunker import ConsecutiveNPChunker
from features import test_features

In [2]:
my_recognizer = ConsecutiveNPChunker(test_features, training)
# etc.

NameError: name 'training' is not defined

## 4. Pickling and unpickling successfully

We have already seen how to pickle and reload a trained tagger.
Working with a classifier is slightly more complicated, since its
operation relies on code that we write and revise.  This requires some
care to work correctly.

It is important to understand that **pickling in python only stores
data.** Pickled objects do not store python code for function or class
definitions. To reload a pickled object, python must be able to find
the definition of its class and of any functions the object refers to.

1. During pickling, a record is made of the modules where the needed
types and functions were defined.

2. During unpickling, the types and functions are imported from the
recorded modules and used with the reloaded objects.

This is a bit of behind-the-screens magic, so you must take some care
to avoid problems:

2.  After you store a pickled object, you should not modify the
functions and classes it depends on (i.e., `ConsecutiveNPChunker` and
the feature extraction function it uses). **If you modify your feature 
function after pickling a model, the model will become invalid.**
Your model may or may not cause runtime errors, but the statistics 
will be incorrect and you'll have to train and pickle a new
version. Use a different name for each version of your feature extraction 
function (`chunkfeatures_2`, `big_features`, or whatever), so 
that unpickled models can retrieve the right function later.

* Code can only be found in *named* modules, but the main script does
not count as a regular named module (its name is always just
`__main__`). If you define your feature function (e.g.,
`chunkfeatures_1`) in your main script, pickle a model, and unpickle
it from a _different_ script, python will not be able to find your
function. The solution is simple:

    * **All necessary classes and functions should be defined in
modules (one or several), and *imported* into your main script.**

 The script that unpickles your model will then know where to find
everything.



**In short:** Use a different name for each new feature extraction
function you define, and keep their definitions in modules, not in
an `if __name__ == "__main__"` script section.


## 5. Self-testing

Here is a simple script that should be able to use your pickled
tagger. Ensure that your code is compatible with it. If your code does
not work with this script, **do not modify the script.** Fix your code
so that it is compatible with the script. If your code does not work with this version of the testing script, it will not be compatible with our grading scripts

In [None]:
"""
FILE: model_test.py
Author: Alexis Dimitriadis

Tests the functionality of a pickled model
"""

import pickle
ner = pickle.load(open("best.pickle", "rb"))

from nltk.corpus import conll2002 as conll

# Usage 1: parse a list of sentences (with POS tags)
tagzinnen = conll.tagged_sents("ned.train")[1000:1050]
result = ner.parse_sents(tagzinnen)

# Usage 2: self-evaluate (on chunked sentences)
chunkzinnen = conll.chunked_sents("ned.testa")[1000:1500]

print(ner.evaluate(chunkzinnen))

## 6. Step 3: Write scripts to build and evaluate your models

Write a script `build_models.py` that trains and pickles classifiers. 

Write a script `evaluate_models.py` that loads and evaluates your models. Make sure it prints at least: something that identifies the model/feature set, and the precision, recall, and F-measure for each model. You may also find you want to print more things; use your discretion.

Give both scripts useful command-line interfaces that allow you to choose what to train and what to evaluate, and explain in a help message or a README file how to use them.

## 7. Step 4: Improve the model by improving the feature selection

**In brief:** Choose suitable features to improve your classifier's
performance.

Now that you have a minimal working classifier, define better 
feature functions to improve your classifier's performance. Train with
`ned.train`, and evaluate performance on the data in `ned.testa` only.

You can start with including capitalization information about the
current word; you can also experiment with features about the number
of letters in the word, the tag that precedes or follows it, whether
it follows a word that is already marked as part of a named entity
(use the `history` vector), etc.

You are also allowed to use (some) external resources. E.g., our list
of Dutch proper names (from an earlier activity) might be helpful in
recognizing whether some word is a person's name: Add a feature that
tells you whether a word is the list of names (or perhaps in an abbreviated
list containing just the most common names). You may also use any dataset
that is provided by the nltk (available through `nltk.download`). Ask
us first before using any other, external resources.

These are only suggestions: Utilize your reading and your imagination
to come up with more.

It is not necessary to *report* on (i.e. write about in your report) every version of the feature
extractors you try out, but you may include a demonstration of more
than one if you find you have something interesting to report.

### `evaluation_output.txt`: a record of your experiments

As you experiment with models, save the output of `evaluate_models` in a file `evaluation_output.txt`. The `evaluation_output.txt` and `features.py` that you hand in should be congruent: all feature functions in (or imported into) `features.py` should have output in `evaluation_output.txt` and any models in `evaluation_output.txt` should have their feature functions in (or imported into) `features.py`. Include an absolute minimum of 3 feature functions and an absolute minimum of 7 features altogether.  

Make sure your `evaluate_models` script prints some information identifying the models. When you hand in your assignment, add a line to `evaluation_output.txt` identifying the model pickled as `best.pickle`, and a line identifying your best model overall if you have an even better one with a different training algorithm (see Step 4.a below).

### Step 4.a: OPTIONAL: improve other things about your model

If you're feeling ambitious, it's permitted to change other things; for example, you could trade out the learning algorithm for the MaxEnt GIS or IIS algorithm, or Decision Tree (available as options in the classifier class we built) and see if it's better. 

If you do anything that involves changing the provided code beyond what the assignment asks, make sure the default behaviour is the same as in the assignment, and be sure to document it. For example, if you add any new arguments to any methods, make sure they have default values. 

Please note that the Naive Bayes algorithm is much, much faster to train than the others, which is why it is the default in the class we built. It is not usually the best one, though. As such, we have different criteria for what counts as a good model depending on your algorithm. This is to keep the training manageable, especially if your computer is slow.

## 8. Step 5: Performance evaluation

Train your best, finished Naive Bayes classifier on the complete set of sentences
in `ned.train`, pickle it, and evaluate it on the data in `ned.testa`. Call the pickle `best.pickle`.
Include the results in your short report (see Submission).

If you have a good classifier that uses a different algorithm, pickle and evaluate that too. 

Precision and recall are discussed in [section 6.3.3][3.3] of the NLTK
book. The section immediately below that presents the NLTK's
["ConfusionMatrix( )"][3.4] method, which is useful for identifying
where your classifier makes mistakes. (Its use is optional).

You may also find the nltk's [nltk.chunk.util.ChunkScore(
)](https://www.nltk.org/api/nltk.chunk.util.html#nltk.chunk.util.ChunkScore)
function useful (again, optional).

[3.3]: http://www.nltk.org/book/ch06.html#precision-and-recall
[3.4]:  http://www.nltk.org/book/ch06.html#confusion-matrices

**What performance should you aim for?** As a baseline, a very simple
six-feature Naive Bayes classifier achieved about 42% precision and 57% recall,
trained on the entire `ned.train` corpus and tested on `ned.testa`.
Your solution should do at least this well.

<pre>
Algorithm: Naive Bayes
Time to train: 6.433463261 seconds. (0.000407 per sentence)

    6 features:
        POS
        word (string)
        first letter is capital (boolean)
        first letter of prev word is capital (boolean)
        previous word
        previous POS
    This is the 6-feature one mentioned in the assignment
    
ChunkParse score:
    IOB Accuracy:  93.6%%
    Precision:     42.0%%
    Recall:        56.7%%
    F-Measure:     48.2%%

</pre>

A more complex classifier based on NLTK's built-in NER function actually performs worse with the Naive Bayes algorithm, which is something to keep in mind as you explore: sometimes more features isn't better.

<pre>

Algorithm: Naive Bayes
The feature set used by the NLTK's NER
ChunkParse score:
    IOB Accuracy:  90.4%%
    Precision:     37.9%%
    Recall:        51.7%%
    F-Measure:     43.7%%

</pre>

With the IIS MaxEnt algorithm, it does pretty well, but requires overnight training. This is not expected of you.


<pre>
MODEL: NLTK feature set
MaxEnt, IIS algorithm
15 features, 15806 train sentences, 2895 testing
Time to train: 9473 seconds (0.6s per sentence)

ChunkParse score:
    IOB Accuracy:  96.2%%
    Precision:     73.8%%
    Recall:        61.6%%
    F-Measure:     67.2%%

</pre>



Your classifier will also be evaluated on sentences from outside the `train` and `testa` sets.

## 9. Step 6: Submission

Prepare to submit your code, organized in the following units (which
may share some additional modules that you **also**
submit).

Double-check that the code and models you upload can run without
modifications, and according to the instructions. Your code should
contain no absolute paths or Windows-only paths with backslashes (`\`).

**Use good style**:

* All imports should be at the beginning of the file
* Name your functions and variables informatively
* If you find yourself copying and pasting more than a couple of lines of code into multiple places, instead factor it out into a function.
* Include **comments** for anything complex. A good rule of thumb is that if something is hard for you to write or explain, it might need a comment. 
* **Docstrings**: Make sure every **file, class, and function** has a docstring, and edit the docstrings and comments provided here if necessary, removing the TODOs and updating it to reflect anything about your code that isn't in the original docstring. 

In the file-level docstring, please include the authors of the file.

### Zip the following files and hand in the zip file

1. **Your python source files.** It should consist of the following
standard files, plus any additional files you find it useful to
create:

    1. Your script `build_models.py` that trains and pickles the different
classifiers you have
      developed. 
    2. Your script `evaluate_models.py` that loads and evaluates your
models. Ensure that the output of your evaluation script is a comprehensible report,
not just raw numbers. E.g., print out a line identifying the
classifier that you are about to `evaluate()`, and add some blank
lines or other visual structure.
    4. The module `features.py` that defines, or imports from other
modules,
       one or more feature extractor functions (whatever you find
worth reporting on). 
    3. The file `evaluation_output.txt`, with the saved output of the
evaluation. This should include outputs for any models you mention in the report, even if they're not your best. We expect to see at least a few here. Add a line of text identifying the model saved as `best.pickle`.
    5. Any additional files you require, including external data
(e.g., a list of names).
    6. **Optional**: 
        * Your version of `model_test.py` if you needed to make **any** changes to  to get it to run. (See below)
        * Your version of `custom_chunker.py` if you made any changes. Make sure to mention it in the report.

 
<p/>

2. **The pickled model of your best Naive Bayes classifier.** Please name it
`best.pickle` and **test** that it can be reloaded and works correctly
with the simple script `model_test.py` (see above). If you needed to make **any** changes to `model_test.py` to get it to run, be sure to include
your version in the upload. **If you need to do this, it will cost you a few points**<p/>

3. **Optional**: Any additional pickled model you want to share; e.g. if you trained with different algorithm
<p/>

4. **A short report** (1-2 pages, **txt file, markdown file, or PDF only**), summarizing your
results. (Cf. the results from step 4.)  You want this to be clear, walking your reader through your reasoning and information.
    * Report **and briefly explain** the features you used 
    * Report relevant precision, recall and F-values
    * Report about how much time was needed for training and for evaluation of your best model(s)
    * Any additional explanations or observations about the task.
    * If you ran into difficulties or did anything extra, this is where you should mention it.
    * **Optional**: You learned about the Naive Bayes algorithm in class. Do you have an idea about why it is so fast, and also so poor?


## 10. Practicalities

* There will be a small grade bonus (and bragging rights!) for the three groups that achieve the highest F-scores on new data with a Naive Bayes classifier, and bragging rights (only) for the 3 groups with the best models overall, if they're not Naive Bayes.


*    Your work should be your own. Cite external sources etc. with a comment or in the docstring if local in the code. Add a citation to your report for any external source you use.  Do not ask for help in any forum.


*    You can freely use any part of the `conll` module itself and any
other parts of the NLTK (except, obviously, the build-in NER!), including utility routines for evaluation, etc.
    You may use standard python libraries, and all libraries included with
Anaconda; but not libraries that must be
    separately downloaded, except with prior approval. If in doubt,
ask us.