9# Evaluation: NER, Chunking, and other retrieval tasks

_Practical Python for Linguistics and the Humanities -- Alexis 
Dimitriadis_

## Contents


**[1. Evaluating a retrieval task: Adjective recognition](#1.-Evaluating-a-retrieval-task:-Adjective-recognition)**  
&nbsp;&nbsp;&nbsp;&nbsp;
  [1.1 Prepare the testing data](#1.1-Prepare-the-testing-data)  
&nbsp;&nbsp;&nbsp;&nbsp;
  [1.2 Examine the data](#1.2-Examine-the-data)  
&nbsp;&nbsp;&nbsp;&nbsp;
  [1.3 Prepare the tagger](#1.3-Prepare-the-tagger)  
&nbsp;&nbsp;&nbsp;&nbsp;
  [1.4 Tag and evaluate](#1.4-Tag-and-evaluate)  

**[2. Chunking the CONLL2002 corpus](#2.-Chunking-the-CONLL2002-corpus)**  
&nbsp;&nbsp;&nbsp;&nbsp;
  [2.1 IOB and Tree formats](#2.1-IOB-and-Tree-formats)  
&nbsp;&nbsp;&nbsp;&nbsp;
  [2.2 Converting between formats](#2.2-Converting-between-formats)  

**[3. Evaluating the NLTK's own Named Entity chunker](#3.-Evaluating-the-NLTK's-own-Named-Entity-chunker)**  
&nbsp;&nbsp;&nbsp;&nbsp;
  [3.1 Error analysis](#3.1-Error-analysis)  


Measuring the performance of our tools is an important part of Natural
 Language Processing. While in tasks like POS tagging every word must 
be tagged, information retrieval tasks are only concerned with some 
items out of the mass of data.  The notions of precision and recall, 
and the "F-score" that combines them, measure the success of our 
solutions. 


## Reading:

* NLTK book, [section 6.3: 
Evaluation](http://www.nltk.org/book/ch06.html#evaluation).  
See also Jurafsky and Martin section 5.7.


* NLTK book, [section 7.2: Chunking.](http://www.nltk.org/book/ch07.html#sec-chunking)

- - - - 

## 1. Evaluating a retrieval task: Adjective recognition

We begin by evaluating a simple (and easy) information retrieval task:
 Finding adjectives
in the Dutch [CONLL 2002 
corpus](http://www.cnts.ua.ac.be/conll2002/ner/). We will use the 
Dutch tagger you built in a previous assignment, and evaluate its 
performance for the adjective retrieval task.

The tags that come with the corpus are our "gold standard": We will 
assume (contrary to fact!) that they are always right. We will compare
 our tagger's performance on the test data with the "gold standard" 
answers.

### 1.1 Prepare the testing data


Once again we'll use the [CONLL 2002 
corpus](http://www.cnts.ua.ac.be/conll2002/ner/), which comes as part 
of the NLTK. Recall that the CONLL2002 corpus includes texts in Dutch 
and in Spanish, so we must _always_ take care to only fetch the 
appropriate file. We will use the data in the file `"ned.train"`, with
 the same division into training and testing sentences as before: 
Sentences with index less than 1500 are for testing, and the rest are 
for training. Never test a solution on the data it was 
trained with!

In [None]:
from nltk.corpus import conll2002 as conll

allsents = conll.tagged_sents("ned.train")
test_sents = allsents[:1500]
train_sents = allsents[1500:]

**Your turn:** Create an untagged version of the test part of the 
corpus.

In [None]:
# FILL IN:

test_sents_untagged = ...

### 1.2 Examine the data

Find out how many adjectives there _really_ are in the **testing 
part** of the corpus. You'll need to loop over the sentences in 
`test_sents` and count how many words are tagged as adjectives. Print 
out the result nicely. There are over one thousand adjectives in the 
test set.

In [None]:
# YOUR CODE:



### 1.3 Prepare the tagger

1. Use your notebook "training a tagger" to create and pickle a Dutch 
tagger, trained on the data in `train_sents`. If you still have your 
pickled tagger, you may use it instead of training a new one.<p/>

2. In the code cell below, load your pickled tagger into a variable 
`tagger`. (Move or copy your tagger to the directory containing this notebook, 
if necessary.)

In [None]:
# YOUR CODE:



### 1.4 Tag and evaluate

Tag the untagged version of our test set with the tagger and save the 
result in a list. Compare the newly tagged sentences with the 
correctly tagged "gold standard": Count the number of adjectives that 
are correctly retrived, that were missed, and that are false 
positives. (I.e., true positives, false negatives, and false 
positives.) Print these numbers out for our information.

Hint: To iterate over two lists of sentences in parallel, you can use 
either an index variable or the function `zip()`.

In [None]:
# FILL IN:

...


print("Found correctly:", found)
print("Missed:         ", missed),
print("False positives:", falsepos)

Calculate and print out the precision, recall, and F-score that you 
calculate from them. (Definitions in the NLTK book's section on 
[evaluation](http://www.nltk.org/book/ch06.html#evaluation).)

If your tagger was built according to our recipe (with all the 
back-off taggers in place),  recall should be around 0.82 and 
precision around 0.97.

In [None]:
# FILL IN:

...


print("Recall:    %.3f" % recall)
print("Precision: %.3f" % precision)
print("F-score:   %.3f" % f_score)

## 2. Chunking the CONLL2002 corpus


CONLL stands for _Computational Natural Language Learning_, and each 
year's version is targeted to a different NLP task. CONLL2002 is about
 "language independent" Named Entity regognition, by which they meant 
languages other than English: Dutch and Spanish data are included.

The CONLL2002 corpus is loaded in the usual way:

In [None]:
from nltk.corpus import conll2002 
conll = conll2002 # Save us some typing with this alias

The corpus provides the usual methods `words(), sents(), 
tagged_words(), tagged_sents(), fileids()`, as well as some special
ones: `chunked_sents()`, `iob_sents()`, and `iob_words()`.

### 2.1 IOB and Tree formats

Because named entities can consist of several words, the task of 
recognizing them falls in the category of _chunking:_ Recognizing and 
categorizing small groups of words. 
As the NLTK book [explains][iob], the **IOB format** (Inside, Outside,
 Begin) is a
commonly used form of tagging that identifies chunks. Text in the
NLTK's IOB format looks like tagged sentences and words, except there
are three items in each tuple: _word, POS-tag, IOB-tag._

[iob]: http://www.nltk.org/book/ch07.html#representing-chunks-tags-vs-trees

In [None]:
for w in conll.iob_sents("ned.train")[14]:
    print(w)    

**Your turn:** 

List the named entities in the above sentence, along with their type.

But the NLTK's designers decided to also provide a **different,** more
 convenient format.  The
method `chunked_sents()` returns instances of the nltk's `Tree` class,
a very special kind of `list` with several extra methods.

In [None]:
sample = conll.chunked_sents("ned.train")[14]
print("Type:", type(sample))
print("Top-level label:", sample.label())
print("Contents:\n"+repr(sample))

We used `repr()` above to show the internal structure of our sample tree. 
When printed the regular way, the `Tree` class displays itself in a 
nice readable form:

In [None]:
print(sample)

If your python configuration supports it (our Notebooks do), the 
method `draw()` will
display an image of the tree structure in a separate window. 
You'll have to find the pop-up window (it might be hidden behind other windows, like the `nltk.downloader()` window was) and close it to continue execution. 

**Note.** If the tree window does not appear, it _may_ help to add this "magic" notebook command at the top of the code cell, above your code:

    %matplotlib

In [None]:
sample.draw()

It can be seen that this tree contains branches for the two named
entities, _Ruimtelijke Ordening_ (an organization) and _BPA_
("miscellaneous").

* A chunked sentence is a Tree of maximal height 3: the sentence node,
the chunks, and the chunk words. (The height will be 2 if there are no chunks in the sentence.)


* If a sentence contains no chunks, it will have the usual tagged
sentence structure: A list of `(word, pos-tag)` pairs. Although it is
a `Tree`, not a `list`, you can use all the usual list methods.


* Each chunk is a `Tree` containing the chunk's
tagged words. This also applies to one-word "chunks" (i.e., one-word 
named entities). 
Unchunked words are direct children of the top-
level `Tree`, as in untagged sentences.

Note that the above only describes _chunked_ trees. Full sentence structure consists of many levels of trees and subtrees. The nltk's parsed corpora, for example, have deep trees with many levels of branches. 

A `Tree` object is also a list. The list elements are the children of
the top node.

In [None]:
print("Type:", type(sample))
print("Top-level label:", sample.label())
print("Elements:")
for part in sample:
    print(part)

Let's examine the first named entity, which is at index zero:

In [None]:
print("Type:", type(sample[0]))
print("Chunk label:", sample[0].label())
for part in sample[0]:
    print(part)

To distinguish between words (i.e., tuples) and subtrees, use the
function `isinstance`.

In [None]:
from nltk import Tree  # Import the type so we can check for it
for part in sample:
    if isinstance(part, Tree):
        print(part, "---> a subtree")
    else:
        print(part)

Since all tree components are either subtrees or tuples, we could have
 instead checked which parts have type `tuple`. 

### 2.2 Converting between formats

The classifiers that actually find and classify the chunks work with
the IOB format, which allows chunking to be approached as a word
classification problem. But Tree-structured chunks are easier to work
with in many ways, so the NLTK uses them as its default
representation. E.g., `chunked_sents()` will give you Trees, not IOB
tags. The NLTK provides functions for converting among the two
formats. (Use `help()` for details, or consult the book and
documentation.)

In [None]:
from nltk.chunk.util import tree2conlltags, conlltags2tree

# Convert sample tree to IOB list
iobsample = tree2conlltags(sample)
for wordtuple in iobsample:
    print(wordtuple)
    
print()
# Convert IOB list to tree again
newtree = conlltags2tree(iobsample)
print(newtree)

print("\nSame as we started with?", newtree == sample)

## 3. Evaluating the NLTK's own Named Entity chunker

**In brief:** We'll process the CONLL2002 corpus with the NLTK's 
built-in named entity recognizer, and measure its performance using 
the NLTK's own tools.

The nltk comes with a bundled Named Entity Recognizer, available as
`nltk.ne_chunk()` or `nltk.ne_chunk_sents()` (for one sentence or a 
list of
sentences, respectively). Like the NLTK's POS tagger, its NE 
recognizer was trained for English.
Can we use it on the Dutch data?

There are several reasons this might not work: The kind of texts it 
was
trained on might be too different from Dutch news text; named entities
might look different in the two languages (e.g., Dutch names do often
have different structure from common American names); and the POS tags
in our text might not match those that the NER expects. (The
recognizer needs text that has already been POS-tagged.)

The shortcut `nltk.ne_chunk()` will let us process text easily, but to
evaluate its performance it is better to load the pickled recognizer
model that `ne_chunk()` uses.

In [None]:
import nltk
default_NER = nltk.data.load('chunkers/maxent_ne_chunker/english_ace_multiclass.pickle')

Chunking is a form of shallow parsing, so the NE recognizer's chunking methods are called 
`parse()` (for one sentence) and `parse_sents()` (for a list
of sentences). Like other NLTK tools, it also has an `evaluate()`
method that can be used to generate a performance report. 

The NLTK's evaluation methods are used with a dataset that already 
includes
the correct answer. They remove the answer, process the data, and
compare the result against the correct answers. The NER chunker
evaluates "IOB accuracy" by checking all IOB tags (i.e., all words,
whether or not they are part of a chunk); the other measures count how
many NE chunks were found andd identified correctly.

**Your turn:** Evaluate `default_NER` with the first 5000 sentences of
 the CONLL file `"ned.train"`. You should get "IOB Accuracy" around 
90%, but zero precision and recall. 

In [None]:
testdata = ...
print(default_NER.evaluate(testdata))

### 3.1 Error analysis

The IOB accuracy is high because most words are not in a named entity chunk, and they are correctly tagged as "O" (outside a chunk). But our recognizer found a negligible proportion of all named entities (0%
 recall). Of the chunks it marked as named entities, a small 
proportion (7.7%) were indeed named entities; the rest were identified
 incorrectly. But how many chunks are we talking about, and what do 
they look like? We can find out by saving the value returned by 
`evaluate()`, which is actually an object, and using its methods to 
get more information.

The methods `correct()`, `guessed()`, and `missed()` and `incorrect()`
 give us not just counts, but a list of the actual chunks that fall in
 each category. There seems to be no direct way to get the chunks that
 were correctly retrieved, but the information is not hard to access.

In [None]:
metrics = default_NER.evaluate(conll.chunked_sents("ned.testa"))
print(metrics)
print()

correct = metrics.correct()
print("Actual NEs in the test data:", len(correct))
print()

guessed = metrics.guessed()
print("Chunks guessed (proposed) by the recognizer:", len(guessed), guessed[:5], "...")
print()

# There seems to be no way to get this via the API
found = [ v[1] for v in metrics._tp ]
print("Found correctly (truepos):", len(found), found)
print()

incorrect = metrics.incorrect()
print("Incorrect guesses (falsepos):", len(incorrect), incorrect[:5], "...")
print()

print("Missed NEs: (falseneg)", len(metrics.missed()))

When we are developing our own classifier, access to the classifier's 
mistakes is essential for _error analysis,_ which may allow us to 
identify weak spots and ways to remedy them.

**Your turn:**

1. Print out nicely (one per line) the first 50 incorrect guesses. 

2. Form a set of all different IOB labels appearing in the CONLL corpus (file `ned.train`). Examine them and compare them with the labels generated by the default NE chunker. (The meaning of the non-obvious labels is documented in the NLTK book). 

In [None]:
# YOUR CODE:

