This tutorial is a slight modification of the tutorial by Sam Galen.

In [1]:
from __future__ import print_function
import nltk
from sklearn.metrics import confusion_matrix
import scipy
import sklearn
import pycrfsuite
from features import *
from evaluate import *
print(sklearn.__version__)

0.19.0


# Use the CoNLL 2002 data to build a Named Entity Recognition (NER) system

The CoNLL2002 corpus is available in NLTK. We use the Spanish (esp.) data sets.


In [2]:
nltk.corpus.conll2002.fileids()

['esp.testa', 'esp.testb', 'esp.train', 'ned.testa', 'ned.testb', 'ned.train']

In [3]:
train_sents = list(nltk.corpus.conll2002.iob_sents('esp.train'))
test_sents = list(nltk.corpus.conll2002.iob_sents('esp.testb'))

Data format: Lets see how our data looks like

In [4]:
train_sents[0]

[('Melbourne', 'NP', 'B-LOC'),
 ('(', 'Fpa', 'O'),
 ('Australia', 'NP', 'B-LOC'),
 (')', 'Fpt', 'O'),
 (',', 'Fc', 'O'),
 ('25', 'Z', 'O'),
 ('may', 'NC', 'O'),
 ('(', 'Fpa', 'O'),
 ('EFE', 'NC', 'B-ORG'),
 (')', 'Fpt', 'O'),
 ('.', 'Fp', 'O')]

## Features

Define some features to characterize the Named Entities. Find a feature extraction python code example in [features](./features.py). In the example, the features are: word identity, word suffix, word shape and word POS-tag.

This makes a simple baseline.  You can add and/or remove features to get (much?) better results - experiment with it as you will need to do this for assignment.


       

Let's see what word2features extracts:

In [5]:
sent2features(train_sents[0])[0]

{'+1:postag': 'Fpa',
 '+1:postag[:2]': 'Fp',
 '+1:word.lower()': '(',
 'BOS': True,
 'bias': 1.0,
 'postag': 'NP',
 'postag[:2]': 'NP',
 'word.lower()': 'melbourne',
 'word[-2:]': 'ne',
 'word[-3:]': 'rne'}

Extract the features from the data:

In [6]:
X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_test = [sent2features(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]

## Train a NER model

To train a NER model: (1) invoque the pycrfsuite.Trainer method, (2) load the training data, (3) set the CRF training parameters, (4) call the "trainer" method to start the training process.

In [7]:
trainer = pycrfsuite.Trainer(verbose=False)

for xseq, yseq in zip(X_train, y_train):
    trainer.append(xseq, yseq)

Set training parameters. We will use L-BFGS training algorithm (it is default) with Elastic Net (L1 + L2) regularization.

In [8]:
trainer.set_params({
    'c1': 10.0,   # coefficient for L1 penalty
    'c2': 1e-3,  # coefficient for L2 penalty
    'max_iterations': 50,  # stop earlier

    # include transitions that are possible, but not observed
    'feature.possible_transitions': True
})

Possible parameters for the default training algorithm:

In [10]:
trainer.params()

['feature.minfreq',
 'feature.possible_states',
 'feature.possible_transitions',
 'c1',
 'c2',
 'max_iterations',
 'num_memories',
 'epsilon',
 'period',
 'delta',
 'linesearch',
 'max_linesearch']

Train the model and name the resulting model file (e.g.: conll2002-esp.crfsuite)

In [11]:
%%time
trainer.train('ner-esp.model')

CPU times: user 15.8 s, sys: 99.9 ms, total: 15.9 s
Wall time: 16 s


trainer.train saves model to a file:

In [12]:
!ls -lh ./ner-esp.model

-rw-r--r--  1 zhaolongfei  staff   116K Oct 23 12:48 ./ner-esp.model


We can get information for every training step using trainer.logparser.iterations

Here we are extracting the information about the last step

In [13]:
print (len(trainer.logparser.iterations), trainer.logparser.iterations[-1])

50 {'num': 50, 'scores': {}, 'loss': 48787.828124, 'feature_norm': 35.995718, 'error_norm': 1800.820649, 'active_features': 2306, 'linesearch_trials': 1, 'linesearch_step': 1.0, 'time': 0.268}


## Make predictions

To use your NER model, you need to create pycrfsuite.Tagger, open the model, and use the "tag" method:

In [13]:
tagger = pycrfsuite.Tagger()
tagger.open('ner-esp.model')

<contextlib.closing at 0x7f705beb9790>

Lets tag a sentence to see how it works: (1) print the first sentence of the test set, (2) use your tagger to make predictions in that sentence, (3) print the predicted labels, (4) print the correct labels 

In [14]:
example_sent = test_sents[0]
print(' '.join(sent2tokens(example_sent)), end='\n\n')

print("Predicted:", ' '.join(tagger.tag(sent2features(example_sent))))
print("Correct:  ", ' '.join(sent2labels(example_sent)))

La Coruña , 23 may ( EFECOM ) .

Predicted: B-LOC I-LOC O O O O B-ORG O O
Correct:   B-LOC I-LOC O O O O B-ORG O O


## Evaluate the NER model

The evaluation code is in the Python file named [evaluate](evaluate.py). Please don't change it.


First, get the predicted entity labels for all sentences in the test set ('testb' Spanish data set):

In [15]:
%%time
y_pred = [tagger.tag(xseq) for xseq in X_test]

CPU times: user 376 ms, sys: 4 ms, total: 380 ms
Wall time: 377 ms


Second, see how good are you doing by running the custom evaluation function

In [16]:
print(bio_classification_report(y_test, y_pred))

             precision    recall  f1-score   support

      B-LOC       0.68      0.47      0.55      1084
      I-LOC       0.52      0.25      0.34       325
     B-MISC       0.54      0.11      0.19       339
     I-MISC       0.54      0.22      0.32       557
      B-ORG       0.76      0.51      0.61      1400
      I-ORG       0.67      0.44      0.53      1104
      B-PER       0.73      0.68      0.71       735
      I-PER       0.78      0.82      0.80       634

avg / total       0.68      0.48      0.55      6178



## Check what the classifier has learned

In [17]:
from collections import Counter
info = tagger.info()

def print_transitions(trans_features):
    for (label_from, label_to), weight in trans_features:
        print("%-6s -> %-7s %0.6f" % (label_from, label_to, weight))

print("Top likely transitions:")
print_transitions(Counter(info.transitions).most_common(15))

print("\nTop unlikely transitions:")
print_transitions(Counter(info.transitions).most_common()[-15:])

Top likely transitions:
B-PER  -> I-PER   5.034204
B-MISC -> I-MISC  4.509289
B-ORG  -> I-ORG   4.445884
I-MISC -> I-MISC  4.169687
I-ORG  -> I-ORG   4.148371
I-PER  -> I-PER   3.584838
B-LOC  -> I-LOC   3.583397
I-LOC  -> I-LOC   3.565042
O      -> O       2.874174
O      -> B-ORG   1.657239
O      -> B-MISC  1.099429
O      -> B-LOC   1.004697
O      -> B-PER   0.827228
B-ORG  -> O       0.678863
B-PER  -> O       0.015656

Top unlikely transitions:
I-PER  -> B-PER   -1.178009
B-LOC  -> I-MISC  -1.179662
I-ORG  -> B-LOC   -1.199538
I-ORG  -> I-PER   -1.305765
I-PER  -> I-LOC   -1.314171
B-ORG  -> I-MISC  -1.416927
I-MISC -> I-ORG   -1.484186
I-ORG  -> I-MISC  -1.530820
B-PER  -> B-PER   -1.690435
I-MISC -> I-LOC   -1.801041
I-ORG  -> I-LOC   -1.970001
O      -> I-PER   -4.536760
O      -> I-MISC  -4.856690
O      -> I-ORG   -5.194917
O      -> I-LOC   -5.523952


We can see that, for example, it is very likely that the beginning of an organization name (B-ORG) will be followed by a token inside organization name (I-ORG), but transitions to I-ORG from tokens with other labels are penalized. Also note I-PER -> B-LOC transition: a positive weight means that model thinks that a person name is often followed by a location.

Check the state features:

In [18]:
def print_state_features(state_features):
    for (attr, label), weight in state_features:
        print("%0.6f %-6s %s" % (weight, label, attr))    

print("Top positive:")
print_state_features(Counter(info.state_features).most_common(20))

print("\nTop negative:")
print_state_features(Counter(info.state_features).most_common()[-20:])

Top positive:
4.569708 O      postag[:2]:Fp
4.139291 B-MISC word.lower():juegos
3.839936 B-ORG  word.lower():efe
3.248121 B-ORG  word.lower():psoe-progresistas
3.133776 B-ORG  word.lower():telefónica
3.120209 O      -1:word.lower():efe
2.994530 B-LOC  -1:word.lower():nuboso
2.938371 B-LOC  word.lower():líbano
2.810946 B-PER  -1:word.lower():dijo
2.796526 B-ORG  word.lower():amena
2.778038 B-ORG  word[-2:]:OE
2.734662 I-LOC  -1:word.lower():calle
2.669057 B-ORG  word.lower():ejército
2.633509 B-PER  -1:word.lower():según
2.629499 B-PER  word.lower():reyes
2.597705 B-MISC word[-3:]:Ley
2.583170 B-LOC  -1:word.lower():despejado
2.571595 I-ORG  -1:word.lower():asociación
2.531081 I-LOC  -1:word.lower():san
2.509848 B-LOC  word.lower():cáceres

Top negative:
-1.421753 B-PER  word[-2:]:os
-1.432009 B-ORG  word[-2:]:or
-1.506883 B-ORG  word[-2:]:de
-1.527245 O      word[-3:]:Los
-1.626755 B-LOC  word[-3:]:la
-1.681866 O      word[-3:]:opa
-1.683306 O      word.lower():estados
-1.684262 O     

Some observations:

* **3.248121 B-ORG  word.lower=psoe-progresistas** - the model remembered names of some entities - maybe it is overfit, or maybe our features are not adequate, or maybe remembering is indeed helpful;
* **2.734662 I-LOC  -1:word.lower=calle**: "calle" is a street in Spanish; model learns that if a previous word was "calle" then the token is likely a part of location;
* **-2.112873 O      postag=NP** - proper nouns (NP is a proper noun in the Spanish tagset) are often entities.