# Information Extraction Practice - NER

In this practical lab we will focus on one of the main Information Extraction methodologies: Named Entity Recognition (NER). As seen in class, NER is focus to detect and classify the names in the text.

In the theoretical session we also presented three main methodologies to address the recognition of named entities:
 - Hidden Markov Models
 - MaxEnt Markov Models
 - Conditional Random Fields
 
This latter offered the best of both approaches, therefore, this is the one that we are going to use for implementing our NER system. There are some libraries to create CRF models, but I decide to go for `sklearn_crfsuite` (https://sklearn-crfsuite.readthedocs.io/en/latest/), because it is well documented and provides an interface that can be used with sklearn.



In [1]:
# Required Imports
import nltk
import sklearn_crfsuite
import eli5
from sklearn import metrics

# Experimental Setup

We are using the experimental scenario provided by the CoNLLL 2002 shared task - which is about NER in Spanish and Dutch. Take a look to the task webpage for more details:
https://www.clips.uantwerpen.be/conll2002/ner/

CoNLL 2002 datasets contains a list of Spanish sentences, with Named Entities annotated. The dataset is included in the NLTK distribution.


## Dataset Loading

The dataset is conveniently provided by the NLTK library. We load the training and test sets for Spanish and take a look to the dataset.

In [2]:
train_sents = list(nltk.corpus.conll2002.iob_sents('esp.train'))
test_sents = list(nltk.corpus.conll2002.iob_sents('esp.testb'))

In [3]:
train_sents[0]

[(u'Melbourne', u'NP', u'B-LOC'),
 (u'(', u'Fpa', u'O'),
 (u'Australia', u'NP', u'B-LOC'),
 (u')', u'Fpt', u'O'),
 (u',', u'Fc', u'O'),
 (u'25', u'Z', u'O'),
 (u'may', u'NC', u'O'),
 (u'(', u'Fpa', u'O'),
 (u'EFE', u'NC', u'B-ORG'),
 (u')', u'Fpt', u'O'),
 (u'.', u'Fp', u'O')]

The dataset is provided as a list of list, where each sentence is a list of tokens, its POS tagging and its entity annotation (the entity type: `B-LOC` if the token is an entity or `O` if it is not). In particular, we have thes annotations:

    TYPE	DESCRIPTION
    PER     Named person or family.
    LOC 	Name of politically or geographically defined location (cities, provinces, countries,...).
    ORG     Named corporate, governmental, or other organizational entity.
    MISC	Miscellaneous entities, e.g. events, nationalities, products or works of art.
  
They can be of type:


    TAG         DESCRIPTION
    B (EGIN)    The first token of a multi-token entity.
    I (N)	   An inner token of a multi-token entity.

# Feature Extraction

Next, define some features.

In this example we use word identity, word suffix, word shape and word POS tag (POS tags can be seen as pre-extracted features); also, some information from nearby words is used and convert them to sklear-crfsuite format - each sentence should be converted to a list of dicts.

We define a function to given a sentence create all these features

In [4]:
def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]

    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'postag': postag,
        'postag[:2]': postag[:2],
    }
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2],
        })
    else:
        features['BOS'] = True

    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2],
        })
    else:
        features['EOS'] = True

    return features


def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

def sent2tokens(sent):
    return [token for token, postag, label in sent]

Let see how features extracted from a single token look like

In [5]:
%%time
X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_test = [sent2features(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]

Wall time: 1.06 s


In [6]:
X_train[0][1]

{'+1:postag': u'NP',
 '+1:postag[:2]': u'NP',
 '+1:word.istitle()': True,
 '+1:word.isupper()': False,
 '+1:word.lower()': u'australia',
 '-1:postag': u'NP',
 '-1:postag[:2]': u'NP',
 '-1:word.istitle()': True,
 '-1:word.isupper()': False,
 '-1:word.lower()': u'melbourne',
 'bias': 1.0,
 'postag': u'Fpa',
 'postag[:2]': u'Fp',
 'word.isdigit()': False,
 'word.istitle()': False,
 'word.isupper()': False,
 'word.lower()': u'(',
 'word[-3:]': u'('}

# Train a CRF model

As seen in class, CRFs are the best methodology to detect named entities taking the best of both worlds (HMMs and MaxEnt models).

Once we have features in a right format we can train a linear-chain CRF (Conditional Random Fields) model using sklearn_crfsuite.CRF.

In [7]:
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.1,
    c2=0.1,
    max_iterations=20,
    all_possible_transitions=False,
)
crf.fit(X_train, y_train)

CRF(algorithm='lbfgs', all_possible_states=None,
  all_possible_transitions=False, averaging=None, c=None, c1=0.1, c2=0.1,
  calibration_candidates=None, calibration_eta=None,
  calibration_max_trials=None, calibration_rate=None,
  calibration_samples=None, delta=None, epsilon=None, error_sensitive=None,
  gamma=None, keep_tempfiles=None, linesearch=None, max_iterations=20,
  max_linesearch=None, min_freq=None, model_filename=None,
  num_memories=None, pa_type=None, period=None, trainer_cls=None,
  variance=None, verbose=False)

# Inspecting trained model

CRFsuite CRF models use two kinds of features: state features and transition features. Let’s check their weights using eli5.explain_weights

In [8]:
eli5.show_weights(crf, top=30)

From \ To,O,B-LOC,I-LOC,B-MISC,I-MISC,B-ORG,I-ORG,B-PER,I-PER
O,3.281,2.204,0.0,2.101,0.0,3.468,0.0,2.325,0.0
B-LOC,-0.259,-0.098,4.058,0.0,0.0,0.0,0.0,-0.212,0.0
I-LOC,-0.173,-0.609,3.436,0.0,0.0,0.0,0.0,0.0,0.0
B-MISC,-0.673,-0.341,0.0,0.0,4.069,-0.308,0.0,-0.331,0.0
I-MISC,-0.803,-0.998,0.0,-0.519,4.977,-0.817,0.0,-0.611,0.0
B-ORG,-0.096,-0.242,0.0,-0.57,0.0,-1.012,4.739,-0.306,0.0
I-ORG,-0.339,-1.758,0.0,-0.841,0.0,-1.382,5.062,-0.472,0.0
B-PER,-0.4,-0.851,0.0,0.0,0.0,-1.013,0.0,-0.937,4.329
I-PER,-0.676,-0.47,0.0,0.0,0.0,0.0,0.0,-0.659,3.754

Weight?,Feature,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,Unnamed: 8_level_0
Weight?,Feature,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Weight?,Feature,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Weight?,Feature,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3
Weight?,Feature,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4
Weight?,Feature,Unnamed: 2_level_5,Unnamed: 3_level_5,Unnamed: 4_level_5,Unnamed: 5_level_5,Unnamed: 6_level_5,Unnamed: 7_level_5,Unnamed: 8_level_5
Weight?,Feature,Unnamed: 2_level_6,Unnamed: 3_level_6,Unnamed: 4_level_6,Unnamed: 5_level_6,Unnamed: 6_level_6,Unnamed: 7_level_6,Unnamed: 8_level_6
Weight?,Feature,Unnamed: 2_level_7,Unnamed: 3_level_7,Unnamed: 4_level_7,Unnamed: 5_level_7,Unnamed: 6_level_7,Unnamed: 7_level_7,Unnamed: 8_level_7
Weight?,Feature,Unnamed: 2_level_8,Unnamed: 3_level_8,Unnamed: 4_level_8,Unnamed: 5_level_8,Unnamed: 6_level_8,Unnamed: 7_level_8,Unnamed: 8_level_8
+4.416,postag[:2]:Fp,,,,,,,
+3.116,BOS,,,,,,,
+2.401,bias,,,,,,,
+2.297,"word.lower():,",,,,,,,
+2.297,postag[:2]:Fc,,,,,,,
+2.297,"word[-3:]:,",,,,,,,
+2.297,postag:Fc,,,,,,,
+2.124,postag:CC,,,,,,,
+2.124,postag[:2]:CC,,,,,,,
+1.984,EOS,,,,,,,

Weight?,Feature
+4.416,postag[:2]:Fp
+3.116,BOS
+2.401,bias
+2.297,"word.lower():,"
+2.297,postag[:2]:Fc
+2.297,"word[-3:]:,"
+2.297,postag:Fc
+2.124,postag:CC
+2.124,postag[:2]:CC
+1.984,EOS

Weight?,Feature
+2.530,word.istitle()
+2.224,-1:word.lower():en
+0.906,word[-3:]:rid
+0.905,word.lower():madrid
+0.646,word.lower():españa
+0.640,word[-3:]:ona
+0.595,word[-3:]:aña
+0.595,+1:postag[:2]:Fp
+0.515,word.lower():parís
+0.514,word[-3:]:rís

Weight?,Feature
+0.886,-1:word.istitle()
+0.664,-1:word.lower():de
+0.582,word[-3:]:de
+0.578,word.lower():de
+0.529,-1:word.lower():san
+0.444,+1:word.istitle()
+0.441,word.istitle()
+0.335,-1:word.lower():la
+0.262,postag[:2]:SP
+0.262,postag:SP

Weight?,Feature
+1.770,word.isupper()
+0.693,word.istitle()
+0.606,"word[-3:]:"""
+0.606,"word.lower():"""
+0.606,postag[:2]:Fe
+0.606,postag:Fe
+0.538,+1:word.istitle()
+0.508,"-1:word.lower():"""
+0.508,-1:postag:Fe
+0.508,-1:postag[:2]:Fe

Weight?,Feature
+1.364,-1:word.istitle()
+0.675,-1:word.lower():de
+0.597,"+1:word.lower():"""
+0.597,+1:postag:Fe
+0.597,+1:postag[:2]:Fe
+0.369,-1:postag[:2]:NC
+0.369,-1:postag:NC
+0.324,-1:word.lower():liga
+0.318,word[-3:]:de
+0.304,word.lower():de

Weight?,Feature
+2.695,word.lower():efe
+2.519,word.isupper()
+2.084,word[-3:]:EFE
+1.174,word.lower():gobierno
+1.142,word.istitle()
+1.018,-1:word.lower():del
+0.958,word[-3:]:rno
+0.671,word.lower():pp
+0.671,word[-3:]:PP
+0.667,-1:word.lower():al

Weight?,Feature
+1.499,-1:word.istitle()
+1.200,-1:word.lower():de
+0.539,-1:word.lower():real
+0.511,word[-3:]:rid
+0.446,word[-3:]:de
+0.433,word.lower():de
+0.428,-1:postag:SP
+0.428,-1:postag[:2]:SP
+0.399,word.lower():madrid
+0.368,word[-3:]:la

Weight?,Feature
+1.698,word.istitle()
+0.683,-1:postag:VMI
+0.601,+1:postag[:2]:VM
+0.589,postag[:2]:NP
+0.589,postag:NP
+0.589,+1:postag:VMI
+0.565,-1:word.lower():a
+0.520,word[-3:]:osé
+0.503,word.lower():josé
+0.476,-1:postag[:2]:VM

Weight?,Feature
+2.742,-1:word.istitle()
+0.736,word.istitle()
+0.660,-1:word.lower():josé
+0.598,-1:postag[:2]:AQ
+0.598,-1:postag:AQ
+0.510,-1:postag[:2]:VM
+0.487,-1:word.lower():juan
+0.419,-1:word.lower():maría
+0.413,-1:postag:VMI
+0.345,-1:word.lower():luis


Does the transition features make sense?

It seems so. Model learned that I-ENITITY must follow B-ENTITY. It also learned some unlikely transitions: it is not common in this dataset to have a location right after an organization name (I-ORG -> B-LOC has a large negative weight).

Features don’t use gazetteers, so model had to remember some geographic names from the training data, e.g. that España is a location.


## Regularization

If we regularize CRF more, we can expect that only features which are generic will remain, and memorized tokens will go. With L1 regularization (c1 parameter) coefficients of most features should be driven to zero. Let’s check what effect does regularization have on CRF weights

In [None]:
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=200,
    c2=0.1,
    max_iterations=20,
    all_possible_transitions=False,
)
crf.fit(X_train, y_train)
eli5.show_weights(crf, top=30)

As you can see, memorized tokens are mostly gone and model now relies on word shapes and POS tags. There is only a few non-zero features remaining. In our example the change probably made the quality worse, but that’s a separate question.

Let’s focus on transition weights. We can expect that O -> I-ENTITY transitions to have large negative weights because they are impossible. But these transitions have zero weights, not negative weights, both in heavily regularized model and in our initial model. Something is going on here.

The reason they are zero is that crfsuite haven’t seen these transitions in training data, and assumed there is no need to learn weights for them, to save some computation time. This is the default behavior, but it is possible to turn it off using sklearn_crfsuite.CRF all_possible_transitions option. Let’s check how does it affect the result:

In [None]:
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.1,
    c2=0.1,
    max_iterations=20,
    all_possible_transitions=True,
)
crf.fit(X_train, y_train);
eli5.show_weights(crf, top=30)

# Predict on the Test Set

Using the trained crf model we then apply it to predict the named entities in the test set.

There is much more O entities in data set, but we're more interested in other entities. To account for this we'll use averaged F1 score computed for all labels except for O. sklearn-crfsuite.metrics package provides some useful metrics for sequence classification task, including this one.

In [None]:
from sklearn.preprocessing import MultiLabelBinarizer

predictions = crf.predict(X_test)
metrics.f1_score(MultiLabelBinarizer().fit_transform(y_test), MultiLabelBinarizer().fit_transform(predictions), 
                      average='weighted')

Inspect per-class results in more detail

In [None]:
labels = list(crf.classes_)
labels
sorted_labels = sorted(
    labels, 
    key=lambda name: (name[1:], name[0])
)

print(metrics.classification_report(
    MultiLabelBinarizer().fit_transform(y_test), MultiLabelBinarizer().fit_transform(predictions), target_names = sorted_labels, digits=3
))

So, we have created a NER system with an F-score equals to 83.5%. It sounds pretty good, but how can we actually have an idea of the performance of our system?

Let us check the results achived by other research teams in the CONLL task (taken from https://www.clips.uantwerpen.be/conll2002/ner/):

       +----------+-----------+-----------++-----------++
       | System   | precision |   recall  ||     F     ||
       +----------+-----------+-----------++-----------++
       | [CMP02]  |   81.38%  |   81.40%  ||   81.39   || ±1.5
       | [Flo02]  |   78.70%  |   79.40%  ||   79.05   || ±1.4
       | [CY02]   |   78.19%  |   76.14%  ||   77.15   || ±1.4
       | [WNC02]  |   75.85%  |   77.38%  ||   76.61   || ±1.4
       | [BHM02]  |   74.19%  |   77.44%  ||   75.78   || ±1.4
       | [Tjo02]  |   76.00%  |   75.55%  ||   75.78   || ±1.5
       | [PWM02]  |   74.32%  |   73.52%  ||   73.92   || ±1.5
       | [Jan02]  |   74.03%  |   73.76%  ||   73.89   || ±1.5
       | [Mal02]  |   73.93%  |   73.39%  ||   73.66   || ±1.6
       | [Tsu02]  |   69.04%  |   74.12%  ||   71.49   || ±1.4
       | [BV02]   |   60.53%  |   67.29%  ||   63.73   || ±1.8
       | [MM02]   |   56.28%  |   66.51%  ||   60.97   || ±1.7
       +----------+-----------+-----------++-----------++
       | baseline |   26.27%  |   56.48%  ||   35.86   || ±1.3
       +----------+-----------+-----------++-----------++

Wow! we have got the best results for the task!! (improving the F-measure by more than 2 points)

Well, hold your horses. These results were obtained back in 2002 when the task was proposed. The state of the art has been improved in these 15 years and nowadays the best systems are around 85% of F-score:

https://arxiv.org/pdf/1603.06270.pdf

Anyhow, we have create a very good model :)