# Named Entity Recognition

This notebook is based on the tutorial of [sklearn_crfsuite](https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html)

We start with loading all the necessary packages:
* **sklearn_crfsuite** for the implementation of the Conditional Random Field
* **pandas** for nice displaying of dataframes
* **eli5** for illustration of learned parameters

In [1]:
import sklearn_crfsuite
import eli5
import pandas as pd
from sklearn_crfsuite import metrics
from sklearn_crfsuite.metrics import flat_classification_report
from sklearn.metrics import classification_report

# Conll 2000 Corpus

Within this notebook we're working on the Conll 2000 Corpus, which was obtained from https://github.com/Franck-Dernoncourt/NeuroNER/

First, we take a look at the data to inspect the data format.

In [2]:
def print_input_file(filename, lines=10):
    with open(filename, 'r') as file:
        for line_number, line in enumerate(file):
            if line_number >= lines:
                break;
            else:
                print(line.strip())
                
print_input_file('NER_data/train.txt')

-DOCSTART- -X- -X- O

EU NNP B-NP B-ORG
rejects VBZ B-VP O
German JJ B-NP B-MISC
call NN I-NP O
to TO B-VP O
boycott VB I-VP O
British JJ B-NP B-MISC
lamb NN I-NP O


Load the data into train and test corpora and inspecting the raw sentences.

In [3]:
def load_file(filename): 
    text = []
    with open(filename, 'r') as f:
        sentence = []
        for line in f:
            if line.startswith('-DOCSTART-') or len(line.strip()) == 0:
                if len(sentence) > 0:
                    text.append(sentence)
                sentence = []
            else:
                l = line.strip().split(' ')
                sentence.append((l[0], l[1], l[3]))
    return text

train = load_file('NER_data/train.txt')
test = load_file('NER_data/test.txt')

In [4]:
def print_text(corpus, amount=1):
    for sentence in corpus[:amount]:
        print([l[0] for l in sentence])

print_text(train,4)

['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']
['Peter', 'Blackburn']
['BRUSSELS', '1996-08-22']
['The', 'European', 'Commission', 'said', 'on', 'Thursday', 'it', 'disagreed', 'with', 'German', 'advice', 'to', 'consumers', 'to', 'shun', 'British', 'lamb', 'until', 'scientists', 'determine', 'whether', 'mad', 'cow', 'disease', 'can', 'be', 'transmitted', 'to', 'sheep', '.']


Inspecting the loaded data.

In [5]:
def print_sentence(corpus, idx=0):
    for token in corpus[idx]:
        print(token)
        
print_sentence(train)

('EU', 'NNP', 'B-ORG')
('rejects', 'VBZ', 'O')
('German', 'JJ', 'B-MISC')
('call', 'NN', 'O')
('to', 'TO', 'O')
('boycott', 'VB', 'O')
('British', 'JJ', 'B-MISC')
('lamb', 'NN', 'O')
('.', '.', 'O')


# Compute features

In order to allow the CRF to learn how to distinguish between different Named Entities, we have to compute features for the individual words.

For the time being, we select information about the word itself, such as
* the word
* the suffix of the word
* the shape of the word
* the POS tag
but also about the word before and after. In addition, we add information about whether the word is at the beginning or the end of the sentence.

In [6]:
def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]
    
    features = {
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'postag': postag,
        'postag[:2]': postag[:2],        
    }
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2],
        })
    else:
        features['BOS'] = True
        
    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2],
        })
    else:
        features['EOS'] = True
                
    return features

For the first word, the features look as follows

In [7]:
word2features(train[0], 0)

{'word.lower()': 'eu',
 'word[-3:]': 'EU',
 'word.isupper()': True,
 'word.istitle()': False,
 'word.isdigit()': False,
 'postag': 'NNP',
 'postag[:2]': 'NN',
 'BOS': True,
 '+1:word.lower()': 'rejects',
 '+1:word.istitle()': False,
 '+1:word.isupper()': False,
 '+1:postag': 'VBZ',
 '+1:postag[:2]': 'VB'}

For the first sentence, the features look like:

In [8]:
def sentence2features(corpus, sent_idx):
    sentence_features = []
    for i in range(len(corpus[sent_idx])):
        sentence_features.append(word2features(corpus[sent_idx], i))
    return sentence_features

sentence2features(train, 0)    

[{'word.lower()': 'eu',
  'word[-3:]': 'EU',
  'word.isupper()': True,
  'word.istitle()': False,
  'word.isdigit()': False,
  'postag': 'NNP',
  'postag[:2]': 'NN',
  'BOS': True,
  '+1:word.lower()': 'rejects',
  '+1:word.istitle()': False,
  '+1:word.isupper()': False,
  '+1:postag': 'VBZ',
  '+1:postag[:2]': 'VB'},
 {'word.lower()': 'rejects',
  'word[-3:]': 'cts',
  'word.isupper()': False,
  'word.istitle()': False,
  'word.isdigit()': False,
  'postag': 'VBZ',
  'postag[:2]': 'VB',
  '-1:word.lower()': 'eu',
  '-1:word.istitle()': False,
  '-1:word.isupper()': True,
  '-1:postag': 'NNP',
  '-1:postag[:2]': 'NN',
  '+1:word.lower()': 'german',
  '+1:word.istitle()': True,
  '+1:word.isupper()': False,
  '+1:postag': 'JJ',
  '+1:postag[:2]': 'JJ'},
 {'word.lower()': 'german',
  'word[-3:]': 'man',
  'word.isupper()': False,
  'word.istitle()': True,
  'word.isdigit()': False,
  'postag': 'JJ',
  'postag[:2]': 'JJ',
  '-1:word.lower()': 'rejects',
  '-1:word.istitle()': False,
  '-

## Computing features for the entire corpus

In [9]:
def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

def sent2tokens(sent):
    return [token for token, postag, label in sent]

X_train = [sent2features(s) for s in train]
y_train = [sent2labels(s) for s in train]

X_test = [sent2features(s) for s in test]
y_test = [sent2labels(s) for s in test]

In [10]:
X_train[0]

[{'word.lower()': 'eu',
  'word[-3:]': 'EU',
  'word.isupper()': True,
  'word.istitle()': False,
  'word.isdigit()': False,
  'postag': 'NNP',
  'postag[:2]': 'NN',
  'BOS': True,
  '+1:word.lower()': 'rejects',
  '+1:word.istitle()': False,
  '+1:word.isupper()': False,
  '+1:postag': 'VBZ',
  '+1:postag[:2]': 'VB'},
 {'word.lower()': 'rejects',
  'word[-3:]': 'cts',
  'word.isupper()': False,
  'word.istitle()': False,
  'word.isdigit()': False,
  'postag': 'VBZ',
  'postag[:2]': 'VB',
  '-1:word.lower()': 'eu',
  '-1:word.istitle()': False,
  '-1:word.isupper()': True,
  '-1:postag': 'NNP',
  '-1:postag[:2]': 'NN',
  '+1:word.lower()': 'german',
  '+1:word.istitle()': True,
  '+1:word.isupper()': False,
  '+1:postag': 'JJ',
  '+1:postag[:2]': 'JJ'},
 {'word.lower()': 'german',
  'word[-3:]': 'man',
  'word.isupper()': False,
  'word.istitle()': True,
  'word.isdigit()': False,
  'postag': 'JJ',
  'postag[:2]': 'JJ',
  '-1:word.lower()': 'rejects',
  '-1:word.istitle()': False,
  '-

# Training of the CRF

We train the CRF based on the computed features by use of gradient descent with elastic net regularisation.

In [11]:
%%time
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.1,
    c2=0.1,
    max_iterations=100,
    all_possible_transitions=True
)

try:
    crf.fit(X_train, y_train)
except AttributeError:
    pass
predictions = crf.predict(X_test)

CPU times: total: 19.1 s
Wall time: 22.8 s


CRFsuite CRF models use two kinds of features: state features and transition features. Let's check their weights 
using eli5.explain_weights:

In [12]:
#eli5.show_weights(crf, top=10)

# Predict and Evaluate

In [13]:
y_pred = crf.predict(X_test)

def show_sentence_results(corpus, prediction, idx):
    df = pd.DataFrame(corpus[idx])
    df['prediction'] = prediction[idx]
    return df

show_sentence_results(test, y_pred, 0)

Unnamed: 0,0,1,2,prediction
0,SOCCER,NN,O,O
1,-,:,O,O
2,JAPAN,NNP,B-LOC,B-LOC
3,GET,VB,O,O
4,LUCKY,NNP,O,O
5,WIN,NNP,O,O
6,",",",",O,O
7,CHINA,NNP,B-PER,B-LOC
8,IN,IN,O,O
9,SURPRISE,DT,O,O


In [14]:
y_pred_train = crf.predict(X_train)

show_sentence_results(train, y_pred_train, 0)

Unnamed: 0,0,1,2,prediction
0,EU,NNP,B-ORG,B-ORG
1,rejects,VBZ,O,O
2,German,JJ,B-MISC,B-MISC
3,call,NN,O,O
4,to,TO,O,O
5,boycott,VB,O,O
6,British,JJ,B-MISC,B-MISC
7,lamb,NN,O,O
8,.,.,O,O


In [15]:
from sklearn.preprocessing import MultiLabelBinarizer
y = [[2, 3, 4], [2], [0, 1, 3], [0, 1, 2, 3, 4], [0, 1, 2]]

MultiLabelBinarizer().fit_transform(y)

array([[0, 0, 1, 1, 1],
       [0, 0, 1, 0, 0],
       [1, 1, 0, 1, 0],
       [1, 1, 1, 1, 1],
       [1, 1, 1, 0, 0]])

In [16]:
from sklearn.preprocessing import MultiLabelBinarizer

y_test_binary = MultiLabelBinarizer().fit_transform(y_test)
y_pred_binary = MultiLabelBinarizer().fit_transform(y_pred)

print(classification_report(y_test_binary, y_pred_binary))

              precision    recall  f1-score   support

           0       0.89      0.84      0.86      1266
           1       0.84      0.82      0.83       563
           2       0.80      0.80      0.80      1229
           3       0.84      0.90      0.87      1025
           4       0.80      0.69      0.74       220
           5       0.70      0.69      0.70       162
           6       0.76      0.77      0.76       515
           7       0.86      0.98      0.91       720
           8       1.00      0.99      0.99      3411

   micro avg       0.89      0.90      0.90      9111
   macro avg       0.83      0.83      0.83      9111
weighted avg       0.89      0.90      0.90      9111
 samples avg       0.91      0.90      0.90      9111



In [17]:
from sklearn.metrics import confusion_matrix

def flatten(y):
    return [word_label for sentence_labels in y for word_label in sentence_labels]

truth = flatten(y_test)
pred = flatten(y_pred)

cm = confusion_matrix(truth, pred)
cm

array([[ 1355,    27,   150,    47,     1,     2,     9,     3,    74],
       [   17,   537,    36,    25,     0,     2,     3,     2,    80],
       [  109,    39,  1211,   143,     0,     0,    18,     6,   135],
       [   53,     9,    55,  1385,     1,     3,     9,    15,    87],
       [    5,     0,     0,     0,   165,     2,    53,    18,    14],
       [    2,     7,     1,     1,     4,   144,    17,    15,    25],
       [    8,     2,    20,     7,    28,    15,   622,    75,    58],
       [    0,     0,     0,     5,     3,     4,    38,  1099,     7],
       [   32,    32,    80,    58,     9,    36,   126,    34, 37916]],
      dtype=int64)