# Named Entity Recognition using sklearn-crfsuite

In this notebook we train a basic CRF model for Named Entity Recognition on CoNLL2002 data (following https://github.com/TeamHG-Memex/sklearn-crfsuite/blob/master/docs/CoNLL2002.ipynb) and check its weights to see what it learned.

To follow this tutorial you need NLTK > 3.x and sklearn-crfsuite Python packages. The tutorial uses Python 3.

In [None]:
import nltk
import sklearn_crfsuite
import eli5

## 1. Training data

CoNLL 2002 datasets contains a list of Spanish sentences, with Named Entities annotated. It uses [IOB2](https://en.wikipedia.org/wiki/Inside_Outside_Beginning) encoding. CoNLL 2002 data also provide POS tags.

In [None]:
train_sents = list(nltk.corpus.conll2002.iob_sents('esp.train'))
test_sents = list(nltk.corpus.conll2002.iob_sents('esp.testb'))
train_sents[0]

## 2. Feature extraction

POS tags can be seen as pre-extracted features. Let's extract more features (word parts, simplified POS tags, lower/title/upper flags, features of nearby words) and convert them to sklear-crfsuite format - each sentence should be converted to a list of dicts. This is a very simple baseline; you certainly can do better.

In [None]:
def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]
    
    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'postag': postag,
        'postag[:2]': postag[:2],        
    }
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2],
        })
    else:
        features['BOS'] = True
        
    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2],
        })
    else:
        features['EOS'] = True
                
    return features


def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

def sent2tokens(sent):
    return [token for token, postag, label in sent]

X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_test = [sent2features(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]

This is how features extracted from a single token look like:

In [None]:
X_train[0][1]

## 3. Train a CRF model

Once we have features in a right format we can train a linear-chain CRF (Conditional Random Fields) model using sklearn_crfsuite.CRF:

In [None]:
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.1, 
    c2=0.1, 
    max_iterations=20,
    all_possible_transitions=False,
)
crf.fit(X_train, y_train);

## 4. Inspect model weights

CRFsuite CRF models use two kinds of features: state features and transition features. Let's check their weights 
using eli5.explain_weights:

In [None]:
eli5.show_weights(crf, top=30)

Transition features make sense: at least model learned that I-ENITITY must follow B-ENTITY. It also learned that some transitions are unlikely, e.g. it is not common in this dataset to have a location right after an organization name (I-ORG -> B-LOC has a large negative weight).

Features don't use gazetteers, so model had to remember some geographic names from the training data, e.g. that España is a location. 

If we regularize CRF more, we can expect that only features which are generic will remain, and memoized tokens will go. With L1 regularization (c1 parameter) coefficients of most features should be driven to zero. Let's check what effect does regularization have on CRF weights:

In [None]:
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=200,
    c2=0.1,
    max_iterations=20,
    all_possible_transitions=False,
)
crf.fit(X_train, y_train)
eli5.show_weights(crf, top=30)

As you can see, memoized tokens are mostly gone and model now relies on word shapes and POS tags. There is only a few non-zero features remaining. In our example the change probably made the quality worse, but that's a separate question.

Let's focus on transition weights. We can expect that O -> I-ENTIRY transitions to have large negative weights because they are impossible. But these transitions have zero weights, not negative weights, both in heavily regularized model and in our initial model. Something is going on here. 

The reason they are zero is that crfsuite haven't seen these transitions in training data, and assumed there is no need to learn weights for them, to save some computation time. This is the default behavior, but it is possible to turn it off using sklearn_crfsuite.CRF ``all_possible_transitions`` option. Let's check how does it affect the result:

In [None]:
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.1, 
    c2=0.1, 
    max_iterations=20, 
    all_possible_transitions=True,
)
crf.fit(X_train, y_train);

In [None]:
eli5.show_weights(crf, top=5, show=['transition_features'])

With `all_possible_transitions=True` CRF learned large negative weights for impossible transitions like O -> I-ORG.

## 5. Customization

The table above is large and kind of hard to inspect; eli5 provides several options to look only at a part of features. You can check only a subset of labels:

In [None]:
eli5.show_weights(crf, top=10, targets=['O', 'B-ORG', 'I-ORG'])

Another option is to check only some of the features - it helps to check if a feature function works as intended. For example, let's check how word shape features are used by model using ``feature_re`` argument and hide transition table:

In [None]:
eli5.show_weights(crf, top=10, feature_re='^word\.is', 
                  horizontal_layout=False, show=['targets'])

Looks fine - UPPERCASE and Titlecase words are likely to be entities of some kind.

## 6. Formatting in console

It is also possible to format the result as text (could be useful in console):

In [None]:
expl = eli5.explain_weights(crf, top=5, targets=['O', 'B-LOC', 'I-LOC'])
print(eli5.format_as_text(expl))