# Named Entity Recognition using sklearn-crfsuite

In this notebook we train a basic CRF model for Named Entity Recognition on CoNLL2002 data (following https://github.com/TeamHG-Memex/sklearn-crfsuite/blob/master/docs/CoNLL2002.ipynb) and check its weights to see what it learned.

To follow this tutorial you need NLTK > 3.x and sklearn-crfsuite Python packages. The tutorial uses Python 3.

In [1]:
import nltk
import sklearn_crfsuite
import eli5

## 1. Training data

CoNLL 2002 datasets contains a list of Spanish sentences, with Named Entities annotated. It uses [IOB2](https://en.wikipedia.org/wiki/Inside_Outside_Beginning) encoding. CoNLL 2002 data also provide POS tags.

In [2]:
train_sents = list(nltk.corpus.conll2002.iob_sents('esp.train'))
test_sents = list(nltk.corpus.conll2002.iob_sents('esp.testb'))
train_sents[0]

[('Melbourne', 'NP', 'B-LOC'),
 ('(', 'Fpa', 'O'),
 ('Australia', 'NP', 'B-LOC'),
 (')', 'Fpt', 'O'),
 (',', 'Fc', 'O'),
 ('25', 'Z', 'O'),
 ('may', 'NC', 'O'),
 ('(', 'Fpa', 'O'),
 ('EFE', 'NC', 'B-ORG'),
 (')', 'Fpt', 'O'),
 ('.', 'Fp', 'O')]

## 2. Feature extraction

POS tags can be seen as pre-extracted features. Let's extract more features (word parts, simplified POS tags, lower/title/upper flags, features of nearby words) and convert them to sklear-crfsuite format - each sentence should be converted to a list of dicts. This is a very simple baseline; you certainly can do better.

In [3]:
def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]
    
    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'postag': postag,
        'postag[:2]': postag[:2],        
    }
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2],
        })
    else:
        features['BOS'] = True
        
    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2],
        })
    else:
        features['EOS'] = True
                
    return features


def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

def sent2tokens(sent):
    return [token for token, postag, label in sent]

X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_test = [sent2features(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]

This is how features extracted from a single token look like:

In [4]:
X_train[0][1]

{'+1:postag': 'NP',
 '+1:postag[:2]': 'NP',
 '+1:word.istitle()': True,
 '+1:word.isupper()': False,
 '+1:word.lower()': 'australia',
 '-1:postag': 'NP',
 '-1:postag[:2]': 'NP',
 '-1:word.istitle()': True,
 '-1:word.isupper()': False,
 '-1:word.lower()': 'melbourne',
 'bias': 1.0,
 'postag': 'Fpa',
 'postag[:2]': 'Fp',
 'word.isdigit()': False,
 'word.istitle()': False,
 'word.isupper()': False,
 'word.lower()': '(',
 'word[-3:]': '('}

## 3. Train a CRF model

Once we have features in a right format we can train a linear-chain CRF (Conditional Random Fields) model using sklearn_crfsuite.CRF:

In [5]:
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.1, 
    c2=0.1, 
    max_iterations=20, 
    all_possible_transitions=False,
)
crf.fit(X_train, y_train);

## 4. Inspect model weights

CRFsuite CRF models use two kinds of features: state features and transition features. Let's check their weights 
using eli5.explain_weights:

In [6]:
eli5.explain_weights(crf, top=30)

From \ To,O,B-LOC,I-LOC,B-MISC,I-MISC,B-ORG,I-ORG,B-PER,I-PER
O,3.281,2.204,0.0,2.101,0.0,3.468,0.0,2.325,0.0
B-LOC,-0.259,-0.098,4.058,0.0,0.0,0.0,0.0,-0.212,0.0
I-LOC,-0.173,-0.609,3.436,0.0,0.0,0.0,0.0,0.0,0.0
B-MISC,-0.673,-0.341,0.0,0.0,4.069,-0.308,0.0,-0.331,0.0
I-MISC,-0.803,-0.998,0.0,-0.519,4.977,-0.817,0.0,-0.611,0.0
B-ORG,-0.096,-0.242,0.0,-0.57,0.0,-1.012,4.739,-0.306,0.0
I-ORG,-0.339,-1.758,0.0,-0.841,0.0,-1.382,5.062,-0.472,0.0
B-PER,-0.4,-0.851,0.0,0.0,0.0,-1.013,0.0,-0.937,4.329
I-PER,-0.676,-0.47,0.0,0.0,0.0,0.0,0.0,-0.659,3.754

y=O  top features,y=O  top features,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,Unnamed: 8_level_0
Weight,Feature,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
y=B-LOC  top features,y=B-LOC  top features,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Weight,Feature,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3
y=I-LOC  top features,y=I-LOC  top features,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4
Weight,Feature,Unnamed: 2_level_5,Unnamed: 3_level_5,Unnamed: 4_level_5,Unnamed: 5_level_5,Unnamed: 6_level_5,Unnamed: 7_level_5,Unnamed: 8_level_5
y=B-MISC  top features,y=B-MISC  top features,Unnamed: 2_level_6,Unnamed: 3_level_6,Unnamed: 4_level_6,Unnamed: 5_level_6,Unnamed: 6_level_6,Unnamed: 7_level_6,Unnamed: 8_level_6
Weight,Feature,Unnamed: 2_level_7,Unnamed: 3_level_7,Unnamed: 4_level_7,Unnamed: 5_level_7,Unnamed: 6_level_7,Unnamed: 7_level_7,Unnamed: 8_level_7
y=I-MISC  top features,y=I-MISC  top features,Unnamed: 2_level_8,Unnamed: 3_level_8,Unnamed: 4_level_8,Unnamed: 5_level_8,Unnamed: 6_level_8,Unnamed: 7_level_8,Unnamed: 8_level_8
Weight,Feature,Unnamed: 2_level_9,Unnamed: 3_level_9,Unnamed: 4_level_9,Unnamed: 5_level_9,Unnamed: 6_level_9,Unnamed: 7_level_9,Unnamed: 8_level_9
y=B-ORG  top features,y=B-ORG  top features,Unnamed: 2_level_10,Unnamed: 3_level_10,Unnamed: 4_level_10,Unnamed: 5_level_10,Unnamed: 6_level_10,Unnamed: 7_level_10,Unnamed: 8_level_10
Weight,Feature,Unnamed: 2_level_11,Unnamed: 3_level_11,Unnamed: 4_level_11,Unnamed: 5_level_11,Unnamed: 6_level_11,Unnamed: 7_level_11,Unnamed: 8_level_11
y=I-ORG  top features,y=I-ORG  top features,Unnamed: 2_level_12,Unnamed: 3_level_12,Unnamed: 4_level_12,Unnamed: 5_level_12,Unnamed: 6_level_12,Unnamed: 7_level_12,Unnamed: 8_level_12
Weight,Feature,Unnamed: 2_level_13,Unnamed: 3_level_13,Unnamed: 4_level_13,Unnamed: 5_level_13,Unnamed: 6_level_13,Unnamed: 7_level_13,Unnamed: 8_level_13
y=B-PER  top features,y=B-PER  top features,Unnamed: 2_level_14,Unnamed: 3_level_14,Unnamed: 4_level_14,Unnamed: 5_level_14,Unnamed: 6_level_14,Unnamed: 7_level_14,Unnamed: 8_level_14
Weight,Feature,Unnamed: 2_level_15,Unnamed: 3_level_15,Unnamed: 4_level_15,Unnamed: 5_level_15,Unnamed: 6_level_15,Unnamed: 7_level_15,Unnamed: 8_level_15
y=I-PER  top features,y=I-PER  top features,Unnamed: 2_level_16,Unnamed: 3_level_16,Unnamed: 4_level_16,Unnamed: 5_level_16,Unnamed: 6_level_16,Unnamed: 7_level_16,Unnamed: 8_level_16
Weight,Feature,Unnamed: 2_level_17,Unnamed: 3_level_17,Unnamed: 4_level_17,Unnamed: 5_level_17,Unnamed: 6_level_17,Unnamed: 7_level_17,Unnamed: 8_level_17
+4.416,postag[:2]:Fp,,,,,,,
+3.116,BOS,,,,,,,
+2.401,bias,,,,,,,
+2.297,"word[-3:]:,",,,,,,,
+2.297,"word.lower():,",,,,,,,
+2.297,postag:Fc,,,,,,,
+2.297,postag[:2]:Fc,,,,,,,
+2.124,postag:CC,,,,,,,
+2.124,postag[:2]:CC,,,,,,,
+1.984,EOS,,,,,,,

y=O  top features,y=O  top features
Weight,Feature
+4.416,postag[:2]:Fp
+3.116,BOS
+2.401,bias
+2.297,"word[-3:]:,"
+2.297,"word.lower():,"
+2.297,postag:Fc
+2.297,postag[:2]:Fc
+2.124,postag:CC
+2.124,postag[:2]:CC
+1.984,EOS

y=B-LOC  top features,y=B-LOC  top features
Weight,Feature
+2.530,word.istitle()
+2.224,-1:word.lower():en
+0.906,word[-3:]:rid
+0.905,word.lower():madrid
+0.646,word.lower():españa
+0.640,word[-3:]:ona
+0.595,word[-3:]:aña
+0.595,+1:postag[:2]:Fp
+0.515,word.lower():parís
+0.514,word[-3:]:rís

y=I-LOC  top features,y=I-LOC  top features
Weight,Feature
+0.886,-1:word.istitle()
+0.664,-1:word.lower():de
+0.582,word[-3:]:de
+0.578,word.lower():de
+0.529,-1:word.lower():san
+0.444,+1:word.istitle()
+0.441,word.istitle()
+0.335,-1:word.lower():la
+0.262,postag[:2]:SP
+0.262,postag:SP

y=B-MISC  top features,y=B-MISC  top features
Weight,Feature
+1.770,word.isupper()
+0.693,word.istitle()
+0.606,"word[-3:]:"""
+0.606,"word.lower():"""
+0.606,postag:Fe
+0.606,postag[:2]:Fe
+0.538,+1:word.istitle()
+0.508,"-1:word.lower():"""
+0.508,-1:postag[:2]:Fe
+0.508,-1:postag:Fe

y=I-MISC  top features,y=I-MISC  top features
Weight,Feature
+1.364,-1:word.istitle()
+0.675,-1:word.lower():de
+0.597,"+1:word.lower():"""
+0.597,+1:postag[:2]:Fe
+0.597,+1:postag:Fe
+0.369,-1:postag:NC
+0.369,-1:postag[:2]:NC
+0.324,-1:word.lower():liga
+0.318,word[-3:]:de
+0.304,word.lower():de

y=B-ORG  top features,y=B-ORG  top features
Weight,Feature
+2.695,word.lower():efe
+2.519,word.isupper()
+2.084,word[-3:]:EFE
+1.174,word.lower():gobierno
+1.142,word.istitle()
+1.018,-1:word.lower():del
+0.958,word[-3:]:rno
+0.671,word.lower():pp
+0.671,word[-3:]:PP
+0.667,-1:word.lower():al

y=I-ORG  top features,y=I-ORG  top features
Weight,Feature
+1.499,-1:word.istitle()
+1.200,-1:word.lower():de
+0.539,-1:word.lower():real
+0.511,word[-3:]:rid
+0.446,word[-3:]:de
+0.433,word.lower():de
+0.428,-1:postag:SP
+0.428,-1:postag[:2]:SP
+0.399,word.lower():madrid
+0.368,word[-3:]:la

y=B-PER  top features,y=B-PER  top features
Weight,Feature
+1.698,word.istitle()
+0.683,-1:postag:VMI
+0.601,+1:postag[:2]:VM
+0.589,postag:NP
+0.589,postag[:2]:NP
+0.589,+1:postag:VMI
+0.565,-1:word.lower():a
+0.520,word[-3:]:osé
+0.503,word.lower():josé
+0.476,-1:postag[:2]:VM

y=I-PER  top features,y=I-PER  top features
Weight,Feature
+2.742,-1:word.istitle()
+0.736,word.istitle()
+0.660,-1:word.lower():josé
+0.598,-1:postag:AQ
+0.598,-1:postag[:2]:AQ
+0.510,-1:postag[:2]:VM
+0.487,-1:word.lower():juan
+0.419,-1:word.lower():maría
+0.413,-1:postag:VMI
+0.345,-1:word.lower():luis


Features don't use gazetteers, so model had to remember some geographic names from the training data, e.g. that España is a location.

Transition features make sense: at least model learned that I-ENITITY must follow B-ENTITY, and that some transitions are unlikely, e.g. it is not common to have location right after an organization name (I-LOC -> B-ORG has a large negative weight). 

We'd also expect that O -> I-ENTIRY transitions have large negative weights because they are impossible, but these transitions have zero weight, not negative weight; it can be a problem, and decrease quality. sklearn_crfsuite.CRF provides ``all_possible_transitions`` argument which allows model to learn weights for transitions which are not observed in training data. Let's check how does it affect the result:

In [7]:
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.1, 
    c2=0.1, 
    max_iterations=20, 
    all_possible_transitions=True,
)
crf.fit(X_train, y_train);

In [8]:
eli5.explain_weights(crf, top=5)

From \ To,O,B-LOC,I-LOC,B-MISC,I-MISC,B-ORG,I-ORG,B-PER,I-PER
O,2.732,1.217,-4.675,1.515,-5.785,1.36,-6.19,0.968,-6.236
B-LOC,-0.226,-0.091,3.378,-0.433,-1.065,-0.861,-1.783,-0.295,-1.57
I-LOC,-0.184,-0.585,2.404,-0.276,-0.485,-0.582,-0.749,-0.442,-0.647
B-MISC,-0.714,-0.353,-0.539,-0.278,3.512,-0.412,-1.047,-0.336,-0.895
I-MISC,-0.697,-0.846,-0.587,-0.297,4.252,-0.84,-1.206,-0.523,-1.001
B-ORG,0.419,-0.187,-1.074,-0.567,-1.607,-1.13,5.392,-0.223,-2.122
I-ORG,-0.117,-1.715,-0.863,-0.631,-1.221,-1.442,5.141,-0.397,-1.908
B-PER,-0.127,-0.806,-0.834,-0.52,-1.228,-1.089,-2.076,-1.01,4.04
I-PER,-0.766,-0.242,-0.67,-0.418,-0.856,-0.903,-1.472,-0.692,2.909

y=O  top features,y=O  top features,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,Unnamed: 8_level_0
Weight,Feature,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
y=B-LOC  top features,y=B-LOC  top features,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Weight,Feature,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3
y=I-LOC  top features,y=I-LOC  top features,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4
Weight,Feature,Unnamed: 2_level_5,Unnamed: 3_level_5,Unnamed: 4_level_5,Unnamed: 5_level_5,Unnamed: 6_level_5,Unnamed: 7_level_5,Unnamed: 8_level_5
y=B-MISC  top features,y=B-MISC  top features,Unnamed: 2_level_6,Unnamed: 3_level_6,Unnamed: 4_level_6,Unnamed: 5_level_6,Unnamed: 6_level_6,Unnamed: 7_level_6,Unnamed: 8_level_6
Weight,Feature,Unnamed: 2_level_7,Unnamed: 3_level_7,Unnamed: 4_level_7,Unnamed: 5_level_7,Unnamed: 6_level_7,Unnamed: 7_level_7,Unnamed: 8_level_7
y=I-MISC  top features,y=I-MISC  top features,Unnamed: 2_level_8,Unnamed: 3_level_8,Unnamed: 4_level_8,Unnamed: 5_level_8,Unnamed: 6_level_8,Unnamed: 7_level_8,Unnamed: 8_level_8
Weight,Feature,Unnamed: 2_level_9,Unnamed: 3_level_9,Unnamed: 4_level_9,Unnamed: 5_level_9,Unnamed: 6_level_9,Unnamed: 7_level_9,Unnamed: 8_level_9
y=B-ORG  top features,y=B-ORG  top features,Unnamed: 2_level_10,Unnamed: 3_level_10,Unnamed: 4_level_10,Unnamed: 5_level_10,Unnamed: 6_level_10,Unnamed: 7_level_10,Unnamed: 8_level_10
Weight,Feature,Unnamed: 2_level_11,Unnamed: 3_level_11,Unnamed: 4_level_11,Unnamed: 5_level_11,Unnamed: 6_level_11,Unnamed: 7_level_11,Unnamed: 8_level_11
y=I-ORG  top features,y=I-ORG  top features,Unnamed: 2_level_12,Unnamed: 3_level_12,Unnamed: 4_level_12,Unnamed: 5_level_12,Unnamed: 6_level_12,Unnamed: 7_level_12,Unnamed: 8_level_12
Weight,Feature,Unnamed: 2_level_13,Unnamed: 3_level_13,Unnamed: 4_level_13,Unnamed: 5_level_13,Unnamed: 6_level_13,Unnamed: 7_level_13,Unnamed: 8_level_13
y=B-PER  top features,y=B-PER  top features,Unnamed: 2_level_14,Unnamed: 3_level_14,Unnamed: 4_level_14,Unnamed: 5_level_14,Unnamed: 6_level_14,Unnamed: 7_level_14,Unnamed: 8_level_14
Weight,Feature,Unnamed: 2_level_15,Unnamed: 3_level_15,Unnamed: 4_level_15,Unnamed: 5_level_15,Unnamed: 6_level_15,Unnamed: 7_level_15,Unnamed: 8_level_15
y=I-PER  top features,y=I-PER  top features,Unnamed: 2_level_16,Unnamed: 3_level_16,Unnamed: 4_level_16,Unnamed: 5_level_16,Unnamed: 6_level_16,Unnamed: 7_level_16,Unnamed: 8_level_16
Weight,Feature,Unnamed: 2_level_17,Unnamed: 3_level_17,Unnamed: 4_level_17,Unnamed: 5_level_17,Unnamed: 6_level_17,Unnamed: 7_level_17,Unnamed: 8_level_17
+4.931,BOS,,,,,,,
+3.754,postag[:2]:Fp,,,,,,,
+3.539,bias,,,,,,,
… 15043 more positive …,… 15043 more positive …,,,,,,,
… 3906 more negative …,… 3906 more negative …,,,,,,,
-3.685,word.isupper(),,,,,,,
-7.025,word.istitle(),,,,,,,
+2.397,word.istitle(),,,,,,,
+2.147,-1:word.lower():en,,,,,,,
… 2284 more positive …,… 2284 more positive …,,,,,,,

y=O  top features,y=O  top features
Weight,Feature
+4.931,BOS
+3.754,postag[:2]:Fp
+3.539,bias
… 15043 more positive …,… 15043 more positive …
… 3906 more negative …,… 3906 more negative …
-3.685,word.isupper()
-7.025,word.istitle()

y=B-LOC  top features,y=B-LOC  top features
Weight,Feature
+2.397,word.istitle()
+2.147,-1:word.lower():en
… 2284 more positive …,… 2284 more positive …
… 433 more negative …,… 433 more negative …
-1.080,postag[:2]:SP
-1.080,postag:SP
-1.273,-1:word.istitle()

y=I-LOC  top features,y=I-LOC  top features
Weight,Feature
+0.882,-1:word.lower():de
+0.780,-1:word.istitle()
+0.718,word[-3:]:de
+0.711,word.lower():de
… 1684 more positive …,… 1684 more positive …
… 268 more negative …,… 268 more negative …
-1.965,BOS

y=B-MISC  top features,y=B-MISC  top features
Weight,Feature
+2.017,word.isupper()
+0.603,word.istitle()
… 2287 more positive …,… 2287 more positive …
… 337 more negative …,… 337 more negative …
-0.850,-1:word.istitle()
-0.959,postag:SP
-0.959,postag[:2]:SP

y=I-MISC  top features,y=I-MISC  top features
Weight,Feature
+0.864,-1:word.istitle()
+0.616,-1:word.lower():de
+0.591,+1:postag[:2]:Fe
+0.591,+1:postag:Fe
+0.591,"+1:word.lower():"""
… 3684 more positive …,… 3684 more positive …
… 582 more negative …,… 582 more negative …

y=B-ORG  top features,y=B-ORG  top features
Weight,Feature
+3.041,word.isupper()
+2.952,word.lower():efe
+1.851,word[-3:]:EFE
… 3528 more positive …,… 3528 more positive …
… 622 more negative …,… 622 more negative …
-1.416,postag:SP
-1.416,postag[:2]:SP

y=I-ORG  top features,y=I-ORG  top features
Weight,Feature
+1.159,-1:word.lower():de
+0.993,-1:word.istitle()
+0.637,-1:postag[:2]:SP
+0.637,-1:postag:SP
… 3519 more positive …,… 3519 more positive …
… 679 more negative …,… 679 more negative …
-1.290,bias

y=B-PER  top features,y=B-PER  top features
Weight,Feature
+1.757,word.istitle()
… 4142 more positive …,… 4142 more positive …
… 352 more negative …,… 352 more negative …
-0.971,-1:postag[:2]:DA
-0.971,-1:postag:DA
-1.503,postag[:2]:DA
-1.503,postag:DA

y=I-PER  top features,y=I-PER  top features
Weight,Feature
+1.545,-1:word.istitle()
+0.976,word.istitle()
+0.695,-1:word.lower():josé
+0.677,postag[:2]:NC
+0.677,postag:NC
… 3930 more positive …,… 3930 more positive …
… 363 more negative …,… 363 more negative …


With `all_possible_transitions=True` CRF learned large negative weights for impossible transitions like O -> I-ORG.

## 5. Customization

The table above is large and kind of hard to inspect; eli5 provides several options to look only at a part of features. You can check only a subset of labels:

In [9]:
eli5.explain_weights(crf, top=10, targets=['O', 'B-ORG', 'I-ORG'])

From \ To,O,B-ORG,I-ORG
O,2.732,1.36,-6.19
B-ORG,0.419,-1.13,5.392
I-ORG,-0.117,-1.442,5.141

y=O  top features,y=O  top features,Unnamed: 2_level_0
Weight,Feature,Unnamed: 2_level_1
y=B-ORG  top features,y=B-ORG  top features,Unnamed: 2_level_2
Weight,Feature,Unnamed: 2_level_3
y=I-ORG  top features,y=I-ORG  top features,Unnamed: 2_level_4
Weight,Feature,Unnamed: 2_level_5
+4.931,BOS,
+3.754,postag[:2]:Fp,
+3.539,bias,
+2.328,"word[-3:]:,",
+2.328,postag[:2]:Fc,
+2.328,postag:Fc,
+2.328,"word.lower():,",
… 15039 more positive …,… 15039 more positive …,
… 3905 more negative …,… 3905 more negative …,
-2.187,postag[:2]:NP,

y=O  top features,y=O  top features
Weight,Feature
+4.931,BOS
+3.754,postag[:2]:Fp
+3.539,bias
+2.328,"word[-3:]:,"
+2.328,postag[:2]:Fc
+2.328,postag:Fc
+2.328,"word.lower():,"
… 15039 more positive …,… 15039 more positive …
… 3905 more negative …,… 3905 more negative …
-2.187,postag[:2]:NP

y=B-ORG  top features,y=B-ORG  top features
Weight,Feature
+3.041,word.isupper()
+2.952,word.lower():efe
+1.851,word[-3:]:EFE
+1.278,word.lower():gobierno
+1.033,word[-3:]:rno
+1.005,word.istitle()
+0.864,-1:word.lower():del
… 3524 more positive …,… 3524 more positive …
… 621 more negative …,… 621 more negative …
-0.842,-1:word.lower():en

y=I-ORG  top features,y=I-ORG  top features
Weight,Feature
+1.159,-1:word.lower():de
+0.993,-1:word.istitle()
+0.637,-1:postag[:2]:SP
+0.637,-1:postag:SP
+0.570,-1:word.lower():real
+0.547,word.istitle()
… 3517 more positive …,… 3517 more positive …
… 676 more negative …,… 676 more negative …
-0.480,postag:VMI
-0.508,postag[:2]:VM


Another option is to check only some of the features - it helps to check if a feature function works as intended. For example, let's check how word shape features are used by model using ``feature_re`` argument:

In [10]:
eli5.explain_weights(crf, top=10, feature_re='^word\.is')

From \ To,O,B-LOC,I-LOC,B-MISC,I-MISC,B-ORG,I-ORG,B-PER,I-PER
O,2.732,1.217,-4.675,1.515,-5.785,1.36,-6.19,0.968,-6.236
B-LOC,-0.226,-0.091,3.378,-0.433,-1.065,-0.861,-1.783,-0.295,-1.57
I-LOC,-0.184,-0.585,2.404,-0.276,-0.485,-0.582,-0.749,-0.442,-0.647
B-MISC,-0.714,-0.353,-0.539,-0.278,3.512,-0.412,-1.047,-0.336,-0.895
I-MISC,-0.697,-0.846,-0.587,-0.297,4.252,-0.84,-1.206,-0.523,-1.001
B-ORG,0.419,-0.187,-1.074,-0.567,-1.607,-1.13,5.392,-0.223,-2.122
I-ORG,-0.117,-1.715,-0.863,-0.631,-1.221,-1.442,5.141,-0.397,-1.908
B-PER,-0.127,-0.806,-0.834,-0.52,-1.228,-1.089,-2.076,-1.01,4.04
I-PER,-0.766,-0.242,-0.67,-0.418,-0.856,-0.903,-1.472,-0.692,2.909

y=O  top features,y=O  top features,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,Unnamed: 8_level_0
Weight,Feature,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
y=B-LOC  top features,y=B-LOC  top features,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Weight,Feature,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3
y=I-LOC  top features,y=I-LOC  top features,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4
Weight,Feature,Unnamed: 2_level_5,Unnamed: 3_level_5,Unnamed: 4_level_5,Unnamed: 5_level_5,Unnamed: 6_level_5,Unnamed: 7_level_5,Unnamed: 8_level_5
y=B-MISC  top features,y=B-MISC  top features,Unnamed: 2_level_6,Unnamed: 3_level_6,Unnamed: 4_level_6,Unnamed: 5_level_6,Unnamed: 6_level_6,Unnamed: 7_level_6,Unnamed: 8_level_6
Weight,Feature,Unnamed: 2_level_7,Unnamed: 3_level_7,Unnamed: 4_level_7,Unnamed: 5_level_7,Unnamed: 6_level_7,Unnamed: 7_level_7,Unnamed: 8_level_7
y=I-MISC  top features,y=I-MISC  top features,Unnamed: 2_level_8,Unnamed: 3_level_8,Unnamed: 4_level_8,Unnamed: 5_level_8,Unnamed: 6_level_8,Unnamed: 7_level_8,Unnamed: 8_level_8
Weight,Feature,Unnamed: 2_level_9,Unnamed: 3_level_9,Unnamed: 4_level_9,Unnamed: 5_level_9,Unnamed: 6_level_9,Unnamed: 7_level_9,Unnamed: 8_level_9
y=B-ORG  top features,y=B-ORG  top features,Unnamed: 2_level_10,Unnamed: 3_level_10,Unnamed: 4_level_10,Unnamed: 5_level_10,Unnamed: 6_level_10,Unnamed: 7_level_10,Unnamed: 8_level_10
Weight,Feature,Unnamed: 2_level_11,Unnamed: 3_level_11,Unnamed: 4_level_11,Unnamed: 5_level_11,Unnamed: 6_level_11,Unnamed: 7_level_11,Unnamed: 8_level_11
y=I-ORG  top features,y=I-ORG  top features,Unnamed: 2_level_12,Unnamed: 3_level_12,Unnamed: 4_level_12,Unnamed: 5_level_12,Unnamed: 6_level_12,Unnamed: 7_level_12,Unnamed: 8_level_12
Weight,Feature,Unnamed: 2_level_13,Unnamed: 3_level_13,Unnamed: 4_level_13,Unnamed: 5_level_13,Unnamed: 6_level_13,Unnamed: 7_level_13,Unnamed: 8_level_13
y=B-PER  top features,y=B-PER  top features,Unnamed: 2_level_14,Unnamed: 3_level_14,Unnamed: 4_level_14,Unnamed: 5_level_14,Unnamed: 6_level_14,Unnamed: 7_level_14,Unnamed: 8_level_14
Weight,Feature,Unnamed: 2_level_15,Unnamed: 3_level_15,Unnamed: 4_level_15,Unnamed: 5_level_15,Unnamed: 6_level_15,Unnamed: 7_level_15,Unnamed: 8_level_15
y=I-PER  top features,y=I-PER  top features,Unnamed: 2_level_16,Unnamed: 3_level_16,Unnamed: 4_level_16,Unnamed: 5_level_16,Unnamed: 6_level_16,Unnamed: 7_level_16,Unnamed: 8_level_16
Weight,Feature,Unnamed: 2_level_17,Unnamed: 3_level_17,Unnamed: 4_level_17,Unnamed: 5_level_17,Unnamed: 6_level_17,Unnamed: 7_level_17,Unnamed: 8_level_17
-3.685,word.isupper(),,,,,,,
-7.025,word.istitle(),,,,,,,
+2.397,word.istitle(),,,,,,,
+0.099,word.isupper(),,,,,,,
-0.152,word.isdigit(),,,,,,,
+0.460,word.istitle(),,,,,,,
-0.018,word.isdigit(),,,,,,,
-0.345,word.isupper(),,,,,,,
+2.017,word.isupper(),,,,,,,
+0.603,word.istitle(),,,,,,,

y=O  top features,y=O  top features
Weight,Feature
-3.685,word.isupper()
-7.025,word.istitle()

y=B-LOC  top features,y=B-LOC  top features
Weight,Feature
2.397,word.istitle()
0.099,word.isupper()
-0.152,word.isdigit()

y=I-LOC  top features,y=I-LOC  top features
Weight,Feature
0.46,word.istitle()
-0.018,word.isdigit()
-0.345,word.isupper()

y=B-MISC  top features,y=B-MISC  top features
Weight,Feature
2.017,word.isupper()
0.603,word.istitle()
-0.012,word.isdigit()

y=I-MISC  top features,y=I-MISC  top features
Weight,Feature
0.271,word.isdigit()
-0.072,word.isupper()
-0.106,word.istitle()

y=B-ORG  top features,y=B-ORG  top features
Weight,Feature
3.041,word.isupper()
1.005,word.istitle()
-0.044,word.isdigit()

y=I-ORG  top features,y=I-ORG  top features
Weight,Feature
0.547,word.istitle()
0.014,word.isdigit()
-0.012,word.isupper()

y=B-PER  top features,y=B-PER  top features
Weight,Feature
1.757,word.istitle()
0.05,word.isupper()
-0.123,word.isdigit()

y=I-PER  top features,y=I-PER  top features
Weight,Feature
0.976,word.istitle()
0.193,word.isupper()
-0.106,word.isdigit()


Looks fine - UPPERCASE and Titlecase words are likely to be entities of some kind.

## 6. Formatting in console

It is also possible to format the result as text (could be useful in console):

In [11]:
expl = eli5.explain_weights(crf, top=5, targets=['O', 'B-LOC', 'I-LOC'])
print(eli5.format_as_text(expl))

Explained as: CRF

Transition features:
            O    B-LOC    I-LOC
-----  ------  -------  -------
O       2.732    1.217   -4.675
B-LOC  -0.226   -0.091    3.378
I-LOC  -0.184   -0.585    2.404

y='O' top features
----------------------------
  +4.931  BOS               
  +3.754  postag[:2]:Fp     
  +3.539  bias              
       …  (15043 more positive features)
       …  (3906 more negative features)
  -3.685  word.isupper()    
  -7.025  word.istitle()    

y='B-LOC' top features
----------------------------
  +2.397  word.istitle()    
  +2.147  -1:word.lower():en
       …  (2284 more positive features)
       …  (433 more negative features)
  -1.080  postag[:2]:SP     
  -1.080  postag:SP         
  -1.273  -1:word.istitle() 

y='I-LOC' top features
----------------------------
  +0.882  -1:word.lower():de
  +0.780  -1:word.istitle() 
  +0.718  word[-3:]:de      
  +0.711  word.lower():de   
       …  (1684 more positive features)
       …  (268 more negative features)
