### [`Named Entity Recognition and Classification with Scikit-Learn`](https://towardsdatascience.com/named-entity-recognition-and-classification-with-scikit-learn-f05372f07ba2)

#### Essential info about entities:

* geo = Geographical Entity
* org = Organization
* per = Person
* gpe = Geopolitical Entity
* tim = Time indicator
* art = Artifact
* eve = Event
* nat = Natural Phenomenon

#### Inside–outside–beginning (tagging)

The IOB (short for inside, outside, beginning) is a common tagging format for tagging tokens.

* I- prefix before a tag indicates that the tag is inside a chunk.
* B- prefix before a tag indicates that the tag is the beginning of a chunk.
* An O tag indicates that a token belongs to no chunk (outside).

In [151]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import Perceptron
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

In [152]:
df = pd.read_csv('ner_dataset.csv', encoding = "ISO-8859-1")[:10000]
#df = pd.read_csv('ner_dataset.csv', encoding='latin1')

In [153]:
df.head()

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,,of,IN,O
2,,demonstrators,NNS,O
3,,have,VBP,O
4,,marched,VBN,O


In [154]:
df.isnull().sum(axis = 0)

Sentence #    9543
Word             0
POS              0
Tag              0
dtype: int64

In [155]:
#help(df.fillna)

In [156]:
df.fillna(method='ffill', inplace = True)

In [157]:
df.head(40)

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,Sentence: 1,of,IN,O
2,Sentence: 1,demonstrators,NNS,O
3,Sentence: 1,have,VBP,O
4,Sentence: 1,marched,VBN,O
5,Sentence: 1,through,IN,O
6,Sentence: 1,London,NNP,B-geo
7,Sentence: 1,to,TO,O
8,Sentence: 1,protest,VB,O
9,Sentence: 1,the,DT,O


In [158]:
df.isnull().sum()

Sentence #    0
Word          0
POS           0
Tag           0
dtype: int64

In [159]:
df.columns

Index(['Sentence #', 'Word', 'POS', 'Tag'], dtype='object')

In [160]:
df['Sentence #'].nunique(), df['Word'].nunique(), df['Tag'].nunique()

(457, 2746, 17)

In [161]:
df.groupby('Tag').size().reset_index(name = 'counts')

Unnamed: 0,Tag,counts
0,B-art,28
1,B-eve,10
2,B-geo,244
3,B-gpe,303
4,B-nat,5
5,B-org,176
6,B-per,160
7,B-tim,149
8,I-art,20
9,I-eve,10


In [162]:
X = df.drop('Tag', axis=1)
X.head()

Unnamed: 0,Sentence #,Word,POS
0,Sentence: 1,Thousands,NNS
1,Sentence: 1,of,IN
2,Sentence: 1,demonstrators,NNS
3,Sentence: 1,have,VBP
4,Sentence: 1,marched,VBN


**[`DictVectorizer`](https://stackoverflow.com/questions/27473957/understanding-dictvectorizer-in-scikit-learn)**

In [163]:
v = DictVectorizer(sparse=False)

In [164]:
#help(v)

In [165]:
X.to_dict('records')

[{'Sentence #': 'Sentence: 1', 'Word': 'Thousands', 'POS': 'NNS'},
 {'Sentence #': 'Sentence: 1', 'Word': 'of', 'POS': 'IN'},
 {'Sentence #': 'Sentence: 1', 'Word': 'demonstrators', 'POS': 'NNS'},
 {'Sentence #': 'Sentence: 1', 'Word': 'have', 'POS': 'VBP'},
 {'Sentence #': 'Sentence: 1', 'Word': 'marched', 'POS': 'VBN'},
 {'Sentence #': 'Sentence: 1', 'Word': 'through', 'POS': 'IN'},
 {'Sentence #': 'Sentence: 1', 'Word': 'London', 'POS': 'NNP'},
 {'Sentence #': 'Sentence: 1', 'Word': 'to', 'POS': 'TO'},
 {'Sentence #': 'Sentence: 1', 'Word': 'protest', 'POS': 'VB'},
 {'Sentence #': 'Sentence: 1', 'Word': 'the', 'POS': 'DT'},
 {'Sentence #': 'Sentence: 1', 'Word': 'war', 'POS': 'NN'},
 {'Sentence #': 'Sentence: 1', 'Word': 'in', 'POS': 'IN'},
 {'Sentence #': 'Sentence: 1', 'Word': 'Iraq', 'POS': 'NNP'},
 {'Sentence #': 'Sentence: 1', 'Word': 'and', 'POS': 'CC'},
 {'Sentence #': 'Sentence: 1', 'Word': 'demand', 'POS': 'VB'},
 {'Sentence #': 'Sentence: 1', 'Word': 'the', 'POS': 'DT'},
 

In [166]:
(v.fit_transform(X.to_dict('records'))==0).any()

True

In [167]:
XX = v.fit_transform(X.to_dict('records'))
XX

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [168]:
y = df.Tag.values
y

array(['O', 'O', 'O', ..., 'O', 'O', 'O'], dtype=object)

In [169]:
XX.shape

(10000, 3242)

In [170]:
classes = np.unique(y).tolist()
classes

['B-art',
 'B-eve',
 'B-geo',
 'B-gpe',
 'B-nat',
 'B-org',
 'B-per',
 'B-tim',
 'I-art',
 'I-eve',
 'I-geo',
 'I-gpe',
 'I-nat',
 'I-org',
 'I-per',
 'I-tim',
 'O']

In [171]:
X_train, X_test, y_train, y_test = train_test_split(XX, y, test_size = 0.10, random_state=0)
X_train.shape, y_train.shape

((9000, 3242), (9000,))

In [172]:
y_train

array(['O', 'O', 'O', ..., 'O', 'O', 'O'], dtype=object)

In [173]:
X_test.shape, y_test.shape

((1000, 3242), (1000,))

In [174]:
new_classes = classes.copy()
new_classes.pop()
new_classes

['B-art',
 'B-eve',
 'B-geo',
 'B-gpe',
 'B-nat',
 'B-org',
 'B-per',
 'B-tim',
 'I-art',
 'I-eve',
 'I-geo',
 'I-gpe',
 'I-nat',
 'I-org',
 'I-per',
 'I-tim']

### Perceptron

In [176]:
per = Perceptron(verbose=10, n_jobs=-1, max_iter=5)
per.partial_fit(X_train, y_train, classes)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done   4 out of  17 | elapsed:    0.1s remaining:    0.2s
[Parallel(n_jobs=-1)]: Done   6 out of  17 | elapsed:    0.1s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done   8 out of  17 | elapsed:    0.1s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done  10 out of  17 | elapsed:    0.1s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done  12 out of  17 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  14 out of  17 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  17 out of  17 | elapsed:    0.2s finished


-- Epoch 1-- Epoch 1-- Epoch 1
-- Epoch 1

-- Epoch 1

-- Epoch 1-- Epoch 1-- Epoch 1


Norm: 10.68, NNZs: 88, Bias: -4.000000, T: 9000, Avg. loss: 0.005222
Total training time: 0.04 seconds.
Norm: 18.57, NNZs: 170, Bias: -3.000000, T: 9000, Avg. loss: 0.010333Norm: 22.89, NNZs: 316, Bias: -4.000000, T: 9000, Avg. loss: 0.028667
Total training time: 0.05 seconds.

Total training time: 0.05 seconds.
Norm: 26.42, NNZs: 400, Bias: -4.000000, T: 9000, Avg. loss: 0.039889Norm: 5.29, NNZs: 20, Bias: -2.000000, T: 9000, Avg. loss: 0.000889
Total training time: 0.05 seconds.
Norm: 20.49, NNZs: 276, Bias: -4.000000, T: 9000, Avg. loss: 0.021556-- Epoch 1
Total training time: 0.05 seconds.

Total training time: 0.05 seconds.
-- Epoch 1-- Epoch 1
Norm: 28.55, NNZs: 435, Bias: -5.000000, T: 9000, Avg. loss: 0.040222-- Epoch 1


Norm: 6.16, NNZs: 35, Bias: -2.000000, T: 9000, Avg. loss: 0.002111-- Epoch 1
Total training time: 0.05 seconds.

-- Epoch 1

Total training time: 0.05 seconds.
-- Epoch 1


Perceptron(alpha=0.0001, class_weight=None, early_stopping=False, eta0=1.0,
      fit_intercept=True, max_iter=5, n_iter=None, n_iter_no_change=5,
      n_jobs=-1, penalty=None, random_state=0, shuffle=True, tol=None,
      validation_fraction=0.1, verbose=10, warm_start=False)

In [177]:
print(classification_report(y_pred=per.predict(X_test), y_true=y_test, labels=new_classes))

              precision    recall  f1-score   support

       B-art       0.50      0.50      0.50         2
       B-eve       0.00      0.00      0.00         1
       B-geo       0.40      0.95      0.56        20
       B-gpe       0.95      0.60      0.74        35
       B-nat       0.00      0.00      0.00         0
       B-org       0.43      0.33      0.38        18
       B-per       0.42      0.42      0.42        12
       B-tim       1.00      0.83      0.90        23
       I-art       0.00      0.00      0.00         0
       I-eve       0.00      0.00      0.00         3
       I-geo       0.00      0.00      0.00         2
       I-gpe       1.00      0.67      0.80         3
       I-nat       0.00      0.00      0.00         0
       I-org       0.75      0.27      0.40        11
       I-per       0.73      0.35      0.47        23
       I-tim       0.00      0.00      0.00         0

   micro avg       0.58      0.55      0.56       153
   macro avg       0.39   

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


### Linear classifiers with SGD training

In [178]:
sgd = SGDClassifier()
sgd.partial_fit(X_train, y_train, classes)

SGDClassifier(alpha=0.0001, average=False, class_weight=None,
       early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
       l1_ratio=0.15, learning_rate='optimal', loss='hinge', max_iter=None,
       n_iter=None, n_iter_no_change=5, n_jobs=None, penalty='l2',
       power_t=0.5, random_state=None, shuffle=True, tol=None,
       validation_fraction=0.1, verbose=0, warm_start=False)

In [179]:
print(classification_report(y_pred=sgd.predict(X_test), y_true=y_test, labels=new_classes))

              precision    recall  f1-score   support

       B-art       1.00      0.50      0.67         2
       B-eve       0.00      0.00      0.00         1
       B-geo       0.69      0.45      0.55        20
       B-gpe       0.89      0.49      0.63        35
       B-nat       0.00      0.00      0.00         0
       B-org       1.00      0.22      0.36        18
       B-per       1.00      0.25      0.40        12
       B-tim       1.00      0.83      0.90        23
       I-art       0.00      0.00      0.00         0
       I-eve       0.00      0.00      0.00         3
       I-geo       0.00      0.00      0.00         2
       I-gpe       0.00      0.00      0.00         3
       I-nat       0.00      0.00      0.00         0
       I-org       0.15      0.91      0.26        11
       I-per       0.67      0.17      0.28        23
       I-tim       0.00      0.00      0.00         0

   micro avg       0.48      0.44      0.46       153
   macro avg       0.40   

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


### Naive Bayes classifier for multinomial models

In [180]:
nb = MultinomialNB(alpha=0.01)
nb.partial_fit(X_train, y_train, classes)

MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)

In [181]:
print(classification_report(y_pred=nb.predict(X_test), y_true=y_test, labels = new_classes))

              precision    recall  f1-score   support

       B-art       0.00      0.00      0.00         2
       B-eve       0.00      0.00      0.00         1
       B-geo       0.80      0.60      0.69        20
       B-gpe       0.68      0.71      0.69        35
       B-nat       0.00      0.00      0.00         0
       B-org       0.50      0.39      0.44        18
       B-per       0.38      0.50      0.43        12
       B-tim       0.79      0.83      0.81        23
       I-art       0.00      0.00      0.00         0
       I-eve       0.33      0.33      0.33         3
       I-geo       0.00      0.00      0.00         2
       I-gpe       0.50      0.67      0.57         3
       I-nat       0.00      0.00      0.00         0
       I-org       0.47      0.82      0.60        11
       I-per       0.67      0.52      0.59        23
       I-tim       0.00      0.00      0.00         0

   micro avg       0.54      0.61      0.57       153
   macro avg       0.32   

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


### Passive Aggressive Classifier

In [182]:
pa =PassiveAggressiveClassifier()
pa.partial_fit(X_train, y_train, classes)

PassiveAggressiveClassifier(C=1.0, average=False, class_weight=None,
              early_stopping=False, fit_intercept=True, loss='hinge',
              max_iter=None, n_iter=None, n_iter_no_change=5, n_jobs=None,
              random_state=None, shuffle=True, tol=None,
              validation_fraction=0.1, verbose=0, warm_start=False)

In [183]:
print(classification_report(y_pred=pa.predict(X_test), y_true=y_test, labels=new_classes))

              precision    recall  f1-score   support

       B-art       0.00      0.00      0.00         2
       B-eve       0.00      0.00      0.00         1
       B-geo       1.00      0.05      0.10        20
       B-gpe       1.00      0.54      0.70        35
       B-nat       0.00      0.00      0.00         0
       B-org       0.19      1.00      0.32        18
       B-per       0.60      0.25      0.35        12
       B-tim       0.90      0.78      0.84        23
       I-art       0.00      0.00      0.00         0
       I-eve       0.00      0.00      0.00         3
       I-geo       0.00      0.00      0.00         2
       I-gpe       0.50      0.33      0.40         3
       I-nat       0.00      0.00      0.00         0
       I-org       1.00      0.09      0.17        11
       I-per       0.75      0.13      0.22        23
       I-tim       0.00      0.00      0.00         0

   micro avg       0.43      0.42      0.43       153
   macro avg       0.37   

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


None of the above classifiers produced satisfying results. It is obvious that it is not going to be easy to classify named entities using regular classifiers.

### Conditional Random Fields (CRFs)

In [185]:
import sklearn_crfsuite
from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics

In [187]:
class SentenceGetter(object):
    
    def __init__(self, data):
        self.n_sent = 1
        self.data = data
        self.empty = False
        agg_func = lambda s: [(w, p, t) for w, p, t in zip(s['Word'].values.tolist(), 
                                                           s['POS'].values.tolist(), 
                                                           s['Tag'].values.tolist())]
        self.grouped = self.data.groupby('Sentence #').apply(agg_func)
        self.sentences = [s for s in self.grouped]
        
    def get_next(self):
        try: 
            s = self.grouped['Sentence: {}'.format(self.n_sent)]
            self.n_sent += 1
            return s 
        except:
            return None

In [188]:
getter = SentenceGetter(df)

In [189]:
sent = getter.get_next()
print(sent)

[('Thousands', 'NNS', 'O'), ('of', 'IN', 'O'), ('demonstrators', 'NNS', 'O'), ('have', 'VBP', 'O'), ('marched', 'VBN', 'O'), ('through', 'IN', 'O'), ('London', 'NNP', 'B-geo'), ('to', 'TO', 'O'), ('protest', 'VB', 'O'), ('the', 'DT', 'O'), ('war', 'NN', 'O'), ('in', 'IN', 'O'), ('Iraq', 'NNP', 'B-geo'), ('and', 'CC', 'O'), ('demand', 'VB', 'O'), ('the', 'DT', 'O'), ('withdrawal', 'NN', 'O'), ('of', 'IN', 'O'), ('British', 'JJ', 'B-gpe'), ('troops', 'NNS', 'O'), ('from', 'IN', 'O'), ('that', 'DT', 'O'), ('country', 'NN', 'O'), ('.', '.', 'O')]


In [192]:
sentences = getter.sentences
#sentences

In [193]:
def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]
    
    features = {
        'bias': 1.0, 
        'word.lower()': word.lower(), 
        'word[-3:]': word[-3:],
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'postag': postag,
        'postag[:2]': postag[:2],
    }
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2],
        })
    else:
        features['BOS'] = True
    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2],
        })
    else:
        features['EOS'] = True

    return features

def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

def sent2tokens(sent):
    return [token for token, postag, label in sent]

In [194]:
X = [sent2features(s) for s in sentences]
y = [sent2labels(s) for s in sentences]

In [195]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=0)

In [196]:
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.1,
    c2=0.1,
    max_iterations=100,
    all_possible_transitions=True
)
crf.fit(X_train, y_train)

CRF(algorithm='lbfgs', all_possible_states=None,
  all_possible_transitions=True, averaging=None, c=None, c1=0.1, c2=0.1,
  calibration_candidates=None, calibration_eta=None,
  calibration_max_trials=None, calibration_rate=None,
  calibration_samples=None, delta=None, epsilon=None, error_sensitive=None,
  gamma=None, keep_tempfiles=None, linesearch=None, max_iterations=100,
  max_linesearch=None, min_freq=None, model_filename=None,
  num_memories=None, pa_type=None, period=None, trainer_cls=None,
  variance=None, verbose=False)

In [197]:
y_pred = crf.predict(X_test)
metrics.flat_f1_score(y_test, y_pred, average='weighted', labels=new_classes)

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


0.8785603678722904

In [198]:
print(metrics.flat_classification_report(y_test, y_pred, labels = new_classes))

              precision    recall  f1-score   support

       B-art       1.00      1.00      1.00         1
       B-eve       0.00      0.00      0.00         0
       B-geo       0.86      0.71      0.77        17
       B-gpe       0.70      0.82      0.76        17
       B-nat       0.00      0.00      0.00         0
       B-org       0.88      0.94      0.91        16
       B-per       0.95      0.90      0.92        20
       B-tim       1.00      0.90      0.95        10
       I-art       0.00      0.00      0.00         0
       I-eve       0.00      0.00      0.00         0
       I-geo       1.00      0.33      0.50         3
       I-gpe       0.00      0.00      0.00         0
       I-nat       0.00      0.00      0.00         0
       I-org       0.88      0.94      0.91        16
       I-per       0.96      1.00      0.98        23
       I-tim       0.00      0.00      0.00         0

   micro avg       0.88      0.88      0.88       123
   macro avg       0.51   

In [200]:
import scipy.stats
from sklearn.metrics import make_scorer
from sklearn.model_selection import RandomizedSearchCV

crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    max_iterations=100,
    all_possible_transitions=True
)
params_space = {
    'c1': scipy.stats.expon(scale=0.5),
    'c2': scipy.stats.expon(scale=0.05),
}

# use the same metric for evaluation
f1_scorer = make_scorer(metrics.flat_f1_score,
                        average='weighted', labels=new_classes)

# search
rs = RandomizedSearchCV(crf, params_space,
                        cv=3,
                        verbose=1,
                        n_jobs=-1,
                        n_iter=50,
                        scoring=f1_scorer)
rs.fit(X_train, y_train)

Fitting 3 folds for each of 50 candidates, totalling 150 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   29.8s
[Parallel(n_jobs=-1)]: Done 150 out of 150 | elapsed:  2.0min finished


RandomizedSearchCV(cv=3, error_score='raise-deprecating',
          estimator=CRF(algorithm='lbfgs', all_possible_states=None,
  all_possible_transitions=True, averaging=None, c=None, c1=None, c2=None,
  calibration_candidates=None, calibration_eta=None,
  calibration_max_trials=None, calibration_rate=None,
  calibration_samples=None, delta=None, epsilon=None, error...e,
  num_memories=None, pa_type=None, period=None, trainer_cls=None,
  variance=None, verbose=False),
          fit_params=None, iid='warn', n_iter=50, n_jobs=-1,
          param_distributions={'c1': <scipy.stats._distn_infrastructure.rv_frozen object at 0x1a23129278>, 'c2': <scipy.stats._distn_infrastructure.rv_frozen object at 0x1a23129358>},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score='warn',
          scoring=make_scorer(flat_f1_score, average=weighted, labels=['B-art', 'B-eve', 'B-geo', 'B-gpe', 'B-nat', 'B-org', 'B-per', 'B-tim', 'I-art', 'I-eve', 'I-geo', 'I-gpe', 

In [201]:
print('best params:', rs.best_params_)
print('best CV score:', rs.best_score_)
print('model size: {:0.2f}M'.format(rs.best_estimator_.size_ / 1000000))

best params: {'c1': 0.0472141053461638, 'c2': 0.04182733498716666}
best CV score: 0.7501685817456327
model size: 0.18M


In [203]:
crf = rs.best_estimator_
y_pred = crf.predict(X_test)
print(metrics.flat_classification_report(y_test, y_pred, labels=new_classes))

              precision    recall  f1-score   support

       B-art       1.00      1.00      1.00         1
       B-eve       0.00      0.00      0.00         0
       B-geo       0.92      0.71      0.80        17
       B-gpe       0.68      0.76      0.72        17
       B-nat       0.00      0.00      0.00         0
       B-org       0.79      0.94      0.86        16
       B-per       0.95      0.90      0.92        20
       B-tim       1.00      0.90      0.95        10
       I-art       0.00      0.00      0.00         0
       I-eve       0.00      0.00      0.00         0
       I-geo       1.00      0.33      0.50         3
       I-gpe       0.00      0.00      0.00         0
       I-nat       0.00      0.00      0.00         0
       I-org       0.88      0.94      0.91        16
       I-per       0.96      1.00      0.98        23
       I-tim       0.00      0.00      0.00         0

   micro avg       0.87      0.87      0.87       123
   macro avg       0.51   

In [205]:
crf.transition_features_

{('O', 'O'): 3.71179,
 ('O', 'B-per'): 0.319432,
 ('O', 'I-per'): -1.107723,
 ('O', 'B-org'): 1.217019,
 ('O', 'B-tim'): 1.568112,
 ('O', 'B-geo'): 1.238541,
 ('O', 'B-gpe'): 1.418985,
 ('O', 'I-geo'): -1.113751,
 ('O', 'B-art'): 0.869546,
 ('O', 'I-art'): -1.484602,
 ('O', 'I-org'): -2.357155,
 ('O', 'I-gpe'): -0.992842,
 ('O', 'B-nat'): 0.285523,
 ('O', 'I-nat'): -0.330126,
 ('O', 'I-tim'): -1.707714,
 ('O', 'B-eve'): 0.846647,
 ('O', 'I-eve'): -0.617048,
 ('B-per', 'B-per'): -0.562426,
 ('B-per', 'I-per'): 4.844823,
 ('B-per', 'B-org'): -0.023253,
 ('B-per', 'B-tim'): -0.003791,
 ('B-per', 'B-gpe'): -0.226479,
 ('B-per', 'I-geo'): -0.069592,
 ('B-per', 'I-art'): -0.064572,
 ('B-per', 'I-org'): -0.297481,
 ('I-per', 'O'): 0.000912,
 ('I-per', 'B-per'): -0.528621,
 ('I-per', 'I-per'): 4.602827,
 ('I-per', 'B-org'): -0.062156,
 ('I-per', 'B-tim'): -0.10524,
 ('I-per', 'B-geo'): -0.064542,
 ('I-per', 'B-gpe'): -0.176418,
 ('I-per', 'I-geo'): -0.29862,
 ('I-per', 'I-art'): -0.159879,
 ('

In [204]:
from collections import Counter

def print_transitions(trans_features):
    for (label_from, label_to), weight in trans_features:
        print("%-6s -> %-7s %0.6f" % (label_from, label_to, weight))

print("Top likely transitions:")
print_transitions(Counter(crf.transition_features_).most_common(20))

print("\nTop unlikely transitions:")
print_transitions(Counter(crf.transition_features_).most_common()[-20:])

Top likely transitions:
B-org  -> I-org   5.072222
I-org  -> I-org   5.047169
B-per  -> I-per   4.844823
B-art  -> I-art   4.660405
I-per  -> I-per   4.602827
B-eve  -> I-eve   4.503007
B-geo  -> I-geo   4.234307
I-art  -> I-art   4.153501
B-gpe  -> I-gpe   4.131273
O      -> O       3.711790
I-gpe  -> I-gpe   3.321807
I-geo  -> I-geo   3.106486
B-tim  -> I-tim   2.594568
I-tim  -> I-tim   2.427676
B-org  -> B-art   2.364817
B-nat  -> I-nat   2.086803
I-eve  -> I-eve   2.078610
B-gpe  -> B-per   1.641248
O      -> B-tim   1.568112
B-geo  -> B-tim   1.468613

Top unlikely transitions:
B-geo  -> I-art   -0.674879
I-org  -> B-geo   -0.695325
B-geo  -> I-org   -0.731665
B-org  -> B-gpe   -0.739327
B-org  -> B-org   -0.796858
B-gpe  -> B-gpe   -0.826489
B-gpe  -> I-geo   -0.874668
B-tim  -> B-gpe   -0.880130
B-geo  -> I-per   -0.928700
B-geo  -> B-geo   -0.969787
B-gpe  -> I-org   -0.986512
O      -> I-gpe   -0.992842
B-org  -> I-per   -1.061117
I-org  -> I-per   -1.080303
O      -> I-per  

In [206]:
def print_state_features(state_features):
    for (attr, label), weight in state_features:
        print("%0.6f %-8s %s" % (weight, label, attr))

print("Top positive:")
print_state_features(Counter(crf.state_features_).most_common(30))

print("\nTop negative:")
print_state_features(Counter(crf.state_features_).most_common()[-30:])

Top positive:
5.551995 O        bias
4.692316 O        BOS
4.425858 B-tim    word[-3:]:day
4.354333 B-tim    word[-2:]:ay
3.799938 B-tim    word[-2:]:0s
3.679993 B-geo    -1:word.lower():in
3.394786 B-gpe    word.istitle()
2.998959 B-gpe    -1:word.lower():recognize
2.978161 O        postag:NN
2.943617 B-tim    -1:word.lower():in
2.861157 O        -1:word.lower():8,000
2.846302 B-org    word.isupper()
2.804378 B-per    word.lower():sperling
2.670195 B-org    word[-3:]:ban
2.623569 B-tim    word[-3:]:ber
2.511878 O        +1:word.lower():budget
2.477549 B-org    word.lower():halliburton
2.415843 B-geo    +1:word.lower():jury
2.390959 B-gpe    word[-2:]:na
2.328258 B-geo    word.lower():bali
2.311909 B-gpe    postag:JJ
2.286808 O        word.lower():ibero-american
2.282577 O        +1:word.lower():summit
2.252562 B-gpe    word[-3:]:dan
2.250135 B-geo    -1:word.lower():from
2.242296 O        +1:word.lower():men
2.239394 B-org    +1:word.lower():was
2.239227 B-org    -1:word.lower():u.s.


In [208]:
import eli5

eli5.show_weights(crf, top=10)

From \ To,O,B-art,I-art,B-eve,I-eve,B-geo,I-geo,B-gpe,I-gpe,B-nat,I-nat,B-org,I-org,B-per,I-per,B-tim,I-tim
O,3.712,0.87,-1.485,0.847,-0.617,1.239,-1.114,1.419,-0.993,0.286,-0.33,1.217,-2.357,0.319,-1.108,1.568,-1.708
B-art,-0.079,0.0,4.66,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.205,-0.0,0.0
I-art,-0.197,0.0,4.154,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.0,-0.446,0.013,0.0
B-eve,-0.516,0.0,0.0,0.0,4.503,0.0,0.0,-0.002,0.0,0.0,0.0,0.0,0.0,0.0,-0.079,0.0,0.0
I-eve,0.0,0.0,0.0,0.0,2.079,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
B-geo,0.327,0.0,-0.675,0.0,-0.125,-0.97,4.234,1.296,-0.45,0.0,0.0,-0.175,-0.732,-0.289,-0.929,1.469,-0.041
I-geo,-0.045,0.0,0.0,0.0,0.0,0.0,3.106,0.0,0.0,0.0,0.0,0.0,-0.225,0.0,-0.233,0.0,0.0
B-gpe,0.353,0.0,-0.536,0.0,-0.148,-0.439,-0.875,-0.826,4.131,0.0,0.0,0.854,-0.987,1.641,-1.145,-0.479,0.0
I-gpe,-0.149,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.322,0.0,0.0,0.0,0.0,0.0,-0.104,0.0,0.0
B-nat,-0.466,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.087,0.0,0.0,0.0,0.0,0.0,0.0

Weight?,Feature,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,Unnamed: 8_level_0,Unnamed: 9_level_0,Unnamed: 10_level_0,Unnamed: 11_level_0,Unnamed: 12_level_0,Unnamed: 13_level_0,Unnamed: 14_level_0,Unnamed: 15_level_0,Unnamed: 16_level_0
Weight?,Feature,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Weight?,Feature,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
Weight?,Feature,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3,Unnamed: 13_level_3,Unnamed: 14_level_3,Unnamed: 15_level_3,Unnamed: 16_level_3
Weight?,Feature,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4,Unnamed: 9_level_4,Unnamed: 10_level_4,Unnamed: 11_level_4,Unnamed: 12_level_4,Unnamed: 13_level_4,Unnamed: 14_level_4,Unnamed: 15_level_4,Unnamed: 16_level_4
Weight?,Feature,Unnamed: 2_level_5,Unnamed: 3_level_5,Unnamed: 4_level_5,Unnamed: 5_level_5,Unnamed: 6_level_5,Unnamed: 7_level_5,Unnamed: 8_level_5,Unnamed: 9_level_5,Unnamed: 10_level_5,Unnamed: 11_level_5,Unnamed: 12_level_5,Unnamed: 13_level_5,Unnamed: 14_level_5,Unnamed: 15_level_5,Unnamed: 16_level_5
Weight?,Feature,Unnamed: 2_level_6,Unnamed: 3_level_6,Unnamed: 4_level_6,Unnamed: 5_level_6,Unnamed: 6_level_6,Unnamed: 7_level_6,Unnamed: 8_level_6,Unnamed: 9_level_6,Unnamed: 10_level_6,Unnamed: 11_level_6,Unnamed: 12_level_6,Unnamed: 13_level_6,Unnamed: 14_level_6,Unnamed: 15_level_6,Unnamed: 16_level_6
Weight?,Feature,Unnamed: 2_level_7,Unnamed: 3_level_7,Unnamed: 4_level_7,Unnamed: 5_level_7,Unnamed: 6_level_7,Unnamed: 7_level_7,Unnamed: 8_level_7,Unnamed: 9_level_7,Unnamed: 10_level_7,Unnamed: 11_level_7,Unnamed: 12_level_7,Unnamed: 13_level_7,Unnamed: 14_level_7,Unnamed: 15_level_7,Unnamed: 16_level_7
Weight?,Feature,Unnamed: 2_level_8,Unnamed: 3_level_8,Unnamed: 4_level_8,Unnamed: 5_level_8,Unnamed: 6_level_8,Unnamed: 7_level_8,Unnamed: 8_level_8,Unnamed: 9_level_8,Unnamed: 10_level_8,Unnamed: 11_level_8,Unnamed: 12_level_8,Unnamed: 13_level_8,Unnamed: 14_level_8,Unnamed: 15_level_8,Unnamed: 16_level_8
Weight?,Feature,Unnamed: 2_level_9,Unnamed: 3_level_9,Unnamed: 4_level_9,Unnamed: 5_level_9,Unnamed: 6_level_9,Unnamed: 7_level_9,Unnamed: 8_level_9,Unnamed: 9_level_9,Unnamed: 10_level_9,Unnamed: 11_level_9,Unnamed: 12_level_9,Unnamed: 13_level_9,Unnamed: 14_level_9,Unnamed: 15_level_9,Unnamed: 16_level_9
Weight?,Feature,Unnamed: 2_level_10,Unnamed: 3_level_10,Unnamed: 4_level_10,Unnamed: 5_level_10,Unnamed: 6_level_10,Unnamed: 7_level_10,Unnamed: 8_level_10,Unnamed: 9_level_10,Unnamed: 10_level_10,Unnamed: 11_level_10,Unnamed: 12_level_10,Unnamed: 13_level_10,Unnamed: 14_level_10,Unnamed: 15_level_10,Unnamed: 16_level_10
Weight?,Feature,Unnamed: 2_level_11,Unnamed: 3_level_11,Unnamed: 4_level_11,Unnamed: 5_level_11,Unnamed: 6_level_11,Unnamed: 7_level_11,Unnamed: 8_level_11,Unnamed: 9_level_11,Unnamed: 10_level_11,Unnamed: 11_level_11,Unnamed: 12_level_11,Unnamed: 13_level_11,Unnamed: 14_level_11,Unnamed: 15_level_11,Unnamed: 16_level_11
Weight?,Feature,Unnamed: 2_level_12,Unnamed: 3_level_12,Unnamed: 4_level_12,Unnamed: 5_level_12,Unnamed: 6_level_12,Unnamed: 7_level_12,Unnamed: 8_level_12,Unnamed: 9_level_12,Unnamed: 10_level_12,Unnamed: 11_level_12,Unnamed: 12_level_12,Unnamed: 13_level_12,Unnamed: 14_level_12,Unnamed: 15_level_12,Unnamed: 16_level_12
Weight?,Feature,Unnamed: 2_level_13,Unnamed: 3_level_13,Unnamed: 4_level_13,Unnamed: 5_level_13,Unnamed: 6_level_13,Unnamed: 7_level_13,Unnamed: 8_level_13,Unnamed: 9_level_13,Unnamed: 10_level_13,Unnamed: 11_level_13,Unnamed: 12_level_13,Unnamed: 13_level_13,Unnamed: 14_level_13,Unnamed: 15_level_13,Unnamed: 16_level_13
Weight?,Feature,Unnamed: 2_level_14,Unnamed: 3_level_14,Unnamed: 4_level_14,Unnamed: 5_level_14,Unnamed: 6_level_14,Unnamed: 7_level_14,Unnamed: 8_level_14,Unnamed: 9_level_14,Unnamed: 10_level_14,Unnamed: 11_level_14,Unnamed: 12_level_14,Unnamed: 13_level_14,Unnamed: 14_level_14,Unnamed: 15_level_14,Unnamed: 16_level_14
Weight?,Feature,Unnamed: 2_level_15,Unnamed: 3_level_15,Unnamed: 4_level_15,Unnamed: 5_level_15,Unnamed: 6_level_15,Unnamed: 7_level_15,Unnamed: 8_level_15,Unnamed: 9_level_15,Unnamed: 10_level_15,Unnamed: 11_level_15,Unnamed: 12_level_15,Unnamed: 13_level_15,Unnamed: 14_level_15,Unnamed: 15_level_15,Unnamed: 16_level_15
Weight?,Feature,Unnamed: 2_level_16,Unnamed: 3_level_16,Unnamed: 4_level_16,Unnamed: 5_level_16,Unnamed: 6_level_16,Unnamed: 7_level_16,Unnamed: 8_level_16,Unnamed: 9_level_16,Unnamed: 10_level_16,Unnamed: 11_level_16,Unnamed: 12_level_16,Unnamed: 13_level_16,Unnamed: 14_level_16,Unnamed: 15_level_16,Unnamed: 16_level_16
+5.552,bias,,,,,,,,,,,,,,,
+4.692,BOS,,,,,,,,,,,,,,,
+2.978,postag:NN,,,,,,,,,,,,,,,
+2.861,"-1:word.lower():8,000",,,,,,,,,,,,,,,
+2.512,+1:word.lower():budget,,,,,,,,,,,,,,,
+2.287,word.lower():ibero-american,,,,,,,,,,,,,,,
+2.283,+1:word.lower():summit,,,,,,,,,,,,,,,
… 271 more positive …,… 271 more positive …,,,,,,,,,,,,,,,
… 149 more negative …,… 149 more negative …,,,,,,,,,,,,,,,
-3.964,word.isdigit(),,,,,,,,,,,,,,,

Weight?,Feature
+5.552,bias
+4.692,BOS
+2.978,postag:NN
+2.861,"-1:word.lower():8,000"
+2.512,+1:word.lower():budget
+2.287,word.lower():ibero-american
+2.283,+1:word.lower():summit
… 271 more positive …,… 271 more positive …
… 149 more negative …,… 149 more negative …
-3.964,word.isdigit()

Weight?,Feature
+1.612,word.lower():vioxx
+1.612,word[-3:]:oxx
+1.612,word[-2:]:xx
+1.602,word.lower():jeep
+1.601,word[-3:]:eep
+1.601,word[-2:]:ep
+1.557,word.lower():dodge
+1.525,word.lower():huygens
+1.512,word.lower():chrysler
… 128 more positive …,… 128 more positive …

Weight?,Feature
+1.009,-1:word.istitle()
+1.003,+1:postag[:2]:NN
+0.887,+1:word.lower():airport
+0.736,+1:postag:NN
+0.713,+1:word.lower():interview
+0.703,word.lower():mirror
+0.702,word[-3:]:ror
+0.701,-1:word.lower():daily
+0.697,-1:word.lower():nuclear
+0.686,word.lower():non-proliferation

Weight?,Feature
+1.517,+1:word.lower():war
+1.484,word[-3:]:mes
+1.474,word.lower():games
+1.243,postag:NNPS
+1.141,word[-2:]:es
+1.100,+1:word.istitle()
+1.060,+1:word.lower():open
+0.899,-1:word.lower():the
+0.771,+1:word.lower():olympic
+0.753,word.lower():australian

Weight?,Feature
+1.119,word[-2:]:ic
+1.022,word[-3:]:War
+1.021,-1:postag:CD
+1.021,-1:postag[:2]:CD
+1.019,word[-2:]:ar
+1.017,word.lower():war
+0.981,word.lower():open
+0.981,word[-3:]:pen
+0.904,+1:word.lower():in
+0.860,+1:word.lower():mascots

Weight?,Feature
+3.680,-1:word.lower():in
+2.416,+1:word.lower():jury
+2.328,word.lower():bali
+2.250,-1:word.lower():from
+2.203,word.lower():latvia
+1.979,word[-2:]:ta
+1.966,+1:word.lower():before
+1.964,+1:word.lower():conflict
+1.959,-1:word.lower():neighboring
+1.922,word.lower():baltic

Weight?,Feature
+1.883,-1:word.lower():baltic
+1.562,-1:word.lower():west
+1.371,-1:word.lower():new
+1.327,word[-2:]:st
+1.289,-1:word.lower():south
+1.070,word[-3:]:tes
+1.020,word[-3:]:rab
+1.020,word.lower():arab
+1.011,word[-2:]:ab
+0.930,-1:word.lower():frontier

Weight?,Feature
+3.395,word.istitle()
+2.999,-1:word.lower():recognize
+2.391,word[-2:]:na
+2.312,postag:JJ
+2.253,word[-3:]:dan
+2.228,-1:word.lower():with
+2.227,word.lower():tanzania
+2.153,word.lower():lima
+2.153,word[-3:]:ima
… 311 more positive …,… 311 more positive …

Weight?,Feature
+1.491,word.lower():republic
+1.305,word[-3:]:can
+1.276,word[-3:]:ngo
+1.276,word.lower():congo
+1.268,word[-2:]:go
+1.233,word.lower():states
+0.980,word[-3:]:tes
+0.975,+1:word.lower():congo
+0.962,-1:word.lower():republic
+0.959,-1:word.lower():united

Weight?,Feature
+2.083,word.lower():h5n1
+2.083,word[-3:]:5N1
+2.083,word[-2:]:N1
+1.544,+1:word.lower():jing
+1.427,word.lower():jing
+1.285,word.isupper()
+1.174,word[-3:]:ing
+1.084,word[-2:]:ng
+0.866,+1:postag[:2]:NN
+0.574,-1:word.lower():the

Weight?,Feature
+1.491,-1:word.lower():jing
+1.476,word.lower():jing
+1.152,word[-2:]:ng
+1.149,word[-3:]:ing
+0.270,-1:postag:NNP
+0.149,+1:word.lower():'s
+0.149,+1:postag[:2]:PO
+0.149,+1:postag:POS
+0.144,-1:word.istitle()
… 4 more positive …,… 4 more positive …

Weight?,Feature
+2.846,word.isupper()
+2.670,word[-3:]:ban
+2.478,word.lower():halliburton
+2.239,+1:word.lower():was
+2.239,-1:word.lower():u.s.
+2.114,word.lower():senate
+2.093,word.lower():merck
+2.093,word[-3:]:rck
+2.078,+1:word.lower():military
+1.975,+1:word.lower():militant

Weight?,Feature
+1.389,word.lower():nations
+1.362,word[-3:]:ion
+1.356,+1:word.lower():fighters
+1.328,word[-2:]:ce
+1.295,-1:word.lower():american
+1.249,word[-3:]:ger
+1.237,word[-2:]:ty
+1.227,-1:word.lower():qaida
+1.185,-1:postag[:2]:IN
+1.185,-1:postag:IN

Weight?,Feature
+2.804,word.lower():sperling
+2.162,+1:word.lower():administration
+2.105,word[-2:]:am
+2.023,word.lower():ramda
+2.023,word[-3:]:mda
+1.955,word.lower():obama
+1.865,+1:word.lower():paul
+1.752,word[-3:]:yam
+1.752,word.lower():khayam
… 208 more positive …,… 208 more positive …

Weight?,Feature
+1.321,-1:word.lower():president
+1.270,word[-2:]:in
+1.181,-1:word.lower():mr.
+1.118,+1:postag:VBD
+1.064,-1:postag:NNP
+0.995,word[-2:]:am
+0.894,postag:NNP
+0.828,word[-2:]:er
… 167 more positive …,… 167 more positive …
… 6 more negative …,… 6 more negative …

Weight?,Feature
+4.426,word[-3:]:day
+4.354,word[-2:]:ay
+3.800,word[-2:]:0s
+2.944,-1:word.lower():in
+2.624,word[-3:]:ber
+2.157,word[-3:]:uly
+2.157,word.lower():july
+2.157,word.lower():january
+2.096,word.isdigit()
+2.063,word[-3:]:ust

Weight?,Feature
+1.999,word.isdigit()
+1.475,postag[:2]:CD
+1.475,postag:CD
+0.919,-1:word.lower():july
+0.768,+1:postag:CD
+0.768,+1:postag[:2]:CD
+0.695,-1:word.lower():25
+0.675,+1:word.lower():1995
+0.564,+1:word.lower():attack
… 32 more positive …,… 32 more positive …


In [209]:
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=200,
    c2=0.1,
    max_iterations=20,
    all_possible_transitions=False,
)
crf.fit(X_train, y_train)
eli5.show_weights(crf, top=10)

From \ To,O,B-art,I-art,B-eve,I-eve,B-geo,I-geo,B-gpe,I-gpe,B-nat,I-nat,B-org,I-org,B-per,I-per,B-tim,I-tim
O,0.822,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
B-art,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
I-art,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
B-eve,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
I-eve,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
B-geo,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
I-geo,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
B-gpe,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
I-gpe,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
B-nat,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0

Weight?,Feature,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,Unnamed: 8_level_0,Unnamed: 9_level_0,Unnamed: 10_level_0,Unnamed: 11_level_0,Unnamed: 12_level_0,Unnamed: 13_level_0,Unnamed: 14_level_0,Unnamed: 15_level_0,Unnamed: 16_level_0
Weight?,Feature,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
+4.185,bias,,,,,,,,,,,,,,,
-1.707,word.istitle(),,,,,,,,,,,,,,,
-2.395,postag:NNP,,,,,,,,,,,,,,,
+0.193,word.istitle(),,,,,,,,,,,,,,,
y=O  top features,y=B-art  top features,y=I-art  top features,y=B-eve  top features,y=I-eve  top features,y=B-geo  top features,y=I-geo  top features,y=B-gpe  top features,y=I-gpe  top features,y=B-nat  top features,y=I-nat  top features,y=B-org  top features,y=I-org  top features,y=B-per  top features,y=I-per  top features,y=B-tim  top features,y=I-tim  top features
Weight?  Feature  +4.185  bias  -1.707  word.istitle()  -2.395  postag:NNP,,,,,,,Weight?  Feature  +0.193  word.istitle(),,,,,,,,,

Weight?,Feature
4.185,bias
-1.707,word.istitle()
-2.395,postag:NNP

Weight?,Feature
0.193,word.istitle()


In [210]:
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.1,
    c2=0.1,
    max_iterations=100,
    all_possible_transitions=True,
)
crf.fit(X_train, y_train);
eli5.show_weights(crf, top=5, show=['transition_features'])

From \ To,O,B-art,I-art,B-eve,I-eve,B-geo,I-geo,B-gpe,I-gpe,B-nat,I-nat,B-org,I-org,B-per,I-per,B-tim,I-tim
O,3.454,0.751,-1.376,0.766,-0.674,1.141,-1.019,1.402,-0.948,0.239,-0.349,1.257,-2.226,0.405,-0.954,1.568,-1.5
B-art,0.0,0.0,4.063,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.102,-0.004,0.0
I-art,-0.082,0.0,3.63,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.025,-0.014,-0.294,0.001,0.0
B-eve,-0.566,0.0,0.0,0.0,4.089,0.0,0.0,-0.002,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
I-eve,0.0,0.0,0.0,0.0,1.807,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
B-geo,0.435,0.0,-0.554,0.0,-0.14,-0.668,3.727,0.801,-0.442,0.0,0.0,-0.147,-0.63,-0.198,-0.85,1.19,-0.047
I-geo,0.0,0.0,0.0,0.0,0.0,0.0,2.7,0.0,0.0,0.0,0.0,0.0,-0.16,0.0,-0.191,0.0,0.0
B-gpe,0.297,0.0,-0.471,0.0,-0.134,-0.393,-0.679,-0.712,3.553,0.0,0.0,0.869,-0.79,1.493,-0.827,-0.362,0.0
I-gpe,-0.155,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.788,0.0,0.0,0.0,0.0,0.0,-0.016,0.0,0.0
B-nat,-0.423,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.791,0.0,0.0,0.0,0.0,0.0,0.0


In [211]:
eli5.show_weights(crf, top=10, targets=['O', 'B-org', 'I-per'])

From \ To,O,B-org,I-per
O,3.454,1.257,-0.954
B-org,-0.122,-0.59,-0.768
I-per,0.019,-0.014,4.355

Weight?,Feature,Unnamed: 2_level_0
Weight?,Feature,Unnamed: 2_level_1
Weight?,Feature,Unnamed: 2_level_2
+4.902,bias,
+4.310,BOS,
+2.498,postag:NN,
+2.089,"-1:word.lower():8,000",
+1.898,+1:word.lower():budget,
+1.845,+1:word.lower():summit,
+1.784,word.lower():ibero-american,
… 241 more positive …,… 241 more positive …,
… 139 more negative …,… 139 more negative …,
-3.486,word.isdigit(),

Weight?,Feature
+4.902,bias
+4.310,BOS
+2.498,postag:NN
+2.089,"-1:word.lower():8,000"
+1.898,+1:word.lower():budget
+1.845,+1:word.lower():summit
+1.784,word.lower():ibero-american
… 241 more positive …,… 241 more positive …
… 139 more negative …,… 139 more negative …
-3.486,word.isdigit()

Weight?,Feature
+2.435,word.isupper()
+2.210,word[-3:]:ban
+1.933,word.lower():halliburton
+1.871,-1:word.lower():u.s.
+1.816,+1:word.lower():was
+1.659,word.lower():merck
+1.659,word[-3:]:rck
+1.645,word.lower():senate
+1.546,word.lower():taleban
+1.520,word.lower():democrats

Weight?,Feature
+1.203,-1:word.lower():president
+1.089,word[-2:]:in
+1.065,-1:postag:NNP
+1.016,-1:word.lower():mr.
+0.975,+1:postag:VBD
+0.761,postag:NNP
+0.760,word[-2:]:am
+0.738,word[-2:]:er
… 159 more positive …,… 159 more positive …
… 4 more negative …,… 4 more negative …


In [212]:
eli5.show_weights(crf, top=10, feature_re='^word\.is',
                  horizontal_layout=False, show=['targets'])

Weight?,Feature
-1.373,word.isupper()
-3.486,word.isdigit()
-4.186,word.istitle()

Weight?,Feature
-0.245,word.istitle()
-0.39,word.isupper()

Weight?,Feature
0.32,word.isdigit()
0.0,word.isupper()

Weight?,Feature
0.409,word.isdigit()

Weight?,Feature
0.465,word.isupper()

Weight?,Feature
1.509,word.istitle()
-0.071,word.isdigit()

Weight?,Feature
0.085,word.istitle()

Weight?,Feature
2.907,word.istitle()
0.564,word.isupper()

Weight?,Feature
0.754,word.istitle()

Weight?,Feature
1.16,word.isupper()

Weight?,Feature
-0.002,word.istitle()

Weight?,Feature
2.435,word.isupper()
-0.089,word.isdigit()
-0.565,word.istitle()

Weight?,Feature
-0.006,word.istitle()

Weight?,Feature
0.619,word.istitle()
-0.649,word.isupper()

Weight?,Feature
0.236,word.istitle()
-0.195,word.isupper()

Weight?,Feature
1.898,word.isdigit()
-0.075,word.istitle()
-0.142,word.isupper()

Weight?,Feature
1.687,word.isdigit()
-1.089,word.istitle()
