# Named Entity Recognition with Conditional Random Fields

One of the classic challenges of Natural Language Processing is sequence labelling. In sequence labelling, the goal is to label each word in a text with a word class. In part-of-speech tagging, these word classes are parts of speech, such as noun or verb. In named entity recognition (NER), they're types of generic named entities, such as locations, people or organizations, or more specialized entities, such as diseases or symptoms in the healthcare domain. In this way, sequence labelling can help us extract the most important information from a text and improve the performance of analytics, search or matching applications. 

In this notebook we'll explore Conditional Random Fields, the most popular approach to sequence labelling before Deep Learning arrived. Deep Learning may get all the attention right now, but Conditional Random Fields are still a powerful tool to build a simple sequence labeller. 

The tool we're going to use is `sklearn-crfsuite`. This is a wrapper around `python-crfsuite`, which itself is a Python binding of [CRFSuite](http://www.chokkan.org/software/crfsuite/). The reason we're using `sklearn-crfsuite` is that it provides a number of handy utility functions, for example for evaluating the output of the model. You can install it with `pip install sklearn-crfsuite`.

## Data

First we get some data. A well-known data set for training and testing NER models is the CoNLL-2002 data, which has Spanish and Dutch texts labelled with four types of entities: locations (LOC), persons (PER), organizations (ORG) and miscellaneous entities (MISC). Both corpora are split up in three portions: a training portion and two smaller test portions, one of which we'll use as development data. It's easy to collect the data from NLTK. 

In [1]:
import nltk
import sklearn
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import LabelBinarizer
import sklearn_crfsuite as crfsuite
from sklearn_crfsuite import metrics

In [2]:
train_sents = list(nltk.corpus.conll2002.iob_sents("ned.train"))
dev_sents = list(nltk.corpus.conll2002.iob_sents("ned.testa"))
test_sents = list(nltk.corpus.conll2002.iob_sents("ned.testb"))

The data consists of a list of tokenized sentences. For each of the tokens we have the string itself, its part-of-speech tag and its entity tag, which follows the BIO convention. In the deep learning world we live in today, it's common to ignore the part-of-speech tags. However, since CRFs rely on good feature extraction, we'll gladly make use of this information. After all, the part of speech of a word tells us a lot about its possible status as a named entity: nouns will more often be entities than verbs, for example.

In [3]:
train_sents[0]

[('De', 'Art', 'O'),
 ('tekst', 'N', 'O'),
 ('van', 'Prep', 'O'),
 ('het', 'Art', 'O'),
 ('arrest', 'N', 'O'),
 ('is', 'V', 'O'),
 ('nog', 'Adv', 'O'),
 ('niet', 'Adv', 'O'),
 ('schriftelijk', 'Adj', 'O'),
 ('beschikbaar', 'Adj', 'O'),
 ('maar', 'Conj', 'O'),
 ('het', 'Art', 'O'),
 ('bericht', 'N', 'O'),
 ('werd', 'V', 'O'),
 ('alvast', 'Adv', 'O'),
 ('bekendgemaakt', 'V', 'O'),
 ('door', 'Prep', 'O'),
 ('een', 'Art', 'O'),
 ('communicatiebureau', 'N', 'O'),
 ('dat', 'Conj', 'O'),
 ('Floralux', 'N', 'B-ORG'),
 ('inhuurde', 'V', 'O'),
 ('.', 'Punc', 'O')]

## Feature Extraction

Whereas today neural networks are expected to learn the relevant features of the input texts themselves, this is very different with Conditional Random Fields. CRFs learn the relationship between the features we give them and the label of a token in a given context. They're not going to earn these features themselves. Instead, the quality of the model will depend highly on the relevance of the features we show it. 

The most important method in this tutorial is therefore the one that collects the features for every token. What information could be useful? The word itself, of course, together with its part of speech tag. It can also be interesting to know whether the word is completely uppercase, whether it starts with a capital or is a digit. In addition, we also take a look at the character bigram and trigram the word ends with. We also give every token a `bias` feature, which always has the same value. This bias feature helps the CRF learn the relative frequency of each label type in the training data.

To give the CRF more information about the meaning of a word, we also introduce information from word embeddings. In our [Word Embedding notebook](https://github.com/nlptown/nlp-notebooks/blob/master/An%20Introduction%20to%20Word%20Embeddings.ipynb), we trained word embeddings on Dutch Wikipedia and clustered them in 500 clusters. Here we'll read these 500 clusters from a file, and map each word to the id of the cluster it is in. This is really useful for Named Entity Recognition, as most entity types cluster together. This allows CRFs to generalize above the word level. For example, when the CRF encounters a word it has never seen (say, *Albania*), it can base its decision on the cluster the word is in. If this cluster contains many other entities the CRF has met in its training data (say, *Italy*, *Germany* and *France*), it will have learnt a string link between this cluster and a specific entity type. As a result, it can still assign that entity type to the unknown word. In our experiments, this feature alone boosts the performance with around 3%. 

Finally, apart from the token itself, we also want the CRF to look at its context. More specifically, we're going to give it some extra information about the two words to the left and the right of the targt word. We'll tell the CRF what these words are, whether they start with a capital or are completely uppercase, and give it their part-of-speech tag. If there is no left or right context, we'll inform the CRF that the token is at the beginning or end of the sentence (`BOS` or `EOS`). 

In [4]:
def read_clusters(cluster_file):
    word2cluster = {}
    with open(cluster_file) as i:
        for line in i:
            word, cluster = line.strip().split("\t")
            word2cluster[word] = cluster
    return word2cluster


def word2features(sent, i, word2cluster):
    word = sent[i][0]
    postag = sent[i][1]
    features = [
        "bias",
        "word.lower=" + word.lower(),
        "word[-3:]=" + word[-3:],
        "word[-2:]=" + word[-2:],
        "word.isupper=%s" % word.isupper(),
        "word.istitle=%s" % word.istitle(),
        "word.isdigit=%s" % word.isdigit(),
        "word.cluster=%s" % word2cluster[word.lower()]
        if word.lower() in word2cluster
        else "0",
        "postag=" + postag,
    ]
    if i > 0:
        word1 = sent[i - 1][0]
        postag1 = sent[i - 1][1]
        features.extend(
            [
                "-1:word.lower=" + word1.lower(),
                "-1:word.istitle=%s" % word1.istitle(),
                "-1:word.isupper=%s" % word1.isupper(),
                "-1:postag=" + postag1,
            ]
        )
    else:
        features.append("BOS")

    if i > 1:
        word2 = sent[i - 2][0]
        postag2 = sent[i - 2][1]
        features.extend(
            [
                "-2:word.lower=" + word2.lower(),
                "-2:word.istitle=%s" % word2.istitle(),
                "-2:word.isupper=%s" % word2.isupper(),
                "-2:postag=" + postag2,
            ]
        )

    if i < len(sent) - 1:
        word1 = sent[i + 1][0]
        postag1 = sent[i + 1][1]
        features.extend(
            [
                "+1:word.lower=" + word1.lower(),
                "+1:word.istitle=%s" % word1.istitle(),
                "+1:word.isupper=%s" % word1.isupper(),
                "+1:postag=" + postag1,
            ]
        )
    else:
        features.append("EOS")

    if i < len(sent) - 2:
        word2 = sent[i + 2][0]
        postag2 = sent[i + 2][1]
        features.extend(
            [
                "+2:word.lower=" + word2.lower(),
                "+2:word.istitle=%s" % word2.istitle(),
                "+2:word.isupper=%s" % word2.isupper(),
                "+2:postag=" + postag2,
            ]
        )

    return features


def sent2features(sent, word2cluster):
    return [word2features(sent, i, word2cluster) for i in range(len(sent))]


def sent2labels(sent):
    return [label for token, postag, label in sent]


def sent2tokens(sent):
    return [token for token, postag, label in sent]


word2cluster = read_clusters("data/embeddings/clusters_nl.tsv")

In [5]:
sent2features(train_sents[0], word2cluster)[0]

['bias',
 'word.lower=de',
 'word[-3:]=De',
 'word[-2:]=De',
 'word.isupper=False',
 'word.istitle=True',
 'word.isdigit=False',
 'word.cluster=38',
 'postag=Art',
 'BOS',
 '+1:word.lower=tekst',
 '+1:word.istitle=False',
 '+1:word.isupper=False',
 '+1:postag=N',
 '+2:word.lower=van',
 '+2:word.istitle=False',
 '+2:word.isupper=False',
 '+2:postag=Prep']

In [6]:
X_train = [sent2features(s, word2cluster) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_dev = [sent2features(s, word2cluster) for s in dev_sents]
y_dev = [sent2labels(s) for s in dev_sents]

X_test = [sent2features(s, word2cluster) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]

## Training

We now create a CRF model and train it. We'll use the standard [L-BFGS](https://en.wikipedia.org/wiki/Limited-memory_BFGS) algorithm for our parameter estimation and run it for 100 iterations. When we're done, we save the model with `joblib`.

In [7]:
crf = crfsuite.CRF(verbose="true", algorithm="lbfgs", max_iterations=100)

crf.fit(X_train, y_train, X_dev=X_dev, y_dev=y_dev)

loading training data to CRFsuite: 100%|██████████| 15806/15806 [00:02<00:00, 7623.17it/s]
loading dev data to CRFsuite:  27%|██▋       | 769/2895 [00:00<00:00, 7689.13it/s]




loading dev data to CRFsuite: 100%|██████████| 2895/2895 [00:00<00:00, 7186.08it/s]



Holdout group: 2

Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 0
0....1....2....3....4....5....6....7....8....9....10
Number of features: 152117
Seconds required: 0.424

L-BFGS optimization
c1: 0.000000
c2: 1.000000
num_memories: 6
max_iterations: 100
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

Iter 1   time=0.37  loss=104214.83 active=152117 precision=0.100  recall=0.111  F1=0.105  Acc(item/seq)=0.901 0.496  feature_norm=1.00
Iter 2   time=0.21  loss=96997.81 active=152117 precision=0.100  recall=0.111  F1=0.105  Acc(item/seq)=0.901 0.496  feature_norm=1.13
Iter 3   time=0.21  loss=92085.38 active=152117 precision=0.100  recall=0.111  F1=0.105  Acc(item/seq)=0.901 0.496  feature_norm=1.26
Iter 4   time=0.21  loss=84277.67 active=152117 precision=0.100  recall=0.111  F1=0.105  Acc(item/seq)=0.901 0.496  feature_norm=1.51
Iter 5   time=0.21  loss=67577.53 active=15

Iter 63  time=0.21  loss=10458.88 active=152117 precision=0.783  recall=0.717  F1=0.744  Acc(item/seq)=0.970 0.788  feature_norm=45.51
Iter 64  time=0.21  loss=10420.78 active=152117 precision=0.763  recall=0.711  F1=0.730  Acc(item/seq)=0.969 0.786  feature_norm=45.67
Iter 65  time=0.21  loss=10315.28 active=152117 precision=0.766  recall=0.721  F1=0.735  Acc(item/seq)=0.969 0.786  feature_norm=46.30
Iter 66  time=0.21  loss=10204.10 active=152117 precision=0.769  recall=0.728  F1=0.740  Acc(item/seq)=0.970 0.786  feature_norm=47.10
Iter 67  time=0.21  loss=10134.54 active=152117 precision=0.769  recall=0.716  F1=0.737  Acc(item/seq)=0.970 0.787  feature_norm=47.87
Iter 68  time=0.21  loss=10095.10 active=152117 precision=0.773  recall=0.718  F1=0.741  Acc(item/seq)=0.970 0.787  feature_norm=47.85
Iter 69  time=0.21  loss=10059.52 active=152117 precision=0.773  recall=0.715  F1=0.738  Acc(item/seq)=0.970 0.790  feature_norm=47.81
Iter 70  time=0.21  loss=10012.53 active=152117 precisi

CRF(algorithm='lbfgs', all_possible_states=None,
  all_possible_transitions=None, averaging=None, c=None, c1=None, c2=None,
  calibration_candidates=None, calibration_eta=None,
  calibration_max_trials=None, calibration_rate=None,
  calibration_samples=None, delta=None, epsilon=None, error_sensitive=None,
  gamma=None, keep_tempfiles=None, linesearch=None, max_iterations=100,
  max_linesearch=None, min_freq=None, model_filename=None,
  num_memories=None, pa_type=None, period=None, trainer_cls=None,
  variance=None, verbose='true')

In [8]:
import joblib
import os

OUTPUT_PATH = "models/ner/"
OUTPUT_FILE = "crf_model"

if not os.path.exists(OUTPUT_PATH):
    os.mkdir(OUTPUT_PATH)

joblib.dump(crf, os.path.join(OUTPUT_PATH, OUTPUT_FILE))

['models/ner/crf_model']

## Evaluation

Let's evaluate the output of our CRF. We'll load the model from the output file above and have it predict labels for the full test set.

As a sanity check, let's take a look at its predictions for the first test sentence. This output looks pretty good: the CRF is able to predict all four locations in the sentence correctly. It only misses the person entity, which is a strange case anyway, because it is not actually a person name.

In [9]:
crf = joblib.load(os.path.join(OUTPUT_PATH, OUTPUT_FILE))
y_pred = crf.predict(X_test)

example_sent = test_sents[0]

print("Sentence:", " ".join(sent2tokens(example_sent)))
print(
    "Predicted:", " ".join(crf.predict([sent2features(example_sent, word2cluster)])[0])
)
print("Correct:  ", " ".join(sent2labels(example_sent)))

Sentence: Dat is in Italië , Spanje of Engeland misschien geen probleem , maar volgens ' Der Kaiser ' in Duitsland wel .
Predicted: O O O B-LOC O B-LOC O B-LOC O O O O O O O B-MISC I-MISC O O B-LOC O O
Correct:   O O O B-LOC O B-LOC O B-LOC O O O O O O O B-PER I-PER O O B-LOC O O


Now we evaluate on the full test set. We'll print out a classification report for all labels except `O`. If we were to include `O`, which far outnumbers the entity labels in our data, the average scores would be inflated artificially, simply because there's an inherently high probability that the `O` labels from our CRF are correct. We obtain an average F-score of 77% (micro average) across all entity types, with particularly good results for `B-LOC`and `B-PER`. 

In [10]:
labels = list(crf.classes_)
labels.remove("O")
y_pred = crf.predict(X_test)
sorted_labels = sorted(labels, key=lambda name: (name[1:], name[0]))

print(metrics.flat_classification_report(y_test, y_pred, labels=sorted_labels))

              precision    recall  f1-score   support

       B-LOC       0.83      0.83      0.83       774
       I-LOC       0.29      0.41      0.34        49
      B-MISC       0.84      0.61      0.71      1187
      I-MISC       0.59      0.42      0.49       410
       B-ORG       0.80      0.69      0.74       882
       I-ORG       0.74      0.66      0.70       551
       B-PER       0.80      0.90      0.85      1098
       I-PER       0.87      0.95      0.91       807

   micro avg       0.80      0.74      0.77      5758
   macro avg       0.72      0.68      0.70      5758
weighted avg       0.80      0.74      0.76      5758



Now we can also look at the most likely transitions the CRF has identified, and at the top features for every label. We'll do this with the `eli5` library, which helps us explain the predictions of machine learning models.

The top transitions are quite intuitive: the most likely transitions are those within the same entity type (from a B-label to an O-label), and those where a B-label follows an O-label. 

The features, too, make sense. For example, if a word does not start with an uppercase letter, it is unlikely to be an entity. By contrast, a word is very likely to be a location if it ends in `ië`, which is indeed a very common suffix for locations in Dutch. Notice also how informative the embedding clusters are: for all entity types, the word clusters form some of the most informative features for the CRF. 

In [11]:
import eli5

eli5.show_weights(crf, top=30)

From \ To,O,B-LOC,I-LOC,B-MISC,I-MISC,B-ORG,I-ORG,B-PER,I-PER
O,4.141,4.583,0.0,4.141,0.0,4.366,0.0,3.819,0.0
B-LOC,-0.248,-0.279,7.101,0.0,0.0,0.0,0.0,-0.661,0.0
I-LOC,-1.062,-0.235,5.967,0.0,0.0,0.0,0.0,0.0,0.0
B-MISC,-0.985,0.655,0.0,-0.316,7.73,0.551,0.0,0.46,0.0
I-MISC,-1.781,0.0,0.0,-0.382,7.769,1.145,0.0,-0.719,0.0
B-ORG,-0.261,0.0,0.0,-0.809,0.0,0.0,7.803,0.106,0.0
I-ORG,-0.794,0.0,0.0,0.0,0.0,0.0,7.174,0.084,0.0
B-PER,0.31,-0.346,0.0,-0.611,0.0,0.0,0.0,-1.408,8.68
I-PER,0.104,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.804

Weight?,Feature,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,Unnamed: 8_level_0
Weight?,Feature,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Weight?,Feature,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Weight?,Feature,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3
Weight?,Feature,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4
Weight?,Feature,Unnamed: 2_level_5,Unnamed: 3_level_5,Unnamed: 4_level_5,Unnamed: 5_level_5,Unnamed: 6_level_5,Unnamed: 7_level_5,Unnamed: 8_level_5
Weight?,Feature,Unnamed: 2_level_6,Unnamed: 3_level_6,Unnamed: 4_level_6,Unnamed: 5_level_6,Unnamed: 6_level_6,Unnamed: 7_level_6,Unnamed: 8_level_6
Weight?,Feature,Unnamed: 2_level_7,Unnamed: 3_level_7,Unnamed: 4_level_7,Unnamed: 5_level_7,Unnamed: 6_level_7,Unnamed: 7_level_7,Unnamed: 8_level_7
Weight?,Feature,Unnamed: 2_level_8,Unnamed: 3_level_8,Unnamed: 4_level_8,Unnamed: 5_level_8,Unnamed: 6_level_8,Unnamed: 7_level_8,Unnamed: 8_level_8
+3.692,word.istitle=False,,,,,,,
+3.312,word.isupper=False,,,,,,,
+2.211,"-1:word.lower=""",,,,,,,
+1.999,-1:word.lower=+,,,,,,,
+1.835,EOS,,,,,,,
+1.820,BOS,,,,,,,
+1.740,word.cluster=158,,,,,,,
+1.684,word.cluster=415,,,,,,,
+1.590,word[-2:]=ag,,,,,,,
+1.588,word.cluster=195,,,,,,,

Weight?,Feature
+3.692,word.istitle=False
+3.312,word.isupper=False
+2.211,"-1:word.lower="""
+1.999,-1:word.lower=+
+1.835,EOS
+1.820,BOS
+1.740,word.cluster=158
+1.684,word.cluster=415
+1.590,word[-2:]=ag
+1.588,word.cluster=195

Weight?,Feature
+3.734,word.cluster=325
+3.650,word.cluster=375
+3.466,word.cluster=68
+3.169,word.cluster=139
+2.020,-1:word.lower=in
+1.995,word.cluster=143
+1.973,word.cluster=476
+1.828,word.cluster=102
+1.617,word[-2:]=ië
+1.538,-1:word.lower=(

Weight?,Feature
+1.778,word.cluster=238
+1.488,+2:word.lower=m
+1.367,word.cluster=161
+0.996,-1:word.lower=col
+0.977,word[-2:]=rk
+0.852,-1:word.istitle=False
+0.821,word[-2:]=al
+0.810,word.lower=york
+0.810,word[-3:]=ork
+0.809,word[-3:]=eum

Weight?,Feature
+3.318,word.cluster=23
+2.494,word.cluster=100
+2.419,word.cluster=39
+2.097,+2:word.lower=1
+2.039,word.cluster=294
+2.036,word.cluster=338
+1.786,word[-2:]='s
+1.772,word.cluster=11
+1.700,word.lower=sport
+1.646,word.lower=buitenland

Weight?,Feature
+1.792,-2:word.lower=ronde
+1.684,-1:word.isupper=True
+1.567,-1:word.lower=ronde
+1.332,word.cluster=37
+1.323,+1:word.lower=ned
+1.316,word.cluster=325
+1.298,word.cluster=1
+1.274,word.lower=leven
+1.215,-1:word.istitle=True
+1.201,-1:postag=Num

Weight?,Feature
+2.798,word.cluster=424
+2.635,word.cluster=228
+2.121,word[-3:]=com
+1.991,word.cluster=187
+1.974,word.lower=quizpeople
+1.922,word[-3:]=ple
+1.848,+1:word.lower=morgen
+1.798,word.cluster=250
+1.683,word.cluster=83
+1.560,word.cluster=29

Weight?,Feature
+1.338,word.lower=morgen
+1.304,word.cluster=403
+1.243,word.cluster=413
+1.200,-1:word.lower=vlaams
+1.141,word.cluster=187
+1.120,word[-3:]=gen
+1.101,word[-3:]=ion
+1.057,-1:word.lower=radio
+1.028,word.cluster=321
+0.970,-1:word.lower=ned

Weight?,Feature
+3.523,word.cluster=489
+2.888,word.cluster=204
+2.818,word.cluster=301
+2.804,word.cluster=3
+2.765,word.cluster=246
+2.444,word.cluster=337
+2.419,word.cluster=6
+2.361,word.cluster=326
+2.199,word.cluster=296
+2.069,word.cluster=87

Weight?,Feature
+1.748,-1:word.lower=van
+1.425,word.cluster=3
+1.313,word.cluster=388
+1.287,word.cluster=450
+1.250,word.cluster=6
+1.231,word.cluster=249
+1.152,+1:word.lower=(
+1.127,+2:word.lower=die
+1.044,word.lower=gucht
+0.970,word.cluster=337


## Finding the optimal hyperparameters

So far we've trained a model with the default parameters. It's unlikely that these will give us the best performance possible. Therefore we're going to search automatically for the best hyperparameter settings by iteratively training different models and evaluating them. Eventually we'll pick the best one.

Here we'll focus on two parameters: `c1` and `c2`. These are the parameters for L1 and L2 regularization, respectively. Regularization prevents overfitting on the training data by adding a penalty to the loss function. In L1 regularization, this penalty is the sum of the absolute values of the weights; in L2 regularization, it is the sum of the squared weights. L1 regularization performs a type of feature selection, as it assigns 0 weight to irrelevant features. L2 regularization, by contrast, makes the weight of irrelevant features small, but not necessarily zero. L1 regularization is often called the Lasso method, L2 is called the Ridge method, and the linear combination of both is called Elastic Net regularization.

We define the parameter space for c1 and c2 and use the flat F1-score to compare the individual models. We'll rely on three-fold cross validation to score each of the 50 candidates. We use a randomized search, which means we're not going to try out all specified parameter settings, but instead, we'll let the process sample randomly from the distributions we've specified in the parameter space. It will do this 50 (`n_iter`) times. This process takes a while, but it's worth the wait.

In [12]:
import scipy
from sklearn.metrics import make_scorer
from sklearn.model_selection import RandomizedSearchCV

crf = crfsuite.CRF(algorithm="lbfgs", max_iterations=100, all_possible_transitions=True)

params_space = {
    "c1": scipy.stats.expon(scale=0.5),
    "c2": scipy.stats.expon(scale=0.05),
}

f1_scorer = make_scorer(metrics.flat_f1_score, average="weighted", labels=labels)

rs = RandomizedSearchCV(
    crf, params_space, cv=3, verbose=1, n_jobs=-1, n_iter=50, scoring=f1_scorer
)
rs.fit(X_train, y_train)

Fitting 3 folds for each of 50 candidates, totalling 150 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:  3.2min
[Parallel(n_jobs=-1)]: Done 150 out of 150 | elapsed: 24.3min finished


RandomizedSearchCV(cv=3, error_score='raise-deprecating',
          estimator=CRF(algorithm='lbfgs', all_possible_states=None,
  all_possible_transitions=True, averaging=None, c=None, c1=None, c2=None,
  calibration_candidates=None, calibration_eta=None,
  calibration_max_trials=None, calibration_rate=None,
  calibration_samples=None, delta=None, epsilon=None, error...e,
  num_memories=None, pa_type=None, period=None, trainer_cls=None,
  variance=None, verbose=False),
          fit_params=None, iid='warn', n_iter=50, n_jobs=-1,
          param_distributions={'c1': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f9947f04e10>, 'c2': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f9947f04c88>},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score='warn',
          scoring=make_scorer(flat_f1_score, average=weighted, labels=['B-ORG', 'B-MISC', 'B-PER', 'I-PER', 'B-LOC', 'I-MISC', 'I-ORG', 'I-LOC']),
          verbose=1)

Let's take a look at the best hyperparameter settings. Our random search suggests a combination of L1 and L2 normalization.

In [13]:
print("best params:", rs.best_params_)
print("best CV score:", rs.best_score_)
print("model size: {:0.2f}M".format(rs.best_estimator_.size_ / 1000000))

best params: {'c1': 0.08869645933566639, 'c2': 0.005642379370340676}
best CV score: 0.7608794798691931
model size: 1.06M


To find out what precision, recall and F1-score this translates to, we take the best estimator from our random search and evaluate it on the test set. This indeed shows a nice improvement from our initial model. We've gone from an average F1-score of 77% to 79.1%. Both precision and recall have improved, and we see a positive result for all four entity types.

In [14]:
best_crf = rs.best_estimator_
y_pred = best_crf.predict(X_test)
print(
    metrics.flat_classification_report(y_test, y_pred, labels=sorted_labels, digits=3)
)

              precision    recall  f1-score   support

       B-LOC      0.849     0.863     0.856       774
       I-LOC      0.359     0.571     0.441        49
      B-MISC      0.847     0.622     0.717      1187
      I-MISC      0.664     0.415     0.511       410
       B-ORG      0.806     0.727     0.764       882
       I-ORG      0.772     0.677     0.721       551
       B-PER      0.834     0.903     0.867      1098
       I-PER      0.892     0.958     0.924       807

   micro avg      0.823     0.761     0.791      5758
   macro avg      0.753     0.717     0.725      5758
weighted avg      0.820     0.761     0.784      5758



## Conclusions

Conditional Random Fields have lost some of their popularity since the advent of neural-network models. Still, they can be very effective for named entity recognition, particularly when word embedding information is taken into account.  