<a href="https://colab.research.google.com/github/KCL-Health-NLP/nlp_examples/blob/master/chunking/crfsuite_ner_answers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CRF for named entity recognition of clinical concepts

In this practical, we will build a named entity recognition classifier using crfsuite, a CRF package integrated with sklearn.

Named entity recognition is a structured learning problem, i.e., we want to learn sequence patterns.

We will use data from mtsamples again, and build classifiers that find clinical concepts. 

The 'gold' standard data is *not* manually annotated, it is the output of a clinical concept recognition system developed by Zeljko Kraljevic called 'CAT' (a predecessor to MedCAT), thus this data is not perfect. This system matches concepts to the entire UMLS. We will only use a few example concepts here.

Part of this material is adapted, inspired etc from:

https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html,

Written by Sumithra Velupillai, March 2019, updated February 2021. Updated May 2023 by Angus Roberts acknowledgements and many thanks to Zeljko Kraljevic for the data preparations.

In [1]:
# By default, pip will install the original sklearn_crfsuite package from PyPI
# However, this is not compatible with more recent sklearns, and is no longer 
# being maintained. So we will install from a github fork that is being maintained.
# You might be able to go back to the PyPI version in the future, if someone
# starts maintiaing it again.
try:
  import sklearn_crfsuite
except ImportError as e:
  !pip install git+https://github.com/MeMartijn/updated-sklearn-crfsuite.git#egg=sklearn_crfsuite
  #!pip install sklearn_crfsuite
  import sklearn_crfsuite


from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics

# We use sklearn for scoring, metrics,
# and parameter searching
import sklearn
from sklearn.metrics import make_scorer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics

# We use scipy to make exponential continuous random variables
# when parameter searching
import scipy

# import random

# requests is a package to submit requests to URLs
# We will use it to fetch our data
import requests

# We use spacy to create our BILOU tags
import spacy
from spacy.training import offsets_to_biluo_tags

# You might choose to turn off warnings - could be for
# documents with no entities, etc
#import warnings
#warnings.filterwarnings('ignore')

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sklearn_crfsuite
  Cloning https://github.com/MeMartijn/updated-sklearn-crfsuite.git to /tmp/pip-install-nudk7254/sklearn-crfsuite_b07df36d560d4adbbfae7e8b7e2df070
  Running command git clone --filter=blob:none --quiet https://github.com/MeMartijn/updated-sklearn-crfsuite.git /tmp/pip-install-nudk7254/sklearn-crfsuite_b07df36d560d4adbbfae7e8b7e2df070
  Resolved https://github.com/MeMartijn/updated-sklearn-crfsuite.git to commit 675038761b4405f04691a83339d04903790e2b95
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting python-crfsuite>=0.8.3 (from sklearn_crfsuite)
  Downloading python_crfsuite-0.9.9-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (993 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m993.5/993.5 kB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: sklearn_crfsuite
  Building wheel f

# 4: Training a model with crfsuite
There are other machine learning algorithms that can be used for this sequence learning problem. Let's try crfsuite. 

Let's use some functions to get sentences and tokens in the right format.

In [4]:
# We will use a spacy pipeline to POS and BILOU tag our data.
# We do not need to have NER, as we will use CRF for that.
try:
  nlp = spacy.load('en_core_web_sm', exclude=['ner'])
except OSError as e:
  !python -m spacy download en_core_web_sm
  nlp = spacy.load('en_core_web_sm', exclude=['ner'])

In [5]:
print(nlp.pipeline)

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec object at 0x7fd737740760>), ('tagger', <spacy.pipeline.tagger.Tagger object at 0x7fd7364283a0>), ('parser', <spacy.pipeline.dep_parser.DependencyParser object at 0x7fd736f39a80>), ('attribute_ruler', <spacy.pipeline.attributeruler.AttributeRuler object at 0x7fd7362033c0>), ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer object at 0x7fd7361d5140>)]


In [12]:
# This function loads data from a filename and then
# uses SpaCy to get BILUO tags for each sentence.
# The parameter bio flags whether these should be
# converted to BIO tags
def get_sentences(filename, bio=False):
    
    print('reading data: ', filename)
    r = requests.get(filename)
    train_data = r.json()
 
    sentences = []
        
    for text, entities in train_data:
        doc = nlp(text)

        tags = offsets_to_biluo_tags(doc, entities['entities'])

        tag_counter = 0
        for sent in doc.sents:
            tagged_sentence = []
            for tok in sent:
                tag = tags[tag_counter]
                if bio:
                    tag = tag.replace('L-', 'I-')
                    tag = tag.replace('U-', 'B-')
                w = (tok.text, tok.pos_, tag)
                tagged_sentence.append(w)
                tag_counter +=1
            sentences.append(tagged_sentence)
    print('done')
    return sentences

In [13]:
def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]

    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'postag': postag,
        'postag[:2]': postag[:2],
    }
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2],
        })
    else:
        features['BOS'] = True

    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2],
        })
    else:
        features['EOS'] = True

    return features


def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

def sent2tokens(sent):
    return [token for token, postag, label in sent]



Let's use these functions and read in the training and test data. 
There are different alternative token level representations that can be used.
The BIO format (Begin, Inside, Outside) or the BILOU format (Begin, Inside, Last, Outside, Unit).
What do you think is better or worse with each of these?
In the function below, you can choose either format with the boolean flag 'bio'. Let's start with BIO.

In [14]:
train_sents = get_sentences('https://github.com/KCL-Health-NLP/nlp_examples/blob/master/chunking/chunking_traindata_CAT_updated_2021.json?raw=true', bio=True)
test_sents = get_sentences('https://github.com/KCL-Health-NLP/nlp_examples/blob/master/chunking/chunking_testdata_CAT_updated_2021.json?raw=true', bio=True)

reading dat:  https://github.com/KCL-Health-NLP/nlp_examples/blob/master/chunking/chunking_traindata_CAT_updated_2021.json?raw=true
done
reading dat:  https://github.com/KCL-Health-NLP/nlp_examples/blob/master/chunking/chunking_testdata_CAT_updated_2021.json?raw=true
done


In [15]:
print(train_sents[37])

[('The', 'DET', 'O'), ('catheter', 'NOUN', 'O'), ('was', 'AUX', 'O'), ('then', 'ADV', 'O'), ('removed', 'VERB', 'O'), ('.', 'PUNCT', 'O'), (' ', 'SPACE', 'O')]


Now let's create the feature and label vectors for the training and test data.

In [16]:
X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_test = [sent2features(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]

What labels do we have?

In [17]:
labels = list(set(x for l in y_test for x in l))
labels

['-',
 'B-ANATOMY',
 'I-ANATOMY',
 'I-DISEASESYNDROME',
 'B-DISEASESYNDROME',
 'B-SIGNSYMPTOM',
 'I-SIGNSYMPTOM',
 'O']

Now let's train the model.

In [18]:
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',             # gradient descent
    c1=0.1,                        # L1 regularisation
    c2=0.1,                        # L2 regularisation
    max_iterations=100,
    all_possible_transitions=True  # Consider transitions not in the training data
)
crf.fit(X_train, y_train);

# 5: evaluation
How does this model perform on our test data? Let's look at the f1 score first.

In [19]:
y_pred = crf.predict(X_test)
metrics.flat_f1_score(y_test, y_pred, average='weighted', labels=labels)

0.9810388541756578

We can also print a classification report with more details and metrics.

In [20]:
from sklearn_crfsuite.utils import flatten
print(metrics.flat_classification_report(y_test, y_pred, labels=labels))

                   precision    recall  f1-score   support

                -       0.00      0.00      0.00         1
        B-ANATOMY       0.95      0.84      0.89       945
        I-ANATOMY       0.90      0.76      0.82       299
I-DISEASESYNDROME       0.81      0.53      0.64       219
B-DISEASESYNDROME       0.88      0.71      0.79       480
    B-SIGNSYMPTOM       0.93      0.70      0.80       308
    I-SIGNSYMPTOM       0.93      0.68      0.79       122
                O       0.99      1.00      0.99     36197

         accuracy                           0.98     38571
        macro avg       0.80      0.65      0.71     38571
     weighted avg       0.98      0.98      0.98     38571



What do you think? There's a huge imbalance in the number of instances. Do we really want to evaluate the 'O' label? There's also one instance with an erroneous label ('-') Let's look at the results without these labels.

In [21]:
labels = list(set(x for l in y_test for x in l if x !='O' and x!='-'))
labels

['B-ANATOMY',
 'I-ANATOMY',
 'I-DISEASESYNDROME',
 'B-DISEASESYNDROME',
 'B-SIGNSYMPTOM',
 'I-SIGNSYMPTOM']

In [22]:
print(metrics.flat_classification_report(y_test, y_pred, labels = labels))

                   precision    recall  f1-score   support

        B-ANATOMY       0.95      0.84      0.89       945
        I-ANATOMY       0.90      0.76      0.82       299
I-DISEASESYNDROME       0.81      0.53      0.64       219
B-DISEASESYNDROME       0.88      0.71      0.79       480
    B-SIGNSYMPTOM       0.93      0.70      0.80       308
    I-SIGNSYMPTOM       0.93      0.68      0.79       122

        micro avg       0.91      0.75      0.82      2373
        macro avg       0.90      0.70      0.79      2373
     weighted avg       0.91      0.75      0.82      2373



This was quite different! 
Try training this model with the BILOU scheme instead. We can do this by converting the BIO tags in the get_sentences function with the boolean flag 'BIO'. Are results better or worse?

In [23]:
training_file = 'https://github.com/KCL-Health-NLP/nlp_examples/blob/master/chunking/chunking_traindata_CAT_updated_2021.json?raw=true'
test_file = 'https://github.com/KCL-Health-NLP/nlp_examples/blob/master/chunking/chunking_testdata_CAT_updated_2021.json?raw=true'
train_sents = get_sentences(training_file, bio=False)
test_sents = get_sentences(test_file, bio=False)
X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_test = [sent2features(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]

new_classes = list(set(x for l in y_test for x in l if x !='O' and x!='-'))

c1=0.1
c2=0.1

crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=c1,
    c2=c2,
    max_iterations=100,
    all_possible_transitions=True,
)
crf.fit(X_train, y_train)
y_pred = crf.predict(X_test)
print(metrics.flat_classification_report(y_test, y_pred, labels = new_classes))

reading dat:  https://github.com/KCL-Health-NLP/nlp_examples/blob/master/chunking/chunking_traindata_CAT_updated_2021.json?raw=true
done
reading dat:  https://github.com/KCL-Health-NLP/nlp_examples/blob/master/chunking/chunking_testdata_CAT_updated_2021.json?raw=true
done
                   precision    recall  f1-score   support

    L-SIGNSYMPTOM       0.92      0.69      0.79        85
        L-ANATOMY       0.91      0.77      0.84       222
        B-ANATOMY       0.90      0.76      0.82       222
        I-ANATOMY       1.00      0.79      0.88        77
U-DISEASESYNDROME       0.89      0.75      0.82       316
I-DISEASESYNDROME       0.88      0.27      0.42        55
B-DISEASESYNDROME       0.83      0.63      0.72       164
    U-SIGNSYMPTOM       0.94      0.74      0.83       223
L-DISEASESYNDROME       0.83      0.63      0.72       164
    B-SIGNSYMPTOM       0.86      0.65      0.74        85
    I-SIGNSYMPTOM       0.96      0.68      0.79        37
        U-ANATOMY 

# Optional: cross-validation to find best parameters with crfsuite
We have used default parameters in the above. We can try to find the best parameters on the training data by cross-validation. 

__This takes some time, 20 - 30 minutes (even with only 3 folds)!__ 

You might make it a bit faster by re-reading your data, this time reverting to BIO tags

In [24]:
# from: https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html#hyperparameter-optimization

crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    max_iterations=100,
    all_possible_transitions=True
)
params_space = {
    'c1': scipy.stats.expon(scale=0.5),
    'c2': scipy.stats.expon(scale=0.05),
}

# use the same metric for evaluation
f1_scorer = make_scorer(metrics.flat_f1_score,
                        average='weighted', labels=new_classes)

# search
rs = RandomizedSearchCV(crf, params_space,
                        cv=3,
                        verbose=1,
                        n_jobs=-1,
                        n_iter=50,
                        scoring=f1_scorer)
rs.fit(X_train, y_train)

Fitting 3 folds for each of 50 candidates, totalling 150 fits


In [25]:
print('best params:', rs.best_params_)
print('best CV score:', rs.best_score_)

best params: {'c1': 0.1727635814257965, 'c2': 0.005001234271500444}
best CV score: 0.8311706051427139


In [26]:
crf = rs.best_estimator_
y_pred = crf.predict(X_test)
print(metrics.flat_classification_report(
    y_test, y_pred, labels=new_classes, digits=3
))

                   precision    recall  f1-score   support

    L-SIGNSYMPTOM      0.952     0.706     0.811        85
        L-ANATOMY      0.921     0.793     0.852       222
        B-ANATOMY      0.911     0.784     0.843       222
        I-ANATOMY      0.984     0.818     0.894        77
U-DISEASESYNDROME      0.895     0.832     0.862       316
I-DISEASESYNDROME      0.882     0.273     0.417        55
B-DISEASESYNDROME      0.847     0.640     0.729       164
    U-SIGNSYMPTOM      0.933     0.816     0.871       223
L-DISEASESYNDROME      0.847     0.640     0.729       164
    B-SIGNSYMPTOM      0.887     0.647     0.748        85
    I-SIGNSYMPTOM      1.000     0.649     0.787        37
        U-ANATOMY      0.962     0.907     0.934       723

        micro avg      0.925     0.791     0.853      2373
        macro avg      0.918     0.709     0.790      2373
     weighted avg      0.922     0.791     0.847      2373



What do you think? Are there other parameters that could be tested in the cross-validation setup? What about the measure used for optimisation?