# Named entity recognition - clinical concepts

In this practical, we will try to build a named entity recognition classifier using spaCy and crfsuite.

Named entity recognition is a structured learning problem, i.e., we want to learn sequence patterns.

We will use data from mtsamples again, and build classifiers that find clinical concepts. 

The 'gold' standard data is *not* manually annotated, it is the output of a clinical concept recognition system developed by Zeljko Kraljevic called 'CAT' (a predecessor to MedCAT), thus this data is not perfect. This system matches concepts to the entire UMLS. We will only use a few example concepts here.

Part of this material is adapted, inspired etc from:

https://spacy.io/usage/training,

https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html,

https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da

Written by Sumithra Velupillai, March 2019, updated February 2021 - acknowledgements and many thanks to Zeljko for the data preparations!

In [1]:
## tested with spacy version '2.2.4'
try:
    import spacy
except ImportError as e:
    !pip install spacy
    import spacy

from spacy import displacy
from spacy.gold import GoldParse
from spacy.scorer import Scorer

import json
import requests

try:
    import sklearn_crfsuite
except ImportError as e:
    !pip install sklearn_crfsuite
    import sklearn_crfsuite

from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics

import sklearn
from sklearn.metrics import make_scorer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics
import scipy

import random


# 1: corpus
We have prepared the training and test data in a json format.

In [2]:
data_url = 'https://github.com/KCL-Health-NLP/nlp_examples/blob/master/chunking/chunking_traindata_CAT_updated_2021.json?raw=true'
r = requests.get(data_url)
train_data = r.json()

Let's take a look at a random document and its annotations. The json format contains the text itself, and then the start and end offsets for each annotated entity. What are the instances we want to learn?

In [3]:
train_data[15][1]

{'entities': [[124, 134, 'ANATOMY'],
  [629, 633, 'ANATOMY'],
  [695, 701, 'ANATOMY'],
  [765, 770, 'ANATOMY'],
  [829, 839, 'ANATOMY'],
  [1015, 1027, 'ANATOMY'],
  [1349, 1359, 'ANATOMY'],
  [80, 87, 'DISEASESYNDROME'],
  [318, 325, 'DISEASESYNDROME'],
  [534, 543, 'DISEASESYNDROME'],
  [979, 986, 'DISEASESYNDROME'],
  [1084, 1091, 'DISEASESYNDROME'],
  [1397, 1404, 'DISEASESYNDROME'],
  [1138, 1146, 'SIGNSYMPTOM'],
  [1410, 1418, 'SIGNSYMPTOM']]}

# 2: Training a named entity model with spaCy
We can use spaCy to train our own named entity recognition model using their training algorithm.
First we need to load a spaCy English language model, so that we can sentence- and word tokenize.

In [4]:
try:
    #nlp = spacy.load('en')
    nlp = spacy.load('en_core_web_sm')
except OSError as e:
    !python -m spacy download en_core_web_sm
    nlp = spacy.load('en_core_web_sm')

What nlp preprocessing parts does this model contain? In spaCy, these are called 'pipes'.

In [5]:
nlp.pipe_names

['tagger', 'parser', 'ner']

The default named entity pipe in spaCy is not trained for out labels. We have our own named entities that we want to develop a model for. Let's add these entity labels to the spaCy ner pipe.

In [6]:
if "ner" not in nlp.pipe_names:
        ner = nlp.create_pipe("ner")
        nlp.add_pipe(ner, last=True)
    # otherwise, get it so we can add labels
else:
        ner = nlp.get_pipe("ner")

## We'll create an empty set where we'll store our ner labels, that we get from the annotations in our data.        
labels = set()
for _, annotations in train_data:
        for ent in annotations.get("entities"):
            ner.add_label(ent[2])
            labels.add(ent[2])

We don't want to retrain the other pipeline steps, so let's keep those. We only want to retrain the ner pipeline with our own labels and annotations.

In [7]:
pipe_exceptions = ["ner"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]

In [8]:
other_pipes

['tagger', 'parser']

What entities/labels do we have in our data?

In [9]:
print(labels)

{'DISEASESYNDROME', 'SIGNSYMPTOM', 'ANATOMY'}


Now let's train our clinical concept ner model. Let's set the number of training iterations.

In [10]:
n_iter=(10)

In [11]:
import warnings

In [12]:
#optimizer = nlp.begin_training()

Now let's train the model.

In [13]:
with nlp.disable_pipes(other_pipes), warnings.catch_warnings():# only train NER
    warnings.filterwarnings("once", category=UserWarning, module='spacy')
#with nlp.disable_pipes(other_pipes):  # only train NER
        # reset and initialize the weights randomly – but only if we're
        # training a new model
    nlp.begin_training()
    for itn in range(n_iter):
        random.shuffle(train_data)
        losses = {}
        # batch up the examples using spaCy's minibatch
        #batches = spacy.util.minibatch(train_data, size=spacy.util.compounding(4.0, 32.0, 1.001))
        ##batches = spacy.util.minibatch(train_data, size=2)
        ##for batch in batches:
        ##    texts, annotations = zip(*batch)
        ##    nlp.update(
        ##        texts,  # batch of texts
        ##        annotations,  # batch of annotations
        ##        sgd = optimizer,
        ##        drop=0.5,  # dropout - make it harder to memorise data
        ##        losses=losses,
        ##    )
        for batch in spacy.util.minibatch(train_data, size=2):
            texts = [text for text, entities in batch]
            annotations = [entities for text, entities in batch]# Update the model
            nlp.update(texts, annotations, losses=losses, drop=0.3)
        print("Losses", losses)

Losses {'ner': 40518.12223743927}
Losses {'ner': 39903.97432938963}
Losses {'ner': 39658.97405430768}
Losses {'ner': 39335.66381083056}
Losses {'ner': 38166.94617314637}
Losses {'ner': 37975.83893874474}
Losses {'ner': 38008.94368905318}
Losses {'ner': 37702.0379986912}
Losses {'ner': 38019.34605675191}
Losses {'ner': 37714.754248877056}


In [14]:
text, _ = train_data[60]

In [15]:
doc2 = nlp(text)
colors = {'ANATOMY': 'lightyellow',
           'DISEASESYNDROME': 'pink',  
           'SIGNSYMPTOM': 'lightgreen'}
displacy.render(doc2, style='ent', jupyter=True, options={'colors':colors})

We have now added a clinical concept entity recognizer in the spaCy nlp model! Let's look at an example document and the predicted entities from the new model.

In [16]:
text, _ = train_data[61]

In [17]:
doc2 = nlp(text)
colors = {'ANATOMY': 'lightyellow',
           'DISEASESYNDROME': 'pink',  
           'SIGNSYMPTOM': 'lightgreen'}
displacy.render(doc2, style='ent', jupyter=True, options={'colors':colors})

We can also look at the underlying representation - let's look at one sentence in this document.

In [18]:
print([(x, x.ent_iob_, x.ent_type_) for x in list(doc2.sents)[3]])

[(PROCEDURE, 'O', ''), (:, 'O', ''), (
 , 'O', '')]


What do you think? Does it seem like the model works well on this document? Are there concepts that are missed? 


# 3: Evaluation
How do we know how good this model is? Let's compare with the 'gold standard' test data.

In [19]:
scorer = Scorer()

data_url = 'https://github.com/KCL-Health-NLP/nlp_examples/blob/master/chunking/chunking_testdata_CAT_updated_2021.json?raw=true'
r = requests.get(data_url)
test_data = r.json()

for text, entity_offsets in test_data:
    doc = nlp.make_doc(text)
    gold = GoldParse(doc, entities=entity_offsets.get('entities'))
    doc = nlp(text)
    scorer.score(doc, gold)
print('Precision: ',scorer.scores['ents_p'])
print('Recall: ',scorer.scores['ents_r'])
print('F1: ',scorer.scores['ents_f'])

Precision:  77.62345679012346
Recall:  59.21130076515597
F1:  67.17863105175292


Are these good results do you think? Can this be improved? What happens if you increase the number of iterations in the training?

Let's look at a document from the test data.

In [20]:
text, entity_offsets = test_data[37]
doc2 = nlp(text)
colors = {'ANATOMY': 'lightyellow',
           'DISEASESYNDROME': 'pink',  
           'SIGNSYMPTOM': 'lightgreen'}
displacy.render(doc2, style='ent', jupyter=True, options={'colors':colors})

What does the underlying representation look like?

In [21]:
doc2 = nlp.make_doc(text)
gold = GoldParse(doc2, entities=entity_offsets.get('entities'))
gold.ner

['O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'U-ANATOMY',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'B-SIGNSYMPTOM',
 'L-SIGNSYMPTOM',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'U-ANATOMY',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'U-ANATOMY',
 'O',
 'O',
 'O',
 'O',
 'U-ANATOMY',
 'O',
 'O',
 'U-ANATOMY',
 'O',
 'O',
 'O',
 'O',
 'O',
 'U-ANATOMY',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'U-ANATOMY',
 'O',
 'B-ANATOMY',
 'I-ANATOMY',
 'L-ANATOMY',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'U-ANATOMY',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'U-ANATOMY',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'U-ANATOMY',
 '

There are other options available using spaCy, training models etc. If interested, look at their website, e.g. https://spacy.io/usage/training

# 4: Training a model with crfsuite
There are other machine learning algorithms that can be used for this sequence learning problem. Let's try crfsuite. 

Let's use some functions to get sentences and tokens in the right format.

In [22]:
def get_sentences(filename, bio=False):
    
    r = requests.get(filename)
    train_data = r.json()

    sentences = []
    try:
        nlp = spacy.load('en', entity=False, parser=False)
    except OSError as e:
        nlp = spacy.load('en_core_web_sm', entity=False, parser=False)
    #nlp = spacy.load('en', entity=False, parser=False)

    print('reading data: ', filename)
    for text, entity_offsets in train_data:
        doc = nlp(text)
        gold = spacy.gold.biluo_tags_from_offsets(doc, entity_offsets.get('entities'))
        counter = 0
        for s in doc.sents:
            sent = []
            for t in s:
                l = gold[counter]
                if bio:
                    l = l.replace('L-', 'I-')
                    l = l.replace('U-', 'B-')
                w = t.text, t.pos_, l
                sent.append(w)
                counter +=1
            sentences.append(sent)
    print('done')
    return sentences

In [23]:
def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]

    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'postag': postag,
        'postag[:2]': postag[:2],
    }
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2],
        })
    else:
        features['BOS'] = True

    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2],
        })
    else:
        features['EOS'] = True

    return features


def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

def sent2tokens(sent):
    return [token for token, postag, label in sent]



Let's use these functions and read in the training and test data. 
There are different alternative token level representations that can be used.
The BIO format (Begin, Inside, Outside) or the BILOU format (Begin, Inside, Last, Outside, Unit).
What do you think is better or worse with each of these?
In the function below, you can choose either format with the boolean flag 'bio'. Let's start with BIO.

In [24]:
train_sents = get_sentences('https://github.com/KCL-Health-NLP/nlp_examples/blob/master/chunking/chunking_traindata_CAT_updated_2021.json?raw=true', bio=True)
test_sents = get_sentences('https://github.com/KCL-Health-NLP/nlp_examples/blob/master/chunking/chunking_testdata_CAT_updated_2021.json?raw=true', bio=True)

reading data:  https://github.com/KCL-Health-NLP/nlp_examples/blob/master/chunking/chunking_traindata_CAT_updated_2021.json?raw=true
done
reading data:  https://github.com/KCL-Health-NLP/nlp_examples/blob/master/chunking/chunking_testdata_CAT_updated_2021.json?raw=true
done


Now let's create the feature and label vectors for the training and test data.

In [25]:
X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_test = [sent2features(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]

What labels do we have?

In [26]:
labels = list(set(x for l in y_test for x in l))
labels

['O',
 'B-SIGNSYMPTOM',
 'B-DISEASESYNDROME',
 'I-SIGNSYMPTOM',
 '-',
 'B-ANATOMY',
 'I-ANATOMY',
 'I-DISEASESYNDROME']

Now let's train the model.

In [27]:
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.1,
    c2=0.1,
    max_iterations=100,
    all_possible_transitions=True,
)
crf.fit(X_train, y_train);

# 5: evaluation
How does this model perform on our test data? Let's look at the f1 score first.

In [28]:
y_pred = crf.predict(X_test)
metrics.flat_f1_score(y_test, y_pred, average='weighted', labels=labels)

  'precision', 'predicted', average, warn_for)


0.9815580970053605

We can also print a classification report with more details and metrics.

In [29]:
print(metrics.flat_classification_report(y_test, y_pred, labels = labels))

  'precision', 'predicted', average, warn_for)


                   precision    recall  f1-score   support

                O       0.99      1.00      0.99     36011
    B-SIGNSYMPTOM       0.92      0.71      0.81       308
B-DISEASESYNDROME       0.88      0.72      0.79       480
    I-SIGNSYMPTOM       0.86      0.73      0.79       122
                -       0.00      0.00      0.00         1
        B-ANATOMY       0.95      0.85      0.89       945
        I-ANATOMY       0.90      0.78      0.83       299
I-DISEASESYNDROME       0.81      0.53      0.64       219

         accuracy                           0.98     38385
        macro avg       0.79      0.66      0.72     38385
     weighted avg       0.98      0.98      0.98     38385



What do you think? There's a huge imbalance in the number of instances. Do we really want to evaluate the 'O' label? There's also one instance with an erroneous label ('-') Let's look at the results without these labels.

In [30]:
labels = list(set(x for l in y_test for x in l if x !='O' and x!='-'))
labels

['B-SIGNSYMPTOM',
 'B-DISEASESYNDROME',
 'I-SIGNSYMPTOM',
 'B-ANATOMY',
 'I-ANATOMY',
 'I-DISEASESYNDROME']

In [31]:
print(metrics.flat_classification_report(y_test, y_pred, labels = labels))

                   precision    recall  f1-score   support

    B-SIGNSYMPTOM       0.92      0.71      0.81       308
B-DISEASESYNDROME       0.88      0.72      0.79       480
    I-SIGNSYMPTOM       0.86      0.73      0.79       122
        B-ANATOMY       0.95      0.85      0.89       945
        I-ANATOMY       0.90      0.78      0.83       299
I-DISEASESYNDROME       0.81      0.53      0.64       219

        micro avg       0.91      0.76      0.83      2373
        macro avg       0.89      0.72      0.79      2373
     weighted avg       0.91      0.76      0.83      2373



This was quite different! 
Try training this model with the BILOU scheme instead. We can do this by converting the BIO tags in the get_sentences function with the boolean flag 'BIO'. Are results better or worse?

In [32]:
training_file = 'https://github.com/KCL-Health-NLP/nlp_examples/blob/master/chunking/chunking_traindata_CAT_updated_2021.json?raw=true'
test_file = 'https://github.com/KCL-Health-NLP/nlp_examples/blob/master/chunking/chunking_testdata_CAT_updated_2021.json?raw=true'
train_sents = get_sentences(training_file, bio=False)
test_sents = get_sentences(test_file, bio=False)
X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_test = [sent2features(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]

new_classes = list(set(x for l in y_test for x in l if x !='O' and x!='-'))

c1=0.1
c2=0.1

crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=c1,
    c2=c2,
    max_iterations=100,
    all_possible_transitions=True,
)
crf.fit(X_train, y_train)
y_pred = crf.predict(X_test)
print(metrics.flat_classification_report(y_test, y_pred, labels = new_classes))

reading data:  https://github.com/KCL-Health-NLP/nlp_examples/blob/master/chunking/chunking_traindata_CAT_updated_2021.json?raw=true
done
reading data:  https://github.com/KCL-Health-NLP/nlp_examples/blob/master/chunking/chunking_testdata_CAT_updated_2021.json?raw=true
done
                   precision    recall  f1-score   support

    U-SIGNSYMPTOM       0.95      0.73      0.82       223
U-DISEASESYNDROME       0.90      0.77      0.83       316
B-DISEASESYNDROME       0.82      0.63      0.71       164
    B-SIGNSYMPTOM       0.91      0.69      0.79        85
    I-SIGNSYMPTOM       1.00      0.76      0.86        37
L-DISEASESYNDROME       0.83      0.63      0.72       164
        B-ANATOMY       0.90      0.76      0.82       222
        U-ANATOMY       0.96      0.87      0.92       723
        I-ANATOMY       1.00      0.78      0.88        77
I-DISEASESYNDROME       0.79      0.27      0.41        55
        L-ANATOMY       0.91      0.77      0.84       222
    L-SIGNSYMPTO

# Optional: cross-validation to find best parameters with crfsuite
We have used default parameters in the above. We can try to find the best parameters on the training data by cross-validation. 

__This takes some time (even with only 3 folds)!__ 

In [None]:
# from: https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html#hyperparameter-optimization

crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    max_iterations=100,
    all_possible_transitions=True
)
params_space = {
    'c1': scipy.stats.expon(scale=0.5),
    'c2': scipy.stats.expon(scale=0.05),
}

# use the same metric for evaluation
f1_scorer = make_scorer(metrics.flat_f1_score,
                        average='weighted', labels=new_classes)

# search
rs = RandomizedSearchCV(crf, params_space,
                        cv=3,
                        verbose=1,
                        n_jobs=-1,
                        n_iter=50,
                        scoring=f1_scorer)
rs.fit(X_train, y_train)

In [None]:
print('best params:', rs.best_params_)
print('best CV score:', rs.best_score_)

In [None]:
crf = rs.best_estimator_
y_pred = crf.predict(X_test)
print(metrics.flat_classification_report(
    y_test, y_pred, labels=new_classes, digits=3
))

What do you think? Are there other parameters that could be tested in the cross-validation setup? What about the measure used for optimisation?