# Named entity recognition - clinical concepts

In this practical, we will try to build a named entity recognition classifier using spaCy and crfsuite.

Named entity recognition is a structured learning problem, i.e., we want to learn sequence patterns.

We will use data from mtsamples again, and build classifiers that find clinical concepts. The gold standard data is not manually annotated, it is the output of a clinical concept recognition system developed by Zeljko Kraljevic called 'CAT'. This system matches concepts to the entire UMLS. We will only use a few example concepts here.

Part of this material is adapted, inspired etc from:

https://spacy.io/usage/training,

https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html,

https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da

Written by Sumithra Velupillai, March 2019 - acknowledgements and many thanks to Zeljko for the data preparations!

In [None]:
#import spacy

## NOTE: spaCy has been updated since this practical was developed and we need to revert to an older version

!pip install spacy==2.0.13
import spacy

from spacy import displacy
from spacy.gold import GoldParse
from spacy.scorer import Scorer

import json
import requests

try:
    import sklearn_crfsuite
except ImportError as e:
    !pip install sklearn_crfsuite
    import sklearn_crfsuite

from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics

import sklearn
from sklearn.metrics import make_scorer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics
import scipy

import random


# 1: corpus
We have prepared training and test data in a json format.

In [None]:
spacy.__version__

In [None]:
data_url = 'https://github.com/KCL-Health-NLP/nlp_examples/blob/master/chunking/chunking_trainingdata_CAT.json?raw=true'
r = requests.get(data_url)
train_data = r.json()

Let's take a look at a random document and its annotations. The json format contains the text itself, and then the start and stop offsets for each entity. What are the instances we want to learn?

In [None]:
train_data[14][1]

# 2: Training a named entity model with spaCy
We can use spaCy to train our own named entity recognition model using their training algorithm.
First we need to load a spaCy English language model, so that we can sentence- and word tokenize.

In [None]:
try:
    nlp = spacy.load('en')
except OSError as e:
    !python -m spacy download en_core_web_sm
    nlp = spacy.load('en_core_web_sm')


What nlp preprocessing parts does this model contain?

In [None]:
nlp.pipe_names

We have our own named entities that we want to develop a model for. Let's add these entity labels to the spaCy ner pipe.

In [None]:
if "ner" not in nlp.pipe_names:
        ner = nlp.create_pipe("ner")
        nlp.add_pipe(ner, last=True)
    # otherwise, get it so we can add labels
else:
        ner = nlp.get_pipe("ner")

        
labels = set()
for _, annotations in train_data:
        for ent in annotations.get("entities"):
            ner.add_label(ent[2])
            labels.add(ent[2])

We don't want to retrain the other pipeline steps, so let's keep those.

In [None]:
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
other_pipes = ["tagger", "parser"]
other_pipes

What entities do we have?

In [None]:
print(labels)

Now let's train our clinical concept ner model. Let's set the number of training iterations.

In [None]:
n_iter=(10)

Now let's train the model.

In [None]:
with nlp.disable_pipes(*other_pipes):  # only train NER
#with nlp.disable_pipes(other_pipes):  # only train NER
        # reset and initialize the weights randomly – but only if we're
        # training a new model
        nlp.begin_training()
        for itn in range(n_iter):
            random.shuffle(train_data)
            losses = {}
            # batch up the examples using spaCy's minibatch
            batches = spacy.util.minibatch(train_data, size=spacy.util.compounding(4.0, 32.0, 1.001))
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(
                    texts,  # batch of texts
                    annotations,  # batch of annotations
                    drop=0.5,  # dropout - make it harder to memorise data
                    losses=losses,
                )
            print("Losses", losses)

In [None]:
for text, _ in train_data:
        doc = nlp(text)
        print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
        print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])

We have now added a clinical concept entity recognizer in the spaCy nlp model! Let's look at an example document and the predicted entities from the new model.

In [None]:
text, _ = train_data[6]

In [None]:
len(text)

In [None]:
doc2 = nlp(text)
colors = {'ANATOMY': 'lightyellow',
           'DISEASESYNDROME': 'pink',  
           'SIGNSYMPTOM': 'lightgreen'}
displacy.render(doc2, style='ent', jupyter=True, options={'colors':colors})

We can also look at the underlying representation - let's look at one sentence in this document.

In [None]:
print([(x, x.ent_iob_, x.ent_type_) for x in list(doc2.sents)[5]])

What do you think? Does it seem like the model works well on this document? Are there concepts that are missed? 


# 3: Evaluation
How do we know how good this model is? Let's compare with the 'gold standard' test data.

In [None]:
scorer = Scorer()

data_url = 'https://github.com/KCL-Health-NLP/nlp_examples/blob/master/chunking/chunking_testdata_CAT.json?raw=true'
r = requests.get(data_url)
test_data = r.json()

for text, entity_offsets in test_data:
    doc = nlp.make_doc(text)
    gold = GoldParse(doc, entities=entity_offsets.get('entities'))
    doc = nlp(text)
    scorer.score(doc, gold)
print('Precision: ',scorer.scores['ents_p'])
print('Recall: ',scorer.scores['ents_r'])
print('F1: ',scorer.scores['ents_f'])
#print(scorer.scores)

In [None]:
test_data

Are these good results do you think? Can this be improved? What happens if you increase the number of iterations in the training?

Let's look at a document from the test data.

In [None]:
text, entity_offsets = test_data[37]
doc2 = nlp(text)
colors = {'ANATOMY': 'lightyellow',
           'DISEASESYNDROME': 'pink',  
           'SIGNSYMPTOM': 'lightgreen'}
displacy.render(doc2, style='ent', jupyter=True, options={'colors':colors})

What does the underlying representation look like?

In [None]:
doc2 = nlp.make_doc(text)
gold = GoldParse(doc2, entities=entity_offsets.get('entities'))
gold.ner

# 4: Training a model with crfsuite
There are other machine learning algorithms that can be used for this sequence learning problem. Let's try crfsuite. If you don't have this package, install with: pip install sklearn-crfsuite

Let's use some functions to get sentences and tokens in the right format.

In [None]:
def get_sentences(filename, bio=False):
    
    r = requests.get(filename)
    train_data = r.json()

    sentences = []
    nlp = spacy.load('en_core_web_sm', entity=False, parser=False)

    print('reading data: ', filename)
    for text, entity_offsets in train_data:
        doc = nlp(text)
        gold = spacy.gold.biluo_tags_from_offsets(doc, entity_offsets.get('entities'))
        counter = 0
        for s in doc.sents:
            sent = []
            for t in s:
                l = gold[counter]
                if bio:
                    l = l.replace('L-', 'I-')
                    l = l.replace('U-', 'B-')
                w = t.text, t.pos_, l
                sent.append(w)
                counter +=1
            sentences.append(sent)
    print('done')
    return sentences

In [None]:
def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]

    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'postag': postag,
        'postag[:2]': postag[:2],
    }
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2],
        })
    else:
        features['BOS'] = True

    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2],
        })
    else:
        features['EOS'] = True

    return features


def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

def sent2tokens(sent):
    return [token for token, postag, label in sent]



Let's use these functions and read in the training and test data. 
There are different alternative token level representations that can be used.
The BIO format (Begin, Inside, Outside) or the BILOU format (Begin, Inside, Last, Outside, Unit).
What do you think is better or worse with each of these?
In the function below, you can choose either format with the boolean flag 'bio'. Let's start with BIO.

In [None]:
train_sents = get_sentences('https://github.com/KCL-Health-NLP/nlp_examples/blob/master/chunking/chunking_trainingdata_CAT.json?raw=true', bio=True)
test_sents = get_sentences('https://github.com/KCL-Health-NLP/nlp_examples/blob/master/chunking/chunking_testdata_CAT.json?raw=true', bio=True)

Now let's create the feature and label vectors for the training and test data.

In [None]:
X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_test = [sent2features(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]

What labels do we have?

In [None]:
labels = list(set(x for l in y_test for x in l))
labels

Now let's train the model.

In [None]:
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.1,
    c2=0.1,
    max_iterations=100,
    all_possible_transitions=True,
)
crf.fit(X_train, y_train);

# 5: evaluation
How does this model perform on this data?

In [None]:
y_pred = crf.predict(X_test)
metrics.flat_f1_score(y_test, y_pred, average='weighted', labels=labels)

In [None]:
print(metrics.flat_classification_report(y_test, y_pred, labels = labels))

What do you think? There's a huge imbalance in the number of instances. Do we really want to evaluate the 'O' label? There's also one instance with an erroneous label ('-') Let's look at the results without these labels.

In [None]:
labels = list(set(x for l in y_test for x in l if x !='O' and x!='-'))
labels

In [None]:
print(metrics.flat_classification_report(y_test, y_pred, labels = labels))

This was quite different! 
Try training this model with the BILOU scheme instead. Are results better or worse?

In [None]:
train_sents = get_sentences('https://github.com/KCL-Health-NLP/nlp_examples/blob/master/chunking/chunking_trainingdata_CAT.json?raw=true', bio=False)
test_sents = get_sentences('https://github.com/KCL-Health-NLP/nlp_examples/blob/master/chunking/chunking_testdata_CAT.json?raw=true', bio=False)
X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_test = [sent2features(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]

new_classes = list(set(x for l in y_test for x in l if x !='O' and x!='-'))

c1=0.1
c2=0.1

crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=c1,
    c2=c2,
    max_iterations=100,
    all_possible_transitions=True,
)
crf.fit(X_train, y_train)
y_pred = crf.predict(X_test)
print(metrics.flat_classification_report(y_test, y_pred, labels = new_classes))

# Optional: cross-validation to find best parameters with crfsuite
We have used default parameters in the above. We can try to find the best parameters on the training data by cross-validation. __This takes some time though (even with only 3 folds)!__ 

In [None]:
# from: https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html#hyperparameter-optimization

crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    max_iterations=100,
    all_possible_transitions=True
)
params_space = {
    'c1': scipy.stats.expon(scale=0.5),
    'c2': scipy.stats.expon(scale=0.05),
}

# use the same metric for evaluation
f1_scorer = make_scorer(metrics.flat_f1_score,
                        average='weighted', labels=new_classes)

# search
rs = RandomizedSearchCV(crf, params_space,
                        cv=3,
                        verbose=1,
                        n_jobs=-1,
                        n_iter=50,
                        scoring=f1_scorer)
rs.fit(X_train, y_train)

In [None]:
print('best params:', rs.best_params_)
print('best CV score:', rs.best_score_)

In [None]:
crf = rs.best_estimator_
y_pred = crf.predict(X_test)
print(metrics.flat_classification_report(
    y_test, y_pred, labels=new_classes, digits=3
))

What do you think? Are there other parameters that could be tested in the cross-validation setup? What about the measure used for optimisation?