<a href="https://colab.research.google.com/github/KCL-Health-NLP/nlp_examples/blob/master/chunking/spacy_custom_ner_answers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# spaCy for named entity recognition of clinical concepts

In this practical, we will try to build a named entity recognition classifier using spaCy.

Named entity recognition is a structured learning problem, i.e., we want to learn sequence patterns.

We will use data from mtsamples again, and build classifiers that find clinical concepts. 

The 'gold' standard data is *not* manually annotated, it is the output of a clinical concept recognition system developed by Zeljko Kraljevic called 'CAT' (a predecessor to MedCAT), thus this data is not perfect. This system matches concepts to the entire UMLS. We will only use a few example concepts here.

Part of this material is adapted, inspired etc from:

https://spacy.io/usage/training


Written by Angus Roberts, May 2023, for spaCy 3. Based on an earlier version for spaCy 2 written by Sumithra Velupillai, March 2019. Acknowledgements and many thanks to Zeljko Kraljevic for the data preparation.

In [None]:
# We'll use spaCy for NER.
try:
    import spacy
except ImportError as e:
    !pip install spacy
    import spacy

# Example holds spacy documents,
# one with predicted annotations
# and one with gold standard annotations
from spacy.training import Example

# DocBin is a serialiser for spacy documents
from spacy.tokens import DocBin

# Displacy provides a graphic display of
# documents and annotations, and Scorer scores...
from spacy import displacy
from spacy.scorer import Scorer


# requests is a package to submit requests to URLs
# We will use it to fetch our data
import requests

# we use sklearn to split our training data in to train
# and dev portions (we have a separate, held out
# final test set)
from sklearn.model_selection import train_test_split

# We will generate warnings for some thing
# You might uncomment to ignore them
#import warnings
#warnings.filterwarnings('ignore')

# 1: What version of spaCy do we have?
SpaCy has changed a lot between V2 and V3, let's check we have the right version - we want V3

In [None]:
!python -m spacy info

# 1: Prepare the corpus
We have prepared the data in a json format.

In [None]:
data_url = 'https://github.com/KCL-Health-NLP/nlp_examples/blob/master/chunking/chunking_traindata_CAT_updated_2021.json?raw=true'
r = requests.get(data_url)
data = r.json()

Let's take a look at a random document and its annotations. The json format contains the text itself, and then the start and end offsets for each annotated entity. What are the instances we want to learn?

In [None]:
len(data)

In [None]:
data[35]

We will split our data 80:20 in to a train set for training and a dev set for testing at each training iteration. We will do this with scikit learn's train_test_split function. Note that we also have a separate, held out test set that we will use at the end.

In [None]:
train_data, dev_data = train_test_split(data, train_size=0.8) 

When training spaCy, we need to pass it a binary file. This can be created from spaCy *DocBin* objects, which is an iterable collection of spaCy *Document* objects.

In [None]:
# A DocBin is a serialisable SpaCy container that holds
# SpaCy documents, and which can be used in SpaCy training.
# This function converts our data format in to a DocBin
def data_to_docbin(data):
  
  # The DocBin we will create for this data
  db = DocBin()
  
  # We need to get the spans of our annotations.
  # We can do this with a blank pipeline with no
  # components. No need to do any other processing.
  nlp = spacy.blank('en')

  # The data contains text and annotations
  for text, annot in data:

    # create Document object from text
    # this will conatin the tokens and
    # their spans
    doc = nlp(text)

    # Now let's get the entities in to a list 
    ents = []

    # The annotations from our data have a start offset,
    # an end offset and a label
    for start, end, label in annot["entities"]:

      # Make a span in our document for these
      span = doc.char_span(start, end, label=label)

      # ignore any entities with spans that do not align with tokens
      # as they will break our training
      if span is None:
        warnings.warn(f'Skipping entity [{start}, {end}, {label}] : span does not align with token boundaries')
      else:
        ents.append(span)

    # For each document, add the entities to it
    # and add the document to the DocBin
    doc.set_ents(ents)
    db.add(doc)

  # return the DocBin containing all the Documents
  # with their text and entities
  return db





In [None]:
# Now convert our two datasets and serialise them
# to disk ready for training
train_db = data_to_docbin(train_data)
train_db.to_disk("./train.spacy") 

dev_db = data_to_docbin(dev_data)
dev_db.to_disk("./dev.spacy") 

In [None]:
# Let's check we can deserialise
doc_bin = DocBin().from_disk("./dev.spacy")
docs = list(doc_bin.get_docs(nlp.vocab))
print(len(docs))

# 2: Training a named entity model with spaCy
We can use spaCy to train our own named entity recognition model using their training algorithm.
First we need to load a spaCy English language model, so that we can sentence- and word tokenize.

In [None]:
!python -m spacy init fill-config base_config.cfg config.cfg

In [None]:
!python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy

What nlp preprocessing parts does this model contain? In spaCy, these are called 'pipes'.

The default named entity pipe in spaCy is not trained for our labels. We have our own named entities that we want to develop a model for. Let's add these entity labels to the spaCy ner pipe.

We don't want to retrain the other pipeline steps, so let's keep those. We only want to retrain the ner pipeline with our own labels and annotations.

We have now added a clinical concept entity recognizer in the spaCy nlp model! Let's look at an example document and the predicted entities from the new model. Is it right? Any problems?

In [None]:
text = train_data[17][0]

In [None]:
ner = spacy.load('./output/model-best')
doc = ner(text)
colors = {'ANATOMY': 'lightyellow',
           'DISEASESYNDROME': 'pink',  
           'SIGNSYMPTOM': 'lightgreen'}
displacy.render(doc, style='ent', jupyter=True, options={'colors':colors})





We can also look at the underlying representation - let's look at one sentence in this document.

In [None]:
print([(x, x.ent_iob_, x.ent_type_) for x in list(doc.sents)[4]])

In [None]:
ner = spacy.load('./output/model-best')
ner.add_pipe('sentencizer')
doc = ner(text)
print([(x, x.ent_iob_, x.ent_type_) for x in list(doc.sents)[4]])

What do you think? Does it seem like the model works well on this document? Are there concepts that are missed? 


# 3: Evaluation
How do we know how good this model is? Let's compare with the held out test data.

In [None]:
data_url = 'https://github.com/KCL-Health-NLP/nlp_examples/blob/master/chunking/chunking_testdata_CAT_updated_2021.json?raw=true'
r = requests.get(data_url)
test_data = r.json()



In [None]:
examples = []
scorer = Scorer()
for text, annotations in test_data:
    # Run the ner over the text to make predictions
    doc = ner(text)
    # Create the Example from the predicted doc
    # and the gold annotations 
    example = Example.from_dict(doc, annotations)
    examples.append(example)

scores = scorer.score(examples)

print('Precision: ', scores['ents_p'])
print('Recall: ', scores['ents_r'])
print('F1: ', scores['ents_f'])

print('Per type: ', scores['ents_per_type'])

Are these good results do you think? Can this be improved? What happens if you increase the number of iterations in the training?

Let's look at a document from the test data.

In [None]:
text = test_data[37][0]
doc = ner(text)
# We use the colours from before:
#colors = {'ANATOMY': 'lightyellow',
#           'DISEASESYNDROME': 'pink',  
#           'SIGNSYMPTOM': 'lightgreen'}
displacy.render(doc, style='ent', jupyter=True, options={'colors':colors})


What does the underlying representation look like?

In [None]:
print([(x, x.ent_iob_, x.ent_type_) for x in list(doc.sents)[1]])

There are other options available using spaCy, training models etc. If interested, look at their website, e.g. https://spacy.io/usage/training