<a href="https://colab.research.google.com/github/KCL-Health-NLP/nlp_examples/blob/master/spacy_custom_ner_to_complete.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# spaCy for named entity recognition of clinical concepts

In this practical, we will build a named entity recognition (NER) classifier using spaCy.

Named entity recognition is a structured learning problem, i.e., we want to learn sequence patterns.

We will use data from mtsamples again, and build classifiers that find clinical concepts. 

The 'gold' standard data is *not* manually annotated, it is the output of a clinical concept recognition system developed by Zeljko Kraljevic called 'CAT' (a predecessor to MedCAT), thus this data is not perfect, but it will do to illustrate the idea. MedCAT matches concepts to the entire UMLS. We will only use a few example concepts here.

Part of this material is adapted, inspired etc from:

https://spacy.io/usage/training


Written by Angus Roberts, May 2023, for spaCy 3. Based on an earlier version for spaCy 2 written by Sumithra Velupillai, March 2019. Acknowledgements and many thanks to Zeljko Kraljevic for the data preparation.

In [None]:
# We'll use spaCy for NER.
try:
    import spacy
except ImportError as e:
    !pip install spacy
    import spacy

# DocBin is a serialisable collection of spacy
# Documents.
from spacy.tokens import DocBin

# Displacy provides a graphic display of
# documents and annotations, and Scorer scores...
from spacy import displacy
from spacy.scorer import Scorer

# Example holds spacy documents,
# one with predicted annotations
# and one with gold standard .
# We will use it when evalusating.
from spacy.training import Example


# requests is a package to submit requests to URLs
# We will use it to fetch our data
import requests

# we use sklearn to split our training data in to train
# and dev portions (we have a separate, held out
# final test set)
from sklearn.model_selection import train_test_split

# We will generate warnings for some thing
# You might uncomment to ignore them
import warnings
#warnings.filterwarnings('ignore')

# 1: What version of spaCy do we have?
SpaCy has changed a lot between V2 and V3, let's check we have the right version - we want V3

In [None]:
!python -m spacy info

# 2: Reading in the corpus
Our data is in a json format which derives from an older version of spaCy. We will start by reading it in to a Python compund data structure.

In [None]:
data_url = 'https://github.com/KCL-Health-NLP/nlp_examples/blob/master/chunking/chunking_traindata_CAT_updated_2021.json?raw=true'
r = requests.get(data_url)
data = r.json()

How big is our dataset?

In [None]:
# Write this code yourself

Let's take a look at a random document and its annotations. 

In [None]:
# Write this code yourself

The json format is a list of documents, each document being a list that contains:
* [the text itself,
* {a dictionary with
  * a key string 'entities'
  * [a value that is a list containing all the annotated entities
    * [each annotated entity is itself a list with
      * the start character offset for the entity 
      * the end character offsets for the entity
      * the type of the entity]]}]

What are the instances we want to learn?

We will split our data 80:20 in to a train set for training and a dev set for testing at each training iteration. We will do this with scikit learn's train_test_split function. Note that we also have a separate, held out test set that we will keep blind, and read in later on in the notebook.

Take a look at the documentation for [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) and work out how to do this. Write the code below, calling the two datasets train_data and dev_data.

In [None]:
# Complete this code yourself
train_data, dev_data = 

# 3: Converting the corpus to .spacy format
When training spaCy, we need to pass it a binary file of serialised spaCy *Document* objects. SpaCy *DocBin* objects help with this. They are iterable collections of Documents which can be serialised.

We will write a function that takes our json corpus and converts it to a DocBin.

In [None]:
  # SpaCy documents record the token spans of annotations. 
  # We therefore need to get the token spans of our
  # annotations, so we can save these in the Documents.
  # We can do this with a blank spaCy pipeline with no
  # components. No need to do any other processing.
  nlp = spacy.blank('en')

In [None]:
# A DocBin is a serialisable SpaCy container that holds
# SpaCy documents, and which can be used in SpaCy training.
# This function converts our data format in to a DocBin
def data_to_docbin(json_corpus):
  
  # We create a DocBin to hold out Documents
  db = DocBin()
  
  # The json_corpus contains text and annotations
  for text, annot in json_corpus:

    # create Document object from text
    # this will conatin the tokens and
    # their spans
    doc = nlp(text)

    # Now let's get the entities in to a list 
    ents = []

    # The annotations from our data have a start offset,
    # an end offset and a label
    for start, end, label in annot["entities"]:

      # Make a span in our document for these
      span = doc.char_span(start, end, label=label)

      # If the Document can't align the character offsets with tokens,
      # it will return None. We will ignore any entities like this,
      # as they could break our training
      if span is None:
        warnings.warn(f'Skipping entity [{start}, {end}, {label}] : span does not align with token boundaries')
      else:
        ents.append(span)

    # Add the entities to the document
    # and add the document to the DocBin
    doc.set_ents(ents)
    db.add(doc)

  # return the DocBin containing all the Documents
  # with their text and entities
  return db

Now let's use this function to create DocBins for our train data and our dev data, and write these to disk. Complete the code below to do this.

In [None]:
# Now convert our two datasets and serialise them
# to disk ready for training - you need to
# complete the code to do this

# Complete the code below
train_doc_bin = 
train_doc_bin.to_disk("./train.spacy") 

# Complete the code below
dev_doc_bin = 
dev_doc_bin.to_disk("./dev.spacy") 

We can read one of them in to check it's worked

In [None]:
# Let's check we can deserialise
doc_bin = DocBin().from_disk("./train.spacy")
docs = list(doc_bin.get_docs(nlp.vocab))
print(len(docs))

# 4: Training a named entity model with spaCy
The default named entity pipe in spaCy is trained to recognise traditional named entities, such as people and places. We have our own medical named entities that we want to develop a model for.

You can train spaCy models in Python code, passing samples of your data to the pipeline in a training loop. However spaCy V3 is designed around running many common operations, including training, from a command line interface (CLI). You run spaCy commands from a command line, passing in any data and a configuration file. The configuration file lets you set parameters, data paths, scoring methods, and many more things. It's a flexible approach, and means that you will always be using the most efficient code, and can concentrate on configuring your pipeline.

The first thing we need to do is create a config file. There are lots of options, so the best way is to generate a template using spaCy's online tool, which is here:

https://spacy.io/usage/training#quickstart

You can play with different settings, but to start we recommend:

* Language: English
* Components: parser and ner (you need both!)
* Hardware: CPU
* Optimise for: efficiency

Do the following:
1.   Go to https://spacy.io/usage/training#quickstart
2.   Choosen the above settings
3.   Click the download icon on the config file display (bottom right)
4.   Save the config file to your local computer
5.   Upload the config to colab.

Once you have done this, you will have a base_config.cfg file in your colab file space. This is not a complete config, and needs some parameters filling in and initialising for your system. Do this by running the following command to the spacy CLI:


In [None]:
# Initialise spacy config file
!python -m spacy init fill-config base_config.cfg config.cfg

You now have a config.cfg file. SpaCy will use this for all settings during training. However, it will still need a couple of settings, which you can supply at the command line (overriding values in the config.cfg file):

* output directory for the trained model
* path to the train dataset
* path to the dev dataset

The following command runs spacy's training using config.cfg. You need to complete it to pass in the paths to our prepared train and dev set to it. Edit it to include these paths, and then run it:

In [None]:
# You need to fill in the training and dev dataset paths before running!
!python -m spacy train config.cfg --output ./output --paths.train PUT-PATH_HERE --paths.dev PUT-PATH_HERE

What do the different parts of the training report mean?

We have now trained a clinical concept entity recognizer, and saved it to the model directory in our workspace.

# 5: Take a look at some examples

 Let's look at an example document and the predicted entities from the new model. We will do this with the displacy package.
 
* Try a few documents
* Are they right?
* Any problems?

In [None]:
# Get a document 
text = train_data[17][0]

# Load the model that was saved to disk by spacy train
ner = spacy.load('./output/model-best')

# Process the document
doc = ner(text)

# Set up some colours for displacy
colors = {'ANATOMY': 'lightyellow',
           'DISEASESYNDROME': 'pink',  
           'SIGNSYMPTOM': 'lightgreen'}

# Display in displacy
displacy.render(doc, style='ent', jupyter=True, options={'colors':colors})

What are we actually learning? What are the instances? We can take look at the underlying representation - let's look at the tokens in this document.

In [None]:
print([(t, t.ent_iob_, t.ent_type_) for t in doc])

What do you think? Does it seem like the model works well on this document? Are there concepts that are missed? 


# 6: Evaluation
How do we know how good this model is? Let's compare with the held out test data. First you will need to load this in, just like we loaded in the training data. Complete the code below to do this.

In [None]:
data_url = 'https://github.com/KCL-Health-NLP/nlp_examples/blob/master/chunking/chunking_testdata_CAT_updated_2021.json?raw=true'


# Complete the code to load the test data, in to a variable called test_data




Spacy's *Scorer* takes a list of *Examples*. An Example contains two Documents, one with the fold standard annotations, and one with the predicted annotations.

In [None]:
# We create a list to hold our Examples
examples = []

# The scorer
scorer = Scorer()

#Â Iterate over the text and annotations in our test data
for text, annotations in test_data:

    # Run the ner over the text to make predictions
    doc = ner(text)
  
    # Create the Example from the predicted doc
    # and the gold annotations, add it to our list
    example = Example.from_dict(doc, annotations)
    examples.append(example)

# Score the examples
scores = scorer.score(examples)

print('Precision: ', scores['ents_p'])
print('Recall: ', scores['ents_r'])
print('F1: ', scores['ents_f'])

Are these good results do you think? Can this be improved? What happens if you increase the number of iterations in the training?

The scores dictionary also contains a dictionary of scores per entity type. You can access this from scores['ents_per_type']. Write code to print out these scores in an easy to red format.

In [None]:
# Write code to print out scores['ents_per_type'] in an easy to read format
# e.g. you might iterate over the labels and the metrics in two nested loops


Let's look at a document from the test data.

In [None]:
text = test_data[37][0]
doc = ner(text)

# We use the colours that we set up before
displacy.render(doc, style='ent', jupyter=True, options={'colors':colors})


What does the underlying representation look like?

In [None]:
print([(t, t.ent_iob_, t.ent_type_) for t in doc])

There are other options available using spaCy, training models etc. If interested, look at their website, e.g. https://spacy.io/usage/training