<a href="https://colab.research.google.com/github/KCL-Health-NLP/nlp_examples/blob/master/ann/evaluating_spacy_transformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Comparing spaCy NER CNN and transformer models

**This is the "question" version of this notebook. It differs from the "answer" version in that you are asked to complete some of the code cells.  It will not work without these cells being completed correctly.**

The default spaCy models are CNN based, trained on several datasets. spaCy also makes available transformer models, fine-tuned on the same datasets. This allows us to compare the two. We will carry out such a comparison for NER.

The notebook uses the same classes and methods as introduced in the earlier spaCy NER practical notebook. spaCy hides the trnasoformer models behind the same API as used in the rest of the library, making transformer use completely transparent to the end user. The approach is very different to Hugging Face / Keras / PyTorch / TensorFlow, but depending on what you are trying to do with your NLP, it is worth considering.

Under the hood, spaCy uses a machine learning library called thinc, which itself sits on top of PyTorch tensors (by deafult - other libraries supported). 

We will test the spaCy models on a standard NER dataset, [the CoNLL NER shared task dataset](https://www.clips.uantwerpen.be/conll2003/ner/). This was built for a community challenge - an NLP competition between different research teams, run to push forward the state of the art. Such challenges are common in the NLP research world, giving us many datasets to use for testing ideas.

## Using with GPUs

The execution time of this code will benefit from the use of GPUs. To select a GPU runtime in colab:

* Select the *Runtime* menu
* Select the *Change runtime type* submenu
* In the dialog that appears, under *Hardware accelerator* select *GPU*
* Your existing runtime will disconnect, and you will be allocated and connected to a new GPU runtime.

We will also improve execution time through the way in which we fetch and cache data, in one of the steps below.

## Install spaCy packages

The standard spaCy installation does not include transformer support. We need to install packages needed by spaCy for this, including ther spaCy transformers package, and the spaCy CUDA package. CUDA is a library for paralellising code on GPUs.

See the [spaCy usage notes](https://spacy.io/usage) for mode details on installing extra packages.

*Note: as of May 2023, the below install was giving a couple of error. However, the notebook ran succesfully despite these.*

In [None]:
# Install spaCy transformer packages
# Gives some errors, but seems ok to ignore these
!pip install -U pip setuptools wheel
!pip install -U 'spacy[transformers,cuda-autodetect]'

## ***Restart your runtime***
**You need to restart your runtime in order for the above packages to be made available for imports**

* Menus
  * Runtime
    * Restart runtime

# Imports

In [None]:
# We are using spaCy for NER.
import spacy

# DocBin is a serialisable collection of spacy
# Documents.
from spacy.tokens import DocBin

# Displacy provides a graphic display of
# documents and annotations, and Scorer scores...
from spacy import displacy
from spacy.scorer import Scorer

# Example holds spacy documents,
# one with predicted annotations
# and one with gold standard .
# We will use it when evalusating.
from spacy.training import Example


## Checks and bug hacks

We'll just do a quick check, to make sure we are using GPUs. We will do this with the torch library.

In [None]:
# Check we are using torch.cuda
import torch
print('Torch available:', torch.cuda.is_available())
print('Number of torch devices:', torch.cuda.device_count())
print('Torch current device:', torch.cuda.current_device())

Next we will provide a fix for a bug that happens when initialising spaCy on GPUs. This is apprently a bug with the CUDA Library. The below is a temporary work around until CUDA is fixed.

In [None]:
# Get a locale error on spacy init with GPU - here's a quick fix
# Code from https://github.com/explosion/spaCy/issues/11909
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

## Download spaCy models

We will download two models:

* The ```en_core_web_lg``` model is a CNN based entity recogniser, trained on several datasets.
* The ```en_core_web_trf``` model is based on the [roBERTa](https://huggingface.co/docs/transformers/model_doc/roberta) pre-trained transformer model, fine-tuned with the same datasets as used for ```en_core_web_lg``` training. roBERTa is a varient on BERT. spaCy sources its transformer models from Hugging Face.

You can find out more about the models and the data used to train them in the [spaCy model documentation](https://spacy.io/models/en).


In [None]:
# Download spacy models
!python -m spacy download en_core_web_lg
!python -m spacy download en_core_web_trf

# Get the data

* The dataset is sourced from here: [CoNLL 2003 dataset](https://www.clips.uantwerpen.be/conll2003/ner/)
* Its original construction and use is described in this paper: [CoNLL 2003 NER shared task](https://aclanthology.org/W03-0419/)


In [None]:
# Get the data -  wget is a web client, useful for
# downloading web pages or files hosted on the web
!wget https://github.com/KCL-Health-NLP/nlp_examples/raw/master/ann/conll2003_test.txt

In [None]:
# Take a look at our test data - head is a unix command to
# retrive the first n rows of a file
!head -100 ./conll2003_test.txt

Recall the spaCy NER labels, and compare them to the CoNLL ones shown above. spaCy covers more entity types, uses different names, and slightly different definitions to CoNLL. The following list shows the equivalences:

* **Location:** CoNLL LOC == most are spaCy GPE, with some  being spaCy LOC
* **Person:** CoNLL PER == spaCy PERSON
* **Organisation:** CoNLL ORG == spaCy ORG

So we can evaluate against the CoNLL data, we will map CoNLL entity types to spaCy types. We could do this in Python, but it is quicker and easier to use another Unix command line tool, ```sed```. This is a stream editor which will replace one string with another in a strema of text. The below shows this for converting ```-LOC``` to ```-GPE```.

**Exercise**

Complete the below code cell to add another ```sed``` line that converts CoNLL PER tags to PERSON tags.

In [None]:
# sed stream editing
# -i means "inplace", i.e. edit and save back to the same file
# g at the end of the pattern means replace globally
!sed -i 's/-LOC/-GPE/g' conll2003_test.txt
!sed -i 's/-PER/-PERSON/g' conll2003_test.txt

Did it work?

In [None]:
# Take another look at our test data
!head -100 ./conll2003_test.txt

Next, we convert the data to .spacy serialised DocBin format, as in the earlier spaCy NER practical notebook.

In [None]:
# Convert to .spacy DocBin format
!python -m spacy convert ./conll2003_test.txt . -c ner -n 10


## Load the spaCy models

**NB if you did not restart your runtime after installing spaCy (first code cell), then you will not have the spaCy transformers library, and loading the trf model will fail.**

In [None]:
# Load spacy models
nlp_lg = spacy.load('en_core_web_lg')
nlp_tr = spacy.load('en_core_web_trf')

## Load the data in to spaCy

We now de-serialise the data in to a DobBin, and take a look at it.

In [None]:
docs = DocBin().from_disk("./conll2003_test.spacy")

In [None]:
print(len(docs))

In [None]:
for doc in docs.get_docs(nlp_lg.vocab):
  print(doc.ents)


## Socring the Named Entity Recognition

We will write a function to run a pipeline over all of the documents in a DocBin, and to compare each one to the CoNLL gold standard named entities. This is very similar to the scoring in our previous spaCy NER notebook.

**Exercise**

Complete the scoring function given below, using the comments to guide you.

In [None]:
# Run pipeline over the text each gold standard document
# in docs. For each document, add both the predicted
# version and the gold standard version to an Example
# object.
# Return a tuple containing:
# (1) the results of running a Scorer over all examples
# (2) a list of all examples 
def run_and_score_nlp(docs, pipeline):

  # A Scorer to do our scoring at the end
  scorer = Scorer()

  # A list in which to store the Examples
  examples = []

  # Iterate over the documents
  for gold_doc in docs.get_docs(pipeline.vocab):    # COMPLETE THIS LINE

    # Create the predicted document from
    # the gold standard text
    pred_doc = pipeline(gold_doc.text)              # COMPLETE THIS LINE

    # Create the Example
    ex = Example(pred_doc, gold_doc)                # COMPLETE THIS LINE

    # Add the Example to the examples list
    examples.append(ex)                             # COMPLETE THIS LINE

  return (scorer.score(examples), examples)         # COMPLETE THIS LINE

Now we can score our two pipelines, using the function we wrote above. Remember, it returns a tuple: the scorres and the examples.

We've also added in a few line to time each one, in nano seconds. Timing like this is a bit crude, but may be interesting.

In [None]:
import time

start = time.process_time_ns()
scores_lg, examples_lg = run_and_score_nlp(docs, nlp_lg)
print("LG time:", time.process_time_ns() - start)

start = time.process_time_ns()
scores_tr, examples_tr = run_and_score_nlp(docs, nlp_tr)
print(TR time:", time.process_time_ns() - start)


We have the scores: let's print them out.

**Exercise**

The scores data structure is a nested dictionary, like this:

```scores['ents_per_type'][LABEL-NAME][METRIC-NAME]```

 Comlpete the code below to print our the scores for each of the listed labels and metrics.


In [None]:
# Lists of labels and metrics to scofre
labels = ['GPE', 'ORG', 'PERSON']
metrics = ['p', 'r', 'f']

print(f'{"label": <18}{"score": <8}{"lg": <6}{"tr": <6}')

# Iterate over the labels
for l in labels:                                          # COMPLETE THIS CODE

  # Iterate over the metrics
  for m in metrics:                                       # COMPLETE THIS CODE

    # Retrieve the large models metric
    lg = scores_lg['ents_per_type'][l][m]                 # COMPLETE THIS CODE

    # Retrieve the transformer model metric
    tr = scores_tr['ents_per_type'][l][m]                 # COMPLETE THIS CODE

    # Print the two scores
    print(f'{l: <18}{m: <8}{lg: <6.2f}{tr: <6.2f}')       # COMPLETE THIS CODE

## Display some documents

Finally, let's take a look at some documents, using displacy. We will choose a document number, and render the transformer, large, and gold standard annotations for this document.

Before rendering, we will add tiles to our document data. displacy will render this as a heading on eahc document, to help us distinguish themm

In [None]:
# Retrieve documents
# Add titles to documents - displacy will render these titles.

doc_num = 150

# transformer model
doc_tr = examples_tr[doc_num].predicted
doc_tr.user_data["title"] = "Transformer model predictions"

# large model
doc_lg = examples_lg[doc_num].predicted
doc_lg.user_data["title"] = "Large model predictions"

# gold standard
doc_ref = examples_tr[doc_num].reference
doc_ref.user_data["title"] = "Gold standard"



**Exercise**

Write code to render all three documents. You will need to make sure you only display the entity labels we are interested in. See the [displacy documentation](https://spacy.io/usage/visualizers#ent) for information on how to do this. Put a line of dashes or some newlines between each document.

In [None]:
# Display in displacy
displacy.render(doc_tr, style='ent', jupyter=True, options={'ents':labels})
print('\n'*2)
displacy.render(doc_lg, style='ent', jupyter=True, options={'ents':labels})
print('\n'*2)
displacy.render(doc_ref, style='ent', jupyter=True, options={'ents':labels})