<a href="https://colab.research.google.com/github/KCL-Health-NLP/nlp_examples/blob/master/ann/fine_tune_transformer_with_spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine tune spaCy's default transformer for medical NER

This uses the same data as in the spaCy NER notebook, and follows much the same pattern as that. See that notebook for more detail.

It is very version dependent and fragile.

It is time consuming:
* using cuda
* 122 documents in training set
* aborted after:
  * 1 hour 35 mins
  * 171 epochs
  * 8400 instances
  * P 0.80, R 0.66

See the [spaCy Embeddings and Transformers guide](https://spacy.io/usage/embeddings-transformers) for up to date information on installation with CUDA.

See the [spaCy installation guide](https://spacy.io/usage#installation) for information on installing spaCy for GPU use.

See [this discussion](https://github.com/explosion/spaCy/discussions/12353) on versions of CUDA and PyTorch to use, as of March 2023. At that point, the recommendation was to use CUDA 11.8. Couldn't get this to work, so used the below with 11.3



## ***Make sure your runtime is using a GPU***

In [None]:
# Install cuda
!sed -i '/developer\.download\.nvidia\.com\/compute\/cuda\/repos/d' /etc/apt/sources.list.d/*
!sed -i '/developer\.download\.nvidia\.com\/compute\/machine-learning\/repos/d' /etc/apt/sources.list.d/*
!wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.0-1_all.deb
!dpkg -i cuda-keyring_1.0-1_all.deb
!apt-get update
!apt-get -y install cuda-11.3

## ***Now need to restart runtime***

In [None]:
# export cuda path
!export CUDA_PATH="/usr/local/cuda-11.3"

In [None]:
# install torch
!pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html

In [None]:
# install spaCy with the extras for our
# CUDA version and transformers
!pip install -U spacy[cuda-113,transformers]

In [None]:
# get the data for training
import requests
data_url = 'https://github.com/KCL-Health-NLP/nlp_examples/blob/master/chunking/chunking_traindata_CAT_updated_2021.json?raw=true'
r = requests.get(data_url)
data = r.json()

In [None]:
# How much training data is there?
print(len(train_data))

In [None]:
# we use sklearn to split our training data in to train
# and dev portions (we have a separate, held out
# final test set)
from sklearn.model_selection import train_test_split

train_data, dev_data = train_test_split(data, train_size=0.8) 

In [None]:
# spacy for creating docbins
import spacy
nlp = spacy.blank('en')

# DocBin is a serialisable collection of spacy
# Documents.
from spacy.tokens import DocBin

# We will generate warnings for some thing
# You might uncomment to ignore them
import warnings
#warnings.filterwarnings('ignore')

In [None]:
# A DocBin is a serialisable SpaCy container that holds
# SpaCy documents, and which can be used in SpaCy training.
# This function converts our data format in to a DocBin
def data_to_docbin(json_corpus):
  
  # We create a DocBin to hold out Documents
  db = DocBin()
  
  # The json_corpus contains text and annotations
  for text, annot in json_corpus:

    # create Document object from text
    # this will conatin the tokens and
    # their spans
    doc = nlp(text)

    # Now let's get the entities in to a list 
    ents = []

    # The annotations from our data have a start offset,
    # an end offset and a label
    for start, end, label in annot["entities"]:

      # Make a span in our document for these
      span = doc.char_span(start, end, label=label)

      # If the Document can't align the character offsets with tokens,
      # it will return None. We will ignore any entities like this,
      # as they could break our training
      if span is None:
        warnings.warn(f'Skipping entity [{start}, {end}, {label}] : span does not align with token boundaries')
      else:
        ents.append(span)

    # Add the entities to the document
    # and add the document to the DocBin
    doc.set_ents(ents)
    db.add(doc)

  # return the DocBin containing all the Documents
  # with their text and entities
  return db

In [None]:
# convert data and save to disk

train_doc_bin = data_to_docbin(train_data)
data_doc_bin.to_disk("./train.spacy") 

dev_doc_bin = data_to_docbin(dev_data)
dev_doc_bin.to_disk("./dev.spacy") 


In [None]:
# Get a locale error on spacy init - this is a quick fix
# Code from https://github.com/explosion/spaCy/issues/11909
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

## ***Now get your spaCy base config file, for transformers and ner***

In [None]:
# Initialise spacy config file
!python -m spacy init fill-config base_config.cfg config.cfg

In [None]:
# Check if using torch
import torch
print('Torch available:', torch.cuda.is_available())
print('Number of torch devices:', torch.cuda.device_count())
print('Torch current device:', torch.cuda.current_device())

In [None]:
# The flag -g 0 will target at gpu number 0 (i.e. the first gpu)
!python -m spacy train -g 0 config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy