[![Colab Badge Link](https://img.shields.io/badge/open-in%20colab-blue)](https://colab.research.google.com/github/Glasgow-AI4BioMed/tutorials/blob/main/custom_token_classification_models.ipynb)

# Creating a custom token classification model

This notebook illustrates creating a custom Transformer model that is compatible with the [Huggingface trainer](https://huggingface.co/docs/transformers/main_classes/trainer). This model will use intermediate hidden states (so not the final hidden state) of a Transformer model for a token classification task.

## Install dependencies

If needed, you could install dependencies with the command below:

```
pip install transformers
```

## Tokenize some text

We'll work with a single example. First we need a tokenizer:

In [1]:
from transformers import AutoTokenizer

MODEL_NAME = 'microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract'

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Let's tokenize an example sentence.

In [2]:
text = 'The quick brown fox jumps over the lazy dog.'

encoded = tokenizer(text, max_length=512, padding=True, truncation=True, return_tensors='pt')

encoded

{'input_ids': tensor([[    2,  1680,  8787, 10418,  9596, 15163,  1026,  2150,  1680,  5736,
          1035,  1005,  4840,    17,     3]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

How many tokens do we have? Looks like 15

In [3]:
encoded['input_ids'].shape

torch.Size([1, 15])

We also need to make some dummy labels that will be our desired targets for our model. There are 15 tokens in the sequence and we need one for each token. The labels could correspond to `[0, B-DRUG, I-DRUG, etc]` for a biomedical NER task. Arbitrarily, let's say there are nine unique labels.

In [4]:
num_labels = 9

Let's create some labels randomly using [torch.randint](https://pytorch.org/docs/stable/generated/torch.randint.html).

In [5]:
import torch

labels = torch.randint(low=0, high=num_labels, size=(1,15))
labels.shape

torch.Size([1, 15])

For realism, some tokens shouldn't have labels, such as the `[CLS]` and `[SEP]` tokens used in many BERT models. In this case, they are at the beginning and end of this sequence. So to tell the model to ignore these (and not factor them into any calculations), you can set the labels to the special value of `-100`. The loss function that we'll use later ([CrossEntropyLoss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html)) knows that -100 denotes a token that should be ignored.

In [6]:
labels[0,0] = -100
labels[0,-1] = -100

And finally what do our made-up labels look like?

In [7]:
labels

tensor([[-100,    2,    1,    1,    2,    8,    2,    5,    5,    5,    8,    7,
            8,    4, -100]])

## Examining the AutoModelForTokenClassification

Now we've got some tokenized text and some made-up labels, let's see what happens when we put them through a standard `AutoModelForTokenClassification`. Our eventual model should give a similar output type as this.

Let's create a `AutoModelForTokenClassification` and pass in the number of labels to be predicted.

In [8]:
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(MODEL_NAME, num_labels=num_labels)

pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Normally, you would then fine-tune this model with data before using it. But let's just use it now and see the type of the output. The actual output will be nonsense as there hasn't been any fine-tuning.

Practically, when a model is trained, it is given data along with the labels. So let's take the tokenized text and add in the labels.

In [9]:
encoded_with_labels = dict(encoded)
encoded_with_labels['labels'] = labels

Then we pass in the tokenized text with the labels and let's examine what it returns

In [10]:
output = model(**encoded_with_labels)

First, what is the type of the object returned?

In [11]:
type(output)

transformers.modeling_outputs.TokenClassifierOutput

It is a [TokenClassifierOutput](https://huggingface.co/docs/transformers/main_classes/output#transformers.modeling_outputs.TokenClassifierOutput) which wraps various bits of information.

Let's just print it out and see what it gives:

In [12]:
output

TokenClassifierOutput(loss=tensor(2.7876, grad_fn=<NllLossBackward0>), logits=tensor([[[ 1.6239,  0.4530, -0.2071, -0.0514, -0.4117,  0.5804,  0.4853,
          -2.3014, -1.7350],
         [ 0.9774,  0.3130,  0.0528,  0.1127, -0.5041,  0.4156,  0.4323,
          -2.0229, -1.5194],
         [ 2.1344,  0.5107, -0.3481, -0.2140, -0.1960,  0.7112,  0.2396,
          -2.1424, -1.4772],
         [ 1.4910,  0.4706, -0.1716,  0.1899, -0.3729,  0.3282,  0.3491,
          -1.8324, -1.7708],
         [ 1.0691,  0.6996, -0.2954,  0.4707, -0.2819,  0.5988,  0.4125,
          -1.9505, -1.3610],
         [ 1.1215,  0.5364,  0.0738,  0.3399, -0.4563,  0.1871,  0.4104,
          -2.1555, -1.7678],
         [ 1.0024,  0.4320,  0.0317,  0.3498, -0.6502,  0.4121,  0.1683,
          -2.0321, -1.6957],
         [ 1.2773,  0.2202,  0.0600,  0.0503, -0.6146,  0.5854,  0.6330,
          -1.7299, -0.7957],
         [ 1.3225,  0.2818,  0.0949,  0.0964, -0.5555,  0.2897,  0.3718,
          -2.0948, -1.4673],
    

There are two important things in this object:
 - **loss**: This is the loss that the fine-tuning will try to minimise.
 - **logits**: This is the output of the whole model

Let's examine each. First the logits is a [pytorch.tensor](https://pytorch.org/docs/stable/tensors.html). Let's see it's dimensions

In [13]:
output.logits.shape

torch.Size([1, 15, 9])

The dimensions are explained below:
  - 1: We've only given a single input text
  - 15: The length of the input sequence
  - 9: The number of labels

For our custom model, we will want to output a tensor of this same dimension for the same input: `[1, 15, 9]`

There is a score for each of the possible nine labels. Let's see the scores for the first token in the sequence:

In [14]:
output.logits[0,0,:]

tensor([ 1.6239,  0.4530, -0.2071, -0.0514, -0.4117,  0.5804,  0.4853, -2.3014,
        -1.7350], grad_fn=<SliceBackward0>)

You could use `.argmax` to find the label that has the highest score. We don't need to do that here.

As an aside, these scores are often [softmaxxed](https://pytorch.org/docs/stable/generated/torch.nn.Softmax.html) to get nice scores between 0 and 1.

In [15]:
from scipy.special import softmax

softmax(output.logits[0,0,:].tolist())

array([0.39757951, 0.12328794, 0.06371286, 0.07444804, 0.05192478,
       0.14003358, 0.1273403 , 0.00784731, 0.01382568])

The other important output from this `AutoModelForTokenClassification` is the loss. This is a single number that the fine-tuning tries to minimise. It is calculated using the `logits` above when compared against the provided target `labels`.

In [16]:
output.loss

tensor(2.7876, grad_fn=<NllLossBackward0>)

## Create a custom TokenClassifierOutput

Now let's say that we want to make our own model that can be used for TokenClassification but does things slightly differently.

We might start off with a general-purpose `AutoModel` that doesn't have a final task-specific layer on it. If we wanted access to all the hidden layers, we can provide `output_hidden_states=True`.

In [17]:
from transformers import AutoModel

model = AutoModel.from_pretrained(MODEL_NAME, output_hidden_states=True)

Then we can get the output of this model and rework it for what we need it to do.

In [18]:
output = model(**encoded)

We can get access to all the hidden states. Let's see how many and their dimensions.

In [19]:
for i,hidden_state in enumerate(output.hidden_states):
  print(i, hidden_state.shape)

0 torch.Size([1, 15, 768])
1 torch.Size([1, 15, 768])
2 torch.Size([1, 15, 768])
3 torch.Size([1, 15, 768])
4 torch.Size([1, 15, 768])
5 torch.Size([1, 15, 768])
6 torch.Size([1, 15, 768])
7 torch.Size([1, 15, 768])
8 torch.Size([1, 15, 768])
9 torch.Size([1, 15, 768])
10 torch.Size([1, 15, 768])
11 torch.Size([1, 15, 768])
12 torch.Size([1, 15, 768])


There are 12 hidden layers in a standard BERT model, but there are 13 hidden states? Why? Well, we've got the input and output of all 12 layers which comes to 13 sets of context vectors. And all the context vectors are of dimension 768 which is common for standard BERT models.

Now our target shape is `[1, 15, 9]`. One of the hidden layers is almost there, but we need to make it a bit smaller. For this, we can use a fully-connected linear layer to go from 768 down to 9:

In [20]:
import torch

linear = torch.nn.Linear(768, num_labels)

If we apply the linear layer to the final hidden state, we get the logits with the desired shape.

In [21]:
logits = linear(output.hidden_states[-1])
logits.shape

torch.Size([1, 15, 9])

But we could also apply it to one of the intermediate hidden states!

In [22]:
logits = linear(output.hidden_states[5])
logits.shape

torch.Size([1, 15, 9])

At the moment, the linear layer is not fine-tuned, so the output logits would be meaningless. But with fine-tuning, these logits could give us the scores for each of the nine labels, with the highest score being the predicted label for that token.

To effectively train it, we need to calculate the loss between provided labels and the model's current logits for that input. Then the training process can slowly move the logits towards the desired labels. So how do we calculate the loss?

First, remember what the labels look like. We've got one input sequence and an integer representing the labels for each of the fifteen tokens.

In [23]:
labels.shape

torch.Size([1, 15])

Now to calculate the loss, we use [CrossEntropyLoss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html) which is used for multi-class classification problems. It expects two inputs:

- The logits in the shape of (sample_count, num_labels)
- The labels (as integers) in the shape (sample_count).

We can use [.reshape](https://pytorch.org/docs/stable/generated/torch.reshape.html) as below to adjust the shapes accordingly. And recall that CrossEntropyLoss will nicely ignore the tokens with `-100` labels as they shouldn't contribute to the loss.

In [24]:
loss_func = torch.nn.CrossEntropyLoss()
loss = loss_func(logits.reshape(-1,num_labels), labels.reshape(-1))
loss

tensor(2.3545, grad_fn=<NllLossBackward0>)

Now we've calculate the logits and loss, we can create a `TokenClassifierOutput` object that encapsulates them. Now it looks like we have an output similar to `AutoModelForTokenClassification`.

In [25]:
from transformers.modeling_outputs import TokenClassifierOutput

TokenClassifierOutput(loss=loss, logits=logits)

TokenClassifierOutput(loss=tensor(2.3545, grad_fn=<NllLossBackward0>), logits=tensor([[[-0.7257,  0.1261,  0.2513, -0.1950, -0.4206, -0.0177,  0.9735,
          -0.7954, -0.6908],
         [-0.5863,  0.3909,  0.3255, -0.6236, -0.1923,  0.0904,  0.8039,
          -0.6539, -0.8246],
         [ 0.1443,  0.1452,  0.0357, -0.5384,  0.4453, -0.2895,  0.1985,
          -0.8318, -0.6544],
         [-0.6492, -0.0659,  0.7838, -0.2551, -0.1202,  0.0188,  1.1917,
          -0.5900, -0.9259],
         [-0.5200,  0.4048,  0.4674, -0.5886, -0.1336,  0.2274,  0.8792,
          -0.5888, -0.5039],
         [-0.1510, -0.3334, -0.0940, -1.3177, -0.0819,  0.2822,  0.5376,
          -0.7423, -0.6280],
         [-0.2342, -0.2151, -0.1154, -1.1312, -0.4490,  0.6518,  1.0108,
           0.0335, -0.6501],
         [-0.6722, -0.3545, -0.1431, -0.9318, -0.0070, -0.0715,  1.0794,
          -0.7420, -0.7918],
         [-0.5958,  0.0806,  0.3105, -0.6504, -0.2293,  0.0511,  0.8867,
          -0.7367, -0.8476],
    

## Creating a custom model

To actually use this custom approach, we need to wrap it up as a `torch.nn.Module`. The example class below takes the various steps from before and puts them into a single class

In [26]:
class CustomModel(torch.nn.Module):
  def __init__(self, num_labels, hidden_layer):
    super().__init__()
    self.base_model = AutoModel.from_pretrained(MODEL_NAME, output_hidden_states = True)

    self.num_labels = num_labels
    self.hidden_layer = hidden_layer

    self.linear = torch.nn.Linear(768, num_labels)
    self.loss_func = torch.nn.CrossEntropyLoss()

  def forward(self, input_ids, token_type_ids=None, attention_mask=None, labels=None):
    output = self.base_model(input_ids=input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask)

    logits = self.linear(output.hidden_states[self.hidden_layer])

    loss = None
    if labels is not None: # If we're provided with labels, use them to calculate the loss
      loss = self.loss_func(logits.reshape(-1,self.num_labels), labels.reshape(-1))

    return TokenClassifierOutput(loss=loss, logits=logits)

Note that the above class works very similarly to the actual implementation for the `AutoModelForTokenClassification`. You can have a look at `BertModelForTokenClassification` in the [HuggingFace source code](https://github.com/huggingface/transformers/blob/main/src/transformers/models/bert/modeling_bert.py#L1714)

One key difference is that this implementation does not use [dropout](https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html) which may be beneficial.

Now we can create a model and even select which hidden_layer to connect to the output (and thereby removing some final layers from the calculation).

In [27]:
model = CustomModel(num_labels=num_labels, hidden_layer=5)

Let's pass in the tokenized text with the labels and see what we get

In [28]:
output = model(**encoded_with_labels)
output

TokenClassifierOutput(loss=tensor(2.2885, grad_fn=<NllLossBackward0>), logits=tensor([[[ 0.4321, -0.5317,  0.8398,  0.1393, -0.4323,  0.2329, -0.3783,
           0.1467, -0.5762],
         [ 0.5527, -0.6231,  1.2532,  0.4491,  0.1198,  0.0828, -0.0640,
           0.1872, -0.6063],
         [ 1.1288, -0.2954,  1.3737,  0.4369,  0.2266,  0.3522, -0.2014,
           0.6996, -0.6206],
         [ 0.7409, -0.3621,  0.7774,  0.8140, -0.2069,  0.0141, -0.4136,
          -0.3533, -0.1180],
         [ 1.1722, -0.6568,  1.1027,  0.2157, -0.1164,  0.0153, -0.0832,
          -0.3034,  0.0688],
         [ 0.2826, -0.1435,  0.7772,  0.6200,  0.2144,  0.1037, -0.1978,
           0.1359, -0.0591],
         [ 0.2379,  0.0087,  0.9060,  0.0091,  0.0130,  0.3740, -0.1898,
           0.1617, -0.1956],
         [ 0.2272, -0.3866,  0.9681,  0.5030, -0.1783,  0.3064, -0.5520,
          -0.0882, -0.2468],
         [ 0.6588, -0.5525,  1.1593,  0.3387,  0.4629,  0.2235, -0.3387,
          -0.0639, -0.2114],
    

Excellent. The [HuggingFace trainer](https://huggingface.co/docs/transformers/main_classes/trainer) can now be used to fine-tune the model on an appropriate dataset.