In this notebook, we explore a Named Entity Recognition task using transformers. The task will involve finetuning the [ClinicalBert](https://huggingface.co/emilyalsentzer/Bio_ClinicalBERT) model.

In [1]:
! pip install pandas
! pip install datasets
! pip install transformers
! pip install torch
! pip install ipywidgets

Defaulting to user installation because normal site-packages is not writeable
Collecting pandas
  Downloading pandas-1.4.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.7/11.7 MB[0m [31m57.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting numpy>=1.18.5
  Downloading numpy-1.23.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.1/17.1 MB[0m [31m48.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting pytz>=2020.1
  Downloading pytz-2022.1-py2.py3-none-any.whl (503 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m503.5/503.5 KB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pytz, numpy, pandas
Successfully installed numpy-1.23.0 pandas-1.4.3 pytz-2022.1
You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' com

In [1]:
import os
import itertools
import pandas as pd
import numpy as np
import datasets
from datasets import Dataset
from datasets import load_metric
from transformers import AutoTokenizer
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer
from transformers import DataCollatorForTokenClassification
import torch

## The data

In [2]:
import pandas as pd


# Put the code below into a function later

# Test Data Frame
with open("../data/Bioinformatics_16/BioNLP13CG-IOB/test_tokens.txt") as f:
    lines = f.readlines()
tokens = [tokens.replace("\n", "").rstrip().split() for tokens in lines]
with open("../data/Bioinformatics_16/BioNLP13CG-IOB/test_labels.txt") as f:
    lines = f.readlines()
labels = [tokens.replace("\n", "").rstrip().split() for tokens in lines]
test = pd.DataFrame({"tokens": tokens, "ner_tags": labels})

# Validation Data Frame
with open("../data/Bioinformatics_16/BioNLP13CG-IOB/validation_tokens.txt") as f:
    lines = f.readlines()
tokens = [tokens.replace("\n", "").rstrip().split() for tokens in lines]
with open("../data/Bioinformatics_16/BioNLP13CG-IOB/validation_labels.txt") as f:
    lines = f.readlines()
labels = [tokens.replace("\n", "").rstrip().split() for tokens in lines]
validation = pd.DataFrame({"tokens": tokens, "ner_tags": labels})


# Train Data Frame
with open("../data/Bioinformatics_16/BioNLP13CG-IOB/train_tokens.txt") as f:
    lines = f.readlines()
tokens = [tokens.replace("\n", "").rstrip().split() for tokens in lines]
with open("../data/Bioinformatics_16/BioNLP13CG-IOB/train_labels.txt") as f:
    lines = f.readlines()
labels = [tokens.replace("\n", "").rstrip().split() for tokens in lines]
train = pd.DataFrame({"tokens": tokens, "ner_tags": labels})


In [3]:
train.iloc[:2]

Unnamed: 0,tokens,ner_tags
0,"[The, TGF, -, beta, type, II, receptor, in, ch...","[O, B-Gene_or_gene_product, I-Gene_or_gene_pro..."
1,"[Genomic, instability, is, one, mechanism, pro...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."


Convert data to a datasets dictionary

In [4]:
med_df = datasets.DatasetDict({
    "train": Dataset.from_pandas(train),
    "validation": Dataset.from_pandas(validation),
    "test": Dataset.from_pandas(test)
})

med_df

DatasetDict({
    train: Dataset({
        features: ['tokens', 'ner_tags'],
        num_rows: 3021
    })
    validation: Dataset({
        features: ['tokens', 'ner_tags'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['tokens', 'ner_tags'],
        num_rows: 1895
    })
})

In [5]:
# Example
example = pd.DataFrame(med_df["validation"][221])
example

Unnamed: 0,tokens,ner_tags
0,The,O
1,rats,B-Organism
2,were,O
3,divided,O
4,into,O
5,4,O
6,groups,O
7,.,O


Each record is annotated in the `inside-outside-beginning` format i.e a `B-` prefix indicates the beginning of an entity, and consecutive
tokens belonging to the same entity are given an `I-` prefix. An `O` tag indicates that the
token does not belong to any entity. For example, the following sentence:


As a quick check that we don't have any unusual imbalance in the tags, let's calculate the frequencies of each entity across each split:

In [6]:
# subclass for counting hashable objects
from collections import Counter
# calls a factory function to supply missing values
from collections import defaultdict
from datasets import DatasetDict

split2freqs = defaultdict(Counter)
for split, dataset in med_df.items():
    for row in dataset["ner_tags"]:
        for tag in row:
            if tag.startswith("B"):
                tag_type = tag.split("-")[1]
                split2freqs[split][tag_type] += 1


pd.DataFrame(split2freqs)


Unnamed: 0,train,validation,test
Gene_or_gene_product,4016,1354,2513
Cancer,1226,430,918
Cell,1934,540,996
Organism,896,295,513
Simple_chemical,1096,443,720
Multi,415,138,303
Organ,194,71,156
Organism_subdivision,47,12,39
Tissue,314,86,184
Immaterial_anatomical_entity,52,19,31


There are some entitites that are too few across the data sets, perhaps combine the train and validation sets? Let's see how the model will generalize on these.

We now need a way to encode the `ner_tags` eg `Amino_acid` into a numerical form. Let's create the tags that will be used to label each entity and the mapping of each tag to an ID and vice versa. All of this information can be derived as follows:

In [8]:
split2freqs = defaultdict(Counter)

for split, dataset in med_df.items():
    for row in dataset["ner_tags"]:
        for tag in row:
            tag_type = tag
            split2freqs[split][tag_type] +=1


tag_names = pd.DataFrame(split2freqs).reset_index()["index"].to_list()
tag_names

['O',
 'B-Gene_or_gene_product',
 'I-Gene_or_gene_product',
 'B-Cancer',
 'I-Cancer',
 'B-Cell',
 'I-Cell',
 'B-Organism',
 'B-Simple_chemical',
 'I-Simple_chemical',
 'B-Multi-tissue_structure',
 'I-Multi-tissue_structure',
 'B-Organ',
 'B-Organism_subdivision',
 'B-Tissue',
 'I-Tissue',
 'B-Immaterial_anatomical_entity',
 'B-Organism_substance',
 'I-Organism_substance',
 'I-Organism',
 'I-Organism_subdivision',
 'B-Cellular_component',
 'I-Immaterial_anatomical_entity',
 'I-Cellular_component',
 'B-Pathological_formation',
 'I-Pathological_formation',
 'I-Organ',
 'B-Amino_acid',
 'I-Amino_acid',
 'B-Anatomical_system',
 'I-Anatomical_system',
 'B-Developing_anatomical_structure',
 'I-Developing_anatomical_structure']

creating `tags to index` and `index to tag` dictionaries

In [9]:
# Create index and tag mappings
tag2index = {tag: idx for idx, tag in enumerate(tag_names)}
index2tag = {idx: tag for idx, tag in enumerate(tag_names)}
print(index2tag[32])
print(tag2index["I-Developing_anatomical_structure"])

I-Developing_anatomical_structure
32


With these, the next step is to create a new column in each split with the numeric class label for each observation. We'll use the `map ()` method to apply a function to each observation:

In [10]:
# Add ner_tag ids
def create_tag_ids(batch):
    return {"tag_ids": [tag2index[ner_tag] for ner_tag in batch["ner_tags"]]}

# Apply function to multiple batches
med_df = med_df.map(create_tag_ids)
med_df



  0%|          | 0/3021 [00:00<?, ?ex/s]

  0%|          | 0/1000 [00:00<?, ?ex/s]

  0%|          | 0/1895 [00:00<?, ?ex/s]

DatasetDict({
    train: Dataset({
        features: ['tokens', 'ner_tags', 'tag_ids'],
        num_rows: 3021
    })
    validation: Dataset({
        features: ['tokens', 'ner_tags', 'tag_ids'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['tokens', 'ner_tags', 'tag_ids'],
        num_rows: 1895
    })
})

In [11]:
example = pd.DataFrame(med_df["validation"][111])
example

Unnamed: 0,tokens,ner_tags,tag_ids
0,Postoperative,O,0
1,progression,O,0
2,of,O,0
3,pulmonary,B-Organ,12
4,metastasis,O,0
5,in,O,0
6,osteosarcoma,B-Cancer,3
7,.,O,0


Much better! We'll still need to tokenize the tokens into numeric representations. We'll get back to that in just a few.

## Clinical Bert

In [29]:
import torch.nn as nn

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cpu')

The main concept that makes Transformers so versatile is the split of the architecture into a body and head. This separation of bodies and heads allows us to build a custom head
for any task and just mount it on top of a pretrained model.


In [14]:
from transformers import AutoConfig

model_ckpt = "emilyalsentzer/Bio_ClinicalBERT"

AutoConfig.from_pretrained(model_ckpt)

Downloading:   0%|          | 0.00/385 [00:00<?, ?B/s]

BertConfig {
  "_name_or_path": "emilyalsentzer/Bio_ClinicalBERT",
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.20.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}

Let's store the index and tag mappings and the number of distinct classes in the AutoConfig object.

In [17]:
clinical_bert_config = AutoConfig.from_pretrained(model_ckpt, num_labels = len(tag_names),
id2label = index2tag, label2id = tag2index)

clinical_bert_config

BertConfig {
  "_name_or_path": "emilyalsentzer/Bio_ClinicalBERT",
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "O",
    "1": "B-Gene_or_gene_product",
    "2": "I-Gene_or_gene_product",
    "3": "B-Cancer",
    "4": "I-Cancer",
    "5": "B-Cell",
    "6": "I-Cell",
    "7": "B-Organism",
    "8": "B-Simple_chemical",
    "9": "I-Simple_chemical",
    "10": "B-Multi-tissue_structure",
    "11": "I-Multi-tissue_structure",
    "12": "B-Organ",
    "13": "B-Organism_subdivision",
    "14": "B-Tissue",
    "15": "I-Tissue",
    "16": "B-Immaterial_anatomical_entity",
    "17": "B-Organism_substance",
    "18": "I-Organism_substance",
    "19": "I-Organism",
    "20": "I-Organism_subdivision",
    "21": "B-Cellular_component",
    "22": "I-Immaterial_anatomical_entity",
    "23": "I-Cellular_component",
    "24": "B-Pathological_formation",
    "25": "I-Pathological_

The AutoConfig class contains the the blueprint of a model's architecture and is usually downloaded automatically when we run `AutoModelForTokenClassification.from_pretrained`. In this case we load the model first with the additional `config` argument of the configuration file we modified above. 

In [26]:
from transformers import AutoModelForTokenClassification
model = AutoModelForTokenClassification.from_pretrained(model_ckpt, config = clinical_bert_config).to(device)
model

Some weights of the model checkpoint at emilyalsentzer/Bio_ClinicalBERT were not used when initializing BertForTokenClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint 

BertForTokenClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwis

Next, we load in the model's tokenizer which does the task of breaking down a string into numerical representations.

In [23]:
from transformers import AutoTokenizer
model_ckpt = "emilyalsentzer/Bio_ClinicalBERT"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
tokenizer

PreTrainedTokenizerFast(name_or_path='emilyalsentzer/Bio_ClinicalBERT', vocab_size=28996, model_max_len=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

Quick sanity check to ensure the model and tokenizer have been initialized correctly:

In [47]:
text = med_df["validation"][111]["tokens"]
text = " ".join(text)
text

'Postoperative progression of pulmonary metastasis in osteosarcoma .'

In [56]:
tokens = tokenizer(text, return_tensors="pt")
pd.DataFrame([tokens.tokens(), tokens["input_ids"][0].numpy()], index = ["Tokens", "Input IDs"])



Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
Tokens,[CLS],post,##oper,##ative,progression,of,pulmonary,meta,##sta,##sis,in,o,##ste,##osa,##rc,##oma,.,[SEP]
Input IDs,101,2112,19807,5838,16147,1104,26600,27154,8419,4863,1107,184,13894,9275,19878,7903,119,102


Finally, we need to pass the inputs to the model and extract the predictions by taking
the argmax to get the most likely class per token:

In [62]:
outputs = model(tokens["input_ids"].to(device)).logits

outputs.shape #[batch_size, num_tokens, num_tags]

torch.Size([1, 18, 33])

In [63]:
outputs

tensor([[[-3.0681e-02, -7.0484e-01, -1.5960e-01, -5.3787e-01, -1.5586e-01,
          -1.8620e-02, -3.7622e-02,  2.0091e-01,  1.5004e-01, -2.2316e-01,
          -1.4840e-02,  1.6940e-01, -3.7284e-01,  4.2044e-01,  9.6409e-02,
          -1.5482e-01,  3.4893e-02, -6.6570e-02,  3.6009e-01,  1.2015e-01,
          -7.6215e-02,  1.3905e-01,  9.6012e-02,  7.3222e-01, -2.8005e-01,
          -5.3698e-01, -5.8754e-01,  2.1537e-01, -1.0474e-01,  3.7457e-02,
          -1.1115e-02, -6.4045e-01, -2.1389e-01],
         [ 5.3055e-02, -8.6738e-01, -2.3088e-01, -6.4398e-01, -1.4056e-01,
           1.9622e-01, -3.0946e-01,  3.9120e-02,  1.4106e-01,  4.8184e-02,
          -1.2648e-01,  8.4362e-02, -4.1441e-01,  2.7051e-01,  2.9913e-02,
           7.1965e-02,  2.1617e-01,  1.3861e-01, -3.3692e-02,  3.2228e-01,
          -5.7417e-01,  1.2631e-01,  4.4106e-01,  2.1798e-02,  2.9999e-03,
          -4.0930e-01, -4.2402e-01, -5.7801e-02, -1.8042e-01, -7.3760e-02,
           1.0539e-01, -2.7793e-01,  3.2392e-02],


Here we see that each token is given a logit among the 33 possible NER tags.

In [72]:
# Extract predictions
predictions = torch.argmax(outputs, -1).cpu().numpy()
predictions

array([23, 22, 13, 13, 23, 23, 23,  0,  0, 13, 23, 13, 11, 11, 23, 13, 13,
       14])

In [1]:
preds = [index2tag[p] for p in predictions[0]]
pd.DataFrame([tokens.tokens(), preds], index = ["Tokens", "Tags"])

NameError: name 'predictions' is not defined

Unsurprisingly, our token classification layer with random weights leaves a lot to be
desired; let’s fine-tune on some labeled data to make it better! Before doing so, let’s
wrap the preceding steps into a helper function for later use:

In [None]:
def tag_text(text, tags, model, tokenizer):
    # Tokenizer text
    tokens = tokenizer(text, return_tensors = "pt")
    # Get predictions as distribution over 33 possible classes
    outputs = model(tokens["input_ids"]).logits
    predictions = torch.argmax(outputs, -1).cpu().numpy()

    # Map index to string
    preds = [index2[p] for p in predictions[0]]
    return pd.DataFrame([tokens.tokens(), preds], index = ["Tokens", "Tags"])