## CORD-NER Token Classification

The following notebook fine-tunes (<a href="https://huggingface.co/medicalai/ClinicalBERT" >ClinicalBERT</a>) on the entity recognition (token classification) task using the CORD-
NER dataset. The dataset was downloaded from Kaggle at:

<a href="https://www.kaggle.com/datasets/sushilkumarinfo/covid-ner-data-set">Covid NER Data Set</a>

This notebooks is based on the following huggingface tutorial:

<a href="https://huggingface.co/learn/nlp-course/en/chapter7/2?fw=pt#using-the-fine-tuned-model">Token classification tutorial</a>

### Data preprocessing

In [1]:
from datasets import load_from_disk

cord_ner_dataset = load_from_disk('./data/cord-ner') 
cord_ner_dataset

  from .autonotebook import tqdm as notebook_tqdm


DatasetDict({
    train: Dataset({
        features: ['doc_id', 'sent_id', 'sent_tokens', 'labels'],
        num_rows: 2568124
    })
    test: Dataset({
        features: ['doc_id', 'sent_id', 'sent_tokens', 'labels'],
        num_rows: 321016
    })
    validation: Dataset({
        features: ['doc_id', 'sent_id', 'sent_tokens', 'labels'],
        num_rows: 321015
    })
})

In [2]:
ner_feature = cord_ner_dataset["train"].features["labels"]
label_names = ner_feature.feature.names
label_names

['O',
 'B-CARDINAL',
 'I-CARDINAL',
 'B-ORGANISM',
 'I-ORGANISM',
 'B-CELL',
 'I-CELL',
 'B-BACTERIUM',
 'I-BACTERIUM',
 'B-LABORATORY_OR_TEST_RESULT',
 'I-LABORATORY_OR_TEST_RESULT',
 'B-CELL_COMPONENT',
 'I-CELL_COMPONENT',
 'B-PERSON',
 'I-PERSON',
 'B-MACHINE_ACTIVITY',
 'I-MACHINE_ACTIVITY',
 'B-SIGN_OR_SYMPTOM',
 'I-SIGN_OR_SYMPTOM',
 'B-BODY_SUBSTANCE',
 'I-BODY_SUBSTANCE',
 'B-TIME',
 'I-TIME',
 'B-SUBSTRATE',
 'I-SUBSTRATE',
 'B-CELL_FUNCTION',
 'I-CELL_FUNCTION',
 'B-ORDINAL',
 'I-ORDINAL',
 'B-HUMAN-CAUSED_PHENOMENON_OR_PROCESS',
 'I-HUMAN-CAUSED_PHENOMENON_OR_PROCESS',
 'B-EVOLUTION',
 'I-EVOLUTION',
 'B-IMMUNE_RESPONSE',
 'I-IMMUNE_RESPONSE',
 'B-EDUCATIONAL_ACTIVITY',
 'I-EDUCATIONAL_ACTIVITY',
 'B-FOOD',
 'I-FOOD',
 'B-LANGUAGE',
 'I-LANGUAGE',
 'B-GPE',
 'I-GPE',
 'B-BODY_PART_ORGAN_OR_ORGAN_COMPONENT',
 'I-BODY_PART_ORGAN_OR_ORGAN_COMPONENT',
 'B-SOCIAL_BEHAVIOR',
 'I-SOCIAL_BEHAVIOR',
 'B-EVENT',
 'I-EVENT',
 'B-TISSUE',
 'I-TISSUE',
 'B-FAC',
 'I-FAC',
 'B-MONEY',
 '

### Tokenization

In [3]:
from transformers import AutoTokenizer

path_to_model = "./models/ClinicalBERT"
tokenizer = AutoTokenizer.from_pretrained(path_to_model)

In [4]:
tokenizer.is_fast

True

Align labels with tokens. Special tokens will get the label -100. It is a label index that will be ignored by the loss function (cross-entropy.)

In [7]:
def align_labels_with_tokens(labels, word_ids):
    new_labels = []
    current_word = None
    for word_id in word_ids:
        if word_id != current_word:
            # Start of a new word!
            current_word = word_id
            label = -100 if word_id is None else labels[word_id]
            new_labels.append(label)
        elif word_id is None:
            # Special token
            new_labels.append(-100)
        else:
            # Same word as previous token
            label = labels[word_id]
            # If the label is B-XXX we change it to I-XXX
            if label % 2 == 1:
                label += 1
            new_labels.append(label)

    return new_labels

To preprocess the whole dataset, it is necessary to tokenize all the inputs and apply align_labels_with_tokens() on all the labels. To take advantage of the speed of the fast tokenizer, it’s best to tokenize in batches.  So, we use the following function that processes a list of examples and use the Dataset.map() method with the option batched=True. The word_ids() function needs to get the index of the example we want the word IDs of when the inputs to the tokenizer are lists of texts (or in this case, list of lists of words.)

In [9]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["sent_tokens"], truncation=True, max_length=96, is_split_into_words=True
    )
    all_labels = examples["labels"]
    new_labels = []
    for i, labels in enumerate(all_labels):
        word_ids = tokenized_inputs.word_ids(i)
        new_labels.append(align_labels_with_tokens(labels, word_ids))

    tokenized_inputs["labels"] = new_labels
    return tokenized_inputs

Padding is not done at this step yet. It will be done at the data collator.

In [10]:
tokenized_datasets = cord_ner_dataset.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=cord_ner_dataset["train"].column_names,
)

Map: 100%|██████████| 2568124/2568124 [04:21<00:00, 9837.25 examples/s] 
Map: 100%|██████████| 321016/321016 [00:33<00:00, 9593.46 examples/s] 
Map: 100%|██████████| 321015/321015 [00:32<00:00, 9756.55 examples/s] 


In [11]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['labels', 'input_ids', 'attention_mask'],
        num_rows: 2568124
    })
    test: Dataset({
        features: ['labels', 'input_ids', 'attention_mask'],
        num_rows: 321016
    })
    validation: Dataset({
        features: ['labels', 'input_ids', 'attention_mask'],
        num_rows: 321015
    })
})

### Data Collation

In [12]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

### Metrics

To have the Trainer compute a metric every epoch, the function compute_metrics() is defined. It takes the arrays of predictions and labels, and returns a dictionary with the metric names and values.

In [15]:
import evaluate

metric = evaluate.load("seqeval")

This metric does not behave like the standard accuracy: it will actually take the lists of labels as strings, not integers, so we will need to fully decode the predictions and labels before passing them to the metric. 

This compute_metrics() function first takes the argmax of the logits to convert them to predictions (as usual, the logits and the probabilities are in the same order, so we don’t need to apply the softmax). Then we have to convert both labels and predictions from integers to strings. We remove all the values where the label is -100, then pass the results to the metric.compute() method:

In [19]:
import numpy as np


def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)

    # Remove ignored index (special tokens) and convert to labels
    true_labels = [[label_names[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    all_metrics = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": all_metrics["overall_precision"],
        "recall": all_metrics["overall_recall"],
        "f1": all_metrics["overall_f1"],
        "accuracy": all_metrics["overall_accuracy"],
    }

### Model definition

Since we are working on a token classification problem, we will use the AutoModelForTokenClassification class. The main thing to remember when defining this model is to pass along some information on the number of labels we have. The easiest way to do this is to pass that number with the num_labels argument, but if we want a nice inference widget working like the one we saw at the beginning of this section, it’s better to set the correct label correspondences instead.

They should be set by two dictionaries, id2label and label2id, which contain the mappings from ID to label and vice versa:

In [20]:
id2label = {i: label for i, label in enumerate(label_names)}
label2id = {v: k for k, v in id2label.items()}

In [21]:
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(
    path_to_model,
    id2label=id2label,
    label2id=label2id,
)

model.gradient_checkpointing_enable()

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at ./models/ClinicalBERT and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [22]:
model.config.num_labels

127

## Fine-tuning the model

In [22]:
from accelerate import Accelerator

accelerator = Accelerator()
device = accelerator.device
print(device)

cuda


### DataLoader definition

The data loader will generate batch sizes of 32 examples. 

In [23]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    tokenized_datasets["train"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=32,
)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"], collate_fn=data_collator, batch_size=32
)

### Optimizer definition

We choose an initial learning rate of 2e-5

In [24]:
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=2e-5)

We move model, optimizer, and both dataloaders to accelerator

In [25]:
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

### Scheduler

We choose to train for 3 epochs

In [26]:
from transformers import get_scheduler

num_train_epochs = 3
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

### Training loop

To simplify the evaluation part, we define the following postprocess() function that takes predictions and labels and converts them to lists of strings, like the metric object expects:

In [27]:
def postprocess(predictions, labels):
    predictions = predictions.detach().cpu().clone().numpy()
    labels = labels.detach().cpu().clone().numpy()

    # Remove ignored index (special tokens) and convert to labels
    true_labels = [[label_names[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    return true_labels, true_predictions

We define the output directory where model checkpoints will go: 

In [28]:
output_dir = "./models/clinicalbert-finetuned-cord-ner"

training loop:

In [29]:
from tqdm.auto import tqdm
import torch

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    # Training
    model.train()
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    # Evaluation
    model.eval()
    for batch in eval_dataloader:
        with torch.no_grad():
            outputs = model(**batch)

        predictions = outputs.logits.argmax(dim=-1)
        labels = batch["labels"]

        # Necessary to pad predictions and labels for being gathered
        predictions = accelerator.pad_across_processes(predictions, dim=1, pad_index=-100)
        labels = accelerator.pad_across_processes(labels, dim=1, pad_index=-100)

        predictions_gathered = accelerator.gather(predictions)
        labels_gathered = accelerator.gather(labels)

        true_predictions, true_labels = postprocess(predictions_gathered, labels_gathered)
        metric.add_batch(predictions=true_predictions, references=true_labels)

    results = metric.compute()
    print(
        f"epoch {epoch}:",
        {
            key: results[f"overall_{key}"]
            for key in ["precision", "recall", "f1", "accuracy"]
        },
    )

    # Save and upload
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)
        #repo.push_to_hub(
        #    commit_message=f"Training in progress epoch {epoch}", blocking=False
        #)

 33%|███▎      | 80254/240762 [7:08:58<15:15:03,  2.92it/s]

epoch 0: {'precision': 0.8077804519750881, 'recall': 0.7637188743620039, 'f1': 0.7851319656012556, 'accuracy': 0.9134144878001901}


 67%|██████▋   | 160508/240762 [14:27:52<7:48:24,  2.86it/s]   

epoch 1: {'precision': 0.830607309234364, 'recall': 0.7868277971285362, 'f1': 0.808125057793943, 'accuracy': 0.9239892663781595}


100%|██████████| 240762/240762 [21:50:48<00:00,  2.42it/s]      

epoch 2: {'precision': 0.8375507991885777, 'recall': 0.7982051615540402, 'f1': 0.8174047804447852, 'accuracy': 0.9282542328741817}


### Using the fine-tuned model

In [11]:
from transformers import pipeline

# Replace this with your own checkpoint
model_checkpoint = "./models/clinicalbert-finetuned-cord-ner"
token_classifier = pipeline(
    "token-classification", model=model_checkpoint, aggregation_strategy="simple"
)
token_classifier("All five of the genes needed to make the amino acid tryptophan in Escherichia coli are located next to each other in the trp operon.")

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


[{'entity_group': 'CARDINAL',
  'score': 0.5394215,
  'word': 'all five',
  'start': 0,
  'end': 8},
 {'entity_group': 'CHEMICAL',
  'score': 0.92919856,
  'word': 'amino acid tryptophan',
  'start': 41,
  'end': 62},
 {'entity_group': 'CHEMICAL',
  'score': 0.92330784,
  'word': 'escherichia',
  'start': 66,
  'end': 77},
 {'entity_group': 'GENE_OR_GENOME',
  'score': 0.99423695,
  'word': 'trp',
  'start': 121,
  'end': 124}]

### Evaluation on test set 

In [14]:
from transformers import AutoTokenizer

path_to_model = "./models/clinicalbert-finetuned-cord-ner"
tokenizer = AutoTokenizer.from_pretrained(path_to_model)

In [15]:
tokenizer.is_fast

True

In [18]:
tokenized_test_dataset = cord_ner_dataset["test"].map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=cord_ner_dataset["test"].column_names,
)

Map: 100%|██████████| 321016/321016 [01:14<00:00, 4281.34 examples/s]


In [19]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

In [20]:
import evaluate

metric = evaluate.load("seqeval")

In [21]:
import numpy as np


def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)

    # Remove ignored index (special tokens) and convert to labels
    true_labels = [[label_names[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    all_metrics = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": all_metrics["overall_precision"],
        "recall": all_metrics["overall_recall"],
        "f1": all_metrics["overall_f1"],
        "accuracy": all_metrics["overall_accuracy"],
    }

In [23]:
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(
    path_to_model,
    id2label=id2label,
    label2id=label2id,
)

In [24]:
model.config.num_labels

127

In [25]:
from torch.utils.data import DataLoader

test_dataloader = DataLoader(
    tokenized_test_dataset, collate_fn=data_collator, batch_size=32
)

In [28]:
import torch

# Evaluation on test set
model.eval()
for batch in test_dataloader:
    with torch.no_grad():
        outputs = model(**batch)

    predictions = outputs.logits.argmax(dim=-1)
    labels = batch["labels"]

    # Necessary to pad predictions and labels for being gathered
    # predictions = accelerator.pad_across_processes(predictions, dim=1, pad_index=-100)
    # labels = accelerator.pad_across_processes(labels, dim=1, pad_index=-100)

    # predictions_gathered = accelerator.gather(predictions)
    # labels_gathered = accelerator.gather(labels)

    true_predictions, true_labels = postprocess(predictions, labels)
    metric.add_batch(predictions=true_predictions, references=true_labels)

results = metric.compute()
print(
    {
        key: results[f"overall_{key}"]
        for key in ["precision", "recall", "f1", "accuracy"]
    },
)

{'precision': 0.8374526946399435, 'recall': 0.79827810498694, 'f1': 0.8173962980288354, 'accuracy': 0.9281765313397233}


Metrics for individual classes:

In [29]:
results

{'ANATOMICAL_STRUCTURE': {'precision': 0.7191666666666666,
  'recall': 0.7570175438596491,
  'f1': 0.7376068376068375,
  'number': 1140},
 'ARCHAEON': {'precision': 1.0,
  'recall': 0.9782608695652174,
  'f1': 0.989010989010989,
  'number': 46},
 'BACTERIUM': {'precision': 0.8599182004089979,
  'recall': 0.788191190253046,
  'f1': 0.8224938875305624,
  'number': 1067},
 'BODY_PART_ORGAN_OR_ORGAN_COMPONENT': {'precision': 0.8726891557080236,
  'recall': 0.8365762309308487,
  'f1': 0.854251200970104,
  'number': 10947},
 'BODY_SUBSTANCE': {'precision': 0.9775072770574226,
  'recall': 0.9639874739039666,
  'f1': 0.9707003021941926,
  'number': 3832},
 'CARDINAL': {'precision': 0.8973055078589354,
  'recall': 0.844080811092888,
  'f1': 0.8698797649077429,
  'number': 80484},
 'CELL': {'precision': 0.854814158802203,
  'recall': 0.8192186298172559,
  'f1': 0.8366379543853947,
  'number': 37703},
 'CELL_COMPONENT': {'precision': 0.8203454894433782,
  'recall': 0.8019764823617713,
  'f1': 0.8