ASL, v.0602323

# Introduction

In this notebook, we use Transformers on the same task as in the previous notebook (`DAT255-NLP-2.0-MedTweets-fastai-ULMFiT.ipynb`)

# Setup

In [None]:
# This is a quick check of whether the notebook is currently running on Google Colaboratory
# or on Kaggle, as that makes some difference for the code below.
# We'll do this in every notebook of the course.
try:
    import colab
    colab=True
except:
    colab=False

import os
kaggle = os.environ.get('KAGGLE_KERNEL_RUN_TYPE', '')

In [None]:
%matplotlib inline
import numpy as np, pandas as pd, matplotlib.pyplot as plt
from pathlib import Path

In [None]:
#import os
#os.environ['CUDA_VISIBLE_DEVICES'] = "0"

In [None]:
if (colab or kaggle):
    !pip install transformers datasets

In [None]:
if colab:
    from google.colab import drive
    drive.mount("/content/gdrive")
    DATA = Path("/content/gdrive/MyDrive/Colab Notebooks/dat255-data")
    DATA.mkdir(exist_ok=True)
if not colab:
    DATA=Path('./data')
    DATA.mkdir(exist_ok=True)

In [None]:
import torch

In [None]:
import datasets

In [None]:
from datasets import load_dataset

In [None]:
import transformers

In [None]:
# Verify that the transformers library is installed and operational
print(transformers.pipeline('sentiment-analysis')('this is great!'))

In [None]:
from datasets import Dataset
from transformers import (AutoTokenizer, AutoModelForSequenceClassification, 
                          PreTrainedModel, BertModel, BertForSequenceClassification,
                          TrainingArguments, Trainer, GPT2ForSequenceClassification, 
                          RobertaForSequenceClassification)

from transformers.modeling_outputs import SequenceClassifierOutput

# MedWeb using Transformers

Load the data as before:

In [None]:
df = pd.read_csv('https://github.com/HVL-ML/DAT255/raw/main/3-NLP/data/medwebdata.csv')
df.head()

For convenience, we combine all the labels into one vector stored under `y`:

In [None]:
df.drop(['is_test','labels'], axis=1, inplace=True)

In [None]:
df['labels'] = df.apply(lambda x: [x[c] for c in df.columns[2:]], axis=1)

In [None]:
df.head()

Set up the transformers model. There are multiple possible models to try (at the time of writing, HuggingFace has 146,394 models in its library). 

One interesting option is the [PubMedBERT model](https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract) created by Microsoft Research by training a BERT model on 14 million abstracts of PubMed articles. Have a look at the blog post [Domain-specific language model pretraining for biomedical natural language processing](https://www.microsoft.com/en-us/research/blog/domain-specific-language-model-pretraining-for-biomedical-natural-language-processing/) and the accompanying paper. 

A related model is the BioMed-RoBERTa model from AllenAI: https://huggingface.co/allenai/biomed_roberta_base. 

The more recent GPT models are also interesting options (for example the BioMedLM model from Stanford CRFM: https://huggingface.c/stanford-crfm/BioMedLM). Unfortunately, GPT models require enormous amounts of computing resources compared to many alternatives. See the end of the notebook for an example run of the GPT-2 model BioMedLM. 

In [None]:
#model_name = 'microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract'
#model_name = "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext"

model_name = "allenai/biomed_roberta_base"

model = transformers.AutoModelForSequenceClassification.from_pretrained(model_name)

We need to tokenize the data in the same way as was done for the original dataset and create a data set compatible with HuggingFace:

In [None]:
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)

In [None]:
ds = Dataset.from_pandas(df, split='train').train_test_split()

In [None]:
ds

In [None]:
ds['train'][0]

In [None]:
def tokenize_and_encode(examples):
    return tokenizer(examples["Tweet"], truncation=True)

In [None]:
cols = ds['train'].column_names
cols.remove('labels')
ds_enc = ds.map(tokenize_and_encode, batched=True, remove_columns=cols)

In [None]:
ds_enc

In [None]:
model_name

In [None]:
num_labels=8

if "roberta" not in model_name:
    print("Assuming a BERT model")
    model = BertForSequenceClassification.from_pretrained(model_name, 
                                                        problem_type="multi_label_classification", 
                                                        num_labels=num_labels)

elif "roberta" in model_name:
    print("Assuming a RoBERTa model")
    model = RobertaForSequenceClassification.from_pretrained(model_name, 
                                                        problem_type="multi_label_classification", 
                                                        num_labels=num_labels)

We define some metrics to use when scoring on the test data:

In [None]:
from sklearn.metrics import f1_score
def accuracy_thresh(y_pred, y_true, thresh=0.5, sigmoid=True): 
    y_pred = torch.from_numpy(y_pred)
    y_true = torch.from_numpy(y_true)
    if sigmoid: 
        y_pred = y_pred.sigmoid()
    return ((y_pred>thresh)==y_true.bool()).float().mean().item()

def f1score_thresh(y_pred, y_true, average='micro',thresh=0.5, sigmoid=True): 
    y_pred = torch.from_numpy(y_pred)
    y_true = torch.from_numpy(y_true)
    if sigmoid: 
        y_pred = y_pred.sigmoid()
    return f1_score(y_true, y_pred>thresh, average='micro')

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    return {'accuracy_thresh': accuracy_thresh(predictions, labels),
           'f1score_micro_thresh': f1score_thresh(predictions, labels, average='micro'),
           'f1score_macro_thresh': f1score_thresh(predictions, labels, average='macro')}

..and then the training setup:

In [None]:
batch_size = 8
num_train_epochs = 6

args = TrainingArguments(
    output_dir=".",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_train_epochs,
    logging_steps=50,
    weight_decay=0.01, 
    save_steps=2000
)

We have to modify the loss function to deal with multilabel problems. Here's a way to do it (from https://discuss.huggingface.co/t/fine-tune-for-multiclass-or-multilabel-multiclass/4035/9):

In [None]:
class MultilabelTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.pop("labels")
        outputs = model(**inputs)
        logits = outputs.logits
        loss_fct = torch.nn.BCEWithLogitsLoss()
        loss = loss_fct(logits.view(-1, self.model.config.num_labels), 
                        labels.float().view(-1, self.model.config.num_labels))
        return (loss, outputs) if return_outputs else loss

In [None]:
trainer = MultilabelTrainer(
    model,
    args,
    train_dataset=ds_enc["train"],
    eval_dataset=ds_enc["test"],
    compute_metrics=compute_metrics,
    tokenizer=tokenizer)

Let's see how the model does without any training on the MedWeb data:

In [None]:
trainer.evaluate()

Then fine-tune it:

In [None]:
trainer.train()

In [None]:
trainer.train()

In [None]:
trainer.train()

## How does it compare to other approaches?

From the [original article](https://www.jmir.org/2019/2/e12783/) from 2019 that presented the data set:

<img src="https://github.com/MMIV-ML/ELMED219-2022/raw/main/Lab2-NLP/assets/medweb_results.png">

The "NAIST-en" models are _"ensembles of hierarchical attention network and deep character-level convolutional neural network with loss functions (negative loss function, hinge, and hinge squared)"_. I.e. also deep learning-based models.

# Extra: using a GPT-2 model

In [16]:
model_name = "stanford-crfm/BioMedLM" # This is a GPT2 model
model = transformers.AutoModelForSequenceClassification.from_pretrained(model_name)

Some weights of the model checkpoint at stanford-crfm/BioMedLM were not used when initializing GPT2ForSequenceClassification: ['lm_head.weight']
- This IS expected if you are initializing GPT2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing GPT2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at stanford-crfm/BioMedLM and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [18]:
# GPT2
tokenizer.pad_token = tokenizer.eos_token

In [None]:
model = GPT2ForSequenceClassification.from_pretrained(model_name, 
                                                        problem_type="multi_label_classification", 
                                                        num_labels=num_labels)

In [27]:
model.config.pad_token_id = model.config.eos_token_id
model.config.pad_token_id = tokenizer.pad_token_id

In [None]:
batch_size = 1
num_train_epochs = 2



In [None]:
args = TrainingArguments(
    output_dir=".",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_train_epochs,
    logging_steps=50,
    weight_decay=0.01
)

In [None]:
trainer = MultilabelTrainer(
    model,
    args,
    train_dataset=ds_enc["train"],
    eval_dataset=ds_enc["test"],
    compute_metrics=compute_metrics,
    tokenizer=tokenizer)

42GB

In [36]:
# GPT2
trainer.train()

***** Running training *****
  Num examples = 1920
  Num Epochs = 2
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 1
  Gradient Accumulation steps = 1
  Total optimization steps = 3840
  Number of trainable parameters = 2594268160
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy Thresh,F1score Micro Thresh,F1score Macro Thresh
1,0.1396,0.256822,0.951563,0.797054,0.797054
2,0.1646,0.163555,0.966211,0.865579,0.865579


Saving model checkpoint to ./checkpoint-500
Configuration saved in ./checkpoint-500/config.json
The model is bigger than the maximum size per checkpoint (10GB) and is going to be split in 2 checkpoint shards. You can find where each parameters has been saved in the index located at ./checkpoint-500/pytorch_model.bin.index.json.
tokenizer config file saved in ./checkpoint-500/tokenizer_config.json
Special tokens file saved in ./checkpoint-500/special_tokens_map.json
Saving model checkpoint to ./checkpoint-1000
Configuration saved in ./checkpoint-1000/config.json
The model is bigger than the maximum size per checkpoint (10GB) and is going to be split in 2 checkpoint shards. You can find where each parameters has been saved in the index located at ./checkpoint-1000/pytorch_model.bin.index.json.
tokenizer config file saved in ./checkpoint-1000/tokenizer_config.json
Special tokens file saved in ./checkpoint-1000/special_tokens_map.json
Saving model checkpoint to ./checkpoint-1500
Configurat

TrainOutput(global_step=3840, training_loss=0.2673093371093273, metrics={'train_runtime': 2015.1392, 'train_samples_per_second': 1.906, 'train_steps_per_second': 1.906, 'total_flos': 1032739377500160.0, 'train_loss': 0.2673093371093273, 'epoch': 2.0})