# Finetuning BERT

First, let's load the required libraries. We will use the popular transformers package for the model and the necessary preprocessing steps.

Our goal in this notebook will be to classify sentences from the clinical context as to whether or not a medical condition is present.

In [1]:
import torch
import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import (
    BertTokenizer,
    BertForSequenceClassification,
    get_linear_schedule_with_warmup,
)
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from torch.optim import AdamW
import os

We start by loading the pretrained model.

In [2]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased", num_labels=2
).to(device)

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", do_lower_case=True)

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

The warning tells us that we need to train the model for our downstream task, as an untrained classification layer got added to the pretrained model.
To see the effects of the fine tuning, we try to classify a "medical" sentence before training.

In [3]:
# To do!:
# Come up with two test sentences.
# One should indicate the presence of a medical condition, and the other should come from a clinical setting but not indicate any medical condition.

test_sentence_condition = "The patient suffers from chronic headaches."
test_sentence_no_condition = "The hospital has a new meal plan."

In [4]:
sentence_to_classify = test_sentence_no_condition

# Set the model to evaluation mode
model.eval()

# Tokenize the sentence and get tensors
encoding = tokenizer(sentence_to_classify, return_tensors="pt")
input_ids = encoding["input_ids"].to(device)
attention_mask = encoding["attention_mask"].to(device)

# Make a prediction
with torch.no_grad():
    outputs = model(input_ids, attention_mask=attention_mask)
    logits = outputs.logits

# Convert logits to probabilities
probabilities = torch.nn.functional.softmax(logits, dim=1)
predicted_class = torch.argmax(probabilities, dim=1).item()

print(f"Sentence: '{sentence_to_classify}'")
print(f"Logits: {logits.cpu().numpy()[0]}")
print(
    f"Probabilities (No Condition, Medical Condition): {probabilities.cpu().numpy()[0]}"
)
print(f"Predicted Class (before training): {predicted_class}")
print("--------------------------------------------\n")

Sentence: 'The hospital has a new meal plan.'
Logits: [-0.22869185  0.11311549]
Probabilities (No Condition, Medical Condition): [0.41537052 0.5846295 ]
Predicted Class (before training): 1
--------------------------------------------



## **Finetuning the model**

Let's load some data and have a look.

In [5]:
# To Do!

# 1. Load the file medical_data.csv as pandas dataframe
# 2. Display the content of the dataframe. How is the label information encoded?
# 3. Load the content of the column "text" into a list called texts and the column label into a list called labels

df = pd.read_csv("../medical_data.csv")
df.head(5)

Unnamed: 0.1,Unnamed: 0,text,label
0,19,Undergoing chemotherapy for lymphoma.,1
1,28,Discharged from the hospital this morning.,0
2,8,Surgical intervention is recommended.,1
3,7,Prescribed medication for high cholesterol.,1
4,27,Medical history was updated in the file.,0


In [6]:
texts = list(df["text"])
labels = list(df["label"])

Let's create training and validation splits:

In [7]:
# Split data into training and validation sets
train_texts, val_texts, train_labels, val_labels = train_test_split(
    texts, labels, test_size=0.2, random_state=42
)

Bring data in the right format;

In [8]:
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)

# Convert tokenized inputs to PyTorch tensors
train_dataset = TensorDataset(
    torch.tensor(train_encodings["input_ids"]),
    torch.tensor(train_encodings["attention_mask"]),
    torch.tensor(train_labels),
)

val_dataset = TensorDataset(
    torch.tensor(val_encodings["input_ids"]),
    torch.tensor(val_encodings["attention_mask"]),
    torch.tensor(val_labels),
)

print(f"{len(train_dataset)} Training samples.")
print(f"{len(val_dataset)} Validation samples.")

24 Training samples.
6 Validation samples.


Let's take a look:

In [9]:
train_dataset[0]

(tensor([  101,  1996, 16012, 18075,  3463,  2020,  3893,  2005, 16007, 28207,
          5666,  1012,   102,     0,     0]),
 tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]),
 tensor(1))

In [10]:
train_dataset[1]

(tensor([ 101, 2668, 3778, 2003, 2306, 1996, 3671, 2846, 1012,  102,    0,    0,
            0,    0,    0]),
 tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]),
 tensor(0))

Next, we create our dataloaders:

In [11]:
batch_size = 2

train_loader = DataLoader(
    train_dataset, sampler=RandomSampler(train_dataset), batch_size=batch_size
)

val_loader = DataLoader(
    val_dataset, sampler=SequentialSampler(val_dataset), batch_size=batch_size
)

Finally, we can train the model:

In [13]:
optimizer = AdamW(model.parameters(), lr=2e-5, eps=1e-8)
epochs = 10

# Create a learning rate scheduler to linearly decrease the learning rate over the training epochs
total_steps = len(train_loader) * epochs
scheduler = get_linear_schedule_with_warmup(
    optimizer, num_warmup_steps=0, num_training_steps=total_steps
)


for epoch in range(epochs):
    print(f"\n======== Epoch {epoch + 1} / {epochs} ========")

    # --- Training Phase ---
    print("Training...")
    model.train()
    train_loss = 0.0

    for batch in train_loader:
        # move data to device
        input_ids = batch[0].to(device)
        attention_mask = batch[1].to(device)
        labels = batch[2].to(device)

        optimizer.zero_grad()

        # forward pass
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        train_loss += loss.item()

        loss.backward()
        optimizer.step()
        scheduler.step()

    avg_train_loss = train_loss / len(train_loader)
    print(f"  Average training loss: {avg_train_loss:.4f}")

    # --- Validation Phase ---
    print("Validating...")
    model.eval()
    val_loss = 0.0

    with torch.no_grad():
        for batch in val_loader:

            # move to device
            input_ids = batch[0].to(device)
            attention_mask = batch[1].to(device)
            labels = batch[2].to(device)

            # forward pass
            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss
            val_loss += loss.item()

    avg_val_loss = val_loss / len(val_loader)
    print(f"  Validation Loss: {avg_val_loss:.4f}")

print("\nTraining complete!")


Training...
  Average training loss: 0.7435
Validating...
  Validation Loss: 0.6835

Training...
  Average training loss: 0.6547
Validating...
  Validation Loss: 0.6718

Training...
  Average training loss: 0.6248
Validating...
  Validation Loss: 0.6447

Training...
  Average training loss: 0.5266
Validating...
  Validation Loss: 0.5554

Training...
  Average training loss: 0.3538
Validating...
  Validation Loss: 0.4841

Training...
  Average training loss: 0.2011
Validating...
  Validation Loss: 0.4793

Training...
  Average training loss: 0.1247
Validating...
  Validation Loss: 0.4920

Training...
  Average training loss: 0.0948
Validating...
  Validation Loss: 0.4992

Training...
  Average training loss: 0.0800
Validating...
  Validation Loss: 0.5061

Training...
  Average training loss: 0.0778
Validating...
  Validation Loss: 0.5081

Training complete!


Let's redo the classification of our test sample!

In [14]:
# To Do!

# Redo the classification on your test sentences!

sentence_to_classify = test_sentence_condition

# Set the model to evaluation mode
model.eval()

# Tokenize the sentence and get tensors
encoding = tokenizer(sentence_to_classify, return_tensors="pt")
input_ids = encoding["input_ids"].to(device)
attention_mask = encoding["attention_mask"].to(device)

# Make a prediction
with torch.no_grad():
    outputs = model(input_ids, attention_mask=attention_mask)
    logits = outputs.logits

# Convert logits to probabilities
probabilities = torch.nn.functional.softmax(logits, dim=1)
predicted_class = torch.argmax(probabilities, dim=1).item()

print(f"Sentence: '{sentence_to_classify}'")
print(f"Logits: {logits.cpu().numpy()[0]}")
print(
    f"Probabilities (No Condition, Medical Condition): {probabilities.cpu().numpy()[0]}"
)
print(f"Predicted Class (after training): {predicted_class}")
print("--------------------------------------------\n")

Sentence: 'The patient suffers from chronic headaches.'
Logits: [-1.0838794  1.3528422]
Probabilities (No Condition, Medical Condition): [0.08041502 0.919585  ]
Predicted Class (after training): 1
--------------------------------------------

