# Finetuning BERT

First, let's load the required libraries. We will use the popular transformers package for the model and the necessary preprocessing steps.

Our goal in this notebook will be to classify sentences from the clinical context as to whether or not a medical condition is present.

In [None]:
import torch
import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer, BertForSequenceClassification, get_linear_schedule_with_warmup
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from torch.optim import AdamW
import os

We start by loading the pretrained model.

In [None]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=2
).to(device)

tokenizer = BertTokenizer.from_pretrained(
    "bert-base-uncased",
    do_lower_case=True
)

The warning tells us that we need to train the model for our downstream task, as an untrained classification layer got added to the pretrained model.
To see the effects of the fine tuning, we try to classify a "medical" sentence before training.

In [8]:
# To do!:
# Come up with two test sentences.
# One should indicate the presence of a medical condition, and the other should come from a clinical setting but not indicate any medical condition.

test_sentence_condition = ""
test_sentence_no_condition = ""

In [None]:
sentence_to_classify = test_sentence_condition

# Set the model to evaluation mode
model.eval()

# Tokenize the sentence and get tensors
encoding = tokenizer(sentence_to_classify, return_tensors='pt')
input_ids = encoding['input_ids'].to(device)
attention_mask = encoding['attention_mask'].to(device)

# Make a prediction
with torch.no_grad():
    outputs = model(input_ids, attention_mask=attention_mask)
    logits = outputs.logits

# Convert logits to probabilities
probabilities = torch.nn.functional.softmax(logits, dim=1)
predicted_class = torch.argmax(probabilities, dim=1).item()

print(f"Sentence: '{sentence_to_classify}'")
print(f"Logits: {logits.cpu().numpy()[0]}")
print(f"Probabilities (No Condition, Medical Condition): {probabilities.cpu().numpy()[0]}")
print(f"Predicted Class (before training): {predicted_class}")
print("--------------------------------------------\n")

## **Finetuning the model**

Let's load some data and have a look.

In [24]:
# To Do!

# 1. Load the file medical_data.csv as pandas dataframe
# 2. Display the content of the dataframe. How is the label information encoded?
# 3. Load the content of the column "text" into a list called texts and the column label into a list called labels

Let's create training and validation splits:

In [29]:
# Split data into training and validation sets
train_texts, val_texts, train_labels, val_labels = train_test_split(
    texts, labels, test_size=0.2, random_state=42
)

Bring data in the right format;

In [None]:
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)

# Convert tokenized inputs to PyTorch tensors
train_dataset = TensorDataset(
    torch.tensor(train_encodings["input_ids"]),
    torch.tensor(train_encodings["attention_mask"]),
    torch.tensor(train_labels)
)

val_dataset = TensorDataset(
    torch.tensor(val_encodings["input_ids"]),
    torch.tensor(val_encodings["attention_mask"]),
    torch.tensor(val_labels)
)

print(f"{len(train_dataset)} Training samples.")
print(f"{len(val_dataset)} Validation samples.")

Let's take a look:

In [None]:
train_dataset[0]

In [None]:
train_dataset[1]

Next, we create our dataloaders:

In [50]:
batch_size = 2

train_loader = DataLoader(
    train_dataset,
    sampler=RandomSampler(train_dataset),
    batch_size=batch_size
)

val_loader = DataLoader(
    val_dataset,
    sampler=SequentialSampler(val_dataset),
    batch_size=batch_size
)

Finally, we can train the model:

In [None]:
optimizer = AdamW(model.parameters(), lr=2e-5, eps=1e-8)
epochs = 10

# Create a learning rate scheduler to linearly decrease the learning rate over the training epochs
total_steps = len(train_loader) * epochs
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0,
    num_training_steps=total_steps
)



for epoch in range(epochs):
    print(f"\n======== Epoch {epoch + 1} / {epochs} ========")
    
    # --- Training Phase ---
    print("Training...")
    model.train()
    train_loss = 0.0

    for batch in train_loader:
        # move data to device
        input_ids = batch[0].to(device)
        attention_mask = batch[1].to(device)
        labels = batch[2].to(device)

        optimizer.zero_grad()

        # forward pass
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        train_loss += loss.item()

        loss.backward() 
        optimizer.step() 
        scheduler.step() 

    avg_train_loss = train_loss / len(train_loader)
    print(f"  Average training loss: {avg_train_loss:.4f}")

    # --- Validation Phase ---
    print("Validating...")
    model.eval()  
    val_loss = 0.0

    with torch.no_grad():  s
        for batch in val_loader:

            # move to device
            input_ids = batch[0].to(device)
            attention_mask = batch[1].to(device)
            labels = batch[2].to(device)

            # forward pass
            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss
            val_loss += loss.item()

    avg_val_loss = val_loss / len(val_loader)
    print(f"  Validation Loss: {avg_val_loss:.4f}")

print("\nTraining complete!")

Let's redo the classification of our test sample!

In [None]:
# To Do!

# Redo the classification on your test sentences!