# 🧠 Fine-Tuning BERT with LoRA for Sequence Classification

This notebook demonstrates how to fine-tune a pretrained BERT model using **LoRA (Low-Rank Adaptation)** via the `peft` library for a binary **sequence classification task**. The approach leverages **parameter-efficient fine-tuning**, meaning only a small portion of the model is updated during training — making it faster and more memory-friendly.

## ✅ Key Steps Covered

- **Load and preprocess dataset** using 🤗 `datasets` and `transformers` tokenizers  
- **Apply LoRA** to a `BertForSequenceClassification` model using the `peft` library  
- **Tokenize and dynamically pad** text inputs with `DataCollatorWithPadding`  
- **Define training arguments** with Hugging Face `TrainingArguments`  
- **Log training progress** and metrics with `Weights & Biases (wandb)`  
- **Train the model** efficiently using Hugging Face `Trainer`  
- **Evaluate performance** using `accuracy`, `f1`, `precision`, and `recall`  
- **Review metrics across epochs** and identify the best checkpoint

## 📦 Tools & Libraries Used

- `transformers` — model loading, tokenization, training
- `datasets` — easy access to the dataset
- `peft` — for efficient fine-tuning using LoRA
- `sklearn` — for metric computation
- `wandb` — to visualize and track experiments

## 🧪 Why LoRA?

Traditional fine-tuning retrains all model parameters — expensive and slow. LoRA injects **trainable low-rank matrices** into attention layers and trains **only those**, keeping the rest of the model frozen. This drastically reduces training cost without sacrificing performance.

---

At the end of this notebook, you'll have a high-performing, lightweight BERT classifier — ready for deployment or further experimentation.


In [1]:
!pip install wandb
!pip install transformers 
!pip install peft
!pip install evaluate
!pip install scikit-learn

Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.13.0->peft)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.13.0->peft)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=1.13.0->peft)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch>=1.13.0->peft)
  Downloading nvidia_curand_cu12-10.3.5.147-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cusolver-cu12==11.6.1.9 (from torch>=1.13.0->peft)
  Downloading nvidia_cusolver_cu12-11.6.1.9-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cusparse-cu12==12.3.1.170 (from torch>=1.13.0->peft)
  Downloading nvidia_cusparse_cu12-12.3.1.170-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting

### 📊 Experiment Tracking with Weights & Biases (wandb)

In this notebook, we use **Weights & Biases (`wandb`)** to track and visualize training metrics such as loss, accuracy, precision, and F1-score across training epochs.

#### 📦 What is `wandb`?

[`wandb`](https://wandb.ai) is a powerful tool for:
- Tracking training progress and hyperparameters
- Logging metrics, model checkpoints, and evaluation scores
- Visualizing learning curves in real-time
- Comparing multiple experiment runs in a dashboard

---

### 🧪 How to Set It Up (First-Time Use)

🛑 Important:
When using wandb for the first time, you must use your personal API key, which you can find in your WandB account settings. This authenticates you and enables access to your dashboard.

⚠️ Keep your API key private — never publish it in public notebooks or repositories.



If you want to use WandB for logging, make sure to set the report_to parameter in your TrainingArguments:  report_to=["wandb"]
If you don't want to log to WandB, you can simply set: report_to=[]

In [2]:
# import wandb
# wandb.login()

### 📁 Dataset Overview: `jackhhao/jailbreak-classification`

In this notebook, we use the dataset **`jackhhao/jailbreak-classification`**, available on the 🤗 Hugging Face Hub.

#### 📌 Purpose:
The dataset is designed to train and evaluate models on their ability to **classify prompts** as either:

- **"jailbreak"** — prompts attempting to bypass AI safety mechanisms (e.g., trying to make a model say something harmful, unsafe, or restricted)
- **"benign"** — safe, standard prompts with no harmful intent

This task is important in the context of **AI safety**, especially for large language models (LLMs), where detecting and preventing harmful usage is a priority.

---

### 📊 Dataset Structure

The dataset consists of two splits:
- `train`
- `test`

Each sample includes:
- `prompt`: the input text given to the model
- `type`: the label (either `"jailbreak"` or `"benign"`)

In [3]:
from datasets import load_dataset

dataset = load_dataset("jackhhao/jailbreak-classification")

README.md:   0%|          | 0.00/988 [00:00<?, ?B/s]

jailbreak_dataset_train_balanced.csv:   0%|          | 0.00/1.31M [00:00<?, ?B/s]

jailbreak_dataset_test_balanced.csv:   0%|          | 0.00/370k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1044 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/262 [00:00<?, ? examples/s]

In [4]:
dataset['train'][5]

 'type': 'jailbreak'}

### 📌 Extracting Labels from the Training Dataset

This line creates a list of labels by iterating over each example in the training split of the dataset.  
Each example is a dictionary, and the `'type'` field represents the class label (`'benign'` or `'jailbreak'`).


In [5]:
labels = [x['type'] for x in dataset['train']]

In [6]:
print(f"Benign type: {labels.count('benign')}, Jailbreak type: {labels.count('jailbreak')}")

Benign type: 517, Jailbreak type: 527


In [7]:
label_mapping = {"benign": 0, "jailbreak": 1}
dataset = dataset.map(lambda x: {"label": label_mapping[x["type"]]})

Map:   0%|          | 0/1044 [00:00<?, ? examples/s]

Map:   0%|          | 0/262 [00:00<?, ? examples/s]

### 🧠 Tokenizing the Dataset

This code block loads a BERT tokenizer and applies it to the dataset using the `map()` function from the 🤗 Datasets library.


In [8]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


def tokenize_function(examples):
    return tokenizer(examples["prompt"], padding="max_length", truncation=True)


tokenized_datasets = dataset.map(tokenize_function, batched=True)

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Map:   0%|          | 0/1044 [00:00<?, ? examples/s]

Map:   0%|          | 0/262 [00:00<?, ? examples/s]

In [9]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42)
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42)

In [10]:
from transformers import Trainer
from evaluate import load
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
# Load a metric (F1-score in this case)
metric = load("f1")

# Define a custom compute_metrics function
def compute_metrics(pred):
    preds = np.argmax(pred.predictions, axis=1)
    labels = pred.label_ids
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

2025-05-15 12:38:14.092542: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1747312694.285307      19 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747312694.347072      19 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


Downloading builder script:   0%|          | 0.00/6.79k [00:00<?, ?B/s]

In [11]:
from transformers import BertForSequenceClassification

model = BertForSequenceClassification.from_pretrained(
    'bert-base-cased', 
    num_labels=2
)

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [12]:
import torch

### 🧠 Model Evaluation with Hugging Face `Trainer`

This block sets up evaluation for a pretrained BERT model using the Hugging Face `Trainer` API — without performing any training or fine-tuning.

#### 📌 What happens step by step:

- **TrainingArguments**: Configuration is defined for the evaluation process, including batch size, logging directory, and disabling external loggers like Weights & Biases (`report_to=[]`).

- **Trainer initialization**: A `Trainer` object is created, linking the model, tokenizer, arguments, and a custom `compute_metrics` function. This enables automated evaluation and metric reporting.

- **Evaluation**: The `evaluate()` method is called on a small evaluation dataset. It feeds the data through the model in evaluation mode and uses the `compute_metrics` function to calculate metrics like accuracy and F1-score.

#### ⚠️ Important:

- The model used here is a **pretrained BERT for sequence classification**, but it has **not been fine-tuned** on your custom dataset.
- As a result, the evaluation metrics will reflect the **baseline performance** of the model before training.

This setup is useful for:
- Checking baseline metrics before fine-tuning
- Ensuring data is processed correctly
- Validating that the evaluation pipeline works as expected



In [13]:
from transformers import TrainingArguments

args = TrainingArguments(
    output_dir="./output",
    per_device_eval_batch_size=32,
    do_eval=True,
    logging_dir="./logs",
    report_to=[]
    # report_to=["wandb"]    # ✅ enables WandB logging
)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=args,
    compute_metrics=compute_metrics
)

# Run evaluation
results = trainer.evaluate(small_eval_dataset)
print(results)

  trainer = Trainer(


{'eval_loss': 0.6976484656333923, 'eval_model_preparation_time': 0.0027, 'eval_accuracy': 0.5152671755725191, 'eval_f1': 0.33507853403141363, 'eval_precision': 0.6153846153846154, 'eval_recall': 0.2302158273381295, 'eval_runtime': 4.5507, 'eval_samples_per_second': 57.573, 'eval_steps_per_second': 1.978}


### 📊 Evaluation Results

```python
{
  'eval_loss': 0.6901,
  'eval_accuracy': 0.5305,
  'eval_f1': 0.3051,
  'eval_precision': 0.7105,
  'eval_recall': 0.1942,
  'eval_model_preparation_time': 0.0029,
  'eval_runtime': 4.5713,
  'eval_samples_per_second': 57.315,
  'eval_steps_per_second': 1.969
}

### 🔧 LoRA Configuration for Sequence Classification

In this section, we define a configuration for applying **LoRA (Low-Rank Adaptation)** to a sequence classification task using the `peft` library.

#### 📦 Library: `peft`
- The `peft` library enables **parameter-efficient fine-tuning** of large language models.
- Instead of updating all model weights, LoRA injects small trainable matrices into selected layers, reducing memory and compute requirements.

#### 🧩 What the configuration means:

- `task_type=TaskType.SEQ_CLS`  
  Specifies the task type as **sequence classification**, e.g., sentiment analysis, toxicity detection, etc.

- `r=32`  
  Defines the **rank** of the low-rank matrices. Higher values improve expressiveness but increase training size.

- `lora_alpha=16`  
  A scaling factor that determines the strength of the LoRA update. The output of LoRA is scaled by this value.

This configuration tells the `peft` system:
- "I want to fine-tune only a small number of parameters"
- "The model is for sequence classification"
- "Use low-rank matrices of size 32 and scale them with α = 16"

#### ⚡️Why it matters:
Using LoRA allows you to fine-tune large models like BERT or LLaMA on consumer hardware, with:
- drastically fewer trainable parameters
- minimal impact on performance (if `r` is well-chosen)
- fast training and reduced overfitting

This is especially helpful for cases where:
- GPU memory is limited
- training time must be short
- multiple personalized models are needed


In [14]:
from peft import LoraConfig, TaskType

lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS, r=32, lora_alpha=16,
)

### 🔗 Applying LoRA to the Base Model

In this step, we apply the previously defined LoRA configuration to our pretrained model using `get_peft_model()` from the `peft` library.

#### ⚙️ What happens here:

- `get_peft_model(model, lora_config)` wraps the base model (e.g., `BertForSequenceClassification`) and injects **LoRA layers** according to the provided configuration.
- The original model parameters are **frozen** (non-trainable).
- Only the newly added LoRA parameters will be updated during training.

This results in a lightweight model that is significantly more efficient to fine-tune, especially useful when:
- working with large language models,
- using low-resource environments (e.g., laptops, limited GPU),
- or training many small models in parallel.

In [15]:
from peft import get_peft_model
model = get_peft_model(model, lora_config)

the number of trainable parameters using:

In [16]:
model.print_trainable_parameters()

trainable params: 1,181,186 || all params: 109,492,996 || trainable%: 1.0788


### ⚙️ TrainingArguments Explanation

This section defines training hyperparameters using Hugging Face's `TrainingArguments`. Below is a breakdown of each parameter:

| Argument                        | Description |
|---------------------------------|-------------|
| `output_dir="./results_2"`      | Directory where checkpoints and model outputs will be saved. |
| `evaluation_strategy="epoch"`   | Evaluation is triggered at the end of each training epoch. |
| `learning_rate=5e-5`            | Initial learning rate for the optimizer. |
| `per_device_train_batch_size=16`| Batch size used for training on each device (GPU/CPU). |
| `per_device_eval_batch_size=16` | Batch size used for evaluation on each device. |
| `num_train_epochs=10`           | Number of times the model will iterate over the entire training dataset. |
| `weight_decay=0.01`             | L2 regularization to prevent overfitting by penalizing large weights. |
| `save_total_limit=2`            | Limits the total number of saved checkpoints; older ones are deleted. |
| `load_best_model_at_end=True`   | Automatically loads the best-performing model at the end of training. |
| `logging_dir="./logs"`          | Directory for storing logs (e.g., for TensorBoard). |
| `logging_steps=100`             | Logs evaluation metrics every 100 steps. |
| `fp16=True`                     | Enables mixed-precision training (float16) for faster performance on GPUs. |
| `save_strategy="epoch"`         | Saves the model at the end of each epoch. |
| `report_to=[]`                  | Disables external logging integrations like WandB or TensorBoard. |
| `run_name="workshop"`           | Name of the training run for easier tracking in logs or dashboards. |

This configuration ensures that the model is evaluated and saved regularly while optimizing for training speed and reproducibility.


In [17]:
# Define training arguments
training_args = TrainingArguments(
    output_dir="./results_2",
    eval_strategy="epoch",
    learning_rate=5e-5, 
    per_device_train_batch_size=16, 
    per_device_eval_batch_size=16,
    num_train_epochs=10,
    weight_decay=0.01,
    save_total_limit=2,
    load_best_model_at_end=True,
    logging_dir="./logs", 
    logging_steps=100,
    fp16=True,
    save_strategy = "epoch",
    report_to = [ ],
    # report_to = ["wandb"],  # ✅ enables WandB logging
    run_name = "workshop"
)


### 📦 Dynamic Padding with `DataCollatorWithPadding`

To train a model using Hugging Face's `Trainer`, all sequences in a batch must be the same length. Since input texts often vary in length, padding is required. Instead of padding all sequences to a fixed length (`max_length`), we can use a **dynamic padding strategy**.


In [18]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [19]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
    data_collator = data_collator
)

No label_names provided for model class `PeftModelForSequenceClassification`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


In [20]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,No log,0.459113,0.847328,0.859155,0.841379,0.877698
2,0.553900,0.262848,0.90458,0.911661,0.895833,0.928058
3,0.553900,0.155971,0.958015,0.960573,0.957143,0.964029
4,0.267400,0.123469,0.961832,0.964539,0.951049,0.978417
5,0.169000,0.100604,0.969466,0.971223,0.971223,0.971223
6,0.169000,0.095116,0.969466,0.971429,0.964539,0.978417
7,0.110600,0.09085,0.965649,0.967742,0.964286,0.971223
8,0.108800,0.085096,0.973282,0.974729,0.978261,0.971223
9,0.108800,0.085604,0.973282,0.97491,0.971429,0.978417
10,0.095700,0.084201,0.973282,0.97491,0.971429,0.978417


TrainOutput(global_step=660, training_loss=0.20585528792756977, metrics={'train_runtime': 448.0124, 'train_samples_per_second': 23.303, 'train_steps_per_second': 1.473, 'total_flos': 2784762037370880.0, 'train_loss': 0.20585528792756977, 'epoch': 10.0})

### ✅ Training Summary

The model was fine-tuned using the LoRA (Low-Rank Adaptation) approach on a sequence classification task for 10 epochs. Below is a summary of training outcomes:

#### 📈 Performance Metrics:

| Metric           | Final Value (Epoch 10) |
|------------------|------------------------|
| **Accuracy**     | 0.9771                 |
| **F1 Score**     | 0.9786                 |
| **Precision**    | 0.9716                 |
| **Recall**       | 0.9856                 |
| **Validation Loss** | 0.0798              |
