# TASK 02: Parameter-Efficient Supervised Fine-Tuning of LLaMA 3.2 (3B) on a Medical Chain-of-Thought Datase

## Project Overview

This project focuses on fine-tuning a large language model, specifically Meta’s LLaMA 3 (1B parameter variant), for medical reasoning tasks using Chain-of-Thought (CoT) supervision. The goal is to enhance the model's ability to generate detailed, step-by-step reasoning and accurate answers for complex medical questions.

## Task and Objectives:
**Task:** Train a causal language model to perform complex medical question-answering by following the chain-of-thought approach.

**Objective:** Improve the model’s reasoning and answer quality by explicitly teaching it to generate intermediate reasoning steps (CoT) before producing the final answer.

**Dataset:** We use the FreedomIntelligence/medical-o1-reasoning-SFT dataset containing medical questions, CoT reasoning, and answers.

## Approach:
**Model:** Meta LLaMA 3.2 1B, a powerful open-source foundation model for language tasks.

**Fine-tuning Method:** Parameter-efficient fine-tuning using LoRA (Low-Rank Adaptation) to adapt the model with fewer trainable parameters and reduced compute requirements.

**Tokenization & Preprocessing:** We prepare the input by combining instruction, chain-of-thought reasoning, and target responses, tokenizing them with padding and truncation to fixed lengths.

**Training:** Using the Hugging Face Trainer API, we train the model with mixed precision (fp16) for efficiency, evaluate periodically, and save the best checkpoint.

**Evaluation:** Splitting data into training and validation sets for monitoring performance and selecting the best model.

**Saving:** Finally, the fine-tuned model and tokenizer are saved for reuse and deployment.


## Why This Matters:
Medical question-answering benefits greatly from detailed, explainable reasoning.

Chain-of-Thought fine-tuning trains the model to think stepwise, improving trustworthiness and answer accuracy.

LoRA enables fine-tuning large models even on limited hardware by reducing parameter updates.

The project demonstrates how to adapt large language models for specialized, high-stakes domains effectively.

## Installing Required Packages
This cell installs the key Python libraries required for your parameter-efficient fine-tuning of the LLaMA 3.2 model using the Unsloth library and supporting tools.

**unsloth:** A high-level library optimized for fast and efficient fine-tuning of LLaMA models (especially 1B–8B scale), with support for Low-Rank Adaptation (LoRA).

**bitsandbytes:** Enables 8-bit and 4-bit quantized models to save memory during training—key for fitting large models into consumer GPUs.

**accelerate:** From Hugging Face, used to abstract hardware management (CPU, GPU, TPU) and distributed training.

**trl:** Transformers Reinforcement Learning library; useful if you want to apply reinforcement learning like PPO for alignment (though not strictly needed for SFT).

**peft:** Parameter-Efficient Fine-Tuning by Hugging Face, helps in using methods like LoRA to reduce memory and computational footprint.

**datasets:** Simplifies loading and preprocessing datasets from Hugging Face Hub or local files.

**wandb:** Weights & Biases, an optional experiment tracking tool.


This cell prepares Colab/Notebook environment with all the necessary libraries to:

Load and quantize a large LLM (LLaMA 3.2) using unsloth and bitsandbytes.

Efficiently fine-tune using LoRA with peft.

Work with datasets using datasets.

Optionally monitor your experiments using wandb.



In [1]:
!pip install -q unsloth
!pip install -q bitsandbytes accelerate trl peft datasets wandb


## Upgrading the Transformers Library

I’m upgrading the Transformers library to the latest version.

The latest versions of transformers:

Include bug fixes, performance improvements, and new model support (like LLaMA 3).

Are often required by other libraries like unsloth, which expect newer versions of transformers for compatibility.

Support the latest features such as TrainingArguments, Trainer, and advanced tokenization logic.

If skipped this step and used an older version, I might run into compatibility errors when initializing models or tokenizers later on.

In [2]:
!pip install --upgrade transformers




## Logging Into Hugging Face

I’m logging into the Hugging Face Hub from this notebook using my personal Hugging Face account.

Logging in gives me authenticated access to:

Download private or gated models and datasets from Hugging Face (like meta-llama models).

Upload my fine-tuned models back to the Hugging Face Hub if I want to share or deploy them.

Avoid 403 Forbidden or rate-limit issues when accessing large files hosted on Hugging Face.

**What Happens During This Step:**

I’ll be prompted to paste my Hugging Face access token

Once authenticated, the notebook will store a secure session token to use Hugging Face APIs behind the scenes.


In [3]:
from huggingface_hub import notebook_login
notebook_login()


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Essential Imports

In this step, I’m importing all the core libraries required for model loading, training, fine-tuning, evaluation, and performance boosts.

**Line-by-Line Explanation:**
**import unsloth**
This is crucial to import first, as Unsloth applies performance optimizations early on to speed up and memory-optimize Hugging Face models (especially for LLaMA 3).

**from transformers import ...**
I'm importing the Hugging Face Transformers components:

**AutoTokenizer & AutoModelForCausalLM:** for tokenizing inputs and loading the base LLM.

**Trainer, TrainingArguments:** to handle model training and evaluation with built-in functionality.

**from datasets import load_dataset**
This lets me load Hugging Face datasets easily — in my case, the Medical Chain-of-Thought dataset.

**from peft import ...**
From the Parameter-Efficient Fine-Tuning (PEFT) library, I import:

LoraConfig, get_peft_model, TaskType to apply LoRA (Low-Rank Adaptation) and reduce memory usage and training time.

**import evaluate**
This allows me to use standard evaluation metrics like accuracy, BLEU, F1, etc.

**import torch**
The core PyTorch library — required since all Hugging Face models are built on PyTorch.

This cell sets up the environment with all the necessary libraries and tools to:

Load the LLM model

Fine-tune it using LoRA

Evaluate it efficiently

Optimize performance with unsloth

In [4]:
# Step 1: Imports (import unsloth BEFORE transformers!)

import unsloth  # must be first for speed optimizations
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
from datasets import load_dataset
from peft import LoraConfig, get_peft_model, TaskType
import evaluate
import torch


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


## Loading the Dataset

In this step, I’m loading the dataset that I want to fine-tune my model on. Specifically, I’m using the English configuration of a Chain-of-Thought (CoT) medical reasoning dataset.


**Line-by-Line Explanation:*8
**load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT", "en", split="train")**
This downloads and loads the training split of the Hugging Face dataset:

**Dataset: FreedomIntelligence/medical-o1-reasoning-SFT**

Configuration: "en" (English)

Split: "train" only (for now)

**print(dataset.column_names)**
This shows me all the column names (fields) in the dataset so I know what data is available — such as "Question", "Complex_CoT", and "Response".

**print(dataset[0])**
I print the first example in the dataset to visually inspect what a full sample looks like. This helps confirm the format and content, which is critical before preprocessing.

This dataset contains structured medical questions, detailed reasoning steps (Chain of Thought), and final answers. It’s ideal for training LLMs on step-by-step medical reasoning, which is valuable for many real-world healthcare AI applications.

I’ve successfully loaded and inspected the Medical CoT dataset to understand its structure and contents. This dataset will be used for training the LLaMA 3 model with Chain-of-Thought supervision.

In [30]:
dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT", "en", split="train")
print(dataset.column_names)
print(dataset[0])


['Question', 'Complex_CoT', 'Response']
{'Question': 'Given the symptoms of sudden weakness in the left arm and leg, recent long-distance travel, and the presence of swollen and tender right lower leg, what specific cardiac abnormality is most likely to be found upon further evaluation that could explain these findings?', 'Complex_CoT': "Okay, let's see what's going on here. We've got sudden weakness in the person's left arm and leg - and that screams something neuro-related, maybe a stroke?\n\nBut wait, there's more. The right lower leg is swollen and tender, which is like waving a big flag for deep vein thrombosis, especially after a long flight or sitting around a lot.\n\nSo, now I'm thinking, how could a clot in the leg end up causing issues like weakness or stroke symptoms?\n\nOh, right! There's this thing called a paradoxical embolism. It can happen if there's some kind of short circuit in the heart - like a hole that shouldn't be there.\n\nLet's put this together: if a blood clo

##  Loading the LLaMA 3.2 Model & Setting the Tokenizer

I'm loading the LLaMA 3.2 1B model tokenizer and making sure it has a padding token. Some Hugging Face tokenizers (especially for causal language models like LLaMA) don’t come with a pad_token by default, so I manually set it to the eos_token.

**Line-by-Line Breakdown:**
**model_name = "meta-llama/Llama-3.2-1B"**
I define the model name — this points to the official Hugging Face ID of Meta’s LLaMA 3.2B model (smallest LLaMA 3 variant, good for fast experiments).

**AutoTokenizer.from_pretrained(model_name, use_fast=False)**
Loads the tokenizer for LLaMA 3.

I set use_fast=False because Unsloth sometimes prefers the slow tokenizer for compatibility reasons.

**if tokenizer.pad_token is None:**
Checks whether a pad_token is defined (most LLaMA tokenizers don't have one by default).

**tokenizer.pad_token = tokenizer.eos_token**
If it’s missing, I simply set the pad token to the same as the end-of-sequence token (eos_token).
This is common practice for causal LMs and makes sure padding works during training.

Without a pad_token, training with batches of unequal lengths would fail or throw an error. We need to pad all sequences in a batch to the same length — hence this fix is essential for stable training.

I’ve loaded the tokenizer for LLaMA 3.2-1B and fixed a critical issue by assigning a padding token, ensuring compatibility with batching and training.


In [31]:
model_name = "meta-llama/Llama-3.2-1B"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)

# Fix if tokenizer has no pad token:
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token


## Preprocessing Function for Supervised Fine-Tuning (SFT)

In this step, I’m defining the preprocessing logic that will convert raw question–CoT–answer examples into tokenized inputs and labels that the model can learn from.

**Step-by-Step Explanation:**
**Input Formatting:**
Combines each question ("Question") with its corresponding chain-of-thought ("Complex_CoT") into a structured input string.

The format helps the model understand the reasoning path leading to the answer.

**Tokenize the Input:**
Tokenizes the formatted prompt.

Ensures uniform length by padding/truncating to 512 tokens — essential for batching.

**Tokenize the Output Labels (Target Text):**
Tokenizes the "Response" field — this is what the model is supposed to generate.

Uses as_target_tokenizer() to distinguish output tokens from input tokens (relevant for Seq2Seq setups).

** Apply Padding Mask to Labels:**
Sets all padding tokens in the label to -100, which tells PyTorch to ignore these during loss calculation.

This is important to avoid penalizing the model for predicting <pad> tokens.

**Attach Labels to Input Dict:**
Adds the masked label tokens to the model_inputs dictionary.

Now, each data point includes both input_ids and labels, ready for training.

Training a model with raw text isn't enough — we need to structure and tokenize it in a way that matches how the model processes input and evaluates performance. This function ensures:

The model understands the format of reasoning (CoT),

The loss is calculated only on relevant parts of the sequence,

We have consistent batch sizes (via fixed-length padding).

I’ve created a preprocessing function that prepares structured examples for training by tokenizing prompts and responses, applying padding, and masking loss on padded tokens.

In [38]:
def preprocess_function(examples):
    inputs = [
        f"Instruction: {q}\nCoT: {cot}\nAnswer:"
        for q, cot in zip(examples["Question"], examples["Complex_CoT"])
    ]
    model_inputs = tokenizer(
        inputs,
        max_length=512,
        padding="max_length",
        truncation=True,
    )

    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            examples["Response"],
            max_length=512,           # match input max_length here
            padding="max_length",
            truncation=True,
        )

    # Mask padding tokens with -100 for loss ignoring
    labels_with_mask = []
    for label_seq in labels["input_ids"]:
        labels_with_mask.append([token if token != tokenizer.pad_token_id else -100 for token in label_seq])

    model_inputs["labels"] = labels_with_mask
    return model_inputs


## Tokenizing the Dataset Using the Preprocessing Function

In this step, I’m applying the preprocess_function (from the previous cell) to every example in the dataset to produce tokenized and model-ready data.

**Step-by-Step Explanation:**
**dataset.map(...)**
This is a Hugging Face datasets function that applies a transformation (in our case, the preprocess_function) to every item in the dataset.

It’s efficient and optimized for large datasets.

**batched=True**
This tells the map function to send a batch of examples at a time into the function, instead of one-by-one.

This is faster and compatible with our preprocess_function, which is designed to process lists of examples.

**remove_columns=dataset.column_names**
After tokenization, the original columns like "Question", "Complex_CoT", and "Response" are no longer needed.

This cleans up the dataset so that only tokenized inputs and labels remain — exactly what the model expects.

At this point, we’re turning our raw medical CoT dataset into something the LLaMA model can train on. Without this step:

The model wouldn't understand the structure or tokenized format.

We wouldn’t be able to pass this data into Trainer.

I’m now transforming my dataset into tokenized format by mapping the preprocess_function over it. This gives me inputs (input_ids, attention_mask) and corresponding target labels, all ready for training.



In [39]:
# Assuming your original dataset is loaded as `dataset`
tokenized_dataset = dataset.map(preprocess_function, batched=True, remove_columns=dataset.column_names)


Map:   0%|          | 0/19704 [00:00<?, ? examples/s]

## Splitting the Tokenized Dataset into Train and Validation Sets

I'm dividing the tokenized dataset into two parts:

Training set (train_data): used by the model to learn.

Validation set (val_data): used to evaluate how well the model is performing during training (without touching test data).

 **Explanation of Each Line:**
**tokenized_dataset.train_test_split(test_size=0.05, seed=42)**
This splits the dataset into:

95% for training

5% for validation

The seed=42 ensures reproducibility — the same split every time you run the code.

**train_data and val_data**
I’m assigning the resulting splits to two separate variables.

These will be passed later to the Trainer for training and evaluation.

We need separate training and validation sets to detect overfitting and monitor generalization.

If we train and evaluate on the same data, we won’t know how well the model performs on unseen examples.

I’m now splitting the tokenized data into training and validation sets so the model can learn and be evaluated properly during training.

In [40]:
split_dataset = tokenized_dataset.train_test_split(test_size=0.05, seed=42)
train_data = split_dataset["train"]
val_data = split_dataset["test"]


## Fixing the Tokenizer Padding Token

I’m checking if the tokenizer has a padding token defined. If it doesn’t, I assign its end-of-sequence (eos) token as the padding token.

Many transformer models require a padding token to pad sequences to the same length during batching.

Some tokenizers (especially from newer or less common models) do not have a pad token by default.

Without a padding token:

You might get errors during training or tokenization.

Padding behavior will be undefined.


**Why choose eos_token as padding?**
The eos token typically signals the end of a sequence.

It's a safe fallback since it already exists in the tokenizer’s vocabulary.

This avoids adding new tokens and changing tokenizer/model embeddings

In [41]:
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token


## Creating the Data Collator for Seq2Seq Training

I’m creating a data collator that dynamically batches and pads inputs during training and evaluation.

During training, input sequences usually have different lengths.

To feed them in batches, sequences need to be padded to the same length.

This data collator:

Handles dynamic padding (pads only as much as needed per batch).

Prepares batches with correct padding tokens and labels.

Supports seq2seq tasks, so it properly manages both inputs and target sequences (labels).



**Why use DataCollatorForSeq2Seq instead of a generic collator?**
This collator is specifically designed for sequence-to-sequence models (e.g., translation, text generation with target outputs).

It makes sure that:

Padding tokens in labels are masked out properly during loss calculation (usually with -100).

Padding is handled correctly for both inputs and labels.

In [42]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)


## Setting Up Training Arguments

I’m configuring the training setup for the model by specifying all the important hyperparameters and behavior for training.

**Why are these parameters important?**
**output_dir:** Directory to save model checkpoints and logs.

**per_device_train_batch_size & per_device_eval_batch_size:** Batch sizes per GPU/CPU for train and eval.

**eval_strategy="steps":** Run evaluation every eval_steps during training.

**eval_steps=200:** Frequency of evaluation steps.

**logging_steps=100:** Log training stats every 100 steps.

**save_steps=400:** Save checkpoint every 400 steps (must be multiple of eval_steps when load_best_model_at_end=True).

**save_total_limit=2:** Keep only the last 2 checkpoints to save disk space.

**num_train_epochs=3:** Number of times to go through the entire training dataset.

**learning_rate=3e-4:** Controls step size in gradient descent; key for training speed and quality.

**warmup_steps=100:** Gradually increase learning rate over these initial steps for stability.

**logging_dir:** Directory to save tensorboard logs.

**report_to="none":** Disable reporting to any external tool like WandB.

**load_best_model_at_end=True:** Automatically load the checkpoint with the best evaluation metric at the end of training.

**bf16=False:** Disabled because bf16 requires specific hardware (Ampere+ GPUs).

**fp16=True:** Enables mixed precision training (faster and less memory usage if hardware supports it).

**Why set save_steps as multiple of eval_steps?**
Because load_best_model_at_end=True requires checkpoints to be saved at evaluation points so it can correctly pick the best checkpoint.


In [43]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./llama3_medical_cot_lora",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    eval_strategy="steps",
    eval_steps=200,
    logging_steps=100,
    save_steps=400,               # Make this multiple of eval_steps (200)
    save_total_limit=2,
    num_train_epochs=3,
    learning_rate=3e-4,
    warmup_steps=100,
    logging_dir="./logs",
    report_to="none",
    load_best_model_at_end=True,
    bf16=False,                  # Set to False unless you have Ampere+ GPU and CUDA 11+
    fp16=True,                   # Use fp16 if supported
)


## Initializing the Trainer

I’m creating a Trainer object, which is the core component from Hugging Face Transformers that handles the training loop, evaluation, checkpoint saving, and more — all configured with the parameters and datasets I prepared earlier.

**Why is this important?**
**model=model:** The model to train (in our case, LLaMA 3 with LoRA).

**args=training_args:** The training settings we just defined.

**train_dataset=train_data:** The dataset for training.

**eval_dataset=val_data:** The dataset for validation during training.

**tokenizer=tokenizer:** Tokenizer for converting text to tokens, used for decoding and encoding during training and evaluation.

**data_collator=data_collator:** Handles batching and padding of samples dynamically during training (important since sequences may vary in length).

**Why use Trainer?**
Trainer abstracts away all the complex steps like:

Batching & padding sequences,

Computing loss,

Running backpropagation,

Saving checkpoints,

Running periodic evaluations, and

Logging progress.

It lets me focus on what I want to train rather than how to train it.


In [44]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=val_data,
    tokenizer=tokenizer,
    data_collator=data_collator,
)


  trainer = Trainer(


## Start Training

This command kicks off the entire training process with all the configurations, data, and model setup we did earlier.

**Why is this important?**
It runs the forward and backward passes on batches of data.

Optimizes the model weights according to the loss function.

Performs evaluation at intervals to monitor progress.

Saves checkpoints as configured.

Applies all training settings like learning rate, batch size, mixed precision (fp16), etc.

**What happens during training?**
The model processes input sequences, predicts outputs.

The loss between predictions and true labels is computed.

Gradients are backpropagated to update model weights.

This repeats for the specified number of epochs or steps.

Validation runs every few steps to check how well the model generalizes.



In [None]:
trainer.train()


Step,Training Loss,Validation Loss
200,6.7959,6.827287
400,6.7302,6.734843
600,6.6476,6.698766
800,6.6108,6.672456
1000,6.5483,6.663226
1200,6.5822,6.640367
1400,6.5762,6.624905
1600,6.5773,6.61207
1800,6.579,6.601445


##  Save Fine-Tuned Model and Tokenizer

Saving the trained model weights and tokenizer configuration to a local folder after training finishes.

**Why is this important?**
Preserves the fine-tuned model so you can reuse it later without retraining.

Saves tokenizer vocab and special tokens to ensure consistency during inference.

Makes it easy to share or deploy your model elsewhere.

The saved folder contains all necessary files to reload the model and tokenizer with from_pretrained().

**What happens when I reload?**
The model weights and architecture load exactly as saved.

The tokenizer loads its vocabulary and special tokens.

You can immediately run inference or continue training.

In [2]:
model.save_pretrained("./llama3_medical_cot_lora")
tokenizer.save_pretrained("./llama3_medical_cot_lora")


NameError: name 'model' is not defined