# üéì Tutorial: Fine-Tuning Llama 3.2 with QLoRA

Welcome to this hands-on tutorial on fine-tuning Large Language Models (LLMs)! 

In this notebook, we will fine-tune the **Llama 3.2** model on the **SAMSum dataset** (dialogue summarization) using a technique called **QLoRA** (Quantized Low-Rank Adaptation).

### üöÄ What you will learn:
1.  **QLoRA:** How to fit a large model into memory using 4-bit quantization (NF4) and Double Quantization.
2.  **LoRA Adapters:** How to train only a tiny fraction (<1%) of the parameters to save time and compute.
3.  **Assistant-Only Masking:** A critical data preprocessing technique to ensure the model only learns to generate *responses*, not *prompts*.
4.  **Hugging Face TRL/PEFT:** How to use the modern stack for efficient training.

---


## 1. Setup and Installation

First, we need to install the necessary libraries. 
* `peft`: For Parameter-Efficient Fine-Tuning (LoRA).
* `bitsandbytes`: For 4-bit quantization.
* `transformers`: The core library for loading models.
* `evaluate` & `rouge_score`: For measuring the quality of our summaries.

In [None]:
# Install dependencies
! pip install -q evaluate torch tqdm datasets peft transformers rouge_score
! pip install -q -U bitsandbytes

## 2. Imports and Configuration

We import the standard data science stack alongside the specific libraries for QLoRA (`BitsAndBytesConfig`, `LoraConfig`).

In [None]:
import os
import yaml
import torch
from transformers import (
    TrainingArguments,
    Trainer,
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig
)
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence
from datasets import load_dataset, load_from_disk
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model, PeftModel

# Setup directories to keep our workspace clean
DATASETS_DIR = "./datasets"
OUTPUTS_DIR = "./outputs"
CONFIG_FILE_PATH = "./config.yaml"

os.makedirs(DATASETS_DIR, exist_ok=True)
os.makedirs(OUTPUTS_DIR, exist_ok=True)

## 3. The QLoRA Setup Function

This is the heart of the optimization. Instead of loading the full model (which might require 16GB+ VRAM for a 7B model), we load it in **4-bit precision**.

### Key Concepts:
1.  **`load_in_4bit=True`**: Compresses weights to 4-bit.
2.  **`bnb_4bit_quant_type="nf4"`**: Uses "NormalFloat4", a data type optimized for the bell-curve distribution of neural network weights.
3.  **`bnb_4bit_use_double_quant=True`**: Quantizes the quantization constants themselves to save even more memory.
4.  **`bnb_4bit_compute_dtype`**: We store weights in 4-bit, but perform calculations in `bfloat16` for stability.

In [None]:
def load_config(config_path: str = CONFIG_FILE_PATH):
    """Helper to load the yaml config file."""
    if not os.path.exists(config_path):
        # Fallback config if file doesn't exist (for tutorial purposes)
        return {
            "base_model": "meta-llama/Llama-3.2-1B-Instruct",
            "dataset": {"name": "knkarthick/samsum", "splits": {"train": 1000, "validation": 100, "test": 100}},
            "task_instruction": "Summarize the following dialogue.",
            "sequence_len": 512,
            "lora_r": 16,
            "lora_alpha": 32,
            "target_modules": ["q_proj", "v_proj"],
            "num_epochs": 1,
            "batch_size": 4,
            "gradient_accumulation_steps": 4,
            "learning_rate": 2e-4,
            "load_in_4bit": True
        }
    with open(config_path, "r", encoding="utf-8") as f:
        return yaml.safe_load(f)

def setup_model_and_tokenizer(cfg, use_4bit: bool = None, use_lora: bool = None):
    """
    Sets up the model with Quantization (BitsAndBytes) and LoRA adapters (PEFT).
    """
    model_name = cfg["base_model"]
    print(f"\nLoading model: {model_name}")

    # 1. Setup Tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    # Llama models often lack a pad token, so we use the EOS token
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = "right" # Right padding is standard for training

    # Check configuration overrides
    load_in_4bit = use_4bit if use_4bit is not None else cfg.get("load_in_4bit", False)
    apply_lora = use_lora if use_lora is not None else ("lora_r" in cfg)

    # 2. Configure Quantization (BitsAndBytes)
    quant_cfg = None
    if load_in_4bit:
        print("‚öôÔ∏è  Enabling 4-bit quantization (NF4 + Double Quantization)...")
        quant_cfg = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type=cfg.get("bnb_4bit_quant_type", "nf4"), # Normalized Float 4
            bnb_4bit_use_double_quant=cfg.get("bnb_4bit_use_double_quant", True),
            bnb_4bit_compute_dtype=torch.bfloat16 # Compute in bf16 for speed/stability
        )

    # 3. Load Base Model
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=quant_cfg,
        device_map="auto",
        # If not using quantization, load in bf16 directly
        dtype=(torch.bfloat16 if cfg.get("bf16", True) and torch.cuda.is_available() else torch.float32),
    )

    # 4. Apply LoRA Adapters
    if apply_lora:
        print("üîß Applying LoRA configuration...")
        # Prepares model for k-bit training (e.g. freezes weights, casts layer norms)
        model = prepare_model_for_kbit_training(model)
        
        lora_cfg = LoraConfig(
            r=cfg.get("lora_r", 8),               # Rank: The "capacity" of the adapter
            lora_alpha=cfg.get("lora_alpha", 16), # Alpha: Scaling factor (usually 2x rank)
            target_modules=cfg.get("target_modules", ["q_proj", "v_proj"]), # Target attention layers
            lora_dropout=cfg.get("lora_dropout", 0.05),
            bias="none",
            task_type="CAUSAL_LM",
        )
        model = get_peft_model(model, lora_cfg)
        
        # Print how many parameters we are actually training (usually < 1%)
        model.print_trainable_parameters()
    
    return model, tokenizer

## 4. Dataset Helper Functions

These helper functions allow us to easily load the SAMSum dataset and select specific subsets (e.g., just 1000 examples) to keep our training fast for this tutorial.

In [None]:
def select_subset(dataset, n_samples, seed=42):
    """Helper to grab a random subset of data for quick iteration."""
    if n_samples == "all" or n_samples is None:
        return dataset
    if n_samples > len(dataset):
        return dataset
    return dataset.shuffle(seed=seed).select(range(n_samples))

def load_and_prepare_dataset(cfg):
    """Loads the dataset from Hugging Face or local cache."""
    # 1. Parse config
    if "dataset" in cfg:
        cfg_dataset = cfg["dataset"]
        dataset_name = cfg_dataset["name"]
        n_train = cfg_dataset.get("splits", {}).get("train", "all")
        n_val = cfg_dataset.get("splits", {}).get("validation", "all")
    else:
        # Fallback
        dataset_name = "knkarthick/samsum"
        n_train = 1000
        n_val = 100

    # 2. Download/Load
    print(f"‚¨áÔ∏è  Loading dataset: {dataset_name}")
    dataset = load_dataset(dataset_name)

    # 3. Select Subsets
    val_key = "validation" if "validation" in dataset else "val"
    train = select_subset(dataset["train"], n_train)
    val = select_subset(dataset[val_key], n_val)
    
    print(f"üìä Ready for training with {len(train)} train and {len(val)} validation samples.")
    return train, val, None

## 5. Data Preprocessing: Assistant-Only Masking

This is **the most critical part** of instruction fine-tuning.

When training on a conversation like:
> **User:** Summarize this.\n **Assistant:** Here is the summary.

We do **NOT** want the model to learn how to write the user's prompt. We only want it to learn the assistant's response.

### How we do it:
1.  **Tokenize** the full conversation.
2.  **Create Labels**: A copy of the input IDs.
3.  **Masking**: We set the label ID to `-100` for all tokens belonging to the **User Prompt**.
4.  **PyTorch behavior**: The CrossEntropyLoss function in PyTorch automatically ignores any index set to `-100`. Therefore, loss is only calculated on the Assistant's response.

In [None]:
def build_user_prompt(dialogue: str, task_instruction: str) -> str:
    """Formats the input into a standard prompt."""
    return f"{task_instruction}\n\n## Dialogue:\n{dialogue}\n## Summary:"

def build_messages_for_sample(sample, task_instruction, include_assistant=False):
    """Creates the list of messages dictionary required by chat templates."""
    messages = [
        {
            "role": "user",
            "content": build_user_prompt(sample["dialogue"], task_instruction),
        }
    ]
    if include_assistant:
        messages.append({"role": "assistant", "content": sample["summary"]})
    return messages

def preprocess_samples(examples, tokenizer, task_instruction, max_length):
    """
    Tokenizes data and applies Assistant-Only Masking.
    """
    input_ids_list, labels_list, attn_masks = [], [], []

    for d, s in zip(examples["dialogue"], examples["summary"]):
        sample = {"dialogue": d, "summary": s}

        # 1. Create the full conversation (User + Assistant)
        msgs_full = build_messages_for_sample(sample, task_instruction, include_assistant=True)
        
        # 2. Create just the prompt (User only) to measure its length
        msgs_prompt = build_messages_for_sample(sample, task_instruction, include_assistant=False)

        # 3. Apply Chat Template (converts list of dicts to string)
        text_full = tokenizer.apply_chat_template(msgs_full, tokenize=False)
        text_prompt = tokenizer.apply_chat_template(msgs_prompt, tokenize=False, add_generation_prompt=True)
        
        prompt_len = len(text_prompt)

        # 4. Tokenize the full text
        tokens = tokenizer(
            text_full,
            max_length=max_length,
            truncation=True,
            padding=False,
            add_special_tokens=False,
            return_offsets_mapping=True, # We need offsets to find where the prompt ends
        )

        # 5. Create Masking (The Magic Step)
        # Find the token index where the prompt ends
        start_idx = len(tokens["input_ids"])
        for i, (start, _) in enumerate(tokens["offset_mapping"]):
            if start >= prompt_len:
                start_idx = i
                break
        
        # Create labels: Mask the prompt part with -100
        labels = [-100] * start_idx + tokens["input_ids"][start_idx:]
        
        input_ids_list.append(tokens["input_ids"])
        labels_list.append(labels)
        attn_masks.append(tokens["attention_mask"])

    return {
        "input_ids": input_ids_list,
        "labels": labels_list,
        "attention_mask": attn_masks,
    }

def tokenize_dataset(cfg, tokenizer, train_data, val_data):
    """Applies the preprocessing to the whole dataset."""
    print("\nTokenizing datasets...")
    fn = lambda e: preprocess_samples(e, tokenizer, cfg["task_instruction"], cfg["sequence_len"])
    
    tokenized_train = train_data.map(fn, batched=True, remove_columns=train_data.column_names)
    tokenized_val = val_data.map(fn, batched=True, remove_columns=val_data.column_names)

    return tokenized_train, tokenized_val

## 6. Data Collator

Since our sequences are of different lengths, we need a custom collator to pad them dynamically per batch. 
Crucially, we must pad the `labels` with `-100` so that the padding tokens are ignored during loss calculation.

In [None]:
class PaddingCollator:
    def __init__(self, tokenizer, label_pad_token_id=-100):
        self.tokenizer = tokenizer
        self.label_pad_token_id = label_pad_token_id

    def __call__(self, batch):
        # Convert lists to tensors
        input_ids = [torch.tensor(f["input_ids"], dtype=torch.long) for f in batch]
        attn_masks = [torch.tensor(f["attention_mask"], dtype=torch.long) for f in batch]
        labels = [torch.tensor(f["labels"], dtype=torch.long) for f in batch]

        # Pad to the max length in this batch
        input_ids = pad_sequence(input_ids, batch_first=True, padding_value=self.tokenizer.pad_token_id)
        attn_masks = pad_sequence(attn_masks, batch_first=True, padding_value=0)
        
        # Important: Pad labels with -100 so loss ignores padding
        labels = pad_sequence(labels, batch_first=True, padding_value=self.label_pad_token_id)

        return {
            "input_ids": input_ids,
            "attention_mask": attn_masks,
            "labels": labels,
        }

## 7. Initialization & Preprocessing Execution

Let's put it all together: Load config, initialize the model/tokenizer, and process the data.

In [None]:
# 1. Load Config
cfg = load_config()

# 2. Setup Model (QLoRA)
model, tokenizer = setup_model_and_tokenizer(cfg, use_4bit=True, use_lora=True)

# 3. Load Data
train_data, val_data, _ = load_and_prepare_dataset(cfg)

# 4. Tokenize & Mask Data
tokenized_train, tokenized_val = tokenize_dataset(cfg, tokenizer, train_data, val_data)

### üîç Inspecting the Data

It's good practice to check if our masking worked. In the labels below, you should see a long sequence of `-100` at the start (masking the prompt), followed by actual token IDs (the response).

In [None]:
sample_idx = 0
input_ids = tokenized_train[sample_idx]['input_ids']
labels = tokenized_train[sample_idx]['labels']

print(f"Original Length: {len(input_ids)}")
print(f"Label Length: {len(labels)}")
print("\nFirst 20 labels (Should be mostly -100):", labels[:20])
print("Last 20 labels (Should be real IDs):", labels[-20:])

## 8. Training Loop

We use the Hugging Face `Trainer` class. Note the QLoRA specific optimizations:
* `paged_adamw_8bit`: An optimizer that saves memory by using 8-bit statistics and paging to CPU RAM if GPU VRAM gets full.
* `gradient_accumulation_steps`: Simulates a larger batch size without using more memory.
* `fp16`/`bf16`: Mixed precision training.

In [None]:
def train_model(cfg, model, tokenizer, tokenized_train, tokenized_val):
    collator = PaddingCollator(tokenizer=tokenizer)

    output_dir = os.path.join(OUTPUTS_DIR, "lora_samsum")

    args = TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=cfg["num_epochs"],
        per_device_train_batch_size=cfg["batch_size"],
        gradient_accumulation_steps=cfg["gradient_accumulation_steps"],
        learning_rate=float(cfg["learning_rate"]),
        lr_scheduler_type="cosine",
        warmup_steps=10,
        bf16=True, # Use BF16 if A100/T4, otherwise FP16
        optim="paged_adamw_8bit", # Saves memory!
        logging_steps=10,
        save_strategy="epoch",
        report_to="none",
    )

    trainer = Trainer(
        model=model,
        args=args,
        train_dataset=tokenized_train,
        eval_dataset=tokenized_val,
        data_collator=collator,
    )

    print("\nStarting LoRA fine-tuning...")
    trainer.train()
    print("\nTraining complete!")
    
    # Save the adapters
    model.save_pretrained(output_dir)
    return model

# Execute Training
model = train_model(cfg, model, tokenizer, tokenized_train, tokenized_val)

## 9. Conclusion & Next Steps

Congratulations! You have successfully fine-tuned a Large Language Model using QLoRA. 

### What happened?
1.  We froze the massive base model parameters.
2.  We trained small adapter matrices (LoRA) on top of them.
3.  We masked the user prompts so the model only learned to generate summaries.

### Optional: Push to Hub
You can now push your adapters to the Hugging Face Hub to share them or load them later for inference.

In [None]:
def push_to_hub(model, tokenizer, model_name, hf_username):
    model_id = f"{hf_username}/{model_name}"
    print(f"Pushing to {model_id}...")
    model.push_to_hub(model_id)
    tokenizer.push_to_hub(model_id)

# Uncomment to push
# from google.colab import userdata
# HF_USERNAME = userdata.get('HF_USERNAME')
# push_to_hub(model, tokenizer, "Llama-3.2-1B-QLoRA-Summarizer", HF_USERNAME)