# NB09: Fine-tuning Qwen3-4B with LoRA

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/RJuro/unistra-nlp2026/blob/main/notebooks/NB09_finetuning_qwen3.ipynb)

**Duration:** 75 minutes

## Learning Goals

By the end of this notebook you will be able to:

1. **Understand LoRA** parameter-efficient fine-tuning and why it matters
2. **Prepare instruction-tuning data** in the conversational chat format
3. **Fine-tune a 4B parameter model** on a free Colab T4 GPU using Unsloth
4. **Evaluate fine-tuned vs zero-shot** performance on a classification task

---

> **Requires T4 GPU runtime** -- go to **Runtime -> Change runtime type -> T4 GPU** before running any cells.

In [None]:
%%capture
!pip install unsloth
# Also get latest transformers and trl
!pip install --upgrade transformers trl datasets pandas scikit-learn -q

## 1. Why Fine-tune?

Zero-shot LLMs are impressively versatile -- you can prompt them for almost any task and get reasonable results out of the box. But they are **generic**: they have no knowledge of your specific task, domain, or label set.

**Fine-tuning** adapts a pre-trained model to **your** specific task by continuing training on task-specific examples. The benefits:

- **Higher accuracy** on your target task
- **More consistent** output format (no parsing headaches)
- **Faster inference** (the model "knows" the task, needs fewer tokens)
- **Smaller models can match larger ones** when fine-tuned

The catch? Full fine-tuning of a 4B parameter model requires enormous GPU memory. This is where **LoRA** comes in -- with LoRA, we only train **~1% of parameters**, which fits comfortably in the free Colab T4 (16 GB VRAM).

### What is LoRA?

**LoRA** (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique introduced by Hu et al. (2021).

The key idea is simple:

- **Freeze** all the original model weights (billions of parameters)
- **Add** small trainable matrices (adapters) to specific layers
- These adapters use a **low-rank decomposition**: instead of a full weight update matrix of size `d x d`, we use two smaller matrices of size `d x r` and `r x d`, where `r << d` (the "rank")

**Why does this work?**

- Research shows that weight updates during fine-tuning have low intrinsic rank
- A rank of 16 (out of thousands) captures most of the task-specific adaptation
- Result: **~1-2% trainable parameters**, **~70% less memory**, **same quality** for many tasks

Combined with **4-bit quantization** (QLoRA), we can fine-tune a 4B parameter model on a free T4 GPU!

In [None]:
from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen3-4B",
    max_seq_length=2048,
    dtype=None,  # Auto-detect
    load_in_4bit=True,  # QLoRA — 4-bit quantization
)

print(f"Model loaded! Parameters: {model.num_parameters():,}")
print(f"GPU memory used: {torch.cuda.memory_allocated()/1e9:.1f} GB")

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r=16,                    # LoRA rank
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",  # Memory optimization
    random_state=42,
)

# Count trainable parameters
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"Trainable parameters: {trainable:,} / {total:,} ({trainable/total:.2%})")

## 2. Preparing Training Data

For instruction-tuning (also called supervised fine-tuning / SFT), we need to format each training example as a **conversation**:

1. **System message** -- tells the model its role and the task
2. **User message** -- the input (the tweet to classify)
3. **Assistant message** -- the expected output (the emotion label)

We use the **dair-ai/emotion** dataset — ~416K English tweets labeled with 6 emotions (sadness, joy, love, anger, fear, surprise). This creates a natural NB08→NB09 pipeline: in NB08 we distilled emotion labels from an LLM, and now we fine-tune a small model on them.

We use the tokenizer's built-in `apply_chat_template()` to format this correctly for Qwen3, including all special tokens the model expects.

In [None]:
import os
import numpy as np
import pandas as pd
from datasets import load_dataset, Dataset

EMOTION_LABELS = ["sadness", "joy", "love", "anger", "fear", "surprise"]
CATEGORIES = sorted(EMOTION_LABELS)

# ── Option A: Load distilled labels from NB08 (if available) ────────
NB08_CSV = "emotion_distilled_labels.csv"

if os.path.exists(NB08_CSV):
    print(f"Found NB08 distilled labels: {NB08_CSV}")
    distilled = pd.read_csv(NB08_CSV)
    distilled = distilled.rename(columns={"llm_label": "label_name"})
    
    # Use distilled data for training, fresh HF data for eval
    train_subset = distilled.sample(min(800, len(distilled)), random_state=42)
    
    emotion_ds = load_dataset("dair-ai/emotion")
    test_full = pd.DataFrame(emotion_ds["test"])
    test_full["label_name"] = test_full["label"].map(lambda x: EMOTION_LABELS[x])
    eval_subset = test_full.sample(200, random_state=42)
    
    print(f"Training on {len(train_subset)} distilled examples from NB08")
    print(f"Evaluating on {len(eval_subset)} gold-labeled examples from HF")

else:
    # ── Option B: Load fresh from HuggingFace ───────────────────────
    print(f"NB08 CSV not found — loading fresh data from HuggingFace")
    emotion_ds = load_dataset("dair-ai/emotion")
    
    train_full = pd.DataFrame(emotion_ds["train"])
    train_full["label_name"] = train_full["label"].map(lambda x: EMOTION_LABELS[x])
    
    test_full = pd.DataFrame(emotion_ds["test"])
    test_full["label_name"] = test_full["label"].map(lambda x: EMOTION_LABELS[x])
    
    np.random.seed(42)
    train_subset = train_full.sample(800, random_state=42)
    eval_subset = test_full.sample(200, random_state=42)
    
    print(f"Training on {len(train_subset)} examples from HuggingFace")
    print(f"Evaluating on {len(eval_subset)} examples from HuggingFace")

print(f"\nEmotion distribution (train):")
print(train_subset['label_name'].value_counts())

def format_instruction(row):
    """Convert a labeled example into an instruction-tuning format."""
    return {
        "text": tokenizer.apply_chat_template([
            {"role": "system", "content": f"You classify tweets into one of these emotion categories: {CATEGORIES}. Respond with only the emotion label."},
            {"role": "user", "content": f"Classify the emotion in this tweet:\n\n{row['text'][:500]}"},
            {"role": "assistant", "content": row['label_name']}
        ], tokenize=False)
    }

# Apply formatting
train_data = [format_instruction(row) for _, row in train_subset.iterrows()]
eval_data = [format_instruction(row) for _, row in eval_subset.iterrows()]

train_dataset = Dataset.from_list(train_data)
eval_dataset = Dataset.from_list(eval_data)

print(f"\nTraining examples: {len(train_dataset)}")
print(f"Eval examples: {len(eval_dataset)}")
print(f"\nExample formatted input:")
print(train_dataset[0]['text'][:500])

## 3. Training with SFTTrainer

We use the `SFTTrainer` from the `trl` library (Transformer Reinforcement Learning), which handles all the details of supervised fine-tuning.

Key training hyperparameters:
- **Batch size 2 x 4 accumulation steps** = effective batch size of 8
- **Learning rate 2e-4** -- standard for LoRA fine-tuning
- **3 epochs** -- enough to learn the task without overfitting
- **8-bit AdamW** -- memory-efficient optimizer

Since tweets are short, training on 800 examples should be fast -- about 5-10 minutes on a T4 GPU.

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    dataset_num_proc=2,
    packing=False,
    args=TrainingArguments(
        output_dir="outputs",
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        num_train_epochs=3,
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=10,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=42,
        report_to="none",
    ),
)

In [None]:
print("Starting training...")
gpu_stats = torch.cuda.get_device_properties(0)
print(f"GPU: {gpu_stats.name} ({gpu_stats.total_mem/1e9:.0f} GB)")

trainer_stats = trainer.train()

print(f"\nTraining complete!")
print(f"  Time: {trainer_stats.metrics['train_runtime']:.0f}s")
print(f"  Loss: {trainer_stats.metrics['train_loss']:.4f}")
print(f"  GPU memory peak: {torch.cuda.max_memory_allocated()/1e9:.1f} GB")

## 4. Evaluation: Fine-tuned vs Zero-shot

Now let's see how our fine-tuned model performs on emotion classification compared to zero-shot prompting. We switch the model to inference mode (disables dropout, merges LoRA weights for speed) and run it over our held-out evaluation set.

In [None]:
FastLanguageModel.for_inference(model)

def classify_with_model(text: str) -> str:
    """Classify text using the fine-tuned model."""
    messages = [
        {"role": "system", "content": f"You classify tweets into one of these emotion categories: {CATEGORIES}. Respond with only the emotion label."},
        {"role": "user", "content": f"Classify the emotion in this tweet:\n\n{text[:500]}"}
    ]
    inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")

    outputs = model.generate(input_ids=inputs, max_new_tokens=50, temperature=0.0, do_sample=False)
    response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
    return response.strip()

# Evaluate on test set
predictions = []

from tqdm import tqdm
for _, row in tqdm(eval_subset.iterrows(), total=len(eval_subset), desc="Evaluating"):
    pred = classify_with_model(row['text'])
    # Match to closest valid category
    pred_clean = pred.strip().strip('"').strip("'").lower()
    if pred_clean in CATEGORIES:
        predictions.append(pred_clean)
    else:
        # Fuzzy match
        from difflib import get_close_matches
        match = get_close_matches(pred_clean, CATEGORIES, n=1, cutoff=0.3)
        predictions.append(match[0] if match else CATEGORIES[0])

eval_subset = eval_subset.copy()
eval_subset['prediction'] = predictions

In [None]:
from sklearn.metrics import classification_report, accuracy_score

acc = accuracy_score(eval_subset['label_name'], eval_subset['prediction'])
print(f"Fine-tuned Model Accuracy: {acc:.1%}")
print(f"\nClassification Report:")
print(classification_report(eval_subset['label_name'], eval_subset['prediction'], zero_division=0))

# Compare to baselines
print(f"\n{'='*50}")
print(f"Comparison to approaches on this dataset:")
print(f"  NB08 LLM zero-shot (teacher): ~70-80%")
print(f"  NB08 TF-IDF student:          ~60-70%")
print(f"  NB08 SBERT student:            ~70-80%")
print(f"  NB09 Fine-tuned Qwen3:         {acc:.0%}")

## 5. Bonus: Export to GGUF for Local Use

One of the great advantages of fine-tuning an open-weights model is that you **own** the result. You can export it in GGUF format and run it locally with tools like [Ollama](https://ollama.com/) or [llama.cpp](https://github.com/ggerganov/llama.cpp) -- no cloud API needed.

In [None]:
# Save the model in GGUF format for use with Ollama
model.save_pretrained_gguf("model_gguf", tokenizer, quantization_method="q4_k_m")
print("Model saved in GGUF format!")
print("To use with Ollama:")
print("  1. Download the .gguf file")
print("  2. Create a Modelfile:")
print('     FROM ./model_gguf-unsloth.Q4_K_M.gguf')
print("  3. ollama create my-classifier -f Modelfile")
print("  4. ollama run my-classifier")

## Bonus: Deploy as a Gradio App

Let's create an interactive demo of our fine-tuned emotion classifier. Users can type tweets and see the model's predictions in real time.

> **Note:** This runs inference on the GPU, so it works best in the Colab session where the model is loaded.

In [None]:
try:
    !pip install gradio -q
    import gradio as gr

    def classify_emotion(text):
        """Classify emotion in text using the fine-tuned model."""
        if not text.strip():
            return "Please enter some text."
        prediction = classify_with_model(text)
        return f"## Predicted emotion: **{prediction}**"

    demo = gr.Interface(
        fn=classify_emotion,
        inputs=gr.Textbox(lines=3, placeholder="Type a tweet or short text..."),
        outputs=gr.Markdown(label="Prediction"),
        title="Emotion Classifier (Fine-tuned Qwen3-4B)",
        description="Classify the emotion in text as: sadness, joy, love, anger, fear, or surprise. Powered by a LoRA-fine-tuned Qwen3-4B model.",
        examples=[
            ["I can't believe how beautiful this sunset is tonight!"],
            ["Just got rejected from my dream job. Feeling awful."],
            ["My best friend surprised me with concert tickets!"],
            ["I'm worried about the exam results coming out tomorrow."],
        ],
    )
    demo.launch(share=True)

except ImportError:
    print("Gradio not available. Install with: pip install gradio")

## 6. Exercise

Try modifying the fine-tuning setup and observe how it affects performance:

1. **Different LoRA rank**: Try `r=4`, `r=8`, or `r=32`. How does this affect the number of trainable parameters and the final accuracy?
2. **Different number of epochs**: Try `num_train_epochs=1` or `num_train_epochs=5`. Does more training help, or does the model start overfitting?
3. **More training data**: Increase from 800 to 2000 or 5000 examples. Does the fine-tuned model improve?
4. **Compare gold vs distilled labels**: If you ran NB08 first, the data cell above loaded distilled labels automatically. Try re-running without the CSV (rename it) to use gold HuggingFace labels instead. How does label noise from distillation affect fine-tuning quality?

Use the skeleton below to experiment.

In [None]:
# ============================================================
# Exercise: Experiment with different LoRA configurations
# ============================================================

# TODO: Reload the base model (copy from cell above)
# model, tokenizer = FastLanguageModel.from_pretrained(...)

# TODO: Try a different LoRA rank
# model = FastLanguageModel.get_peft_model(
#     model,
#     r=4,  # <-- try 4, 8, or 32
#     ...
# )

# TODO: Try a different number of epochs in TrainingArguments
# args = TrainingArguments(
#     num_train_epochs=1,  # <-- try 1 or 5
#     ...
# )

# TODO: Train, evaluate, and compare results
# trainer = SFTTrainer(...)
# trainer.train()

# Record your observations here:
# r=4:  trainable params = ???, accuracy = ???
# r=16: trainable params = ???, accuracy = ???
# r=32: trainable params = ???, accuracy = ???

## 7. Summary & Takeaways

In this notebook we fine-tuned a 4-billion parameter language model on emotion classification using a free Colab T4 GPU. Here are the key takeaways:

- **LoRA makes fine-tuning accessible on free hardware.** By only training ~1-2% of parameters, we reduce memory requirements by ~70%, making it possible to fine-tune large models on a single T4 GPU (16 GB VRAM).

- **Qwen3-4B is a capable base model.** Modern open-weights models like Qwen3 provide strong foundations for task-specific fine-tuning. The 4B parameter size hits a sweet spot between capability and efficiency.

- **Fine-tuning can significantly improve task-specific performance.** Compared to zero-shot prompting, fine-tuned models produce more consistent outputs, follow the expected format more reliably, and often achieve higher accuracy on emotion detection.

- **NB08→NB09 pipeline is a real-world pattern.** Using LLM distillation to create training data (NB08), then fine-tuning a small model on those labels (NB09), is a production-ready approach for building custom classifiers.

- **Export to GGUF enables local deployment.** Unlike API-based solutions, fine-tuned open models can be exported and run locally with tools like Ollama -- giving you full control, privacy, and zero per-token costs.

### What's Next?

- Try fine-tuning on your own datasets and tasks
- Experiment with different base models (Llama, Mistral, Gemma)
- Explore more advanced techniques: DPO, RLHF, multi-task fine-tuning
- Deploy your fine-tuned model as a local API with Ollama or vLLM