# ðŸ¦¥ Fine-Tuning with Unsloth: Teaching Llama to Think Like DeepSeek-R1

## Overview

In this tutorial, we'll fine-tune **Llama-3.2-3B-Instruct** using **Unsloth** on the [ServiceNow-AI/R1-Distill-SFT](https://huggingface.co/datasets/ServiceNow-AI/R1-Distill-SFT) dataset. The goal is to transfer DeepSeek-R1's **chain-of-thought reasoning capabilities** into the smaller Llama model.

### What You'll Learn

| Concept | Description |
|---------|-------------|
| **QLoRA** | Quantized Low-Rank Adaptation for memory-efficient fine-tuning |
| **4-bit Quantization** | Running 3B models on consumer GPUs |
| **LoRA Adapters** | Training only a small fraction of parameters |
| **Knowledge Distillation** | Teaching smaller models to reason like larger ones |

### Why Unsloth?

Unsloth provides **2-5Ã— faster training** and **70% less memory usage** compared to standard Hugging Face training through custom CUDA kernels and optimized implementations.

---

## Step 1: Install Dependencies

First, install Unsloth and its dependencies. Unsloth automatically handles the installation of optimized CUDA kernels for your GPU architecture.

## Step 2: Load the Quantized Model

We load **Llama-3.2-3B-Instruct** with 4-bit quantization enabled. This reduces the model's memory footprint from ~12GB (FP16) to ~3GB, making it trainable on consumer GPUs.

### Key Parameters Explained:

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! Unsloth also supports RoPE (Rotary Positinal Embedding) scaling internally.
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-Instruct", # or choose "unsloth/Llama-3.2-1B-Instruct"
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit, # Will load the 4Bit Quantized Model
)

### Parameter Breakdown

| Parameter | Value | Purpose |
|-----------|-------|---------|
| `model_name` | `"unsloth/Llama-3.2-3B-Instruct"` | Pre-trained model from Unsloth's optimized collection |
| `max_seq_length` | `2048` | Maximum tokens per sequence (Unsloth handles RoPE scaling internally) |
| `dtype` | `None` | Auto-detect optimal precision (FP16 for T4/V100, BF16 for Ampere+) |
| `load_in_4bit` | `True` | Enable NF4 quantization for 8Ã— memory reduction |

**Memory Impact**: Loading in 4-bit reduces VRAM from ~12GB to ~3GB, enabling training on GPUs with 8-16GB VRAM.

## Step 3: Attach LoRA Adapters (PEFT)

Instead of fine-tuning all 3 billion parameters, we attach **LoRA (Low-Rank Adaptation)** layers to specific modules. This trains only ~0.1-1% of parameters while achieving comparable results to full fine-tuning.

### How LoRA Works

LoRA decomposes weight updates into low-rank matrices:

```
W_new = W_original + (B Ã— A)
```

Where `A` and `B` are small matrices with rank `r`. Instead of updating the full weight matrix, we only train `A` and `B`.

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16, # a higher alpha value assigns more weight to the LoRA activations
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

<img src="w_formula.jpg"></img>

### LoRA Configuration Explained

| Parameter | Value | Purpose |
|-----------|-------|---------|
| `r` | `16` | **Rank** of low-rank matrices. Higher = more capacity but more memory. Common values: 8, 16, 32, 64 |
| `target_modules` | `["q_proj", "k_proj", ...]` | Transformer layers to apply LoRA. Targeting attention + MLP layers covers the most important weights |
| `lora_alpha` | `16` | **Scaling factor** for LoRA updates. Higher alpha = stronger influence of fine-tuning |
| `lora_dropout` | `0` | Dropout for regularization. Set to 0 for Unsloth's optimized kernels |
| `bias` | `"none"` | Whether to train bias terms. "none" is fastest and often sufficient |
| `use_gradient_checkpointing` | `"unsloth"` | Memory optimization for long sequences. Trades compute for memory |
| `random_state` | `3407` | Seed for reproducibility |
| `use_rslora` | `False` | Rank-Stabilized LoRA variant (experimental) |
| `loftq_config` | `None` | Low-Rank Quantization config (disabled here) |

### Target Modules

```
q_proj, k_proj, v_proj, o_proj  â†’ Attention mechanism
gate_proj, up_proj, down_proj  â†’ MLP/Feed-Forward layers
```

These modules contain the majority of learnable parameters in transformers.

## Step 4: Load the Training Dataset

We use the **R1-Distill-SFT** dataset which contains reasoning examples from DeepSeek-R1. Each example has:
- `problem`: The input question
- `reannotated_assistant_content`: The response with `<think>` reasoning tags

This dataset teaches the model to "think before answering" like reasoning models (o1, DeepSeek-R1).

In [None]:
from datasets import load_dataset
dataset = load_dataset("ServiceNow-AI/R1-Distill-SFT",'v0', split = "train")

In [None]:
print(dataset[:5])

### Dataset Structure

The dataset contains ~170K examples with the following fields:
- `problem`: The original question/task
- `reannotated_assistant_content`: Response with `<think>` tags containing reasoning steps
- `solution`: The final answer

The `<think>` tags teach the model to **show its work** before answering, enabling chain-of-thought reasoning.

## Step 5: Create the Prompt Template

We define a prompt template that structures the input for training. The template includes:
1. **System instruction**: Tells the model to reason step-by-step
2. **Problem**: The actual question wrapped in `<problem>` tags
3. **Thinking + Solution**: The reasoning and answer (filled during training)

The `formatting_prompts_func` transforms each dataset example into this format and appends the EOS token to signal sequence end.

In [None]:
r1_prompt = """You are a reflective assistant engaging in thorough, iterative reasoning, mimicking human stream-of-consciousness thinking. Your approach emphasizes exploration, self-doubt, and continuous refinement before coming up with an answer.
<problem>
{}
</problem>

{}
{}
"""
EOS_TOKEN = tokenizer.eos_token

def formatting_prompts_func(examples):
  problems = examples["problem"]
  thoughts = examples["reannotated_assistant_content"]
  solutions = examples["solution"]
  texts = []

  for problem, thought, solution in zip(problems, thoughts, solutions):
    text = r1_prompt.format(problem, thought, solution)+EOS_TOKEN
    texts.append(text)

  return {"text": texts}

dataset = dataset.map(formatting_prompts_func, batched = True,)

## Step 6: Configure the Trainer

We use Hugging Face's **SFTTrainer** (Supervised Fine-Tuning Trainer) with carefully tuned hyperparameters:

### SFTTrainer Configuration

| Parameter | Value | Purpose |
|-----------|-------|---------|
| `dataset_text_field` | `"text"` | Column containing formatted prompts |
| `max_seq_length` | `2048` | Truncate/pad sequences to this length |
| `dataset_num_proc` | `2` | Parallel processes for data loading |
| `packing` | `False` | Disable sequence packing (can enable for 5Ã— speedup with short sequences) |

### Training Arguments Explained

| Parameter | Value | Purpose |
|-----------|-------|---------|
| `per_device_train_batch_size` | `2` | Samples per GPU per step (limited by VRAM) |
| `gradient_accumulation_steps` | `4` | Accumulate gradients over 4 steps â†’ effective batch size = 8 |
| `warmup_steps` | `5` | Gradually increase learning rate to prevent early instability |
| `max_steps` | `60` | Total training steps (increase for better results) |
| `learning_rate` | `2e-4` | Standard for LoRA fine-tuning |
| `fp16`/`bf16` | Auto | Use BF16 on Ampere+ GPUs, FP16 otherwise |
| `optim` | `"adamw_8bit"` | 8-bit AdamW optimizer saves memory |
| `weight_decay` | `0.01` | L2 regularization to prevent overfitting |
| `lr_scheduler_type` | `"linear"` | Linear learning rate decay |

### Effective Batch Size Calculation

```
Effective batch size = per_device_batch Ã— gradient_accumulation Ã— num_gpus
                     = 2 Ã— 4 Ã— 1 = 8
```

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2, # Number of processors to use for processing the dataset
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2, # The batch size per GPU/TPU core
        gradient_accumulation_steps = 4, # Number of steps to perform befor each gradient accumulation
        warmup_steps = 5, # Few updates with low learning rate before actual training
        max_steps = 60, # Specifies the total number of training steps (batches) to run.
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit", # Optimizer
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc for observability
    ),
)

In [None]:
trainer_stats = trainer.train()

## Step 7: Train the Model

Run the training loop. With the current settings (60 steps), this takes approximately 5-15 minutes depending on your GPU.

**Training Metrics to Watch**:
- `loss`: Should decrease over time (target: < 1.0)
- `grad_norm`: Should stay stable (spikes indicate instability)
- `learning_rate`: Should decrease linearly after warmup

In [None]:
from unsloth.chat_templates import get_chat_template
sys_prompt = """You are a reflective assistant engaging in thorough, iterative reasoning, mimicking human stream-of-consciousness thinking. Your approach emphasizes exploration, self-doubt, and continuous refinement before coming up with an answer.
<problem>
{}
</problem>
"""
message = sys_prompt.format("How many 'r's are present in 'strawberry'?")
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": message},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

outputs = model.generate(input_ids = inputs, max_new_tokens = 1024, use_cache = True,
                         temperature = 1.5, min_p = 0.1)
response = tokenizer.batch_decode(outputs)

## Step 8: Test the Fine-Tuned Model

Now let's test if the model learned to reason! We use the classic "How many r's in strawberry?" questionâ€”a task that trips up many LLMs.

**Key changes for inference**:
- `FastLanguageModel.for_inference(model)` enables 2Ã— faster inference
- `temperature = 1.5` + `min_p = 0.1` for diverse, creative reasoning
- `max_new_tokens = 1024` allows space for full chain-of-thought

In [None]:
print(response[0])

### Check the Response

The model should now show **explicit reasoning** in `<think>` tags before providing the answer. A successful fine-tune will count each 'r' systematically: s-t-**r**-a-w-b-e-**r**-**r**-y = 3 r's.

In [None]:
model.save_pretrained("chintan-001-3B")  # Local saving
tokenizer.save_pretrained("chintan-001-3B")

## Step 9: Save the Model

Save the fine-tuned model in two formats:

### LoRA Adapters Only
Saves only the trained LoRA weights (~50-100 MB). To use, you need the base model + these adapters.

### GGUF Format (Optional)
Converts to llama.cpp compatible format for use with Ollama, LM Studio, or llama.cpp directly. This creates a single portable file.

In [None]:
model.save_pretrained_gguf("chintan-001-3B-GGUF", tokenizer,)

## Step 10: Deploy with Ollama (Optional)

**Ollama** is a tool for running LLMs locally with a simple API. These cells:
1. Install Ollama
2. Start the Ollama server
3. Register our GGUF model
4. Allow interaction via `ollama run unsloth_model`

After this, you can chat with your fine-tuned model from the terminal!

In [None]:
!curl -fsSL https://ollama.com/install.sh | sh

In [None]:
import subprocess
subprocess.Popen(["ollama", "serve"])
import time
time.sleep(3)

In [None]:
print(tokenizer._ollama_modelfile)

In [None]:
!ollama create unsloth_model -f ./chintan-001-3B-GGUF/Modelfile

---

## Summary & Key Takeaways

### What We Built
We fine-tuned Llama-3.2-3B to think step-by-step like DeepSeek-R1, using only ~50 lines of code and consumer GPU hardware.

### Techniques Used

| Technique | Benefit |
|-----------|---------|
| **4-bit Quantization** | 8Ã— memory reduction (12GB â†’ 3GB) |
| **LoRA Adapters** | Train 0.1% of parameters, comparable results |
| **QLoRA** | Combine quantization + LoRA for extreme efficiency |
| **Knowledge Distillation** | Transfer reasoning from larger models |

### Next Steps

1. **Increase `max_steps`** to 500-1000 for better quality
2. **Experiment with `r`** (LoRA rank): 32 or 64 for more capacity
3. **Try different datasets** for domain-specific fine-tuning
4. **Upload to Hugging Face Hub** for sharing

### Resources

- [Unsloth GitHub](https://github.com/unslothai/unsloth)
- [QLoRA Paper](https://arxiv.org/abs/2305.14314)
- [R1-Distill Dataset](https://huggingface.co/datasets/ServiceNow-AI/R1-Distill-SFT)