**Day 9 topics:**
1. **LoRA (Low-Rank Adaptation)** – Efficient fine-tuning method
2. **QLoRA** – LoRA with 4-bit quantization (even more efficient)
3. **Fine-tuning pipeline** – Complete training setup
4. **Parameter efficiency** – Training 0.01%-0.1% of model weights

**Goal:** Teach **how to customize LLMs efficiently** for specialized tasks.

**1. LoRA** → Train only **tiny matrices** added to model, not all weights. Saves memory/time.

**2. QLoRA** → LoRA but with **model weights in 4-bit** (not 16-bit). Even more memory saved.

**3. Fine-tuning pipeline** → **Code structure** to: load data, train model, evaluate, save.

**4. Parameter efficiency** → Train **~100,000 params** instead of **7,000,000,000**. Makes fine-tuning possible on consumer hardware.

**Simple analogy:**  
Instead of rebuilding a car engine (full fine-tuning), just **add a small chip** (LoRA) that makes it drive better for your specific roads.

#### 1: LoRA

**LoRA = Small "adapter" matrices** added to a frozen model.

**How it works:**
1. **Freeze** the original LLM (no training).
2. **Add** two tiny matrices (`A` and `B`) to each layer.
3. **Train only** these tiny matrices.
4. During inference: `Output = Original + (A × B × input)`.

**Example:**  
Instead of training **1 billion parameters**, train only **0.1 million** (the `A` and `B` matrices).

**Result:** Customized model with **1/1000th the training cost**.

**LoRA lets you customize a giant LLM on a normal computer.**

**Good it does:**
1. **Makes LLMs fit your task** – Train it to excel at your specific format (like your `TOOL:` prompt).
2. **Saves huge memory** – Train 0.1% of parameters instead of 100%.
3. **Fast training** – Hours, not weeks.
4. **Keep base model** – One base model, many LoRA adapters for different tasks.

**For you:** Could train Llama to **perfectly follow** your agent's `TOOL:` format, making it more reliable than prompting alone.

In [1]:
# how:
# But you could change it to train on your agent's tool-use examples to make it better at following TOOL: format.

In [1]:
import torch
import torch.nn as nn  # Import PyTorch neural network modules
import torch.nn.functional as F
from transformers import GPT2LMHeadModel, GPT2Tokenizer
from datasets import load_dataset
from torch.utils.data import DataLoader
import torch.optim as optim

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
def init_lora(base_layer, rank=8, alpha=16):
    for p in base_layer.parameters():
        p.requires_grad = False
    lora_A = nn.Linear(base_layer.in_features, rank, bias=False)
    lora_B = nn.Linear(rank, base_layer.out_features, bias=False)
    nn.init.kaiming_uniform_(lora_A.weight, a=5**0.5)
    nn.init.zeros_(lora_B.weight)
    return {
        'base': base_layer,
        'lora_A': lora_A,
        'lora_B': lora_B,
        'scale': alpha / rank
    }

def forward_lora(lora_dict, x):
    base_out = lora_dict['base'](x)
    lora_out = lora_dict['lora_B'](lora_dict['lora_A'](x))
    return base_out + lora_dict['scale'] * lora_out

In [4]:
# Load GPT-2
model = GPT2LMHeadModel.from_pretrained("./gpt2-local")
tokenizer = GPT2Tokenizer.from_pretrained("./gpt2-local")
tokenizer.pad_token = tokenizer.eos_token

In [5]:
# Apply LoRA to first layer (demo)
if hasattr(model.transformer.h[0].mlp, 'fc1'):
    lora = init_lora(model.transformer.h[0].mlp.fc1, rank=4)
    model.transformer.h[0].mlp.fc1.forward = lambda x: forward_lora(lora, x)

print("LoRA applied to GPT-2. Ready for IMDB fine-tuning.")

LoRA applied to GPT-2. Ready for IMDB fine-tuning.


In [6]:
# Add IMDB dataset and training:

In [9]:
# Load IMDB
dataset = load_dataset("imdb")
train_data = dataset["train"].select(range(1000))  # First 1000 reviews
test_data = dataset["test"].select(range(200))

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Generating train split: 100%|██████████████

In [10]:
# Tokenize
def tokenize(examples):
    return tokenizer(examples["text"], truncation=True, padding=True, max_length=512)

train_data = train_data.map(tokenize, batched=True)
test_data = test_data.map(tokenize, batched=True)
train_data.set_format(type="torch", columns=["input_ids", "attention_mask"])
test_data.set_format(type="torch", columns=["input_ids", "attention_mask"])

# DataLoader
train_loader = DataLoader(train_data, batch_size=4, shuffle=True)
test_loader = DataLoader(test_data, batch_size=4)

print(f"Train batches: {len(train_loader)}, Test batches: {len(test_loader)}")

Map: 100%|█████████████████████████████████████████████████████████████████| 1000/1000 [00:03<00:00, 275.36 examples/s]
Map: 100%|███████████████████████████████████████████████████████████████████| 200/200 [00:00<00:00, 234.42 examples/s]

Train batches: 250, Test batches: 50





In [11]:
# Training Loop

In [None]:
# Optimizer only for LoRA parameters
trainable_params = [p for p in model.parameters() if p.requires_grad]
optimizer = optim.AdamW(trainable_params, lr=1e-3)

model.train()
for batch_idx, batch in enumerate(train_loader):
    inputs = {k: v.to(model.device) for k, v in batch.items()}
    
    optimizer.zero_grad()
    outputs = model(**inputs, labels=inputs["input_ids"])
    loss = outputs.loss
    loss.backward()
    optimizer.step()
    
    if batch_idx % 50 == 0:
        print(f"Batch {batch_idx}, Loss: {loss.item():.4f}")

print("Training done.")

**No.** Fine-tuning and using documents as context are **different things**.

| | **Fine-Tuning** | **Documents as Context (RAG)** |
| :--- | :--- | :--- |
| **What** | **Retrain model weights** on your data | **Add documents to prompt** at query time |
| **Changes model?** | Yes – permanently | No – uses existing model |
| **Best for** | Teaching **new patterns** (like `TOOL:` format) | Adding **knowledge/facts** (like your docs) |
| **Example** | Make GPT-2 output `TOOL:` reliably | Give GPT-2 your manual to answer questions about it |

**For your agent:**
- **Fine-tune** → Make it better at `TOOL:` format.
- **Documents as context** → Let it answer questions about your documents (but not use tools).

They're **separate techniques** you can combine.

**No.**

Fine-tuning **adapts** the model, **doesn't replace all knowledge**.

**What changes:**
- **Adds new patterns** (like your `TOOL:` format)
- **Strengthens existing** relevant knowledge (movie reviews for IMDB)
- **Weakens unrelated** knowledge (catastrophic forgetting)

**Result:** Model **knows both** old and new, but **better at new task**. Like a doctor who **specializes in cardiology** but still knows general medicine—just better at heart issues.

**No.** Using your own documents **doesn't change the model's knowledge at all**.

**How it works:**
1. You **add your documents** to the prompt.
2. The model **reads them** each time.
3. It **answers based on** those documents + its existing knowledge.
4. **Model weights stay unchanged.**

**It's like giving a book to a student for an open-book test.** The student (model) reads it to answer, but doesn't memorize it permanently.

**Pretraining.**

**Pretraining** is the initial training on massive data (text, images, code) that **creates all the model's knowledge from scratch**.

**What it does:**
- Takes **random weights**
- Trains on **terabytes of data** (internet, books, etc.)
- Builds **all knowledge** the model will ever have

**Fine-tuning** and **documents** just **adjust** or **use** this pretrained knowledge.

**No.** You **cannot** use an existing model for pretraining.

**Pretraining** = **creating** a model from random weights + massive data.  
**Existing model** = **already pretrained**—the pretraining is **done**.

You can only:
1. **Fine-tune** it (adjust existing knowledge)
2. Use **prompting/RAG** (use existing knowledge)

To "pretrain" from scratch, you need **billions of examples** and **massive compute** (months on thousands of GPUs). Not feasible for individuals.