# PyTorch Tutorial: Large Models and Fine-Tuning

In 2025, you rarely train models from scratch. Instead, you take a massive pre-trained model (like Llama, BERT, or ResNet) and **fine-tune** it on your data. This notebook introduces you to the world of Large Language Models (LLMs) and efficient fine-tuning.

## Learning Objectives
- Load pre-trained models from Hugging Face
- Understand Fine-Tuning vs Training from Scratch
- Learn about Parameter-Efficient Fine-Tuning (PEFT/LoRA)
- Understand Quantization (loading models in 4-bit/8-bit)


In [None]:
import torch
import torch.nn as nn

# Note: In a real environment, you would install 'transformers' and 'peft'
# !pip install transformers peft bitsandbytes
print("Ready to explore LLMs!")

## 1. Loading Pre-trained Models

We use the `transformers` library (by Hugging Face) to load state-of-the-art models.

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load a small LLM (e.g., TinyLlama)
model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
```

## 2. What is Fine-Tuning?

**Pre-training**: The model reads the entire internet to learn language (expensive, takes months).
**Fine-tuning**: The model trains on *your* specific dataset to learn a task (cheap, takes hours).

Example: Turning a generic model into a medical assistant.

## 3. Parameter-Efficient Fine-Tuning (PEFT)

Fine-tuning a 7B parameter model requires massive GPU memory. **LoRA (Low-Rank Adaptation)** solves this.

Instead of updating all weights ($W$), LoRA freezes the model and adds small trainable adapters ($A$ and $B$):

$$ W_{new} = W_{frozen} + (A \times B) $$

This reduces trainable parameters by 99%!

```python
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=8,             # Rank
    lora_alpha=32,   # Scaling factor
    target_modules=["q_proj", "v_proj"], # Where to add adapters
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
model.print_trainable_parameters()
# Output: "trainable params: 4M || all params: 7B || trainable%: 0.06%"
```

## 4. Quantization (4-bit / 8-bit)

To fit large models on consumer GPUs, we reduce precision from Float32 (32 bits) to Int8 (8 bits) or even 4-bit.

```python
from transformers import BitsAndBytesConfig

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)

model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=quant_config)
```

## 5. Understanding LoRA Mathematics in Depth

LoRA is one of the most important techniques for FAANG ML engineers to understand deeply. Let's dive into the mathematics.

### The Core Insight: Low-Rank Decomposition

For a pre-trained weight matrix $W \in \mathbb{R}^{d \times k}$, instead of updating all $d \times k$ parameters, LoRA introduces two smaller matrices:

$$W_{new} = W_{frozen} + \Delta W = W_{frozen} + BA$$

Where:
- $B \in \mathbb{R}^{d \times r}$ (down-projection)
- $A \in \mathbb{R}^{r \times k}$ (up-projection)
- $r$ is the **rank** (typically 4, 8, 16, or 32)

**Parameter Savings**:
- Original: $d \times k$ parameters
- LoRA: $(d \times r) + (r \times k) = r(d + k)$ parameters
- For $d = k = 4096$ and $r = 8$: 16.7M → 65K (256x reduction!)

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class LoRALayer(nn.Module):
    """
    Implementation of a LoRA adapter layer.
    
    Key insight: We decompose the weight update into two low-rank matrices.
    This dramatically reduces trainable parameters while maintaining performance.
    """
    def __init__(self, original_layer: nn.Linear, rank: int = 8, alpha: float = 16):
        super().__init__()
        
        self.original_layer = original_layer
        self.rank = rank
        self.alpha = alpha  # Scaling factor
        
        in_features = original_layer.in_features
        out_features = original_layer.out_features
        
        # Freeze original weights
        for param in self.original_layer.parameters():
            param.requires_grad = False
        
        # LoRA matrices (trainable)
        # A is initialized with Kaiming, B is initialized to zero
        # This ensures ΔW = BA = 0 at initialization (no change initially)
        self.lora_A = nn.Parameter(torch.zeros(rank, in_features))
        self.lora_B = nn.Parameter(torch.zeros(out_features, rank))
        
        # Initialize A with Kaiming
        nn.init.kaiming_uniform_(self.lora_A, a=5**0.5)
        
        # Scaling factor
        self.scaling = self.alpha / self.rank
    
    def forward(self, x):
        # Original forward pass
        original_output = self.original_layer(x)
        
        # LoRA path: x @ A^T @ B^T * scaling
        # (batch, in) @ (in, rank) @ (rank, out) = (batch, out)
        lora_output = F.linear(F.linear(x, self.lora_A), self.lora_B)
        
        return original_output + lora_output * self.scaling

# Demonstration
original = nn.Linear(4096, 4096)
lora_layer = LoRALayer(original, rank=8, alpha=16)

# Count parameters
original_params = sum(p.numel() for p in original.parameters())
trainable_params = sum(p.numel() for p in lora_layer.parameters() if p.requires_grad)

print(f"Original parameters: {original_params:,}")
print(f"LoRA trainable parameters: {trainable_params:,}")
print(f"Reduction: {original_params / trainable_params:.1f}x")

### Why Zero Initialization for B?

A critical design choice in LoRA is initializing $B = 0$. This ensures:

1. **At initialization**: $\Delta W = BA = 0$, so the model behaves exactly like the pre-trained model
2. **During training**: Updates emerge gradually as $B$ learns
3. **Stability**: Prevents catastrophic forgetting at the start of training

This is different from random initialization, which would immediately corrupt the pre-trained weights!

## 6. QLoRA: Quantized LoRA

QLoRA combines quantization with LoRA for even more memory savings. Key innovations:

### 4-bit NormalFloat (NF4)
Instead of standard 4-bit integers, QLoRA uses a special data type optimized for normally distributed weights:
- Theoretically optimal for Gaussian distributions
- Better precision than INT4 for neural network weights

### Double Quantization
Quantizes the quantization constants themselves:
- First quantization: weights to 4-bit
- Second quantization: scaling factors to 8-bit
- Saves ~0.5GB for a 7B model!

### Paged Optimizers
Uses CPU RAM as overflow for GPU memory during gradient computation.

In [None]:
# QLoRA configuration example
qlora_config_example = """
from transformers import BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",              # NormalFloat4 (optimal for weights)
    bnb_4bit_compute_dtype=torch.bfloat16,   # Compute in bfloat16
    bnb_4bit_use_double_quant=True,          # Double quantization
)

# Load model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto",
)

# Prepare for k-bit training
model = prepare_model_for_kbit_training(model)

# Add LoRA adapters
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
"""

print("QLoRA Configuration Example:")
print(qlora_config_example)

## 7. Memory Calculations for Fine-Tuning

Understanding memory requirements is CRITICAL for production ML. Let's calculate exactly what you need.

In [None]:
def calculate_memory_requirements(
    model_params_billions: float,
    precision: str = "fp32",  # fp32, fp16, bf16, int8, int4
    training: bool = True,
    lora_rank: int = 0,  # 0 = full fine-tuning
    batch_size: int = 1,
    sequence_length: int = 2048,
    hidden_size: int = 4096,
):
    """
    Calculate GPU memory requirements for model training/inference.
    
    Key insight: Memory is dominated by:
    1. Model weights
    2. Optimizer states (2x for Adam)
    3. Gradients
    4. Activations (for backprop)
    """
    
    bytes_per_param = {
        "fp32": 4, "fp16": 2, "bf16": 2, "int8": 1, "int4": 0.5
    }
    
    param_bytes = bytes_per_param[precision]
    params = model_params_billions * 1e9
    
    # Model weights
    model_memory_gb = (params * param_bytes) / 1e9
    
    if not training:
        # Inference only needs model weights + KV cache
        kv_cache_gb = (batch_size * sequence_length * hidden_size * 2 * 2 * param_bytes) / 1e9
        total = model_memory_gb + kv_cache_gb
        return {
            "model_weights_gb": model_memory_gb,
            "kv_cache_gb": kv_cache_gb,
            "total_gb": total,
        }
    
    # Training memory
    if lora_rank > 0:
        # LoRA: Only train ~0.1% of parameters
        trainable_fraction = 0.001
        trainable_params = params * trainable_fraction
        
        optimizer_memory_gb = (trainable_params * 8) / 1e9  # Adam needs 8 bytes/param
        gradient_memory_gb = (trainable_params * param_bytes) / 1e9
    else:
        # Full fine-tuning
        optimizer_memory_gb = (params * 8) / 1e9  # Adam: momentum + variance
        gradient_memory_gb = (params * param_bytes) / 1e9
    
    # Activations (rough estimate)
    activation_memory_gb = (batch_size * sequence_length * hidden_size * 4 * 2) / 1e9
    
    total = model_memory_gb + optimizer_memory_gb + gradient_memory_gb + activation_memory_gb
    
    return {
        "model_weights_gb": round(model_memory_gb, 2),
        "optimizer_states_gb": round(optimizer_memory_gb, 2),
        "gradients_gb": round(gradient_memory_gb, 2),
        "activations_gb": round(activation_memory_gb, 2),
        "total_gb": round(total, 2),
    }

# Compare different configurations
print("=" * 60)
print("Memory Requirements for Llama-2-7B")
print("=" * 60)

configs = [
    ("Full FT (FP32)", {"precision": "fp32", "lora_rank": 0}),
    ("Full FT (FP16)", {"precision": "fp16", "lora_rank": 0}),
    ("LoRA (FP16)", {"precision": "fp16", "lora_rank": 16}),
    ("QLoRA (INT4)", {"precision": "int4", "lora_rank": 16}),
]

for name, config in configs:
    result = calculate_memory_requirements(7, **config)
    print(f"\n{name}:")
    for k, v in result.items():
        print(f"  {k}: {v} GB")

print("\n" + "=" * 60)
print("GPU Recommendations:")
print("=" * 60)
print("Full Fine-Tuning (7B): 4x A100-80GB or 8x A100-40GB")
print("LoRA Fine-Tuning (7B): 1x A100-40GB or 2x RTX 4090")
print("QLoRA Fine-Tuning (7B): 1x RTX 4090 (24GB) or 1x RTX 3090")

## 8. Other PEFT Techniques

LoRA isn't the only option. Here's a comparison of PEFT methods:

### Comparison of PEFT Methods

| Method | Parameters | Memory | Quality | Use Case |
|--------|-----------|--------|---------|----------|
| **Full Fine-Tuning** | 100% | Highest | Best | Large compute budget |
| **LoRA** | ~0.1% | Low | Near-SOTA | General purpose |
| **QLoRA** | ~0.1% | Lowest | Good | Consumer GPUs |
| **Prefix Tuning** | <0.1% | Very Low | Moderate | NLU tasks |
| **Prompt Tuning** | <0.01% | Minimal | Lower | Simple classification |
| **Adapter Layers** | ~2% | Medium | Good | Multi-task learning |

### When to Use What

1. **LoRA**: Default choice for most fine-tuning tasks
2. **QLoRA**: When GPU memory is limited (consumer hardware)
3. **Prefix Tuning**: When you need multiple tasks from one model
4. **Full Fine-Tuning**: When you have the compute and need best quality
5. **Adapter Layers**: When adding new capabilities incrementally

## 9. Production Fine-Tuning Script Template

In [None]:
fine_tuning_script = """
# Production Fine-Tuning Script with LoRA
# Tested on: RTX 4090 (24GB), Llama-2-7B

import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset

# 1. Load Model with Quantization
model_name = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
)
model = prepare_model_for_kbit_training(model)

# 2. Configure LoRA
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

# 3. Training Arguments
training_args = TrainingArguments(
    output_dir="./lora_output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,  # Effective batch = 16
    learning_rate=2e-4,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    fp16=True,
    logging_steps=10,
    save_strategy="steps",
    save_steps=100,
    optim="paged_adamw_8bit",  # Memory-efficient optimizer
    gradient_checkpointing=True,  # Trade compute for memory
)

# 4. Save LoRA Weights (small file!)
# model.save_pretrained("./lora_adapter")
"""

print("Production Fine-Tuning Script:")
print(fine_tuning_script)

## 10. FAANG Interview Questions

### Q1: Explain how LoRA reduces memory requirements while maintaining model quality.

**Answer**: LoRA freezes the pre-trained weights $W$ and introduces a low-rank decomposition for the weight update: $\Delta W = BA$ where $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times k}$.

Memory savings come from:
1. **Fewer trainable parameters**: Instead of $d \times k$ parameters, we only train $r(d + k)$ where $r \ll \min(d, k)$
2. **No gradient storage for frozen weights**: Only LoRA parameters need gradients
3. **Reduced optimizer states**: Adam stores momentum/variance only for trainable params

Quality is maintained because:
1. Pre-trained weights contain most of the knowledge
2. Task-specific adaptation is low-rank in nature
3. The scaling factor $\alpha/r$ controls adaptation strength

---

### Q2: What are the key hyperparameters in LoRA and how do you tune them?

**Answer**:
1. **Rank (r)**: Higher = more capacity, more memory. Start with 8-16, increase if underfitting.
2. **Alpha (α)**: Scaling factor. Common: $\alpha = 2r$. Higher = stronger adaptation.
3. **Target modules**: Which layers to adapt. Attention layers (q, k, v, o) are most impactful.
4. **Dropout**: 0.05-0.1 for regularization.

Tuning strategy:
- Start with r=8, α=16, all attention layers
- If quality is low: increase r, add more target modules
- If overfitting: increase dropout, decrease r

---

### Q3: How does quantization affect model quality? What's the trade-off between INT4 and INT8?

**Answer**:

| Precision | Memory | Quality Loss | Speed |
|-----------|--------|--------------|-------|
| FP16 | 2x smaller | Minimal | Same |
| INT8 | 4x smaller | ~1% perplexity | Faster on CPU |
| INT4 | 8x smaller | ~3-5% perplexity | Depends on hardware |

Key considerations:
- **INT8**: Negligible quality loss, 4x memory savings. Safe default.
- **INT4**: More quality degradation, but NF4 (NormalFloat4) reduces this significantly.
- **Calibration**: Post-training quantization needs representative data for accurate scaling.

---

### Q4: What's the difference between full fine-tuning, LoRA, and prompt tuning?

**Answer**:

| Method | Updates | Use Case |
|--------|---------|----------|
| **Full Fine-Tuning** | All weights | Max quality, large compute budget |
| **LoRA** | Low-rank adapters (~0.1%) | Standard fine-tuning with limited compute |
| **Prompt Tuning** | Prepended embeddings only | Simple classification, multiple tasks |

Decision framework:
1. **Production with quality focus**: Full fine-tuning if budget allows, LoRA otherwise
2. **Consumer GPU**: QLoRA (4-bit + LoRA)
3. **Many tasks, one model**: Prefix tuning or prompt tuning

---

### Q5: How do you handle catastrophic forgetting during fine-tuning?

**Answer**: Catastrophic forgetting occurs when the model "forgets" pre-trained knowledge while learning new tasks.

Mitigation strategies:
1. **Lower learning rate**: 1e-5 to 2e-4 (10-100x lower than pre-training)
2. **LoRA/PEFT**: Freeze most weights, limiting what can be "forgotten"
3. **Regularization**: L2 penalty toward original weights
4. **Rehearsal**: Mix in pre-training data during fine-tuning
5. **Learning rate warmup**: Gradual increase prevents early weight destruction

LoRA naturally prevents forgetting because the original weights are frozen!

## Key Takeaways

1. **Don't train from scratch**: Use pre-trained models.
2. **Fine-tuning**: Adapts a general model to your specific data.
3. **LoRA**: Allows fine-tuning huge models on small GPUs by using low-rank decomposition.
4. **QLoRA**: Combines 4-bit quantization with LoRA for even more memory savings.
5. **Memory math**: Understanding GPU memory requirements is critical for production ML.
6. **PEFT landscape**: Choose the right method based on compute budget and quality requirements.