# PyTorch Tutorial: Large Models and Fine-Tuning

In 2025, you rarely train models from scratch. Instead, you take a massive pre-trained model (like Llama, BERT, or ResNet) and **fine-tune** it on your data. This notebook introduces you to the world of Large Language Models (LLMs) and efficient fine-tuning.

## Learning Objectives
- Load pre-trained models from Hugging Face
- Understand Fine-Tuning vs Training from Scratch
- Learn about Parameter-Efficient Fine-Tuning (PEFT/LoRA)
- Understand Quantization (loading models in 4-bit/8-bit)


In [None]:
import torch
import torch.nn as nn

# Note: In a real environment, you would install 'transformers' and 'peft'
# !pip install transformers peft bitsandbytes
print("Ready to explore LLMs!")

## 1. Loading Pre-trained Models

We use the `transformers` library (by Hugging Face) to load state-of-the-art models.

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load a small LLM (e.g., TinyLlama)
model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
```

## 2. What is Fine-Tuning?

**Pre-training**: The model reads the entire internet to learn language (expensive, takes months).
**Fine-tuning**: The model trains on *your* specific dataset to learn a task (cheap, takes hours).

Example: Turning a generic model into a medical assistant.

## 3. Parameter-Efficient Fine-Tuning (PEFT)

Fine-tuning a 7B parameter model requires massive GPU memory. **LoRA (Low-Rank Adaptation)** solves this.

Instead of updating all weights ($W$), LoRA freezes the model and adds small trainable adapters ($A$ and $B$):

$$ W_{new} = W_{frozen} + (A \times B) $$

This reduces trainable parameters by 99%!

```python
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=8,             # Rank
    lora_alpha=32,   # Scaling factor
    target_modules=["q_proj", "v_proj"], # Where to add adapters
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
model.print_trainable_parameters()
# Output: "trainable params: 4M || all params: 7B || trainable%: 0.06%"
```

## 4. Quantization (4-bit / 8-bit)

To fit large models on consumer GPUs, we reduce precision from Float32 (32 bits) to Int8 (8 bits) or even 4-bit.

```python
from transformers import BitsAndBytesConfig

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)

model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=quant_config)
```

## Key Takeaways

1. **Don't train from scratch**: Use pre-trained models.
2. **Fine-tuning**: Adapts a general model to your specific data.
3. **LoRA**: Allows fine-tuning huge models on small GPUs.
4. **Quantization**: Reduces memory usage significantly.