
# 🚀 Day 10 – Fine-tuning, Quantization & Performance Optimization

This notebook covers **advanced concepts in LLM optimization** including **Fine-tuning, PEFT (LoRA & QLoRA), Quantization, and Performance Optimization techniques**.  
These notes are designed for **easy revision** and **interview preparation**.

---

## 📌 Fine-tuning in LLMs

**What is Fine-tuning?**  
Fine-tuning means adapting a **pre-trained LLM** (like LLaMA, GPT, Gemini) to a **specific domain/task** by training it further on domain-specific data.

### 🔑 Key Points:
- Full parameter fine-tuning = all model parameters updated (very expensive, GPU heavy).
- Allows models to adapt to **new data** not present during pre-training.
- Common in **domain-specific LLM applications** (e.g., medical chatbots, financial assistants).

### Example (Hugging Face - Pseudocode)
```python
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments

model_name = "meta-llama/Llama-2-7b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

training_args = TrainingArguments(
    output_dir="./finetuned-model",
    per_device_train_batch_size=2,
    num_train_epochs=3,
    learning_rate=5e-5
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=my_dataset
)

trainer.train()
```

---

## 📌 PEFT (Parameter Efficient Fine-Tuning)

Instead of tuning **all parameters**, PEFT updates **small subsets** of parameters → faster & cheaper.

### 🔑 Techniques:
1. **LoRA (Low Rank Adaptation)**
   - Inserts trainable matrices into transformer layers.
   - Only small % of weights updated (saves compute).

2. **QLoRA (Quantized LoRA)**
   - Like LoRA, but uses **quantized weights (4-bit/8-bit)** → saves **GPU memory**.
   - Example: Fine-tuning LLaMA 13B with a single 24GB GPU.

---

## 📌 Quantization

**Definition:** Converting high precision weights (FP32) into lower precision (FP16, INT8, INT4).  
This reduces **memory usage** and **inference time**.

### 🔑 Quantization Types:
- **Weight-only** → Only weights quantized (activations stay full precision).
- **Dynamic** → Weights pre-quantized, activations quantized on-the-fly.
- **Static** → Both weights + activations quantized after calibration.
- **QAT (Quantization Aware Training)** → Simulates quantization during training.

### Example (Hugging Face Quantization)
```python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="float16"
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b",
    quantization_config=bnb_config,
    device_map="auto"
)
```

### Trade-offs:
- ✅ Smaller model size, faster inference.
- ❌ Possible drop in accuracy.

---

## 📌 Performance Optimization in LLMs

1. **Batching**  
   - Process multiple requests together → increases throughput.  
   - Trade-off: Higher latency for individual requests.

2. **KV Caching**  
   - Store Key & Value projections during inference.  
   - Speeds up token generation massively.  
   - Uses more memory.

3. **Fused Kernels (e.g., FlashAttention)**  
   - Combine multiple GPU operations into one.  
   - Reduces memory I/O, boosts speed.

---

## 📌 Interview Style Q&A

### Q1. What is Fine-tuning?  
👉 Fine-tuning means adapting a pre-trained LLM on domain/task-specific data. It updates model parameters to specialize in a new context.

---

### Q2. What are PEFT techniques?  
👉 Methods like LoRA & QLoRA which reduce GPU cost by updating only a small subset of weights.

---

### Q3. Difference between LoRA vs QLoRA?  
- **LoRA** → Low Rank Adaptation (uses FP16/FP32 weights).  
- **QLoRA** → LoRA + Quantization (4-bit/8-bit weights), much more memory-efficient.

---

### Q4. Difference between Fine-tuning vs Quantization?  
- **Fine-tuning** → Updating parameters to adapt model for new tasks.  
- **Quantization** → Compressing weights to make inference faster & memory-efficient.

---

### Q5. Difference between RAG vs LLM?  
- **LLM** → Generates answers from pre-trained knowledge only.  
- **RAG** → Retrieves fresh external data + LLM generates response.

---

### Q6. Difference between Gen AI vs Agentic AI?  
- **Gen AI** → Creates new content (text, images, audio).  
- **Agentic AI** → Goal-oriented, uses planning, tools, memory, reasoning.

---

### Q7. Difference between ReLU vs GELU?  
- **ReLU** → `max(0,x)`, faster but simple.  
- **GELU** → Smooth curve, better for transformers, handles negative values gracefully.

---

## 📌 Resume Keywords
- Fine-tuning with **LoRA & QLoRA** on LLaMA models.  
- Quantization using **BitsAndBytes (4-bit, 8-bit)**.  
- Performance optimized inference using **KV caching & FlashAttention**.  
- RAG pipelines using **LangChain & LlamaIndex**.

---

## ✅ Practical Notes
- Always **restart Google Colab runtime** after running large models.  
- Use **Hugging Face + Groq** for deployment of large LLMs.  
- For interviews: focus on **theory differences + practical trade-offs**.

---
