# üöÄ Llama 3.2 1B: SFT + DPO Training Pipeline

## Full Training Pipeline - Just Click "Run All"!

### ‚öôÔ∏è Prerequisites:
1. ‚úÖ Runtime: **GPU (T4)** - Runtime ‚Üí Change runtime type ‚Üí GPU
2. ‚úÖ Upload project folder to: `MyDrive/llama32-mcq-cot/`
3. ‚úÖ Get tokens ready:
   - HuggingFace token: https://huggingface.co/settings/tokens
   - Wandb token: https://wandb.ai/authorize

### üìä Expected Timeline:
- Setup: ~10 minutes
- SFT training: ~3-4 hours
- DPO training: ~2-3 hours
- Evaluation: ~30 minutes
- **Total: ~6-8 hours**

### üéØ What You'll Get:
- Base model accuracy: ~40-50%
- SFT model accuracy: ~55-65%
- DPO model accuracy: ~57-68%

---

## Step 1: Mount Google Drive

In [None]:
from google.colab import drive
import os

print("üìÅ Mounting Google Drive...")
drive.mount('/content/drive')
print("‚úì Drive mounted successfully!\n")

## Step 2: Navigate to Project Directory

In [None]:
import os
from pathlib import Path

PROJECT_DIR = '/content/drive/MyDrive/llama32-mcq-cot'

print(f"üìÇ Navigating to: {PROJECT_DIR}")
os.chdir(PROJECT_DIR)

# Verify
print(f"‚úì Current directory: {os.getcwd()}")
print("\nüìÑ Project files:")
!ls -la

# Verify required directories exist
required_dirs = ['src', 'configs']
for dir_name in required_dirs:
    if Path(dir_name).exists():
        print(f"‚úì {dir_name}/ found")
    else:
        print(f"‚úó {dir_name}/ NOT FOUND! Please check your upload.")

# Create data directory if needed
Path('data').mkdir(exist_ok=True)
print("‚úì data/ directory ready")

## Step 3: Install Dependencies

Installing required packages... (~5 minutes)

In [None]:
print("üì¶ Installing dependencies...\n")

!pip install -q transformers>=4.44.0
!pip install -q datasets>=2.14.0
!pip install -q accelerate>=0.24.0
!pip install -q bitsandbytes>=0.41.0
!pip install -q peft>=0.6.0
!pip install -q trl>=0.7.0
!pip install -q wandb>=0.15.0
!pip install -q scipy scikit-learn

print("\n‚úì All packages installed!\n")

# Verify installations
import transformers
import torch
import datasets
import peft
import trl

print("üìä Package Versions:")
print(f"  Transformers: {transformers.__version__}")
print(f"  Datasets: {datasets.__version__}")
print(f"  PEFT: {peft.__version__}")
print(f"  TRL: {trl.__version__}")
print(f"  PyTorch: {torch.__version__}")

print("\nüñ•Ô∏è GPU Info:")
print(f"  CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"  GPU: {torch.cuda.get_device_name(0)}")
    print(f"  Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
else:
    print("  ‚ö†Ô∏è WARNING: No GPU detected! Please change runtime to GPU.")

## Step 4: Authentication (HuggingFace & Wandb)

**You'll need to enter tokens here:**
- HuggingFace: https://huggingface.co/settings/tokens
- Wandb: https://wandb.ai/authorize

In [None]:
# HuggingFace Login
print("ü§ó HuggingFace Login")
print("Get your token from: https://huggingface.co/settings/tokens\n")

from huggingface_hub import notebook_login
notebook_login()

In [None]:
# Wandb Login
print("üìä Wandb Login")
print("Get your token from: https://wandb.ai/authorize\n")

import wandb
wandb.login()

## Step 5: Data Preparation

Loading ECQA dataset and validating format...

In [None]:
print("="*80)
print("STEP 5: DATA PREPARATION")
print("="*80)
print("")

!python src/prepare_data.py

print("\n‚úì Data preparation completed!")

## Step 6: Build DPO Preference Pairs

Creating (prompt, chosen, rejected) pairs for DPO training...

In [None]:
print("="*80)
print("STEP 6: BUILD DPO PREFERENCE PAIRS")
print("="*80)
print("")

!python src/build_dpo_data.py

print("\n‚úì DPO data created!")

# Verify file exists
from pathlib import Path
dpo_file = Path('data/dpo_pairs.jsonl')
if dpo_file.exists():
    import json
    with open(dpo_file, 'r') as f:
        num_pairs = sum(1 for _ in f)
    print(f"‚úì DPO pairs file: {num_pairs} pairs saved")
else:
    print("‚úó Warning: DPO pairs file not found!")

## Step 7: SFT Training (Supervised Fine-Tuning)

### ‚è±Ô∏è Expected Time: ~3-4 hours

Training Llama 3.2 1B with QLoRA on ECQA dataset...

**What's happening:**
- 4-bit quantization to save memory
- LoRA adapters (r=16) for efficient training
- Training on ~10k samples
- Wandb tracking enabled

**‚ö†Ô∏è Don't close browser during training!**

In [None]:
print("="*80)
print("STEP 7: SFT TRAINING")
print("="*80)
print("")
print("‚è±Ô∏è  This will take ~3-4 hours")
print("üìä Monitor progress on wandb (link will appear below)")
print("")

!python src/train_sft.py

print("\n" + "="*80)
print("‚úì SFT TRAINING COMPLETED!")
print("="*80)

In [None]:
# Verify SFT model saved
from pathlib import Path

print("\nüîç Verifying SFT outputs...\n")

sft_merged = Path("outputs/sft-llama32-1b-mcq-merged")
if sft_merged.exists():
    print(f"‚úì SFT merged model saved")
    !ls -lh outputs/sft-llama32-1b-mcq-merged/ | head -10
else:
    print(f"‚úó SFT merged model not found")

# Clear GPU memory
import torch
import gc
gc.collect()
torch.cuda.empty_cache()
print("\nüßπ GPU memory cleared")

## Step 8: DPO Training (Direct Preference Optimization)

### ‚è±Ô∏è Expected Time: ~2-3 hours

Training with preference pairs to improve reasoning quality...

**What's happening:**
- Loading SFT checkpoint
- Training on preference pairs (correct vs wrong reasoning)
- Beta=0.1 for preference strength
- Lower learning rate than SFT

In [None]:
print("="*80)
print("STEP 8: DPO TRAINING")
print("="*80)
print("")
print("‚è±Ô∏è  This will take ~2-3 hours")
print("üìä Monitor progress on wandb")
print("")

!python src/train_dpo.py

print("\n" + "="*80)
print("‚úì DPO TRAINING COMPLETED!")
print("="*80)

In [None]:
# Verify DPO model saved
from pathlib import Path

print("\nüîç Verifying DPO outputs...\n")

dpo_merged = Path("outputs/dpo-llama32-1b-mcq-merged")
if dpo_merged.exists():
    print(f"‚úì DPO merged model saved")
    !ls -lh outputs/dpo-llama32-1b-mcq-merged/ | head -10
else:
    print(f"‚úó DPO merged model not found")

# Clear GPU memory
import torch
import gc
gc.collect()
torch.cuda.empty_cache()
print("\nüßπ GPU memory cleared")

## Step 9: Evaluation

### ‚è±Ô∏è Expected Time: ~20-30 minutes

Comparing all three models:
1. Base Llama-3.2-1B-Instruct
2. SFT model
3. DPO model

On ECQA validation set...

In [None]:
print("="*80)
print("STEP 9: MODEL EVALUATION")
print("="*80)
print("")
print("‚è±Ô∏è  This will take ~20-30 minutes")
print("üéØ Evaluating: Base ‚Üí SFT ‚Üí DPO")
print("")

!python src/evaluate.py

print("\n" + "="*80)
print("‚úì EVALUATION COMPLETED!")
print("="*80)

## Step 10: Quick Inference Test

Test your DPO model on a sample question!

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

print("üß™ Loading DPO model for testing...\n")

model_path = "outputs/dpo-llama32-1b-mcq-merged"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.float16,
    device_map="auto"
)

print("‚úì Model loaded!\n")

# Sample question
prompt = """Answer the following question with step-by-step reasoning.

Question: Where would you find a jellyfish that has not been captured?
Options:
A. ocean
B. store
C. tank
D. internet
E. aquarium

Think through this step by step, then provide your answer as "Answer: X"."""

print("üìù Sample Question:")
print(prompt)
print("\n" + "="*80)
print("ü§ñ Model Response:")
print("="*80 + "\n")

# Generate
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=150,
    temperature=0.7,
    do_sample=True,
    pad_token_id=tokenizer.pad_token_id
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Extract only the response (remove prompt)
response_only = response[len(prompt):].strip()
print(response_only)

print("\n" + "="*80)
print("\n‚úì Inference test completed!")

## üéâ Training Complete!

### What You've Accomplished:

‚úÖ Loaded ECQA dataset (~10k samples)
‚úÖ Fine-tuned Llama 3.2 1B with QLoRA (SFT)
‚úÖ Applied Direct Preference Optimization (DPO)
‚úÖ Evaluated all models and compared performance

### Your Models:
- `outputs/sft-llama32-1b-mcq-merged/` - SFT model
- `outputs/dpo-llama32-1b-mcq-merged/` - DPO model (best)

### Next Steps:
1. Check your **wandb dashboard** for training metrics
2. Review evaluation results above
3. Test with your own questions
4. Download models if needed (see cell below)

---

## Optional: Download Trained Models

Download models to your local machine

In [None]:
# Zip models (this may take a few minutes)
print("üì¶ Zipping trained models...\n")

!zip -r trained_models.zip outputs/*-merged/

print("\n‚úì Models zipped!")
print("\nüì• Downloading...")

from google.colab import files
files.download('trained_models.zip')

print("‚úì Download started! Check your browser downloads.")

## üõ†Ô∏è Utilities

Helpful commands for debugging and monitoring

In [None]:
# Check GPU memory usage
!nvidia-smi

In [None]:
# Clear GPU memory
import torch
import gc

gc.collect()
torch.cuda.empty_cache()
print("üßπ GPU memory cleared!")

In [None]:
# Check project status
from pathlib import Path

print("üìä Project Status")
print("="*50)

checks = [
    ("DPO data", "data/dpo_pairs.jsonl"),
    ("SFT model", "outputs/sft-llama32-1b-mcq-merged"),
    ("DPO model", "outputs/dpo-llama32-1b-mcq-merged"),
]

for name, path in checks:
    if Path(path).exists():
        print(f"‚úì {name}")
    else:
        print(f"‚úó {name}")

print("="*50)

In [None]:
# Check disk usage
!df -h /content/drive/MyDrive/llama32-mcq-cot/
!du -sh /content/drive/MyDrive/llama32-mcq-cot/outputs/

## üêõ Troubleshooting

### Out of Memory (OOM)?
Run this cell to use smaller settings:

In [None]:
# Edit config for lower memory usage
print("‚öôÔ∏è Applying memory-optimized settings...\n")

config_edit = '''
# Lower memory config
data_config.train_sample_size = 5000  # Use 5k instead of 10k
sft_config.max_seq_length = 384       # Reduce from 512
dpo_config.max_length = 384
model_config.lora_r = 8               # Reduce from 16
'''

print("Add this to configs/config.py:")
print(config_edit)
print("\nThen restart training from the failed step.")

---

## üìö Resources

- **Project README**: Check `README.md` for detailed documentation
- **Colab Guide**: See `COLAB_GUIDE.md` for tips and tricks
- **Config**: Edit `configs/config.py` to customize hyperparameters

## üéì Learning Outcomes

You now understand:
- ‚úÖ QLoRA (4-bit quantization + LoRA)
- ‚úÖ Supervised Fine-Tuning (SFT)
- ‚úÖ Direct Preference Optimization (DPO)
- ‚úÖ Chain-of-Thought reasoning
- ‚úÖ Model evaluation and comparison

## üìù For Your CV

Check the README.md file for a ready-made CV description!

---

**Happy Training! üöÄ**