# QLoRA Training on Mistral-7B (GPU)

**⚠️ REQUIRES GPU!** This notebook must be run in **Google Colab with GPU enabled** (Runtime → Change runtime type → GPU).

**Why GPU is required:**
- QLoRA still needs GPU for training (even with 4-bit quantization)
- CPU training would take days/weeks and likely crash
- GPU training takes ~30-60 minutes for 1 epoch

**Recommended GPU:**
- T4 (16GB) - works fine, free tier
- A100 (80GB) - faster, paid tier (what you're using - excellent!)

## What is QLoRA?

**QLoRA** (Quantized Low-Rank Adaptation) combines:
- **4-bit quantization:** Reduces model memory by ~75%
- **LoRA (Low-Rank Adaptation):** Trains small adapter matrices instead of full weights

Result: Train a 7B model on a T4 GPU (16GB) that normally requires 40GB+.

## How 4-bit Quantization Works

Instead of storing weights in FP32 (4 bytes), we use:
- **4-bit integers:** 0.5 bytes per weight
- **Quantization constants:** Small lookup tables to convert back

This is lossy but preserves most model knowledge. Combined with LoRA, we get:
- Fast training
- Low memory usage
- Good performance

## Why T4 Fits

Google Colab's T4 GPU has 16GB VRAM. With QLoRA:
- Base model: ~4GB (4-bit)
- LoRA adapters: ~100MB
- Training overhead: ~8GB
- **Total: ~12GB** ✅ Fits!

## Hyperparameters in Plain English

- **r (rank):** Size of adapter matrices. Higher = more capacity, more memory. r=8 is a good start.
- **alpha:** Scaling factor. Usually alpha = 2*r. Controls adapter strength.
- **dropout:** Regularization. 0.05 = 5% chance of dropping connections.
- **lr:** Learning rate. 2e-4 is standard for LoRA.
- **grad_accum:** Effective batch size = batch_size × grad_accum. Use 16 to simulate larger batches.

## Avoiding OOM (Out of Memory)

- Use gradient checkpointing
- Keep batch_size=1, use grad_accum for effective batch
- Use bfloat16 (more stable than float16)
- Monitor GPU memory with `nvidia-smi`


In [2]:
# === TODO (you code this) ===
# Install GPU deps. Keep versions conservative. Verify CUDA is available.
# Hints:
#   - Install torch, transformers, peft, bitsandbytes, accelerate
#   - Use !pip install in Colab
#   - Check torch.cuda.is_available()
# Acceptance:
#   - torch.cuda.is_available() is True

import torch

def install_gpu_reqs():
    """
    Install GPU dependencies and verify CUDA availability.
    """
    if not torch.cuda.is_available():
        raise ValueError("CUDA is not available. Please enable GPU in Colab.")



install_gpu_reqs()
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'None'}")


CUDA available: True
GPU: NVIDIA A100-SXM4-40GB


## Load Dataset

Pull the dataset from the Hub (or load from local CSV if you didn't push it).


In [None]:
# ⚠️ FOR COLAB USE: Replace the placeholder below with your actual HF token
# In Colab, you can either:
# 1. Replace "YOUR_TOKEN_HERE" with your actual token (temporary, for this session)
# 2. Use: from huggingface_hub import login; login()  (recommended - stores token securely)
# 3. Set as Colab secret: HF_TOKEN in Colab secrets (most secure)

# Replace this placeholder with your actual token in Colab:
HF_TOKEN = "YOUR_TOKEN_HERE"  # Replace with your actual token in Colab!

# Alternative (recommended): Use login instead
# from huggingface_hub import login
# login()  # Enter token when prompted
# Then use: from huggingface_hub import HfFolder; HF_TOKEN = HfFolder.get_token()

In [11]:
# === TODO (you code this) ===
# Load dataset from HF Hub or local CSV; tokenize with seq_length from config.
# Hints:
#   - Try load_dataset() first (Hub), fallback to CSV if needed
#   - Tokenize using the function from notebook 03
#   - Set padding token if missing
# Acceptance:
#   - tokenized train/validation Datasets ready for Trainer

from datasets import load_dataset
from transformers import AutoTokenizer

import os

# Use the HF_TOKEN defined in cell 4 above
hf_token = HF_TOKEN



def load_and_tokenize(hub_id: str, base_model: str, seq_length: int):
    """
    Load dataset from Hub or CSV and tokenize.
    
    Args:
        hub_id: Hub dataset ID or path to CSV
        base_model: Model name for tokenizer
        seq_length: Maximum sequence length
        
    Returns:
        tuple: (tokenized_train, tokenized_val) datasets
    """
    try:
        # Try loading from Hub
        dataset = load_dataset(hub_id, token=hf_token)
        print(f"Loaded dataset from Hub: {hub_id}")
    except Exception as e:
        print(f"Error loading from Hub: {e}")
        # Fallback to CSV
        raise e

    # Tokenize
    tokenizer = AutoTokenizer.from_pretrained(base_model)
    tokenizer.pad_token = tokenizer.eos_token

    def tokenize_function(examples):
        return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=seq_length)
    
    # Apply tokenization to both train and validation splits
    tokenized_train = dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"])
    tokenized_val = dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"])
    
    print(f"Tokenized train: {len(tokenized_train)} samples")
    print(f"Tokenized validation: {len(tokenized_val)} samples")
    
    return tokenized_train, tokenized_val

# Load and tokenize
hub_id = "Tuminha/frankenstein-fanfic-snippets"  # or "path/to/local.csv"
base_model = "mistralai/Mistral-7B-Instruct-v0.2"
ds_train, ds_val = load_and_tokenize(hub_id, base_model, seq_length=512)
print(f"Train: {len(ds_train)}, Val: {len(ds_val)}")


Loaded dataset from Hub: Tuminha/frankenstein-fanfic-snippets


Map:   0%|          | 0/456 [00:00<?, ? examples/s]

Map:   0%|          | 0/25 [00:00<?, ? examples/s]

Tokenized train: 456 samples
Tokenized validation: 25 samples
Train: 456, Val: 25


## Build 4-bit Model

Load Mistral-7B in 4-bit mode using BitsAndBytes. This is the memory-saving step.


In [None]:
# === TODO (you code this) ===
# Build 4-bit Mistral with BitsAndBytes and prepare for k-bit training.
# Hints:
#   - Use BitsAndBytesConfig with load_in_4bit=True
#   - Load model with quantization_config
#   - Enable gradient checkpointing to save memory
#   - Set tokenizer padding side
# Acceptance:
#   - model loads on GPU; gradients checkpointed; memory < 16GB on T4

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

def build_4bit_model(base_model: str):
    """
    Load model in 4-bit quantization mode.
    
    Args:
        base_model: Model name
        
    Returns:
        tuple: (model, tokenizer)
    """
    BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_use_double_quant=True,
    )
    

model, tokenizer = build_4bit_model("mistralai/Mistral-7B-Instruct-v0.2")
print("4-bit model loaded!")


## Configure LoRA and Train

Set up LoRA adapters and training arguments. Then run one epoch.


In [None]:
# === TODO (you code this) ===
# Create LoRA config and TrainingArguments; run one epoch.
# Hints:
#   - Use LoraConfig from peft with r/alpha/dropout from config
#   - Set target_modules to attention layers
#   - Use TrainingArguments with grad_accum, bf16, etc.
#   - Use SFTTrainer from trl (or Trainer from transformers)
# Acceptance:
#   - training completes; loss decreases; adapter folder saved

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from transformers import TrainingArguments, Trainer
from trl import SFTTrainer

def train_qlora(model, tokenizer, ds_train, ds_val, cfg: dict, out_dir: str):
    """
    Train LoRA adapters on 4-bit model.
    
    Args:
        model: 4-bit quantized model
        tokenizer: Tokenizer
        ds_train: Training dataset
        ds_val: Validation dataset
        cfg: Config dict with qlora settings
        out_dir: Output directory for adapters
    """
    raise NotImplementedError

# Train
cfg = {
    'qlora': {
        'r': 8,
        'alpha': 16,
        'dropout': 0.05,
        'target_modules': ['q_proj', 'k_proj', 'v_proj', 'o_proj'],
        'lr': 2.0e-4,
        'grad_accum': 16,
        'epochs': 1
    }
}
train_qlora(model, tokenizer, ds_train, ds_val, cfg, out_dir="adapters/mistral-frankenstein")
print("Training complete!")


## Push Adapters to Hub

Save the adapters to the Hub so you can use them later (and share them).


In [None]:
# === TODO (you code this) ===
# Push the adapter to the Hub (private ok).
# Hints:
#   - Use adapter.push_to_hub() or model.push_to_hub()
#   - Set private=True if desired
#   - Include tokenizer if needed
# Acceptance:
#   - repo exists with adapter files; URL printed

from peft import PeftModel

def push_adapters(local_dir: str, repo_id: str):
    """
    Push LoRA adapters to Hugging Face Hub.
    
    Args:
        local_dir: Local directory with adapter files
        repo_id: Hub repository ID
    """
    raise NotImplementedError

push_adapters("adapters/mistral-frankenstein", "YOURUSER/mistral-frankenstein-qlora")
print("Adapters pushed to Hub!")
