# Colab 2: LoRA Finetuning with SmolLM2 135M
## Parameter-Efficient Fine-tuning using Unsloth

This notebook demonstrates **LoRA (Low-Rank Adaptation)** fine-tuning of the SmolLM2 135M model.

### What is LoRA?
- **LoRA**: Updates only a small subset of parameters (adapters)
- **Memory Efficient**: Requires much less VRAM than full finetuning
- **Faster Training**: Trains quicker with fewer parameters
- **Good Performance**: Achieves ~95-99% of full finetuning quality
- **Easy to Share**: LoRA adapters are small (~10-50MB vs GBs)

### Key Differences from Colab 1 (Full Finetuning):
| Feature | Full Finetuning | LoRA Finetuning |
|---------|----------------|------------------|
| Parameters Updated | ALL (135M) | ~1-2M (1-2%) |
| Memory Usage | High | Low |
| Training Speed | Slower | Faster |
| Adapter Size | Full model | ~10-50MB |
| r parameter | 0 (not used) | 16-64 |
| lora_alpha | 0 (not used) | 16-32 |

### Dataset:
We'll use the same Alpaca dataset for fair comparison with Colab 1.

## Step 1: Install Unsloth

In [None]:
%%capture
!pip install unsloth
!pip install --upgrade datasets transformers accelerate

## Step 2: Import Libraries

In [None]:
from unsloth import FastLanguageModel
import torch
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments, TextStreamer
from unsloth import is_bfloat16_supported

print("All libraries imported successfully!")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

## Step 3: Configure Model Parameters

In [None]:
max_seq_length = 2048
dtype = None
load_in_4bit = True

model_name = "unsloth/SmolLM2-135M-Instruct"

print(f"Configuration:")
print(f"  Model: {model_name}")
print(f"  Max Sequence Length: {max_seq_length}")
print(f"  4-bit Quantization: {load_in_4bit}")
print(f"  Training Mode: LoRA FINETUNING")

## Step 4: Load Model and Tokenizer

In [None]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

print("Model loaded successfully!")
print(f"Model type: {type(model)}")
print(f"Tokenizer vocab size: {len(tokenizer)}")

## Step 5: Configure LoRA Parameters

### LoRA Parameters Explained:

**r (rank)**: 
- Controls the size of LoRA adapters
- Higher = more parameters = better performance but slower
- Typical values: 8, 16, 32, 64
- We use **16** for a good balance

**lora_alpha**: 
- Scaling factor for LoRA weights
- Usually set to r or 2*r
- We use **16** (same as r)

**lora_dropout**: 
- Prevents overfitting
- 0 = no dropout, 0.1 = 10% dropout
- We use **0** for small datasets

**target_modules**: 
- Which layers to apply LoRA to
- More modules = more parameters
- We target all attention and MLP layers

**Key Difference from Full Finetuning**: r > 0 enables LoRA!

In [None]:
# Configure LoRA
model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # LoRA rank - KEY PARAMETER!
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,  # LoRA scaling factor
    lora_dropout=0,  # Dropout for regularization
    bias="none",  # Don't add bias
    use_gradient_checkpointing="unsloth",
    random_state=3407,
    use_rslora=False,  # Rank-stabilized LoRA
    loftq_config=None,  # LoftQ quantization
)

print("Model configured for LoRA FINETUNING!")
print(f"\nLoRA Configuration:")
print(f"  Rank (r): 16")
print(f"  Alpha: 16")
print(f"  Dropout: 0")
print(f"  Target modules: 7 (all attention + MLP layers)")
print(f"\nOnly ~1-2% of parameters will be updated!")

## Step 6: Load and Prepare Dataset

Using the same Alpaca dataset as Colab 1 for fair comparison.

In [None]:
dataset = load_dataset("yahma/alpaca-cleaned", split="train[:1000]")

print(f"Dataset loaded: {len(dataset)} samples")
print(f"\nSample data point:")
print(dataset[0])

## Step 7: Format Dataset with Chat Template

In [None]:
chat_template = """<|im_start|>system
You are a helpful AI assistant.<|im_end|>
<|im_start|>user
{instruction}{input}<|im_end|>
<|im_start|>assistant
{output}<|im_end|>"""

EOS_TOKEN = tokenizer.eos_token

def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs = examples["input"]
    outputs = examples["output"]
    texts = []
    
    for instruction, input_text, output in zip(instructions, inputs, outputs):
        input_section = f"\n\nInput: {input_text}" if input_text else ""
        text = chat_template.format(
            instruction=instruction,
            input=input_section,
            output=output
        ) + EOS_TOKEN
        texts.append(text)
    
    return {"text": texts}

dataset = dataset.map(formatting_prompts_func, batched=True)

print("Dataset formatted successfully!")
print(f"\nExample formatted text:")
print(dataset[0]["text"][:500] + "...")

## Step 8: Configure Training Arguments

### Key Differences from Full Finetuning:
- **learning_rate**: 2e-4 (can be higher for LoRA, sometimes 1e-3 to 5e-4)
- **Training is faster** due to fewer parameters
- **Less memory usage** overall

In [None]:
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=False,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=60,  # Same as full finetuning for comparison
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
        report_to="none",
    ),
)

print("Trainer configured successfully!")
print(f"\nTraining Configuration:")
print(f"  Batch size: 2")
print(f"  Gradient accumulation: 4")
print(f"  Effective batch size: 8")
print(f"  Max steps: 60")
print(f"  Learning rate: 2e-4")
print(f"  Training mode: LoRA (Parameter-Efficient)")

## Step 9: Start Training!

Training only ~1-2M parameters with LoRA adapters.
This should be noticeably faster than full finetuning!

In [None]:
# Show GPU memory before training
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

# Start training
print("\nStarting LoRA finetuning...")
trainer_stats = trainer.train()

# Show statistics
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_training = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
training_time = trainer_stats.metrics['train_runtime']

print(f"\n{'='*60}")
print(f"LoRA Training completed successfully!")
print(f"{'='*60}")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Memory used for training = {used_memory_for_training} GB.")
print(f"Percentage of max memory used = {used_percentage}%")
print(f"Training time = {training_time:.2f} seconds")
print(f"{'='*60}")
print(f"\nðŸ’¡ Compare this with Full Finetuning from Colab 1!")
print(f"   LoRA uses less memory and trains faster!")

## Step 10: Test the LoRA Fine-tuned Model

In [None]:
FastLanguageModel.for_inference(model)

test_prompts = [
    "Explain what machine learning is in simple terms.",
    "Write a Python function to calculate the factorial of a number.",
    "What are the benefits of exercise?"
]

print("Testing the LoRA fine-tuned model...\n")

for i, instruction in enumerate(test_prompts, 1):
    prompt = f"""<|im_start|>system
You are a helpful AI assistant.<|im_end|>
<|im_start|>user
{instruction}<|im_end|>
<|im_start|>assistant
"""
    
    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
    
    print(f"{'='*60}")
    print(f"Test {i}: {instruction}")
    print(f"{'='*60}")
    
    text_streamer = TextStreamer(tokenizer, skip_prompt=True)
    output = model.generate(
        **inputs,
        streamer=text_streamer,
        max_new_tokens=128,
        use_cache=True,
        temperature=0.7,
        top_p=0.9,
    )
    print(f"\n")

## Step 11: Save the LoRA Adapters

### Advantage of LoRA:
LoRA adapters are tiny (~10-50MB) compared to full models (GBs)!
You can easily share them or switch between multiple adapters.

In [None]:
# Save LoRA adapters only
model.save_pretrained("smollm2_135m_lora_adapters")
tokenizer.save_pretrained("smollm2_135m_lora_adapters")

print("LoRA adapters saved to 'smollm2_135m_lora_adapters/'")
print("\nâœ¨ LoRA adapters are much smaller than full models!")
print("   You can easily share or switch between multiple adapters.")

## Step 12: (Optional) Merge LoRA with Base Model

You can merge the LoRA adapters into the base model to create a standalone model.

In [None]:
# Merge LoRA adapters with base model
# Uncomment to run:

# model.save_pretrained_merged("smollm2_135m_lora_merged", tokenizer, save_method="merged_16bit")
# print("Merged model saved to 'smollm2_135m_lora_merged/'")

# Or save in different formats:
# model.save_pretrained_merged("model", tokenizer, save_method="merged_4bit")  # 4-bit
# model.save_pretrained_merged("model", tokenizer, save_method="lora")  # LoRA only

print("To merge LoRA with base model, uncomment the code above.")

## Step 13: (Optional) Export to GGUF for Ollama

In [None]:
# Export to GGUF format
# Uncomment the quantization method you want:

# Q4_K_M - recommended
# model.save_pretrained_gguf("model", tokenizer, quantization_method="q4_k_m")

# Q8_0 - high quality
# model.save_pretrained_gguf("model", tokenizer, quantization_method="q8_0")

# F16 - full precision
# model.save_pretrained_gguf("model", tokenizer, quantization_method="f16")

print("To export to GGUF, uncomment one of the lines above.")

## Summary & Comparison

### What We Did:
1. âœ… Installed unsloth
2. âœ… Loaded SmolLM2 135M model
3. âœ… Configured **LoRA parameters** (r=16, alpha=16)
4. âœ… Trained only ~1-2% of parameters
5. âœ… Tested the model
6. âœ… Saved tiny LoRA adapters (~10-50MB)

### Full Finetuning (Colab 1) vs LoRA (Colab 2):

| Metric | Full Finetuning | LoRA Finetuning |
|--------|----------------|------------------|
| **Parameters Updated** | 135M (100%) | ~1-2M (1-2%) |
| **Memory Usage** | Higher | Lower |
| **Training Speed** | Slower | Faster |
| **Adapter Size** | Full model (GBs) | ~10-50MB |
| **Performance** | 100% (baseline) | 95-99% |
| **Best For** | Critical tasks | Most tasks |

### LoRA Configuration:
```python
r = 16  # Rank - higher = more parameters
lora_alpha = 16  # Scaling factor
lora_dropout = 0  # Regularization
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                  "gate_proj", "up_proj", "down_proj"]
```

### When to Use LoRA:
- âœ… Limited GPU memory
- âœ… Need fast training
- âœ… Want to share adapters easily
- âœ… Good enough performance for most tasks
- âœ… Want to maintain multiple task-specific adapters

### When to Use Full Finetuning:
- âœ… Critical tasks requiring maximum performance
- âœ… Have sufficient GPU memory and time
- âœ… Domain is very different from pre-training

### Tips for Better LoRA Results:
1. **Increase r**: Try r=32 or r=64 for better performance
2. **Adjust alpha**: Usually alpha = r or alpha = 2*r
3. **More data**: Use more training samples
4. **More steps**: Increase max_steps to 500-1000
5. **Learning rate**: Experiment with 1e-4 to 5e-4