# Colab 1: Full Finetuning with SmolLM2 135M
## Complete Fine-tuning (Not LoRA) using Unsloth

This notebook demonstrates **full finetuning** of the SmolLM2 135M parameter model using unsloth.ai.

### What is Full Finetuning?
- **Full Finetuning**: Updates ALL model parameters during training
- **More Memory Intensive**: Requires more VRAM than LoRA
- **Better Performance**: Can achieve better results for specific tasks
- **Use Case**: Chat/conversation fine-tuning

### Key Points:
- Model: `unsloth/SmolLM2-135M-Instruct` (smallest model for faster training)
- Dataset: We'll use a chat/instruction dataset
- Method: Full parameter training (not LoRA)
- Template: Chat template for instruction following

## Step 1: Install Unsloth
Unsloth provides 2x faster training and 80% less memory usage compared to standard approaches.

**Important**: We need to use `datasets==4.3.0` to avoid recursion errors with Unsloth.

In [None]:
%%capture
# Install unsloth with all dependencies
!pip install unsloth
# Install compatible version of datasets (4.3.0 to avoid recursion errors)
!pip install datasets==4.3.0
# Also install other useful libraries
!pip install --upgrade transformers accelerate

## Step 2: Import Required Libraries

In [2]:
from unsloth import FastLanguageModel
import torch
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments, TextStreamer
from unsloth import is_bfloat16_supported

print("All libraries imported successfully!")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

NotImplementedError: Unsloth cannot find any torch accelerator? You need a GPU.

## Step 3: Configure Model Parameters

### Important Parameters Explained:
- **max_seq_length**: Maximum sequence length (2048 tokens)
- **dtype**: Data type (None = auto-detect, can be float16 or bfloat16)
- **load_in_4bit**: Use 4-bit quantization to save memory (True)

### Note about Full Finetuning:
- **r, lora_alpha, lora_dropout**: NOT USED in full finetuning (only for LoRA)
- Full finetuning doesn't use `get_peft_model()` at all
- We'll enable gradient checkpointing instead for memory efficiency

In [None]:
# Model configuration
max_seq_length = 2048  # Can be increased for longer sequences
dtype = None  # Auto-detect. Can use Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True  # Use 4bit quantization to reduce memory usage

# Model name - using SmolLM2 135M (smallest for fastest training)
model_name = "unsloth/SmolLM2-135M-Instruct"

print(f"Configuration:")
print(f"  Model: {model_name}")
print(f"  Max Sequence Length: {max_seq_length}")
print(f"  4-bit Quantization: {load_in_4bit}")
print(f"  Training Mode: FULL FINETUNING (not LoRA)")

## Step 4: Load the Model and Tokenizer

We're loading the SmolLM2 135M model with 4-bit quantization for memory efficiency.

In [None]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

print("Model loaded successfully!")
print(f"Model type: {type(model)}")
print(f"Tokenizer vocab size: {len(tokenizer)}")

## Step 5: Configure for FULL FINETUNING

### CRITICAL: For Full Finetuning, Skip get_peft_model()!

**What happens here:**
- We do **NOT** call `get_peft_model()` - that's only for LoRA!
- For full finetuning, we use the model directly
- We enable gradient checkpointing for memory efficiency
- All 135M parameters will be updated during training

**Difference:**
- **LoRA**: Call `get_peft_model()` with `r=16` or higher
- **Full Finetuning**: Skip `get_peft_model()`, use model directly

In [None]:
# For FULL finetuning, we DON'T call get_peft_model()
# We use the model directly and enable gradient checkpointing for memory efficiency

# Enable gradient checkpointing
model.gradient_checkpointing_enable()

print("Model configured for FULL FINETUNING!")
print("All 135M model parameters will be updated during training.")
print("Gradient checkpointing enabled for memory efficiency.")

## Step 6: Prepare the Dataset

### Dataset Format:
We'll use a simple instruction-following dataset. The format is:
```
Input: <instruction>
Output: <response>
```

### Chat Template:
The model expects a specific chat format:
```
<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{user_message}<|im_end|>
<|im_start|>assistant
{assistant_response}<|im_end|>
```

In [None]:
# Load a simple instruction dataset
# Using a subset of the Alpaca dataset for chat/instruction following
dataset = load_dataset("yahma/alpaca-cleaned", split="train[:1000]")  # Using 1000 samples for quick training

print(f"Dataset loaded: {len(dataset)} samples")
print(f"\nSample data point:")
print(dataset[0])

## Step 7: Format Dataset with Chat Template

We need to convert the dataset into the proper chat format that SmolLM2 expects.

In [None]:
# Define the chat template formatting function
chat_template = """<|im_start|>system
You are a helpful AI assistant.<|im_end|>
<|im_start|>user
{instruction}{input}<|im_end|>
<|im_start|>assistant
{output}<|im_end|>"""

EOS_TOKEN = tokenizer.eos_token  # Must add EOS token

def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs = examples["input"]
    outputs = examples["output"]
    texts = []
    
    for instruction, input_text, output in zip(instructions, inputs, outputs):
        # Combine instruction and input
        input_section = f"\n\nInput: {input_text}" if input_text else ""
        
        # Format with chat template
        text = chat_template.format(
            instruction=instruction,
            input=input_section,
            output=output
        ) + EOS_TOKEN
        texts.append(text)
    
    return {"text": texts}

# Apply formatting to dataset
dataset = dataset.map(formatting_prompts_func, batched=True)

print("Dataset formatted successfully!")
print(f"\nExample formatted text:")
print(dataset[0]["text"][:500] + "...")

## Step 8: Configure Training Arguments

### Training Parameters Explained:
- **per_device_train_batch_size**: Number of samples per batch (2 for memory efficiency)
- **gradient_accumulation_steps**: Accumulate gradients over 4 steps (effective batch size = 2*4=8)
- **warmup_steps**: Gradually increase learning rate over 5 steps
- **max_steps**: Total training steps (60 for quick demo, increase for better results)
- **learning_rate**: 2e-4 for full finetuning (higher than LoRA)
- **fp16/bf16**: Use mixed precision training
- **logging_steps**: Log every step
- **optim**: Use AdamW 8-bit optimizer to save memory

In [None]:
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=False,  # Can make training 5x faster for short sequences
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=60,  # Increase for better results (e.g., 500-1000)
        learning_rate=2e-4,  # Higher learning rate for full finetuning
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
        report_to="none",  # Disable wandb logging
    ),
)

print("Trainer configured successfully!")
print(f"\nTraining Configuration:")
print(f"  Batch size: 2")
print(f"  Gradient accumulation: 4")
print(f"  Effective batch size: 8")
print(f"  Max steps: 60")
print(f"  Learning rate: 2e-4")
print(f"  Training mode: FULL FINETUNING")

## Step 9: Start Training!

This will train ALL 135M parameters of the model.
Training time: ~5-10 minutes on a T4 GPU.

In [None]:
# Show GPU memory before training
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

# Start training
print("\nStarting full finetuning...")
trainer_stats = trainer.train()

# Show GPU memory after training
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_training = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
training_time = trainer_stats.metrics['train_runtime']

print(f"\n{'='*60}")
print(f"Training completed successfully!")
print(f"{'='*60}")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Memory used for training = {used_memory_for_training} GB.")
print(f"Percentage of max memory used = {used_percentage}%")
print(f"Training time = {training_time:.2f} seconds")
print(f"{'='*60}")

## Step 10: Test the Fine-tuned Model (Inference)

Let's test our fully fine-tuned model with some sample prompts!

In [None]:
# Enable inference mode
FastLanguageModel.for_inference(model)

# Test prompts
test_prompts = [
    "Explain what machine learning is in simple terms.",
    "Write a Python function to calculate the factorial of a number.",
    "What are the benefits of exercise?"
]

print("Testing the fine-tuned model...\n")

for i, instruction in enumerate(test_prompts, 1):
    # Format input with chat template
    prompt = f"""<|im_start|>system
You are a helpful AI assistant.<|im_end|>
<|im_start|>user
{instruction}<|im_end|>
<|im_start|>assistant
"""
    
    # Tokenize
    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
    
    # Generate
    print(f"{'='*60}")
    print(f"Test {i}: {instruction}")
    print(f"{'='*60}")
    
    text_streamer = TextStreamer(tokenizer, skip_prompt=True)
    output = model.generate(
        **inputs,
        streamer=text_streamer,
        max_new_tokens=128,
        use_cache=True,
        temperature=0.7,
        top_p=0.9,
    )
    print(f"\n")

## Step 11: Save the Model

### Save Options:
1. **Local saving**: Save to disk
2. **GGUF format**: For Ollama/llama.cpp
3. **Push to Hugging Face**: Share with the community

In [None]:
# Save locally
model.save_pretrained("smollm2_135m_full_finetuned")
tokenizer.save_pretrained("smollm2_135m_full_finetuned")

print("Model saved locally to 'smollm2_135m_full_finetuned/'")
print("\nYou can now:")
print("1. Use it for inference")
print("2. Continue training from this checkpoint")
print("3. Export to GGUF for Ollama")
print("4. Push to Hugging Face Hub")

## Step 12: (Optional) Save in GGUF Format for Ollama

GGUF format allows you to run the model locally with Ollama.

In [None]:
# Save in different quantization formats
# Uncomment the quantization method you want:

# Q4_K_M - recommended for most use cases (good balance)
# model.save_pretrained_gguf("model", tokenizer, quantization_method="q4_k_m")

# Q8_0 - high quality, larger file size
# model.save_pretrained_gguf("model", tokenizer, quantization_method="q8_0")

# F16 - full precision, largest file size
# model.save_pretrained_gguf("model", tokenizer, quantization_method="f16")

print("To save in GGUF format, uncomment one of the lines above and run the cell.")

## Summary & Key Takeaways

### What We Did:
1. ✅ Installed unsloth and dependencies
2. ✅ Loaded SmolLM2 135M model with 4-bit quantization
3. ✅ Configured for **FULL FINETUNING** (not LoRA)
4. ✅ Prepared and formatted Alpaca dataset with chat template
5. ✅ Trained ALL 135M parameters
6. ✅ Tested the model with inference
7. ✅ Saved the model for future use

### Full Finetuning vs LoRA:
| Aspect | Full Finetuning | LoRA |
|--------|----------------|------|
| **get_peft_model()** | ❌ Not used | ✅ Used with r=16+ |
| **Parameters Updated** | ALL (135M) | Small subset (~1-2M) |
| **Memory Usage** | Higher | Lower |
| **Training Time** | Longer | Faster |
| **Performance** | Better | Good |
| **Use Case** | Critical tasks | General tasks |

### Dataset Format:
```
<|im_start|>system
{system_prompt}<|im_end|>
<|im_start|>user
{user_message}<|im_end|>
<|im_start|>assistant
{response}<|im_end|>
```

### Tips for Better Results:
1. **Increase max_steps**: 500-1000 for production
2. **More data**: Use more training samples
3. **Adjust learning rate**: Experiment with 1e-4 to 5e-4
4. **Enable packing**: Set `packing=True` for shorter sequences
5. **Monitor loss**: Watch training loss to prevent overfitting