# Fine-tuning Llama Models with Unsloth

This notebook demonstrates how to fine-tune Llama models using Unsloth, a library that makes fine-tuning 2x faster with less memory usage.

Unsloth provides optimized implementations for:
- Fast model loading
- Efficient memory usage
- Accelerated training
- Easy model saving and inference

## Step 1: Install Required Packages

Install Unsloth and other necessary dependencies.

In [None]:
# Install Unsloth and dependencies
!pip install -q unsloth
!pip install -q "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install -q --upgrade trl transformers accelerate peft

## Step 2: Import Libraries

Import all necessary libraries for the fine-tuning process.

In [None]:
from unsloth import FastLanguageModel
import torch
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments

## Step 3: Configure Model Parameters

Set up the model name, maximum sequence length, data type, and whether to use 4-bit quantization.

In [None]:
# Model configuration
max_seq_length = 2048  # Choose any! Unsloth auto-supports RoPE Scaling internally
dtype = None  # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True  # Use 4bit quantization to reduce memory usage. Can be False

# Supported models:
# "unsloth/llama-3-8b-bnb-4bit"
# "unsloth/llama-3-70b-bnb-4bit"
# "unsloth/mistral-7b-bnb-4bit"
# "unsloth/gemma-7b-bnb-4bit"
model_name = "unsloth/llama-3-8b-bnb-4bit"

## Step 4: Load Model and Tokenizer

Use Unsloth's FastLanguageModel to load the pre-trained model and tokenizer efficiently.

In [None]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

## Step 5: Configure LoRA Adapters

Set up LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning.

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # LoRA rank. Choose any number > 0. Suggested 8, 16, 32, 64, 128
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,  # Supports any, but = 0 is optimized
    bias="none",     # Supports any, but = "none" is optimized
    use_gradient_checkpointing="unsloth",  # True or "unsloth" for very long context
    random_state=3407,
    use_rslora=False,  # Support rank stabilized LoRA
    loftq_config=None,  # And LoftQ
)

## Step 6: Prepare Dataset

Load and prepare your dataset. Here we use a sample dataset format.

You can replace this with your own dataset in the format:
```python
{
    "instruction": "Your instruction text",
    "input": "Optional input",
    "output": "Expected output"
}
```

In [None]:
# Example: Loading a dataset from Hugging Face
# Replace with your dataset
dataset = load_dataset("yahma/alpaca-cleaned", split="train")

# Define prompt template
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token

def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs = examples["input"]
    outputs = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return {"text": texts}

## Step 7: Configure Training Arguments

Set up the training parameters including batch size, learning rate, and number of epochs.

In [None]:
training_args = TrainingArguments(
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    warmup_steps=5,
    max_steps=60,  # Set this to -1 to train for full epochs
    # num_train_epochs=1,  # Alternatively, set number of epochs
    learning_rate=2e-4,
    fp16=not torch.cuda.is_bf16_supported(),
    bf16=torch.cuda.is_bf16_supported(),
    logging_steps=1,
    optim="adamw_8bit",
    weight_decay=0.01,
    lr_scheduler_type="linear",
    seed=3407,
    output_dir="outputs",
)

## Step 8: Initialize Trainer

Create the SFTTrainer with the model, dataset, and training arguments.

In [None]:
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    formatting_func=formatting_prompts_func,
    args=training_args,
    packing=False,
    max_seq_length=max_seq_length,
)

## Step 9: Start Training

Begin the fine-tuning process. This may take some time depending on your hardware and dataset size.

In [None]:
# Show current memory usage
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

# Train the model
trainer_stats = trainer.train()

## Step 10: Display Training Statistics

Show memory usage and training time after fine-tuning.

In [None]:
# Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime'] / 60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

## Step 11: Test the Model (Inference)

Test the fine-tuned model with a sample prompt.

In [None]:
# Enable native 2x faster inference
FastLanguageModel.for_inference(model)

# Test prompt
inputs = tokenizer(
    [
        alpaca_prompt.format(
            "Explain the concept of machine learning",  # instruction
            "",  # input
            "",  # output - leave this blank for generation!
        )
    ],
    return_tensors="pt"
).to("cuda")

# Generate response
outputs = model.generate(
    **inputs,
    max_new_tokens=64,
    use_cache=True
)

# Decode and print the output
decoded_output = tokenizer.batch_decode(outputs)
print(decoded_output[0])

## Step 12: Save the Model

Save the fine-tuned model locally and optionally push to Hugging Face Hub.

In [None]:
# Save locally
model.save_pretrained("lora_model")  # Local saving
tokenizer.save_pretrained("lora_model")

# Save to 16bit for VLLM
# model.save_pretrained_merged("model", tokenizer, save_method="merged_16bit")

# Save to 4bit for later use
# model.save_pretrained_merged("model", tokenizer, save_method="merged_4bit")

# Save to GGUF format for llama.cpp
# model.save_pretrained_gguf("model", tokenizer, quantization_method="q4_k_m")

# Push to Hugging Face Hub (uncomment and replace with your repo name)
# model.push_to_hub("your_username/your_model_name", token="your_hf_token")
# tokenizer.push_to_hub("your_username/your_model_name", token="your_hf_token")

## Conclusion

You have successfully fine-tuned a Llama model using Unsloth! 

Key advantages of using Unsloth:
- **2x faster training** compared to standard methods
- **Reduced memory usage** with optimizations
- **Easy-to-use API** for quick setup
- **Multiple export formats** (16bit, 4bit, GGUF)

Next steps:
1. Experiment with different hyperparameters
2. Try different datasets
3. Test with various model sizes
4. Deploy your model for inference