# Fine-Tuning with Unsloth

This notebook demonstrates how to fine-tune a language model using Unsloth, a library that makes fine-tuning LLMs faster and more memory-efficient.

## What is Unsloth?

Unsloth optimizes the fine-tuning process for large language models, providing:
- Up to 2x faster fine-tuning
- Lower memory usage (fits larger batch sizes)
- Support for various models (Llama, Mistral, Gemma, Phi, etc.)
- Easy integration with Hugging Face's ecosystem

## 1. Setup and Installation

If you haven't installed Unsloth yet, uncomment and run the following cells:

In [None]:
# Install Unsloth
# !pip install "unsloth[cu118] @ git+https://github.com/unslothai/unsloth.git"
# !pip install --no-deps trl peft accelerate bitsandbytes

## 2. Configuration

Set up the configuration parameters for fine-tuning. Feel free to modify these based on your needs.

In [None]:
# Import necessary libraries
import torch
from datasets import load_dataset
from transformers import TrainingArguments
from trl import SFTTrainer
from unsloth import FastLanguageModel, is_bfloat16_supported

# Model configuration
MODEL_NAME = "unsloth/llama-3-8b-bnb-4bit"  # Pre-quantized model for faster loading
MAX_SEQ_LENGTH = 2048  # Context length
LOAD_IN_4BIT = True    # Use 4-bit quantization to reduce memory usage

# LoRA configuration
LORA_RANK = 16         # Rank of LoRA matrices (higher = more capacity, more memory)
LORA_ALPHA = 16        # LoRA alpha parameter
LORA_DROPOUT = 0       # LoRA dropout (0 is optimized)

# Training configuration
BATCH_SIZE = 1         # Batch size per device
GRADIENT_ACCUMULATION_STEPS = 4  # Accumulate gradients over multiple steps
LEARNING_RATE = 2e-4   # Learning rate
MAX_STEPS = 100        # Number of training steps (set to None to train for full epochs)
NUM_EPOCHS = None      # Number of epochs (alternative to max_steps)
OUTPUT_DIR = "outputs" # Directory to save the model

# Prompt template for instruction fine-tuning
PROMPT_TEMPLATE = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:
{response}"""

## 3. Load Model and Tokenizer

Load the pre-trained model and tokenizer using Unsloth's optimized loading function.

In [None]:
print("Loading model and tokenizer...")
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=MODEL_NAME,
    max_seq_length=MAX_SEQ_LENGTH,
    dtype=None,  # Auto-detect dtype
    load_in_4bit=LOAD_IN_4BIT,
    # token="hf_...",  # Uncomment and add your token if using gated models
)

## 4. Apply LoRA for Efficient Fine-Tuning

LoRA (Low-Rank Adaptation) is a technique that allows efficient fine-tuning by adding small, trainable matrices to the model's weights instead of updating all parameters.

In [None]:
print("Applying LoRA for efficient fine-tuning...")
model = FastLanguageModel.get_peft_model(
    model,
    r=LORA_RANK,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                   "gate_proj", "up_proj", "down_proj"],
    lora_alpha=LORA_ALPHA,
    lora_dropout=LORA_DROPOUT,
    bias="none",  # "none" is optimized
    use_gradient_checkpointing="unsloth",  # "unsloth" uses 30% less VRAM
    random_state=42,
    use_rslora=False,  # Rank-stabilized LoRA (optional)
    loftq_config=None, # LoftQ (optional)
)

## 5. Load and Prepare Dataset

Load a dataset and format it for training. This example uses the Alpaca dataset, but you can replace it with your own.

In [None]:
print("Loading dataset...")
# Option 1: Load a dataset from Hugging Face
dataset = load_dataset("yahma/alpaca-cleaned", split="train")

# Option 2: Load your own dataset (uncomment to use)
# dataset = load_dataset("json", data_files="your_data.json", split="train")

# Display a sample from the dataset
print("\nSample from the dataset:")
print(dataset[0])

In [None]:
# Function to format the prompts
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    responses = examples["output"]
    
    texts = []
    for instruction, response in zip(instructions, responses):
        # Format the text according to the template and add EOS token
        text = PROMPT_TEMPLATE.format(instruction=instruction, response=response) + tokenizer.eos_token
        texts.append(text)
    
    return {"text": texts}

# Apply the formatting function to the dataset
print("Formatting dataset...")
formatted_dataset = dataset.map(formatting_prompts_func, batched=True)

# Display a formatted example
print("\nFormatted example:")
print(formatted_dataset[0]["text"][:500] + "...")

## 6. Setup Trainer

Configure the SFTTrainer with the model, tokenizer, and training parameters.

In [None]:
print("Setting up trainer...")
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=formatted_dataset,
    dataset_text_field="text",
    max_seq_length=MAX_SEQ_LENGTH,
    dataset_num_proc=2,  # Number of processes for dataset preparation
    packing=False,  # Set to True for short sequences to speed up training
    args=TrainingArguments(
        per_device_train_batch_size=BATCH_SIZE,
        gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
        warmup_steps=5,
        max_steps=MAX_STEPS,
        num_train_epochs=NUM_EPOCHS,
        learning_rate=LEARNING_RATE,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=42,
        output_dir=OUTPUT_DIR,
    ),
)

## 7. Train the Model

Start the training process and monitor GPU memory usage.

In [None]:
print("Starting training...")
# Print GPU information
if torch.cuda.is_available():
    gpu_stats = torch.cuda.get_device_properties(0)
    print(f"GPU: {gpu_stats.name}")
    print(f"Total GPU memory: {round(gpu_stats.total_memory / 1024 / 1024 / 1024, 2)} GB")
    
# Track memory usage
start_gpu_memory = 0
if torch.cuda.is_available():
    torch.cuda.reset_peak_memory_stats()
    start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
    print(f"Starting GPU memory usage: {start_gpu_memory} GB")

In [None]:
# Train the model
trainer_stats = trainer.train()

In [None]:
# Print training statistics
print("\n===== TRAINING COMPLETE =====")
print(f"Training time: {trainer_stats.metrics['train_runtime']:.2f} seconds")
print(f"Training time: {round(trainer_stats.metrics['train_runtime']/60, 2)} minutes")

# Print memory usage
if torch.cuda.is_available():
    used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
    used_memory_for_training = round(used_memory - start_gpu_memory, 3)
    print(f"Peak memory usage: {used_memory} GB")
    print(f"Memory used for training: {used_memory_for_training} GB")

## 8. Save the Model

Save the fine-tuned model for later use.

In [None]:
print("\nSaving model...")
# Save the LoRA adapter only (small file)
model.save_pretrained(f"{OUTPUT_DIR}/lora_adapter")
tokenizer.save_pretrained(f"{OUTPUT_DIR}/lora_adapter")
print(f"LoRA adapter saved to {OUTPUT_DIR}/lora_adapter")

In [None]:
# Optional: Save the merged model (much larger file)
print("\nSaving merged model (this may take a while)...")
model.save_pretrained_merged(f"{OUTPUT_DIR}/merged_model", tokenizer, save_method="merged_16bit")
print(f"Merged model saved to {OUTPUT_DIR}/merged_model")

## 9. Test the Model

Test the fine-tuned model with some example prompts.

In [None]:
# Clean up CUDA memory
torch.cuda.empty_cache()

# Function to generate text
def generate_text(instruction, max_new_tokens=512):
    # Format the prompt
    prompt = PROMPT_TEMPLATE.format(instruction=instruction, response="")
    
    # Tokenize the prompt
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    # Generate text
    outputs = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
    )
    
    # Decode the generated text
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    # Extract the response part
    response_start = generated_text.find("### Response:")
    if response_start != -1:
        response = generated_text[response_start + len("### Response:"):].strip()
    else:
        response = generated_text[len(prompt):].strip()
    
    return response

In [None]:
# Test with an example
test_instruction = "Explain the concept of fine-tuning in machine learning."
print(f"Instruction: {test_instruction}")
print("\nResponse:")
print(generate_text(test_instruction))

## 10. Convert to GGUF Format (Optional)

Convert the model to GGUF format for use with llama.cpp or Ollama.

In [None]:
# Uncomment to convert to GGUF format
# !python -m unsloth.convert_to_gguf outputs/merged_model --outfile my_model.gguf

## 11. Use with Ollama (Optional)

Create a Modelfile for use with Ollama:

In [None]:
# Create a Modelfile
modelfile_content = """FROM my_model.gguf
SYSTEM You are a helpful assistant that provides accurate and informative responses.
"""

with open("Modelfile", "w") as f:
    f.write(modelfile_content)

print("Modelfile created. To use with Ollama:")
print("1. Install Ollama from https://ollama.com")
print("2. Run: ollama create mymodel -f Modelfile")
print("3. Run: ollama run mymodel")