# üöÄ ModelOps Fine-Tuning on Google Colab

**GPU-Accelerated LLM Fine-Tuning with QLoRA**

This notebook replicates your ModelOps platform's fine-tuning workflow, optimized for Google Colab's GPU resources.

## Features
- ‚úÖ GPU acceleration (T4/A100)
- ‚úÖ QLoRA fine-tuning
- ‚úÖ Automatic model download
- ‚úÖ Local model saving for download
- ‚úÖ Compatible with your ModelOps app

## Setup
1. Connect to GPU runtime (Runtime ‚Üí Change runtime type ‚Üí GPU)
2. Run cells in order
3. Download your fine-tuned model at the end

## üì¶ Install Dependencies

Install the same dependencies as your ModelOps platform.

In [None]:
# Install PyTorch with CUDA support
!pip install torch==2.6.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Install ML frameworks
!pip install transformers==4.36.0
!pip install peft==0.7.0
!pip install bitsandbytes==0.41.0
!pip install trl==0.7.0
!pip install datasets==2.15.0
!pip install accelerate==0.25.0
!pip install scipy
!pip install wandb  # Optional: for logging

# Install quantization tools
!pip install autoawq==0.1.6

# Other utilities
!pip install python-dotenv
!pip install huggingface-hub

print("‚úÖ All dependencies installed!")

## üîë Setup Hugging Face (Optional)

If using gated models, add your Hugging Face token.

In [None]:
from huggingface_hub import login

# Optional: Login to Hugging Face for gated models
# hf_token = "your_huggingface_token_here"
# login(hf_token)

print("Hugging Face setup complete (login if needed)")

## ‚öôÔ∏è Configuration

Configure your fine-tuning parameters. You can copy these from your ModelOps app.

In [None]:
# Fine-tuning configuration
config = {
    # Model settings
    "base_model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",  # or your chosen model
    "output_dir": "./fine_tuned_model",
    
    # Dataset settings
    "dataset_name": "timdettmers/openassistant-guanaco",  # or upload your own
    "text_column": "text",
    
    # LoRA settings
    "lora_rank": 8,
    "lora_alpha": 16,
    "lora_dropout": 0.1,
    "target_modules": ["q_proj", "v_proj", "k_proj", "o_proj"],
    
    # Training settings
    "num_epochs": 3,
    "batch_size": 2,
    "gradient_accumulation_steps": 8,
    "learning_rate": 2e-4,
    "max_seq_length": 512,
    "logging_steps": 10,
    "save_steps": 50,
    "evaluation_strategy": "steps",
    "eval_steps": 50,
    
    # Memory optimization
    "load_in_4bit": True,
    "bnb_4bit_compute_dtype": "float16",
    "bnb_4bit_quant_type": "nf4",
    
    # Experiment tracking
    "experiment_name": "colab_finetune"
}

print("Configuration loaded:")
for key, value in config.items():
    print(f"  {key}: {value}")

## üìä Load Dataset

Download or upload your training dataset.

In [None]:
from datasets import load_dataset
import pandas as pd

print("Loading dataset...")

# Load from Hugging Face Hub
if "dataset_name" in config and config["dataset_name"]:
    dataset = load_dataset(config["dataset_name"], split="train")
    print(f"‚úÖ Loaded {len(dataset)} examples from {config['dataset_name']}")
    
    # Show sample
    print("\nSample data:")
    for i in range(min(3, len(dataset))):
        print(f"Example {i+1}: {dataset[i][config['text_column']][:200]}...")

# Alternative: Upload your own CSV file
# from google.colab import files
# uploaded = files.upload()
# df = pd.read_csv(list(uploaded.keys())[0])
# dataset = Dataset.from_pandas(df)

print("Dataset ready!")

## ü§ñ Load Base Model

Load the base model with 4-bit quantization for memory efficiency.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

print("Loading base model...")

# Configure 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=config["load_in_4bit"],
    bnb_4bit_compute_dtype=getattr(torch, config["bnb_4bit_compute_dtype"]),
    bnb_4bit_quant_type=config["bnb_4bit_quant_type"],
    bnb_4bit_use_double_quant=True,
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    config["base_model"],
    trust_remote_code=True,
    padding_side="left",
)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Load model
model = AutoModelForCausalLM.from_pretrained(
    config["base_model"],
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

print(f"‚úÖ Model loaded: {config['base_model']}")
print(f"üìä Model size: {model.num_parameters()/1e6:.1f}M parameters")
print(f"üñ•Ô∏è Device: {model.device}")

## üîß Apply LoRA Configuration

Configure the model for parameter-efficient fine-tuning.

In [None]:
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

print("Applying LoRA configuration...")

# Prepare model for training
model = prepare_model_for_kbit_training(model)

# Configure LoRA
lora_config = LoraConfig(
    r=config["lora_rank"],
    lora_alpha=config["lora_alpha"],
    target_modules=config["target_modules"],
    lora_dropout=config["lora_dropout"],
    bias="none",
    task_type="CAUSAL_LM",
)

# Apply LoRA
model = get_peft_model(model, lora_config)

print("‚úÖ LoRA applied")
print(f"üìä Trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad)/1e6:.1f}M")
print(f"üìä Total parameters: {sum(p.numel() for p in model.parameters())/1e6:.1f}M")

# Print LoRA info
model.print_trainable_parameters()

## üéØ Prepare Dataset for Training

Tokenize and format the dataset for training.

In [None]:
def tokenize_function(examples):
    return tokenizer(
        examples[config["text_column"]],
        truncation=True,
        padding="max_length",
        max_length=config["max_seq_length"],
    )

print("Tokenizing dataset...")

# Tokenize dataset
tokenized_dataset = dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=dataset.column_names
)

# Split into train/eval
split_dataset = tokenized_dataset.train_test_split(test_size=0.1, seed=42)
train_dataset = split_dataset["train"]
eval_dataset = split_dataset["test"]

print(f"‚úÖ Dataset prepared")
print(f"üìä Train samples: {len(train_dataset)}")
print(f"üìä Eval samples: {len(eval_dataset)}")
print(f"üìä Sequence length: {config['max_seq_length']}")

## üèãÔ∏è Train the Model

Start the QLoRA fine-tuning process.

In [None]:
from transformers import TrainingArguments, Trainer
from trl import SFTTrainer
import os

print("Starting training...")

# Create output directory
os.makedirs(config["output_dir"], exist_ok=True)

# Training arguments
training_args = TrainingArguments(
    output_dir=config["output_dir"],
    num_train_epochs=config["num_epochs"],
    per_device_train_batch_size=config["batch_size"],
    gradient_accumulation_steps=config["gradient_accumulation_steps"],
    learning_rate=config["learning_rate"],
    logging_steps=config["logging_steps"],
    save_steps=config["save_steps"],
    evaluation_strategy=config["evaluation_strategy"],
    eval_steps=config["eval_steps"],
    save_total_limit=2,
    load_best_model_at_end=True,
    fp16=True,
    report_to="none",  # Disable wandb/tensorboard for simplicity
)

# Initialize trainer
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    dataset_text_field=config["text_column"],
    max_seq_length=config["max_seq_length"],
)

# Start training
print("üöÄ Training started...")
trainer.train()

# Save the model
trainer.save_model(config["output_dir"])
print(f"‚úÖ Training complete! Model saved to {config['output_dir']}")

# Show training results
training_results = trainer.state.log_history
if training_results:
    print("\nüìä Final training metrics:")
    final_log = training_results[-1]
    for key, value in final_log.items():
        if isinstance(value, (int, float)):
            print(f"  {key}: {value:.4f}")

## üß™ Test the Fine-Tuned Model

Generate some sample outputs to verify the model works.

In [None]:
from transformers import pipeline

print("Testing fine-tuned model...")

# Load the fine-tuned model for inference
pipe = pipeline(
    "text-generation",
    model=config["output_dir"],
    tokenizer=tokenizer,
    device_map="auto",
    max_new_tokens=128,
)

# Test prompts
test_prompts = [
    "Explain quantum computing in simple terms:",
    "What is machine learning?",
    "Write a short story about a robot:",
]

print("ü§ñ Model outputs:")
for prompt in test_prompts:
    print(f"\nüìù Prompt: {prompt}")
    output = pipe(prompt)[0]["generated_text"]
    # Remove the prompt from output if it's included
    if output.startswith(prompt):
        output = output[len(prompt):].strip()
    print(f"ü§ñ Response: {output[:200]}...")

print("\n‚úÖ Model testing complete!")

## üíæ Prepare Model for Download

Compress the fine-tuned model for easy download.

In [None]:
import shutil
from google.colab import files

print("Preparing model for download...")

# Create a zip file of the fine-tuned model
zip_filename = f"{config['experiment_name']}_model.zip"
shutil.make_archive(config["experiment_name"] + "_model", 'zip', config["output_dir"])

print(f"‚úÖ Model compressed: {zip_filename}")
print(f"üìä File size: {os.path.getsize(zip_filename)/1024/1024:.2f} MB")

# Show download link
print("\n‚¨áÔ∏è Click below to download your fine-tuned model:")
files.download(zip_filename)

print("\nüìã Instructions:")
print("1. Download the zip file above")
print("2. Upload it to your ModelOps app")
print("3. Import the model for deployment")

print("\nüéâ Fine-tuning complete! Your model is ready to use.")

## üîÑ Optional: Quantize for CPU Deployment

Quantize the model to GGUF format for CPU inference.

In [None]:
# Optional: Quantize to GGUF for CPU deployment
# This requires additional setup and may take time

# from autoawq import AutoAWQForCausalLM
# from transformers import AwqConfig

# print("Quantizing model...")
# quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4}
# model.quantize(tokenizer, quant_config=quant_config)
# model.save_quantized(config["output_dir"] + "_quantized")

print("Quantization skipped - enable if needed for CPU deployment")