# Fine-tuning Aya Vision 8B for African Languages

This notebook demonstrates how to fine-tune the Aya Vision 8B model using LoRA (Low-Rank Adaptation) for improved performance on African language vision-language tasks.

## Model Overview
- **Model**: CohereLabs/aya-vision-8b
- **Parameters**: 8 billion
- **Context Length**: 16K tokens
- **Languages**: 23 languages including African languages
- **Architecture**: Vision-Language Model with SigLIP2 vision encoder

## Key Features of This Notebook
- Memory-efficient training with 4-bit quantization
- LoRA fine-tuning for reduced computational requirements
- Automatic model saving and uploading to Hugging Face
- Progress tracking and checkpoint saving
- Support for custom African language datasets

## Requirements
- Kaggle GPU environment (T4 or P100 recommended)
- Hugging Face account with write token
- Dataset uploaded to Hugging Face Hub


## 📋 Phase 1: Environment Setup and Installation

In [None]:
# Install required libraries
!pip install -q "transformers==4.49.0" "datasets==2.19.1" "accelerate==0.30.1" 
!pip install -q "bitsandbytes==0.43.1" "peft==0.11.1" "trl==0.9.4"
!pip install -q "torch>=2.0.0" "torchvision" "pillow" "wandb"

print("✅ All packages installed successfully!")

In [None]:
# Import necessary libraries
import os
import torch
import json
import pandas as pd
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Core ML libraries
from transformers import (
    AutoProcessor, 
    AutoModelForImageTextToText, 
    BitsAndBytesConfig,
    TrainingArguments
)
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset, Dataset
from huggingface_hub import notebook_login, HfApi

print(f"✅ PyTorch version: {torch.__version__}")
print(f"✅ CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"✅ GPU: {torch.cuda.get_device_name(0)}")
    print(f"✅ GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

## 🔐 Phase 2: Authentication and Configuration

In [None]:
# Configuration parameters
CONFIG = {
    # Model settings
    "base_model_id": "CohereLabs/aya-vision-8b",
    "new_model_name": "aya-vision-8b-african-finetuned",  # Change this to your desired name
    "hub_model_id": None,  # Will be set after login
    
    # Dataset settings
    "dataset_id": "Afri-Aya/afri-aya-dataset-v1",  # Replace with your dataset
    "dataset_split": "train",
    "max_samples": None,  # Set to a number for testing, None for full dataset
    
    # Training settings
    "num_epochs": 1,
    "batch_size": 2,
    "gradient_accumulation_steps": 8,
    "learning_rate": 2e-4,
    "max_seq_length": 1024,
    "save_steps": 50,
    "logging_steps": 10,
    
    # LoRA settings
    "lora_r": 16,
    "lora_alpha": 32,
    "lora_dropout": 0.05,
    
    # Other settings
    "temperature": 0.3,
    "max_new_tokens": 300,
}

print("📋 Configuration loaded:")
for key, value in CONFIG.items():
    print(f"  {key}: {value}")

In [None]:
# Login to Hugging Face
try:
    notebook_login()
    
    # Get username for model naming
    api = HfApi()
    username = api.whoami()["name"]
    CONFIG["hub_model_id"] = f"{username}/{CONFIG['new_model_name']}"
    
    print(f"✅ Logged in as: {username}")
    print(f"🎯 Model will be saved as: {CONFIG['hub_model_id']}")
    
except Exception as e:
    print(f"❌ Login failed: {e}")
    print("Please ensure your HF_TOKEN is properly set in Kaggle secrets")

## 📊 Phase 3: Data Loading and Preparation

In [None]:
# Load dataset
print(f"📥 Loading dataset: {CONFIG['dataset_id']}")

try:
    dataset = load_dataset(CONFIG["dataset_id"], split=CONFIG["dataset_split"])
    print(f"✅ Dataset loaded successfully!")
    print(f"📊 Dataset size: {len(dataset)} samples")
    
    # Show dataset structure
    print("\n📋 Dataset columns:", dataset.column_names)
    print("\n🔍 Sample entry:")
    sample = dataset[0]
    for key, value in sample.items():
        if key == 'image':
            print(f"  {key}: <PIL.Image object>")
        else:
            print(f"  {key}: {str(value)[:100]}..." if len(str(value)) > 100 else f"  {key}: {value}")
    
except Exception as e:
    print(f"❌ Failed to load dataset: {e}")
    print("Please check your dataset ID and ensure it's publicly accessible")

In [None]:
# Optional: Create a smaller subset for testing
if CONFIG["max_samples"] is not None:
    print(f"🔄 Creating subset of {CONFIG['max_samples']} samples for testing...")
    dataset = dataset.shuffle(seed=42).select(range(min(CONFIG["max_samples"], len(dataset))))
    print(f"✅ Subset created with {len(dataset)} samples")

print(f"\n📊 Final dataset size: {len(dataset)} samples")

In [None]:
# Data formatting function for Aya Vision chat template
def format_for_aya_vision(example):
    """
    Format dataset entries for Aya Vision chat template.
    Adjust this function based on your dataset structure.
    """
    # Example assumes your dataset has 'image' and 'caption' or similar fields
    # Modify these field names based on your actual dataset structure
    
    # Default prompts - customize based on your use case
    prompts = [
        "Describe this image in detail.",
        "What do you see in this image?",
        "Please provide a detailed caption for this image.",
        "What is happening in this image?"
    ]
    
    # Use different fields based on your dataset structure
    if "english_caption" in example:
        response = example["english_caption"]
    elif "caption" in example:
        response = example["caption"]
    elif "description" in example:
        response = example["description"]
    else:
        # Try to find any text field
        text_fields = [k for k, v in example.items() if isinstance(v, str) and len(v) > 10]
        response = example[text_fields[0]] if text_fields else "This is an image."
    
    # Select a random prompt or use the first one
    import random
    user_prompt = random.choice(prompts)
    
    return {
        "image": example["image"],
        "messages": [
            {
                "role": "user", 
                "content": f"<image>\n{user_prompt}"
            },
            {
                "role": "assistant", 
                "content": response
            }
        ]
    }

# Format the dataset
print("🔄 Formatting dataset for training...")
formatted_dataset = dataset.map(format_for_aya_vision, desc="Formatting data")

print("✅ Dataset formatted successfully!")
print("\n🔍 Sample formatted entry:")
sample_formatted = formatted_dataset[0]
print(f"Messages: {sample_formatted['messages']}")

## 🤖 Phase 4: Model and Processor Loading

In [None]:
# Configure 4-bit quantization for memory efficiency
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True
)

print("⚙️ Quantization config created")
print(f"  Quantization type: 4-bit NF4")
print(f"  Compute dtype: {bnb_config.bnb_4bit_compute_dtype}")
print(f"  Double quantization: {bnb_config.bnb_4bit_use_double_quant}")

In [None]:
# Load processor first
print(f"📥 Loading processor for {CONFIG['base_model_id']}...")

try:
    processor = AutoProcessor.from_pretrained(
        CONFIG["base_model_id"], 
        trust_remote_code=True
    )
    print("✅ Processor loaded successfully!")
    
except Exception as e:
    print(f"❌ Failed to load processor: {e}")
    raise

In [None]:
# Load the base model with quantization
print(f"📥 Loading model {CONFIG['base_model_id']} with 4-bit quantization...")
print("⏳ This may take several minutes...")

try:
    model = AutoModelForImageTextToText.from_pretrained(
        CONFIG["base_model_id"],
        quantization_config=bnb_config,
        device_map="auto",
        trust_remote_code=True,
        torch_dtype=torch.bfloat16
    )
    
    print("✅ Model loaded successfully!")
    print(f"🎯 Model device: {next(model.parameters()).device}")
    print(f"📊 Model dtype: {next(model.parameters()).dtype}")
    
    # Display memory usage
    if torch.cuda.is_available():
        memory_used = torch.cuda.memory_allocated() / 1e9
        memory_total = torch.cuda.get_device_properties(0).total_memory / 1e9
        print(f"💾 GPU Memory used: {memory_used:.1f} GB / {memory_total:.1f} GB")
    
except Exception as e:
    print(f"❌ Failed to load model: {e}")
    raise

## 🔧 Phase 5: LoRA Configuration and Model Preparation

In [None]:
# Configure LoRA for efficient fine-tuning
lora_config = LoraConfig(
    r=CONFIG["lora_r"],
    lora_alpha=CONFIG["lora_alpha"],
    lora_dropout=CONFIG["lora_dropout"],
    bias="none",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],  # Standard attention modules
    task_type=TaskType.CAUSAL_LM
)

print("🔧 LoRA configuration:")
print(f"  Rank (r): {lora_config.r}")
print(f"  Alpha: {lora_config.lora_alpha}")
print(f"  Dropout: {lora_config.lora_dropout}")
print(f"  Target modules: {lora_config.target_modules}")

# Calculate trainable parameters
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

trainable_params_before = count_parameters(model)
print(f"\n📊 Trainable parameters before LoRA: {trainable_params_before:,}")

In [None]:
# Apply LoRA to the model
print("🔄 Applying LoRA adapters to the model...")

try:
    model = get_peft_model(model, lora_config)
    
    # Print trainable parameters info
    model.print_trainable_parameters()
    
    trainable_params_after = count_parameters(model)
    total_params = sum(p.numel() for p in model.parameters())
    
    print(f"\n📊 Parameter Statistics:")
    print(f"  Total parameters: {total_params:,}")
    print(f"  Trainable parameters: {trainable_params_after:,}")
    print(f"  Trainable %: {100 * trainable_params_after / total_params:.2f}%")
    print(f"  Parameter reduction: {trainable_params_before / trainable_params_after:.1f}x")
    
    print("✅ LoRA adapters applied successfully!")
    
except Exception as e:
    print(f"❌ Failed to apply LoRA: {e}")
    raise

## 🏃‍♂️ Phase 6: Training Configuration and Execution

In [None]:
# Setup training arguments
training_args = SFTConfig(
    output_dir=CONFIG["new_model_name"],
    num_train_epochs=CONFIG["num_epochs"],
    per_device_train_batch_size=CONFIG["batch_size"],
    gradient_accumulation_steps=CONFIG["gradient_accumulation_steps"],
    learning_rate=CONFIG["learning_rate"],
    logging_steps=CONFIG["logging_steps"],
    save_strategy="steps",
    save_steps=CONFIG["save_steps"],
    eval_strategy="no",  # Can be changed to "steps" if you have eval data
    push_to_hub=True,
    hub_model_id=CONFIG["hub_model_id"],
    report_to="tensorboard",
    fp16=False,  # Using bfloat16 instead
    bf16=True,
    max_seq_length=CONFIG["max_seq_length"],
    dataloader_num_workers=2,
    remove_unused_columns=False,
    gradient_checkpointing=True,
    warmup_ratio=0.1,
    weight_decay=0.01,
    optim="paged_adamw_8bit",  # Memory efficient optimizer
)

print("📋 Training Configuration:")
print(f"  Epochs: {training_args.num_train_epochs}")
print(f"  Batch size: {training_args.per_device_train_batch_size}")
print(f"  Gradient accumulation: {training_args.gradient_accumulation_steps}")
print(f"  Effective batch size: {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")
print(f"  Learning rate: {training_args.learning_rate}")
print(f"  Max sequence length: {training_args.max_seq_length}")
print(f"  Output directory: {training_args.output_dir}")
print(f"  Hub model ID: {training_args.hub_model_id}")

In [None]:
# Initialize the SFT Trainer
print("🏗️ Initializing SFT Trainer...")

try:
    trainer = SFTTrainer(
        model=model,
        args=training_args,
        train_dataset=formatted_dataset,
        processor=processor,
        peft_config=lora_config,
    )
    
    print("✅ Trainer initialized successfully!")
    print(f"📊 Training dataset size: {len(formatted_dataset)}")
    
    # Calculate training steps
    total_steps = len(formatted_dataset) // (training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps) * training_args.num_train_epochs
    print(f"📈 Estimated total training steps: {total_steps}")
    
except Exception as e:
    print(f"❌ Failed to initialize trainer: {e}")
    raise

In [None]:
# Start training
print("🚀 Starting training...")
print(f"⏰ Training started at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("\n" + "="*50)
print("🔥 TRAINING IN PROGRESS")
print("="*50)

try:
    # Start training
    training_output = trainer.train()
    
    print("\n" + "="*50)
    print("✅ TRAINING COMPLETED SUCCESSFULLY!")
    print("="*50)
    print(f"⏰ Training finished at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
    
    # Print training metrics
    if hasattr(training_output, 'metrics'):
        print("\n📊 Final Training Metrics:")
        for key, value in training_output.metrics.items():
            print(f"  {key}: {value}")
    
except Exception as e:
    print(f"\n❌ Training failed: {e}")
    print("💡 Check the error message above and consider reducing batch size or sequence length")
    raise

## 💾 Phase 7: Model Saving and Upload

In [None]:
# Save the final model
print("💾 Saving the fine-tuned model...")

try:
    # Save model locally first
    trainer.save_model()
    print(f"✅ Model saved locally to: {training_args.output_dir}")
    
    # Save processor as well
    processor.save_pretrained(training_args.output_dir)
    print("✅ Processor saved locally")
    
except Exception as e:
    print(f"❌ Failed to save model locally: {e}")

In [None]:
# Push to Hugging Face Hub
print(f"🚀 Uploading model to Hugging Face Hub: {CONFIG['hub_model_id']}")

try:
    # Push the model
    trainer.push_to_hub()
    print(f"✅ Model uploaded successfully to: https://huggingface.co/{CONFIG['hub_model_id']}")
    
    # Create and upload model card
    model_card_content = f"""
# {CONFIG['new_model_name']}

This model is a fine-tuned version of [{CONFIG['base_model_id']}](https://huggingface.co/{CONFIG['base_model_id']}) 
for improved performance on African language vision-language tasks.

## Training Details

- **Base Model**: {CONFIG['base_model_id']}
- **Fine-tuning Method**: LoRA (Low-Rank Adaptation)
- **Training Data**: {CONFIG['dataset_id']}
- **Training Samples**: {len(formatted_dataset)}
- **Epochs**: {CONFIG['num_epochs']}
- **Batch Size**: {CONFIG['batch_size']}
- **Learning Rate**: {CONFIG['learning_rate']}
- **LoRA Rank**: {CONFIG['lora_r']}
- **LoRA Alpha**: {CONFIG['lora_alpha']}

## Usage

```python
from transformers import AutoProcessor, AutoModelForImageTextToText
from peft import PeftModel

# Load the base model and processor
base_model = AutoModelForImageTextToText.from_pretrained("{CONFIG['base_model_id']}")
processor = AutoProcessor.from_pretrained("{CONFIG['base_model_id']}")

# Load the fine-tuned LoRA weights
model = PeftModel.from_pretrained(base_model, "{CONFIG['hub_model_id']}")

# Use the model for inference
# [Add your inference code here]
```

## Training Infrastructure

- **Platform**: Kaggle
- **GPU**: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'CPU'}
- **Training Date**: {datetime.now().strftime('%Y-%m-%d')}

## African Language Support

This model has been fine-tuned to better understand and generate content related to African languages and cultures,
building upon the strong multilingual foundation of the base Aya Vision model.
"""
    
    # Save model card
    with open(os.path.join(training_args.output_dir, "README.md"), "w") as f:
        f.write(model_card_content)
    
    print("✅ Model card created and uploaded")
    
except Exception as e:
    print(f"❌ Failed to upload to Hub: {e}")
    print("💡 You can manually upload later using: trainer.push_to_hub()")

## 🧪 Phase 8: Model Testing and Validation

In [None]:
# Test the fine-tuned model
print("🧪 Testing the fine-tuned model...")

# Get a test sample from the dataset
test_sample = formatted_dataset[0]
test_image = test_sample["image"]
test_messages = test_sample["messages"]

print("📸 Test image loaded")
print(f"💬 Test prompt: {test_messages[0]['content']}")
print(f"🎯 Expected response: {test_messages[1]['content'][:200]}...")

# Prepare the input
test_messages_for_inference = [{
    "role": "user",
    "content": [
        {"type": "image", "url": test_image},
        {"type": "text", "text": "Describe this image in detail."}
    ]
}]

try:
    # Apply chat template
    inputs = processor.apply_chat_template(
        test_messages_for_inference,
        padding=True,
        add_generation_prompt=True,
        tokenize=True,
        return_dict=True,
        return_tensors="pt"
    ).to(model.device)
    
    # Generate response
    print("\n🔄 Generating response...")
    with torch.no_grad():
        gen_tokens = model.generate(
            **inputs,
            max_new_tokens=CONFIG["max_new_tokens"],
            do_sample=True,
            temperature=CONFIG["temperature"],
            pad_token_id=processor.tokenizer.eos_token_id
        )
    
    # Decode the response
    response = processor.tokenizer.decode(
        gen_tokens[0][inputs.input_ids.shape[1]:], 
        skip_special_tokens=True
    )
    
    print("\n" + "="*50)
    print("🤖 MODEL RESPONSE:")
    print("="*50)
    print(response)
    print("="*50)
    
    print("\n✅ Model test completed successfully!")
    
except Exception as e:
    print(f"❌ Model testing failed: {e}")
    print("💡 The model was trained but there might be an issue with inference")

## 📈 Phase 9: Training Summary and Next Steps

In [None]:
# Print comprehensive training summary
print("\n" + "="*60)
print("📊 TRAINING SUMMARY")
print("="*60)

print(f"\n🎯 Model Information:")
print(f"  Base Model: {CONFIG['base_model_id']}")
print(f"  Fine-tuned Model: {CONFIG['hub_model_id']}")
print(f"  Model Size: 8B parameters")
print(f"  Fine-tuning Method: LoRA")

print(f"\n📊 Training Configuration:")
print(f"  Dataset: {CONFIG['dataset_id']}")
print(f"  Training Samples: {len(formatted_dataset):,}")
print(f"  Epochs: {CONFIG['num_epochs']}")
print(f"  Batch Size: {CONFIG['batch_size']}")
print(f"  Gradient Accumulation: {CONFIG['gradient_accumulation_steps']}")
print(f"  Learning Rate: {CONFIG['learning_rate']}")
print(f"  LoRA Rank: {CONFIG['lora_r']}")

if torch.cuda.is_available():
    final_memory = torch.cuda.memory_allocated() / 1e9
    max_memory = torch.cuda.max_memory_allocated() / 1e9
    total_memory = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"\n💾 Memory Usage:")
    print(f"  Current GPU Memory: {final_memory:.1f} GB")
    print(f"  Peak GPU Memory: {max_memory:.1f} GB")
    print(f"  Total GPU Memory: {total_memory:.1f} GB")
    print(f"  Memory Efficiency: {max_memory/total_memory*100:.1f}%")

print(f"\n🚀 Model Deployment:")
print(f"  Hugging Face Hub: https://huggingface.co/{CONFIG['hub_model_id']}")
print(f"  Local Save Path: {training_args.output_dir}")

print(f"\n⏰ Timing:")
print(f"  Training Completed: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

print("\n" + "="*60)
print("🎉 FINE-TUNING COMPLETED SUCCESSFULLY!")
print("="*60)

print("\n🔥 Next Steps:")
print("  1. Test your model with different images and prompts")
print("  2. Evaluate on your specific use cases")
print("  3. Consider additional fine-tuning with more epochs if needed")
print("  4. Deploy your model for inference")
print("  5. Share your results with the community!")

print(f"\n📚 Resources:")
print(f"  - Model Hub: https://huggingface.co/{CONFIG['hub_model_id']}")
print(f"  - Base Model: https://huggingface.co/{CONFIG['base_model_id']}")
print(f"  - Dataset: https://huggingface.co/datasets/{CONFIG['dataset_id']}")
print(f"  - Aya Vision Paper: https://arxiv.org/abs/2412.04261")

## 🔧 Troubleshooting and Tips

### Common Issues and Solutions:

1. **Out of Memory (OOM) Errors:**
   - Reduce `batch_size` from 2 to 1
   - Increase `gradient_accumulation_steps` to maintain effective batch size
   - Reduce `max_seq_length` from 1024 to 512
   - Use `torch.cuda.empty_cache()` between training phases

2. **Training Too Slow:**
   - Ensure you're using GPU T4 or P100 in Kaggle
   - Enable `gradient_checkpointing=True` (already enabled)
   - Use `dataloader_num_workers=0` if you have issues

3. **Model Not Uploading to Hub:**
   - Check your HF_TOKEN in Kaggle secrets
   - Ensure you have write permissions
   - Try manual upload: `trainer.push_to_hub()`

4. **Dataset Loading Issues:**
   - Ensure your dataset is public or you have access
   - Check dataset structure matches expected format
   - Modify the `format_for_aya_vision` function as needed

### Performance Optimization Tips:

- **For better quality:** Increase epochs to 2-3, but watch for overfitting
- **For faster training:** Use smaller LoRA rank (r=8) and reduce max_seq_length
- **For memory efficiency:** Use gradient_accumulation_steps=16 with batch_size=1
- **For stability:** Keep learning_rate between 1e-4 and 5e-4

### Kaggle-Specific Tips:

- Save checkpoints frequently (every 50 steps) due to session limits
- Monitor your GPU quota usage
- Download important checkpoints before session expires
- Use persistent storage for large datasets
