# OpenAI Whisper Fine-tuning for Low-Resource African Languages

This notebook implements **Phase 3** of the Low-Resource Speech-to-Speech pipeline, focusing on fine-tuning OpenAI Whisper for African languages.

## Why Whisper for Low-Resource Languages?

Based on our research, Whisper is ideal because:
- ✅ **Robust to noise and accents** - Critical for real-world African audio
- ✅ **Excellent zero-shot performance** - Good baseline even without fine-tuning
- ✅ **Multiple model sizes** - From tiny (39M) to large-v3 (1.5B) parameters
- ✅ **Fine-tunable with limited data** - As little as a few hours of quality audio
- ✅ **Straightforward fine-tuning** - Well-supported by Hugging Face

## Supported African Languages

Whisper has some built-in support for:
- **Swahili** (sw) - Good baseline
- **Hausa** (ha) - Limited support
- **Yoruba** (yo) - Limited support  
- **Arabic** (ar) - Excellent support
- **Somali** (so) - Limited support

This notebook will help improve performance through fine-tuning with your specific data.

## Pipeline Overview

1. **Data Preparation** - Process your audio + transcription data
2. **Synthetic Data Integration** - Combine with generated data from Phase 2
3. **Model Fine-tuning** - Adapt Whisper to your target language
4. **Evaluation** - Measure WER (Word Error Rate) improvements
5. **Deployment** - Export for inference


## 📦 Phase 1: Environment Setup and Installation

In [None]:
# Install required packages for Whisper fine-tuning
!pip install -q torch torchaudio transformers datasets accelerate evaluate jiwer
!pip install -q librosa soundfile pydub
!pip install -q huggingface_hub wandb tensorboard

print("✅ Whisper fine-tuning environment ready!")

In [None]:
# Import necessary libraries
import os
import torch
import torchaudio
import librosa
import numpy as np
import pandas as pd
from pathlib import Path
from dataclasses import dataclass
from typing import Dict, List, Union, Any, Optional

# Transformers and training
from transformers import (
    WhisperFeatureExtractor,
    WhisperTokenizer, 
    WhisperProcessor,
    WhisperForConditionalGeneration,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer,
    TrainerCallback
)

# Dataset and evaluation
from datasets import Dataset, DatasetDict, load_dataset, Audio
from evaluate import load
import jiwer

# Audio processing
from pydub import AudioSegment
from pydub.utils import which

# Utilities
import json
import re
import warnings
warnings.filterwarnings('ignore')

print(f"✅ PyTorch version: {torch.__version__}")
print(f"✅ CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"✅ GPU: {torch.cuda.get_device_name(0)}")

## 🔧 Configuration for African Language Fine-tuning

In [None]:
# Configuration for low-resource African language fine-tuning
CONFIG = {
    # Model settings
    "model_name_or_path": "openai/whisper-small",  # Options: tiny, base, small, medium, large-v3
    "language": "sw",  # Target language code (sw=Swahili, ha=Hausa, yo=Yoruba, etc.)
    "language_name": "Swahili",  # Human readable language name
    "task": "transcribe",  # "transcribe" or "translate"
    
    # Data settings
    "dataset_name": "your-username/african-speech-dataset",  # Your Hugging Face dataset
    "audio_column": "audio",
    "transcript_column": "sentence",
    "test_size": 0.2,  # Fraction for test set
    
    # Audio processing
    "sampling_rate": 16000,  # Whisper expects 16kHz
    "max_audio_length": 30.0,  # Maximum audio length in seconds
    "min_audio_length": 1.0,   # Minimum audio length in seconds
    
    # Training settings
    "output_dir": "./whisper-small-swahili",  # Change based on your language
    "per_device_train_batch_size": 8,  # Adjust based on GPU memory
    "per_device_eval_batch_size": 8,
    "gradient_accumulation_steps": 2,
    "learning_rate": 1e-5,
    "warmup_steps": 500,
    "max_steps": 5000,  # Adjust based on dataset size
    "gradient_checkpointing": True,
    "fp16": True,
    "evaluation_strategy": "steps",
    "eval_steps": 500,
    "save_steps": 500,
    "logging_steps": 100,
    "report_to": "tensorboard",
    
    # Generation settings for evaluation
    "generation_max_length": 225,
    "suppress_tokens": [-1],  # Suppress special tokens
    
    # Low-resource optimizations
    "freeze_encoder": False,  # Set to True if very limited data
    "freeze_feature_extractor": True,
    "dropout": 0.1,
}

print("📋 Configuration loaded for African language fine-tuning:")
for key, value in CONFIG.items():
    print(f"  {key}: {value}")

## 📊 Phase 1: Data Collection & Preparation

This implements **Phase 1** from your methodology - preparing the seed dataset.

In [None]:
# Load and prepare the dataset
def load_and_prepare_dataset():
    """
    Load dataset from various sources and prepare for training.
    Supports multiple formats commonly used for African language datasets.
    """
    
    # Option 1: Load from Hugging Face Hub
    try:
        print(f"📥 Loading dataset: {CONFIG['dataset_name']}")
        dataset = load_dataset(CONFIG["dataset_name"])
        print(f"✅ Loaded dataset from Hugging Face Hub")
        return dataset
    except Exception as e:
        print(f"❌ Failed to load from HF Hub: {e}")
    
    # Option 2: Load Common Voice (if available for your language)
    try:
        print(f"📥 Attempting to load Common Voice for {CONFIG['language']}...")
        dataset = load_dataset("mozilla-foundation/common_voice_11_0", CONFIG["language"])
        print(f"✅ Loaded Common Voice dataset")
        return dataset
    except Exception as e:
        print(f"❌ Common Voice not available for {CONFIG['language']}: {e}")
    
    # Option 3: Create from local files
    print("📁 Creating dataset from local files...")
    return create_dataset_from_local_files()

def create_dataset_from_local_files():
    """
    Create dataset from local audio files and transcriptions.
    Adapt this function based on your data structure.
    """
    
    # Example structure - modify based on your data
    audio_dir = "./audio_files"  # Directory with .wav files
    transcript_file = "./transcripts.csv"  # CSV with filename,transcription
    
    if not os.path.exists(audio_dir) or not os.path.exists(transcript_file):
        print("❌ Local files not found. Please prepare your data or use a different source.")
        print("Expected structure:")
        print("  ./audio_files/audio001.wav, audio002.wav, ...")
        print("  ./transcripts.csv with columns: filename,transcription")
        return None
    
    # Load transcriptions
    df = pd.read_csv(transcript_file)
    
    # Prepare data for Dataset creation
    data = []
    for _, row in df.iterrows():
        audio_path = os.path.join(audio_dir, row['filename'])
        if os.path.exists(audio_path):
            data.append({
                "audio": audio_path,
                "sentence": row['transcription']
            })
    
    # Create Dataset
    dataset = Dataset.from_list(data)
    dataset = dataset.cast_column("audio", Audio(sampling_rate=CONFIG["sampling_rate"]))
    
    return DatasetDict({"train": dataset})

# Load the dataset
raw_dataset = load_and_prepare_dataset()

if raw_dataset:
    print(f"✅ Dataset loaded successfully!")
    print(f"📊 Dataset info:")
    for split_name, split_data in raw_dataset.items():
        print(f"  {split_name}: {len(split_data)} samples")
        print(f"  Columns: {split_data.column_names}")
else:
    print("❌ Failed to load dataset. Please check your data sources.")

In [None]:
# Data preprocessing and cleaning
def preprocess_dataset(dataset):
    """
    Clean and preprocess the dataset for optimal training.
    """
    
    def normalize_text(text):
        """
        Normalize text for better training consistency.
        Adapt normalization rules for your target African language.
        """
        # Basic cleaning
        text = text.strip().lower()
        
        # Remove extra whitespace
        text = re.sub(r'\s+', ' ', text)
        
        # Language-specific normalizations
        if CONFIG["language"] == "sw":  # Swahili-specific
            # Handle common Swahili contractions or variations
            text = text.replace("si", "sio")
        elif CONFIG["language"] == "ha":  # Hausa-specific
            # Add Hausa-specific normalizations
            pass
        elif CONFIG["language"] == "yo":  # Yoruba-specific
            # Add Yoruba-specific normalizations
            pass
        
        return text
    
    def filter_audio_length(example):
        """
        Filter audio based on length criteria.
        """
        duration = len(example["audio"]["array"]) / example["audio"]["sampling_rate"]
        return (CONFIG["min_audio_length"] <= duration <= CONFIG["max_audio_length"])
    
    def prepare_example(example):
        """
        Prepare individual examples for training.
        """
        # Normalize transcription
        example["sentence"] = normalize_text(example["sentence"])
        
        # Ensure audio is at correct sampling rate
        if example["audio"]["sampling_rate"] != CONFIG["sampling_rate"]:
            # Resample if needed
            audio_array = librosa.resample(
                example["audio"]["array"], 
                orig_sr=example["audio"]["sampling_rate"],
                target_sr=CONFIG["sampling_rate"]
            )
            example["audio"] = {
                "array": audio_array,
                "sampling_rate": CONFIG["sampling_rate"]
            }
        
        return example
    
    print("🔄 Preprocessing dataset...")
    
    # Apply preprocessing
    processed_dataset = {}
    for split_name, split_data in dataset.items():
        print(f"  Processing {split_name} split...")
        
        # Filter by audio length
        filtered = split_data.filter(filter_audio_length)
        print(f"    Filtered: {len(split_data)} → {len(filtered)} samples")
        
        # Prepare examples
        processed = filtered.map(prepare_example, desc="Preparing examples")
        processed_dataset[split_name] = processed
    
    return DatasetDict(processed_dataset)

# Preprocess the dataset
if raw_dataset:
    processed_dataset = preprocess_dataset(raw_dataset)
    print("✅ Dataset preprocessing completed!")
    
    # Show sample
    sample = processed_dataset[list(processed_dataset.keys())[0]][0]
    print(f"\n📝 Sample example:")
    print(f"  Audio shape: {sample['audio']['array'].shape}")
    print(f"  Sampling rate: {sample['audio']['sampling_rate']}")
    print(f"  Transcription: '{sample['sentence']}'")
else:
    print("❌ Cannot preprocess dataset - no data loaded")

## 🤖 Phase 3: Model Loading and Preparation

Loading the pretrained Whisper model and preparing for fine-tuning.

In [None]:
# Load Whisper model components
def load_whisper_model():
    """
    Load Whisper model, tokenizer, and feature extractor.
    """
    print(f"📥 Loading Whisper model: {CONFIG['model_name_or_path']}")
    
    # Load feature extractor
    feature_extractor = WhisperFeatureExtractor.from_pretrained(CONFIG["model_name_or_path"])
    
    # Load tokenizer
    tokenizer = WhisperTokenizer.from_pretrained(
        CONFIG["model_name_or_path"], 
        language=CONFIG["language"], 
        task=CONFIG["task"]
    )
    
    # Load model
    model = WhisperForConditionalGeneration.from_pretrained(CONFIG["model_name_or_path"])
    
    # Configure model for fine-tuning
    model.config.forced_decoder_ids = None
    model.config.suppress_tokens = []
    model.config.use_cache = False  # Required for gradient checkpointing
    
    # Language-specific configuration
    if CONFIG["language"] in tokenizer.get_vocab():
        model.config.forced_decoder_ids = tokenizer.get_decoder_prompt_ids(
            language=CONFIG["language"], 
            task=CONFIG["task"]
        )
    
    # Freezing options for low-resource scenarios
    if CONFIG["freeze_feature_extractor"]:
        model.freeze_feature_encoder()
        print("🔒 Feature extractor frozen")
    
    if CONFIG["freeze_encoder"]:
        for param in model.model.encoder.parameters():
            param.requires_grad = False
        print("🔒 Encoder frozen")
    
    # Create processor
    processor = WhisperProcessor.from_pretrained(
        CONFIG["model_name_or_path"],
        language=CONFIG["language"],
        task=CONFIG["task"]
    )
    
    return model, processor, feature_extractor, tokenizer

# Load model components
model, processor, feature_extractor, tokenizer = load_whisper_model()

print("✅ Whisper model loaded successfully!")
print(f"🏗️  Model: {model.config.name_or_path}")
print(f"🌍 Language: {CONFIG['language']} ({CONFIG['language_name']})")
print(f"📋 Task: {CONFIG['task']}")

# Model size info
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"📊 Total parameters: {total_params:,}")
print(f"🎯 Trainable parameters: {trainable_params:,} ({100*trainable_params/total_params:.1f}%)")

## 🔄 Data Collation and Training Setup

In [None]:
# Custom data collator for Whisper fine-tuning
@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    """
    Data collator that will dynamically pad the inputs received and prepare decoder input ids.
    """
    processor: Any
    decoder_start_token_id: int

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # Split inputs and labels since they have to be of different lengths and need different padding methods
        model_input_name = self.processor.model_input_names[0]
        input_features = [{model_input_name: feature[model_input_name]} for feature in features]
        label_features = [{"input_ids": feature["labels"]} for feature in features]

        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # Replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # If bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.decoder_start_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch

# Prepare dataset for training
def prepare_dataset_for_training(dataset):
    """
    Apply final processing to prepare dataset for training.
    """
    
    def prepare_dataset_batch(batch):
        # Load and process audio
        audio = batch["audio"]
        
        # Compute input features
        batch["input_features"] = feature_extractor(
            audio["array"], 
            sampling_rate=audio["sampling_rate"]
        ).input_features[0]
        
        # Encode target text
        batch["labels"] = tokenizer(batch["sentence"]).input_ids
        
        return batch
    
    print("🔄 Preparing dataset for training...")
    
    prepared_dataset = {}
    for split_name, split_data in dataset.items():
        prepared = split_data.map(
            prepare_dataset_batch,
            remove_columns=split_data.column_names,
            desc=f"Preparing {split_name} split"
        )
        prepared_dataset[split_name] = prepared
    
    return DatasetDict(prepared_dataset)

# Create data collator
data_collator = DataCollatorSpeechSeq2SeqWithPadding(
    processor=processor,
    decoder_start_token_id=model.config.decoder_start_token_id,
)

print("✅ Data collator created successfully!")

## 📊 Evaluation Metrics Setup

In [None]:
# Evaluation function
def compute_metrics(eval_preds):
    """
    Compute Word Error Rate (WER) and other metrics for evaluation.
    """
    pred_ids = eval_preds.predictions
    label_ids = eval_preds.label_ids

    # Replace -100 with pad token id
    label_ids[label_ids == -100] = tokenizer.pad_token_id

    # Decode predictions and labels
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    # Compute WER
    wer_score = jiwer.wer(label_str, pred_str)
    
    # Compute CER (Character Error Rate)
    cer_score = jiwer.cer(label_str, pred_str)
    
    return {
        "wer": wer_score,
        "cer": cer_score,
    }

print("✅ Evaluation metrics configured!")
print("📊 Metrics: WER (Word Error Rate), CER (Character Error Rate)")

## 🏃‍♂️ Training Configuration and Execution

In [None]:
# Prepare final dataset and splits
if processed_dataset and 'train' in processed_dataset:
    # Prepare dataset for training
    final_dataset = prepare_dataset_for_training(processed_dataset)
    
    # Create train/test split if needed
    if 'test' not in final_dataset and 'validation' not in final_dataset:
        print(f"🔄 Creating train/test split ({1-CONFIG['test_size']:.1f}/{CONFIG['test_size']:.1f})...")
        train_test = final_dataset['train'].train_test_split(
            test_size=CONFIG['test_size'],
            seed=42
        )
        final_dataset = DatasetDict({
            'train': train_test['train'],
            'test': train_test['test']
        })
    
    print("📊 Final dataset splits:")
    for split_name, split_data in final_dataset.items():
        print(f"  {split_name}: {len(split_data)} samples")
        
    train_dataset = final_dataset['train']
    eval_dataset = final_dataset['test'] if 'test' in final_dataset else final_dataset.get('validation')
    
else:
    print("❌ No training data available. Please check your dataset loading.")
    train_dataset = None
    eval_dataset = None

In [None]:
# Training arguments
if train_dataset:
    training_args = Seq2SeqTrainingArguments(
        output_dir=CONFIG["output_dir"],
        per_device_train_batch_size=CONFIG["per_device_train_batch_size"],
        per_device_eval_batch_size=CONFIG["per_device_eval_batch_size"],
        gradient_accumulation_steps=CONFIG["gradient_accumulation_steps"],
        learning_rate=CONFIG["learning_rate"],
        warmup_steps=CONFIG["warmup_steps"],
        max_steps=CONFIG["max_steps"],
        gradient_checkpointing=CONFIG["gradient_checkpointing"],
        fp16=CONFIG["fp16"],
        evaluation_strategy=CONFIG["evaluation_strategy"],
        eval_steps=CONFIG["eval_steps"],
        save_steps=CONFIG["save_steps"],
        logging_steps=CONFIG["logging_steps"],
        report_to=CONFIG["report_to"],
        load_best_model_at_end=True,
        metric_for_best_model="wer",
        greater_is_better=False,
        push_to_hub=False,  # Set to True if you want to push to HF Hub
        remove_unused_columns=False,  # Important for Whisper
        label_names=["labels"],  # Important for Seq2Seq
        predict_with_generate=True,
        generation_max_length=CONFIG["generation_max_length"],
        generation_num_beams=1,  # Use beam search for better quality
    )

    print("📋 Training configuration:")
    print(f"  Output directory: {training_args.output_dir}")
    print(f"  Batch size per device: {training_args.per_device_train_batch_size}")
    print(f"  Gradient accumulation: {training_args.gradient_accumulation_steps}")
    print(f"  Effective batch size: {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")
    print(f"  Learning rate: {training_args.learning_rate}")
    print(f"  Max steps: {training_args.max_steps}")
    print(f"  Mixed precision: {training_args.fp16}")

else:
    print("❌ Cannot create training configuration - no training data available.")

In [None]:
# Initialize trainer
if train_dataset:
    trainer = Seq2SeqTrainer(
        args=training_args,
        model=model,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        data_collator=data_collator,
        compute_metrics=compute_metrics,
        tokenizer=processor.feature_extractor,
    )

    print("✅ Trainer initialized successfully!")
    print(f"📊 Training samples: {len(train_dataset)}")
    if eval_dataset:
        print(f"📊 Evaluation samples: {len(eval_dataset)}")

else:
    print("❌ Cannot initialize trainer - no training data available.")
    trainer = None

In [None]:
# Start training
if trainer:
    print("🚀 Starting Whisper fine-tuning...")
    print(f"🌍 Target language: {CONFIG['language_name']} ({CONFIG['language']})")
    print(f"⏰ Started at: {pd.Timestamp.now()}")
    print("\n" + "="*50)
    print("🔥 TRAINING IN PROGRESS")
    print("="*50)

    try:
        # Train the model
        trainer.train()
        
        print("\n" + "="*50)
        print("✅ TRAINING COMPLETED SUCCESSFULLY!")
        print("="*50)
        print(f"⏰ Finished at: {pd.Timestamp.now()}")
        
        # Save final model
        trainer.save_model()
        processor.save_pretrained(CONFIG["output_dir"])
        
        print(f"💾 Model saved to: {CONFIG['output_dir']}")
        
    except Exception as e:
        print(f"\n❌ Training failed: {e}")
        print("💡 Try reducing batch_size or max_steps if you're running out of memory")

else:
    print("❌ Cannot start training - trainer not initialized.")

## 📊 Phase 4: Evaluation and Testing

Implementing **Phase 4** of your methodology - measuring performance improvements.

In [None]:
# Comprehensive evaluation
if trainer and eval_dataset:
    print("📊 Running comprehensive evaluation...")
    
    # Evaluate on test set
    eval_results = trainer.evaluate()
    
    print("\n" + "="*50)
    print("📈 EVALUATION RESULTS")
    print("="*50)
    
    for key, value in eval_results.items():
        if isinstance(value, float):
            print(f"  {key}: {value:.4f}")
        else:
            print(f"  {key}: {value}")
    
    # WER interpretation
    wer = eval_results.get('eval_wer', None)
    if wer is not None:
        print(f"\n🎯 Word Error Rate (WER): {wer:.2%}")
        if wer < 0.10:
            print("🏆 Excellent! WER < 10% - Production ready")
        elif wer < 0.20:
            print("✅ Good! WER < 20% - Very usable")
        elif wer < 0.30:
            print("⚠️  Fair! WER < 30% - Needs improvement")
        else:
            print("❌ Poor! WER > 30% - Requires more training data or different approach")

else:
    print("❌ Cannot run evaluation - no trained model or evaluation data available.")

## 🧪 Interactive Testing and Inference

In [None]:
# Test the fine-tuned model on sample audio
def test_model_inference(audio_file_path=None):
    """
    Test the fine-tuned model on a sample audio file.
    """
    if not trainer:
        print("❌ No trained model available for testing")
        return
    
    # Use sample from dataset if no file provided
    if audio_file_path is None and eval_dataset:
        print("🎧 Testing on sample from evaluation dataset...")
        sample = eval_dataset[0]
        
        # Reconstruct audio from features (approximate)
        # For actual testing, use original audio file
        print("📝 Ground truth transcription:")
        ground_truth = tokenizer.decode(sample['labels'], skip_special_tokens=True)
        print(f"  '{ground_truth}'")
        
        # Generate prediction
        input_features = torch.tensor([sample['input_features']]).to(model.device)
        
        with torch.no_grad():
            predicted_ids = model.generate(
                input_features,
                max_length=CONFIG["generation_max_length"],
                num_beams=1,
                do_sample=False
            )
        
        prediction = tokenizer.decode(predicted_ids[0], skip_special_tokens=True)
        
        print("\n🤖 Model prediction:")
        print(f"  '{prediction}'")
        
        # Calculate WER for this sample
        sample_wer = jiwer.wer(ground_truth, prediction)
        print(f"\n📊 Sample WER: {sample_wer:.2%}")
    
    elif audio_file_path and os.path.exists(audio_file_path):
        print(f"🎧 Testing on audio file: {audio_file_path}")
        
        # Load and process audio
        audio, sr = librosa.load(audio_file_path, sr=CONFIG["sampling_rate"])
        input_features = feature_extractor(audio, sampling_rate=sr, return_tensors="pt")
        
        # Generate transcription
        with torch.no_grad():
            predicted_ids = model.generate(
                input_features.input_features.to(model.device),
                max_length=CONFIG["generation_max_length"],
                num_beams=2,
                do_sample=False
            )
        
        prediction = tokenizer.decode(predicted_ids[0], skip_special_tokens=True)
        
        print(f"\n🤖 Transcription: '{prediction}'")
        print(f"⏱️  Audio duration: {len(audio)/sr:.1f} seconds")
    
    else:
        print("❌ No audio file provided or evaluation dataset unavailable")
        print("💡 Provide an audio file path to test: test_model_inference('path/to/audio.wav')")

# Test the model
test_model_inference()

## 📈 Training Summary and Next Steps

In [None]:
# Comprehensive training summary
print("\n" + "="*60)
print("📊 WHISPER FINE-TUNING SUMMARY")
print("="*60)

print(f"\n🎯 Target Language: {CONFIG['language_name']} ({CONFIG['language']})")
print(f"🤖 Base Model: {CONFIG['model_name_or_path']}")
print(f"📋 Task: {CONFIG['task']}")

if trainer:
    print(f"\n📊 Training Configuration:")
    print(f"  Training samples: {len(train_dataset) if train_dataset else 0:,}")
    print(f"  Evaluation samples: {len(eval_dataset) if eval_dataset else 0:,}")
    print(f"  Batch size: {CONFIG['per_device_train_batch_size']}")
    print(f"  Learning rate: {CONFIG['learning_rate']}")
    print(f"  Max steps: {CONFIG['max_steps']}")
    
    if 'eval_results' in locals():
        print(f"\n📈 Final Results:")
        wer = eval_results.get('eval_wer')
        cer = eval_results.get('eval_cer')
        if wer: print(f"  Word Error Rate (WER): {wer:.2%}")
        if cer: print(f"  Character Error Rate (CER): {cer:.2%}")
    
    print(f"\n💾 Model Output:")
    print(f"  Saved to: {CONFIG['output_dir']}")
    print(f"  Files: pytorch_model.bin, config.json, preprocessor_config.json")

else:
    print("\n❌ Training was not completed successfully")

print(f"\n🔄 Phase 4 Next Steps:")
print("  1. ✅ Evaluate WER performance (completed)")
print("  2. 🔄 Test on diverse audio samples")
print("  3. 🔄 Compare with baseline Whisper performance")
print("  4. 🔄 Iterate: Add more synthetic data if WER > 20%")
print("  5. 🔄 Deploy for production use if WER < 10%")

print(f"\n🚀 Deployment Options:")
print("  - Save to Hugging Face Hub for easy sharing")
print("  - Convert to ONNX for faster inference")
print("  - Quantize for mobile deployment")
print("  - Integrate into speech-to-speech pipeline")

print(f"\n📚 Resources for African Language STT:")
print("  - Mozilla Common Voice: https://commonvoice.mozilla.org/")
print("  - OpenSLR African Languages: http://openslr.org/")
print("  - Masakhane Community: https://www.masakhane.io/")

print("\n" + "="*60)
print("🎉 WHISPER FINE-TUNING PIPELINE COMPLETE!")
print("="*60)

## 🔧 Advanced: Integration with Phase 2 Synthetic Data

This section shows how to integrate synthetic data generated from Phase 2 of your methodology.

In [None]:
# Example integration with synthetic data (Phase 2)
def integrate_synthetic_data(original_dataset, synthetic_audio_dir, synthetic_transcripts_file):
    """
    Combine real dataset with synthetic data generated from voice cloning platforms.
    
    This implements the combination strategy from Phase 2 of your methodology.
    """
    
    if not os.path.exists(synthetic_audio_dir) or not os.path.exists(synthetic_transcripts_file):
        print("ℹ️  Synthetic data not found - using only real data")
        print("💡 Generate synthetic data using Phase 2 pipeline first:")
        print("   - Use ElevenLabs, Podcastle, or Cartesia for voice cloning")
        print("   - Generate thousands of hours of synthetic speech")
        print("   - Save audio files and transcriptions")
        return original_dataset
    
    print("🔄 Integrating synthetic data with real dataset...")
    
    # Load synthetic transcripts
    synthetic_df = pd.read_csv(synthetic_transcripts_file)
    
    # Prepare synthetic data
    synthetic_data = []
    for _, row in synthetic_df.iterrows():
        audio_path = os.path.join(synthetic_audio_dir, row['filename'])
        if os.path.exists(audio_path):
            synthetic_data.append({
                "audio": audio_path,
                "sentence": row['transcription'],
                "is_synthetic": True  # Flag to track synthetic samples
            })
    
    # Create synthetic dataset
    synthetic_dataset = Dataset.from_list(synthetic_data)
    synthetic_dataset = synthetic_dataset.cast_column("audio", Audio(sampling_rate=CONFIG["sampling_rate"]))
    
    # Add synthetic flag to original data
    original_with_flag = original_dataset['train'].map(lambda x: {**x, "is_synthetic": False})
    
    # Combine datasets
    from datasets import concatenate_datasets
    combined_dataset = concatenate_datasets([original_with_flag, synthetic_dataset])
    
    # Shuffle the combined dataset
    combined_dataset = combined_dataset.shuffle(seed=42)
    
    print(f"✅ Combined dataset created:")
    print(f"  Original samples: {len(original_with_flag)}")
    print(f"  Synthetic samples: {len(synthetic_dataset)}")
    print(f"  Total samples: {len(combined_dataset)}")
    print(f"  Synthetic ratio: {len(synthetic_dataset)/len(combined_dataset):.1%}")
    
    return DatasetDict({"train": combined_dataset})

# Example usage (uncomment when you have synthetic data)
# combined_dataset = integrate_synthetic_data(
#     processed_dataset,
#     "./synthetic_audio/",  # Directory with synthetic audio files
#     "./synthetic_transcripts.csv"  # CSV with synthetic transcriptions
# )

print("💡 Synthetic data integration ready!")
print("   Run the integration function when you have synthetic data from Phase 2")