# Azerbaijani ASR Model Training

This notebook trains an Automatic Speech Recognition (ASR) model for Azerbaijani language.

**Features:**
- Toggle between sample mode (CPU) and full training (GPU)
- Auto-detection of available hardware
- Fine-tuning Whisper model on Azerbaijani dataset

## 1. Configuration

**Switch `SAMPLE_MODE` to control training:**
- `True` = Small sample, CPU training, quick testing
- `False` = Full dataset, GPU training, production model

In [1]:
# ============================================================
# MAIN SWITCH - Toggle between sample and full training
# ============================================================
SAMPLE_MODE = True  # Set to False for full dataset training

# ============================================================
# Configuration based on mode
# ============================================================
CONFIG = {
    "sample": {
        "dataset_size": 100,          # Number of samples for testing
        "batch_size": 4,
        "epochs": 1,
        "learning_rate": 1e-5,
        "max_steps": 50,
        "eval_steps": 25,
        "save_steps": 25,
        "warmup_steps": 10,
        "gradient_accumulation_steps": 1,
        "fp16": False,                 # CPU doesn't support fp16
    },
    "full": {
        "dataset_size": None,          # Use full dataset
        "batch_size": 16,
        "epochs": 3,
        "learning_rate": 1e-5,
        "max_steps": -1,               # Train for full epochs
        "eval_steps": 500,
        "save_steps": 500,
        "warmup_steps": 500,
        "gradient_accumulation_steps": 2,
        "fp16": True,                  # GPU supports fp16
    }
}

# Select active config
ACTIVE_CONFIG = CONFIG["sample"] if SAMPLE_MODE else CONFIG["full"]

# Model configuration
MODEL_NAME = "openai/whisper-small"  # Options: whisper-tiny, whisper-base, whisper-small, whisper-medium
DATASET_NAME = "LocalDoc/azerbaijani_asr"
OUTPUT_DIR = "./whisper-azerbaijani-sample" if SAMPLE_MODE else "./whisper-azerbaijani"
LANGUAGE = "azerbaijani"
TASK = "transcribe"

print(f"Mode: {'SAMPLE (CPU)' if SAMPLE_MODE else 'FULL (GPU)'}")
print(f"Dataset size: {ACTIVE_CONFIG['dataset_size'] or 'Full'}")
print(f"Batch size: {ACTIVE_CONFIG['batch_size']}")
print(f"Epochs: {ACTIVE_CONFIG['epochs']}")
print(f"Output directory: {OUTPUT_DIR}")

Mode: SAMPLE (CPU)
Dataset size: 100
Batch size: 4
Epochs: 1
Output directory: ./whisper-azerbaijani-sample


## 2. SSL Configuration (Corporate Network Fix)

In [2]:
import os
import ssl
import sys

# ============================================================
# SSL BYPASS FOR CORPORATE NETWORKS  
# Must run BEFORE any huggingface imports
# ============================================================

# Disable Xet storage (causes 503 errors)
os.environ['HF_HUB_DISABLE_XET'] = '1'

# Block hf_xet from being imported
sys.modules['hf_xet'] = None

# SSL environment variables
os.environ['HF_HUB_DISABLE_SSL_VERIFY'] = '1'
os.environ['CURL_CA_BUNDLE'] = ''
os.environ['REQUESTS_CA_BUNDLE'] = ''

# Disable SSL globally
ssl._create_default_https_context = ssl._create_unverified_context

# Suppress warnings
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
import warnings
warnings.filterwarnings('ignore', message='Unverified HTTPS request')

# Patch requests to disable SSL verification
import requests
_orig_request = requests.Session.request

def _patched_request(self, method, url, **kwargs):
    kwargs['verify'] = False
    return _orig_request(self, method, url, **kwargs)

requests.Session.request = _patched_request

print("SSL verification disabled")
print("Xet storage disabled - using regular HTTP")

SSL verification disabled
Xet storage disabled - using regular HTTP


## 3. Install Dependencies (if needed)

In [3]:
# Uncomment if packages not installed
# !pip install -q datasets transformers accelerate evaluate jiwer librosa soundfile tensorboard

## 4. Device Detection

In [4]:
import torch

def detect_device():
    """Auto-detect the best available device."""
    if torch.cuda.is_available():
        device = "cuda"
        device_name = torch.cuda.get_device_name(0)
        memory = torch.cuda.get_device_properties(0).total_memory / 1e9
        print(f"GPU detected: {device_name}")
        print(f"GPU memory: {memory:.1f} GB")
    elif torch.backends.mps.is_available():
        device = "mps"  # Apple Silicon
        print("Apple Silicon (MPS) detected")
    else:
        device = "cpu"
        print("No GPU detected, using CPU")
    
    return device

DEVICE = detect_device()

# Override fp16 based on device
if DEVICE == "cpu":
    ACTIVE_CONFIG["fp16"] = False
    print("\nNote: fp16 disabled for CPU training")

# Warning if running full training on CPU
if not SAMPLE_MODE and DEVICE == "cpu":
    print("\n" + "="*50)
    print("WARNING: Full training on CPU will be very slow!")
    print("Consider using SAMPLE_MODE=True or getting GPU access")
    print("="*50)

No GPU detected, using CPU

Note: fp16 disabled for CPU training


## 5. Load Dataset

In [None]:
from datasets import load_dataset, load_from_disk, Audio
import os

print(f"Loading dataset: {DATASET_NAME}")

# Try loading from local disk first (if downloaded via download_data.py)
LOCAL_DATA_DIR = "./data"

if os.path.exists(LOCAL_DATA_DIR):
    print(f"Loading from local: {LOCAL_DATA_DIR}")
    dataset = load_from_disk(LOCAL_DATA_DIR)
else:
    print("Local data not found. Loading with streaming mode...")
    print("(Run 'python download_data.py' first for faster loading)")
    
    # Use streaming to avoid download issues
    dataset_stream = load_dataset(DATASET_NAME, streaming=True, trust_remote_code=True)
    
    # Convert streaming to regular dataset (takes subset for sample mode)
    if SAMPLE_MODE:
        n_samples = ACTIVE_CONFIG["dataset_size"]
        print(f"Taking {n_samples} samples from stream...")
        
        train_samples = list(dataset_stream["train"].take(n_samples))
        test_samples = list(dataset_stream["test"].take(20)) if "test" in dataset_stream else train_samples[:20]
        
        from datasets import Dataset, DatasetDict
        dataset = DatasetDict({
            "train": Dataset.from_list(train_samples),
            "test": Dataset.from_list(test_samples),
        })
    else:
        raise RuntimeError(
            "Full dataset required but not downloaded locally.\n"
            "Run 'python download_data.py' first to download the full dataset."
        )

print(f"\nDataset structure:")
print(dataset)

# Show sample
print(f"\nSample from training set:")
print(dataset["train"][0])

In [None]:
# Subsample if in sample mode (only needed for local dataset)
if SAMPLE_MODE and ACTIVE_CONFIG["dataset_size"] and os.path.exists(LOCAL_DATA_DIR):
    sample_size = ACTIVE_CONFIG["dataset_size"]
    
    # Sample from train split
    if len(dataset["train"]) > sample_size:
        dataset["train"] = dataset["train"].select(range(sample_size))
    
    # Sample from test/validation split if exists
    eval_size = min(sample_size // 5, 20)
    if "test" in dataset and len(dataset["test"]) > eval_size:
        dataset["test"] = dataset["test"].select(range(eval_size))
    elif "validation" in dataset and len(dataset["validation"]) > eval_size:
        dataset["validation"] = dataset["validation"].select(range(eval_size))
    
    print(f"Subsampled dataset for testing:")
    print(dataset)
else:
    print(f"Using dataset:")
    print(dataset)

## 6. Load Whisper Model & Processor

In [None]:
from transformers import WhisperProcessor, WhisperForConditionalGeneration

print(f"Loading model: {MODEL_NAME}")

# Load processor (tokenizer + feature extractor)
processor = WhisperProcessor.from_pretrained(MODEL_NAME, language=LANGUAGE, task=TASK)

# Load model
model = WhisperForConditionalGeneration.from_pretrained(MODEL_NAME)

# Configure model for Azerbaijani
model.generation_config.language = LANGUAGE
model.generation_config.task = TASK
model.generation_config.forced_decoder_ids = None

print(f"Model loaded successfully!")
print(f"Model parameters: {model.num_parameters():,}")

## 7. Preprocess Dataset

In [None]:
from functools import partial

# Resample audio to 16kHz (Whisper requirement)
SAMPLING_RATE = 16000

# Detect audio column name
sample = dataset["train"][0]
audio_column = "audio" if "audio" in sample else "path" if "path" in sample else None
text_column = "sentence" if "sentence" in sample else "text" if "text" in sample else "transcription"

print(f"Audio column: {audio_column}")
print(f"Text column: {text_column}")

# Cast audio column to Audio type with correct sampling rate
dataset = dataset.cast_column(audio_column, Audio(sampling_rate=SAMPLING_RATE))

print(f"\nAudio resampled to {SAMPLING_RATE} Hz")

In [None]:
def prepare_dataset(batch, audio_column, text_column, processor):
    """Prepare a single batch for training."""
    # Load and resample audio
    audio = batch[audio_column]
    
    # Compute input features from audio
    batch["input_features"] = processor.feature_extractor(
        audio["array"], 
        sampling_rate=audio["sampling_rate"]
    ).input_features[0]
    
    # Encode target text
    batch["labels"] = processor.tokenizer(batch[text_column]).input_ids
    
    return batch

# Create partial function with fixed arguments
prepare_fn = partial(
    prepare_dataset, 
    audio_column=audio_column, 
    text_column=text_column,
    processor=processor
)

# Process dataset
print("Processing dataset...")
dataset = dataset.map(
    prepare_fn,
    remove_columns=dataset.column_names["train"],
    num_proc=1 if SAMPLE_MODE else 4,  # Use multiprocessing for full dataset
)

print(f"Dataset processed!")
print(dataset)

## 8. Data Collator

In [None]:
import torch
from dataclasses import dataclass
from typing import Any, Dict, List, Union

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    """Data collator that dynamically pads the inputs and labels."""
    processor: Any
    decoder_start_token_id: int

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # Split inputs and labels
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        label_features = [{"input_ids": feature["labels"]} for feature in features]

        # Pad input features
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # Pad labels
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # Replace padding with -100 to ignore in loss
        labels = labels_batch["input_ids"].masked_fill(
            labels_batch.attention_mask.ne(1), -100
        )

        # Remove BOS token if present (Whisper adds it during generation)
        if (labels[:, 0] == self.decoder_start_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch

# Create data collator
data_collator = DataCollatorSpeechSeq2SeqWithPadding(
    processor=processor,
    decoder_start_token_id=model.config.decoder_start_token_id,
)

print("Data collator created!")

## 9. Evaluation Metrics

In [None]:
import evaluate

# Load WER metric
wer_metric = evaluate.load("wer")

def compute_metrics(pred):
    """Compute Word Error Rate (WER)."""
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # Replace -100 with pad token id
    label_ids[label_ids == -100] = processor.tokenizer.pad_token_id

    # Decode predictions and references
    pred_str = processor.tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = processor.tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    # Compute WER
    wer = 100 * wer_metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

print("Metrics configured!")

## 10. Training Configuration

In [None]:
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir=OUTPUT_DIR,
    
    # Training parameters
    per_device_train_batch_size=ACTIVE_CONFIG["batch_size"],
    per_device_eval_batch_size=ACTIVE_CONFIG["batch_size"],
    gradient_accumulation_steps=ACTIVE_CONFIG["gradient_accumulation_steps"],
    learning_rate=ACTIVE_CONFIG["learning_rate"],
    num_train_epochs=ACTIVE_CONFIG["epochs"],
    max_steps=ACTIVE_CONFIG["max_steps"],
    warmup_steps=ACTIVE_CONFIG["warmup_steps"],
    
    # Precision
    fp16=ACTIVE_CONFIG["fp16"],
    
    # Evaluation and saving
    eval_strategy="steps",
    eval_steps=ACTIVE_CONFIG["eval_steps"],
    save_strategy="steps",
    save_steps=ACTIVE_CONFIG["save_steps"],
    save_total_limit=2,
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    
    # Generation settings for evaluation
    predict_with_generate=True,
    generation_max_length=225,
    
    # Logging
    logging_steps=25 if SAMPLE_MODE else 100,
    report_to=["tensorboard"],
    
    # Device
    use_cpu=(DEVICE == "cpu"),
    
    # Misc
    push_to_hub=False,
    remove_unused_columns=False,
)

print("Training arguments configured!")
print(f"\nKey settings:")
print(f"  - Batch size: {ACTIVE_CONFIG['batch_size']}")
print(f"  - Learning rate: {ACTIVE_CONFIG['learning_rate']}")
print(f"  - Epochs: {ACTIVE_CONFIG['epochs']}")
print(f"  - FP16: {ACTIVE_CONFIG['fp16']}")
print(f"  - Device: {DEVICE}")

## 11. Initialize Trainer

In [None]:
from transformers import Seq2SeqTrainer

# Determine eval dataset
eval_dataset = None
if "test" in dataset:
    eval_dataset = dataset["test"]
elif "validation" in dataset:
    eval_dataset = dataset["validation"]
else:
    # Use a portion of training data for evaluation
    split = dataset["train"].train_test_split(test_size=0.1, seed=42)
    dataset["train"] = split["train"]
    eval_dataset = split["test"]
    print("Created eval split from training data (10%)")

# Create trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=eval_dataset,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    processing_class=processor.feature_extractor,
)

print("Trainer initialized!")
print(f"\nDataset sizes:")
print(f"  - Train: {len(dataset['train'])} samples")
print(f"  - Eval: {len(eval_dataset)} samples")

## 12. Train Model

In [None]:
print(f"Starting training in {'SAMPLE' if SAMPLE_MODE else 'FULL'} mode...")
print(f"Device: {DEVICE}")
print("="*50)

# Train!
train_result = trainer.train()

print("="*50)
print("Training completed!")
print(f"\nTraining metrics:")
for key, value in train_result.metrics.items():
    print(f"  {key}: {value:.4f}" if isinstance(value, float) else f"  {key}: {value}")

## 13. Evaluate Model

In [None]:
print("Running evaluation...")

eval_results = trainer.evaluate()

print(f"\nEvaluation results:")
for key, value in eval_results.items():
    print(f"  {key}: {value:.4f}" if isinstance(value, float) else f"  {key}: {value}")

## 14. Save Model

In [None]:
# Save the model and processor
trainer.save_model(OUTPUT_DIR)
processor.save_pretrained(OUTPUT_DIR)

print(f"Model saved to: {OUTPUT_DIR}")

## 15. Test Inference

In [None]:
from transformers import pipeline

# Load the trained model for inference
pipe = pipeline(
    "automatic-speech-recognition",
    model=OUTPUT_DIR,
    device=0 if DEVICE == "cuda" else -1,
)

print("Inference pipeline loaded!")

In [None]:
# Test with samples from the eval dataset (already loaded)
print("Testing model on samples:\n")

# Use existing eval_dataset
test_samples = eval_dataset.select(range(min(5, len(eval_dataset))))

for i in range(len(test_samples)):
    sample = test_samples[i]
    
    # Reconstruct audio from features (or use original if available)
    # For now, just print that inference is ready
    print(f"Sample {i+1}: Model ready for inference")
    print(f"  (Use pipe(audio_array) to transcribe)")
    print()

print("To test with your own audio:")
print("  result = pipe('path/to/audio.wav')")
print("  print(result['text'])")

## 16. Push to Hugging Face Hub (Optional)

In [None]:
# Uncomment and run to push to Hub
# from huggingface_hub import notebook_login
# notebook_login()

# HUB_MODEL_NAME = "your-username/whisper-azerbaijani"
# trainer.push_to_hub(HUB_MODEL_NAME)
# processor.push_to_hub(HUB_MODEL_NAME)
# print(f"Model pushed to: https://huggingface.co/{HUB_MODEL_NAME}")

---

## Summary

This notebook trained a Whisper model for Azerbaijani ASR.

**To switch from sample to full training:**
1. Set `SAMPLE_MODE = False` in Cell 1
2. Restart the kernel and run all cells

**Next steps:**
- Experiment with different model sizes (whisper-tiny, whisper-base, whisper-medium)
- Adjust learning rate and batch size for your hardware
- Add data augmentation for better robustness
- Push the final model to Hugging Face Hub