## Final Hugging Face Model Link:
https://huggingface.co/ishmamzarif/augmented_normal_v2_extra_dataset_bangla-whisper-epoch-6

## Table of Contents
1. Setup & Installs
    - Installing dependencies
    - Global seeding
    - Hugging Face Login
2. Data Loading (Train / Validation / Test)
    - Loaded original dataset
    - Loaded extra dataset
3. Train/Validation Split
4. Model Setup & Preprocessing
    - Checkpoint Saving Logic
5. Creating Datasets
6. Preprocessing Function
7. Applying Preprocessing
    - Data Collator
    - Logic for evaluation metrics
8. Training Model
    - Configuring training
    - Initializing trainer
    - Initializing training arguments
9. Evaluation & Basic Inference

## Pipeline:
Required Installations -> Data Loading -> Dataset splitting and preprocessing -> Training model -> Evaluation

In [None]:
!unzip -q shobdotori -d /content/
!unzip -q ExtraDataSet.zip -d /content/

In [1]:
!pip install -q huggingface_hub transformers
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv‚Ä¶

## 1. Setup & Installs

In [None]:
import os
os.environ['PYTHONWARNINGS'] = 'ignore::FutureWarning,ignore::DeprecationWarning'

# Uninstall potentially conflicting packages first
!pip uninstall -y transformers accelerate -q

# Install core packages with compatible versions
!pip install -q transformers==4.44.2 accelerate==0.33.0
!pip install -q datasets==2.20.0 evaluate==0.4.1 jiwer==3.0.3
!pip install -q librosa==0.10.1 soundfile==0.12.1 tqdm
!pip install -q scikit-learn pandas

# Fix fsspec version conflict
!pip install -q fsspec==2025.3.0


!pip install -q sentencepiece

print("\n" + "="*80)
print("All dependencies installed successfully!")
print("="*80)


[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gcsfs 2025.3.0 requires fsspec==2025.3.0, but you have fsspec 2024.5.0 which is incompatible.[0m[31m
[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datasets 2.20.0 requires fsspec[http]<=2024.5.0,>=2023.1.0, but you have fsspec 2025.3.0 which is incompatible.[0m[31m
[0m
All dependencies installed successfully!


In [None]:
# Verify installations and check versions
import sys
import torch
import transformers
import datasets
import librosa
import soundfile
import numpy as np

print("="*60)
print("ENVIRONMENT VERIFICATION")
print("="*60)
print(f"Python version: {sys.version.split()[0]}")
print(f"PyTorch version: {torch.__version__}")
print(f"Transformers version: {transformers.__version__}")
print(f"Datasets version: {datasets.__version__}")
print(f"Librosa version: {librosa.__version__}")
print(f"Soundfile version: {soundfile.__version__}")
print(f"NumPy version: {np.__version__}")
print("\n" + "="*60)
print("CUDA INFO")
print("="*60)
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
print("="*60)


import random
import numpy as np
import torch

def set_seed(seed: int = 42):
    """Set seed for reproducibility across all libraries"""
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)
    print(f"Global seed set to {seed}")

# Set seed globally
set_seed(42)

ENVIRONMENT VERIFICATION
Python version: 3.12.12
PyTorch version: 2.8.0+cu126
Transformers version: 4.44.2
Datasets version: 2.20.0
Librosa version: 0.10.1
Soundfile version: 0.12.1
NumPy version: 1.26.4

CUDA INFO
CUDA available: True
GPU: Tesla T4
GPU Memory: 15.83 GB


In [None]:
import os
from pathlib import Path

class Config:
    # Paths - UPDATE THESE IF YOUR DATA IS IN A DIFFERENT LOCATION
    BASE_PATH = "/content/shobdotori"
    TRAIN_AUDIO_PATH = f"{BASE_PATH}/Train"
    TRAIN_ANNOTATION_PATH = f"{BASE_PATH}/Train_annotation"
    TEST_AUDIO_PATH = f"{BASE_PATH}/Test"
    OUTPUT_DIR = "./whisper-bangla-dialect"
    EXTRA_DATA_PATH = "/content/ExtraDataSet"

    # Whisper Model
    MODEL_NAME = "zarifmahir21/finetuned-modelv6"
    LANGUAGE = "bengali"
    TASK = "transcribe"

    # Training parameters - Whisper
    BATCH_SIZE = 4  # For T4 GPU
    GRADIENT_ACCUMULATION_STEPS = 4  # Effective batch size = 16
    LEARNING_RATE = 2e-5
    WARMUP_STEPS = 200
    NUM_TRAIN_EPOCHS = 8
    LOGGING_STEPS = 50

    # Audio processing
    SAMPLING_RATE = 16000
    MAX_AUDIO_LENGTH = 30  # seconds


# Verify paths exist
print("Checking paths...")
print(f"Base path exists: {os.path.exists(Config.BASE_PATH)}")
print(f"Train audio exists: {os.path.exists(Config.TRAIN_AUDIO_PATH)}")
print(f"Train annotations exist: {os.path.exists(Config.TRAIN_ANNOTATION_PATH)}")
print(f"Test audio exists: {os.path.exists(Config.TEST_AUDIO_PATH)}")

if not os.path.exists(Config.TRAIN_AUDIO_PATH):
    print("\nWARNING: Update Config.BASE_PATH to match your Drive structure!")

Checking paths...
Base path exists: True
Train audio exists: True
Train annotations exist: True
Test audio exists: True


## 2. Data Loading (Train / Validation / Test)

#### Description for external dataset has been given in the `description.docx` file

In [None]:
import pandas as pd
import os

def load_training_data(train_audio_path, train_annotation_path, extradata_path):
    """Load and combine all regional dialect data"""
    all_data = []

    # Loading the extra data
    region_folders = sorted([f for f in os.listdir(extradata_path)
                              if os.path.isdir(os.path.join(extradata_path, f))])

    for region in region_folders:
        word_folder_path = os.path.join(extradata_path, region)
        word_folders = [f for f in os.listdir(word_folder_path)
                          if os.path.isdir(os.path.join(word_folder_path, f))]
        for word in word_folders:
            current_word_rows = [] # Initialize for each word
            audio_folder_path = os.path.join(word_folder_path, word)
            # Corrected: Filter for actual audio files (not directories) and common audio extensions
            audios = [f for f in os.listdir(audio_folder_path)
                        if os.path.isfile(os.path.join(audio_folder_path, f)) and f.lower().endswith(('.wav', '.mp3', '.flac', '.ogg'))]

            if not audios:
                print(f"Warning: No audio files found in {audio_folder_path} for word '{word}' in region '{region}'")
                continue

            for a in audios:
                current_word_rows.append({"audio": a, "text": word})

            # Corrected: Create DataFrame using only rows for the current word
            df = pd.DataFrame(current_word_rows)
            df['audio_path'] = df['audio'].apply(lambda x: os.path.join(audio_folder_path, x))
            df['region'] = region
            all_data.append(df)
            print(f"Loaded {len(df)} samples from {region} of word: {word}")


    # Get all region folders
    region_folders = sorted([f for f in os.listdir(train_audio_path)
                            if os.path.isdir(os.path.join(train_audio_path, f))])

    print(f"Found {len(region_folders)} regional dialects")

    for region in region_folders:
        csv_files = [f for f in os.listdir(train_annotation_path)
                    if region.lower() in f.lower() and f.endswith('.csv')]

        if not csv_files:
            print(f"Warning: No CSV found for region {region}")
            continue

        csv_path = os.path.join(train_annotation_path, csv_files[0])
        df = pd.read_csv(csv_path)

        # Add full audio paths
        audio_dir = os.path.join(train_audio_path, region)
        df['audio_path'] = df['audio'].apply(lambda x: os.path.join(audio_dir, x))
        df['region'] = region

        # Verify files exist
        before_count = len(df)
        df = df[df['audio_path'].apply(os.path.exists)]
        after_count = len(df)

        if before_count != after_count:
            print(f"{region}: {before_count - after_count} audio files not found")

        all_data.append(df)
        print(f"Loaded {len(df)} samples from {region}")




    # Combine all data
    combined_df = pd.concat(all_data, ignore_index=True)

    # Remove any rows with empty text
    combined_df = combined_df[combined_df['text'].notna() & (combined_df['text'].str.strip() != '')]

    print(f"\n{'='*60}")
    print(f"Total training samples: {len(combined_df)}")
    print(f"{'='*60}")

    return combined_df

# Load the data
print("Loading training data...\n")
train_df = load_training_data(Config.TRAIN_AUDIO_PATH, Config.TRAIN_ANNOTATION_PATH, Config.EXTRA_DATA_PATH)

print("\nSample data:")
print(train_df[['audio', 'text', 'region']].head(3))

Loading training data...

Loaded 6 samples from BARISAL_Division of word: ‡¶Ö‡¶≠‡¶æ‡¶¨
Loaded 9 samples from BARISAL_Division of word: ‡¶π‡ßç‡¶Ø‡¶æ‡¶Å
Loaded 5 samples from BARISAL_Division of word: ‡¶Ü‡¶Æ‡¶æ‡¶¶‡ßá‡¶∞
Loaded 16 samples from BARISAL_Division of word: ‡¶≠‡¶Ø‡¶º
Loaded 6 samples from BARISAL_Division of word: ‡¶Ü‡¶Æ‡¶∞‡¶æ
Loaded 10 samples from BARISAL_Division of word: ‡¶∏‡ßá‡¶á
Loaded 6 samples from BARISAL_Division of word: ‡¶Ü‡¶∏‡ßÅ‡¶®
Loaded 12 samples from BARISAL_Division of word: ‡¶∏‡ßá
Loaded 4 samples from BARISAL_Division of word: ‡¶®‡ßá‡¶Ø‡¶º‡¶æ
Loaded 11 samples from BARISAL_Division of word: ‡¶Ü‡¶Æ‡¶ø
Loaded 11 samples from BARISAL_Division of word: ‡¶∏‡ßÅ‡¶®‡ßç‡¶¶‡¶∞
Loaded 2 samples from BARISAL_Division of word: ‡¶∏‡¶§‡ßç‡¶Ø
Loaded 8 samples from BARISAL_Division of word: ‡¶â‡¶†‡¶æ‡¶®‡ßá‡¶æ
Loaded 10 samples from BARISAL_Division of word: ‡¶≠‡¶æ‡¶≤‡ßá‡¶æ
Loaded 5 samples from BARISAL_Division of word: ‡¶§‡¶æ‡¶∞‡¶™‡¶∞
Loaded 11 samples from BARISAL_Divisio

## 3. Train/Validation Split

In [8]:
from sklearn.model_selection import train_test_split

# Split into train and validation (90/10 split, stratified by region)
train_split, val_split = train_test_split(
    train_df,
    test_size=0.1,
    random_state=42,
    stratify=train_df['region']
)

print(f"Training samples: {len(train_split)}")
print(f"Validation samples: {len(val_split)}")
print(f"\nValidation region distribution:")
print(val_split['region'].value_counts().sort_index())

Training samples: 5190
Validation samples: 577

Validation region distribution:
region
BARISAL_Division       17
Barisal                 4
Bhola                  19
Bogura                 11
Brahmanbaria           11
CHITTAGONG_Division    56
Chittagong             40
Comilla                 5
DHAKA_Division         84
Dhaka                  11
Feni                    7
Jessore                 3
Jhenaidah              27
KHULNA_Division        13
Khulna                  2
Kushtia                 5
Lakshmipur             40
MYMENSINGH_Division    30
Mymensingh             40
Natore                 11
Noakhali                6
Pabna                  17
RAJSHAHI_Division      15
RANGPUR_Division        8
Rajshahi                7
Rangpur                36
SYLHET_Division        20
Sylhet                 32
Name: count, dtype: int64


## 4. Model Setup & Preprocessing

In [9]:
from transformers import WhisperProcessor, WhisperForConditionalGeneration

print(f"Loading {Config.MODEL_NAME}...")

# Load processor with language and task
processor = WhisperProcessor.from_pretrained(
    Config.MODEL_NAME,
    language=Config.LANGUAGE,
    task=Config.TASK
)

# Load model
model = WhisperForConditionalGeneration.from_pretrained(Config.MODEL_NAME)

# Explicitly set pad_token_id for model and tokenizer (crucial for Seq2SeqTrainer and data collator)
# Whisper's tokenizer often doesn't have a pad_token_id by default, but it's needed for data collators and internal logic.
if model.config.pad_token_id is None:
    # Use the EOS token ID as the pad token ID as is common practice for Whisper
    model.config.pad_token_id = processor.tokenizer.pad_token_id = processor.tokenizer.eos_token_id
    print(f"  Set model.config.pad_token_id and processor.tokenizer.pad_token_id to: {model.config.pad_token_id} (EOS token ID)")
elif processor.tokenizer.pad_token_id is None:
    processor.tokenizer.pad_token_id = model.config.pad_token_id
    print(f"  Set processor.tokenizer.pad_token_id to: {processor.tokenizer.pad_token_id} (from model config)")


# Configure model for training with proper forced decoder IDs
forced_decoder_ids = processor.get_decoder_prompt_ids(
    language=Config.LANGUAGE,
    task=Config.TASK
)
model.config.forced_decoder_ids = forced_decoder_ids
model.config.suppress_tokens = None
model.config.use_cache = False

def repo_name_for_epoch(base, epoch):
    return f"augmented_normal_v2_extra_dataset_{base}-epoch-{epoch}"

from transformers import TrainerCallback

class PushEachEpochCallback(TrainerCallback):
    def on_epoch_end(self, args, state, control, **kwargs):
        epoch = int(state.epoch)
        repo = repo_name_for_epoch("bangla-whisper", epoch)
        print(f"\nüì§ [Epoch {epoch}] Pushing model to HF repo: {repo}\n")

        model = kwargs["model"]
        model.push_to_hub(repo)

        try:
            processor.push_to_hub(repo)
        except:
            pass

        try:
            tokenizer.push_to_hub(repo)
        except:
            pass

        return control

print("Model and processor loaded successfully!")
print(f"  Model parameters: {model.num_parameters() / 1e6:.1f}M")
print(f"  Language: {Config.LANGUAGE}")
print(f"  Task: {Config.TASK}")

Loading zarifmahir21/finetuned-modelv6...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


preprocessor_config.json:   0%|          | 0.00/339 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

normalizer.json: 0.00B [00:00, ?B/s]

added_tokens.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/967M [00:00<?, ?B/s]

generation_config.json: 0.00B [00:00, ?B/s]

Model and processor loaded successfully!
  Model parameters: 241.7M
  Language: bengali
  Task: transcribe


## 5. Creating Datasets

In [10]:
from datasets import Dataset, Audio
import gc

def create_dataset(df, sampling_rate=16000):
    """Create HuggingFace Dataset with Audio feature"""
    dataset_dict = {
        'audio': df['audio_path'].tolist(),
        'text': df['text'].tolist(),
        'region': df['region'].tolist()
    }

    dataset = Dataset.from_dict(dataset_dict)

    # Cast audio column - this handles resampling to 16kHz automatically
    # Uses soundfile/librosa under the hood (no torchcodec needed)
    dataset = dataset.cast_column("audio", Audio(sampling_rate=sampling_rate, mono=True))

    return dataset

print("Creating datasets with Audio feature...")
train_dataset = create_dataset(train_split, Config.SAMPLING_RATE)
val_dataset = create_dataset(val_split, Config.SAMPLING_RATE)

print(f"Train dataset: {len(train_dataset)} samples")
print(f"Val dataset: {len(val_dataset)} samples")

# Clear memory
gc.collect()

Creating datasets with Audio feature...
Train dataset: 5190 samples
Val dataset: 577 samples


10348

## 6. Preprocessing Function

In [11]:
def prepare_dataset(batch):
    """Preprocess audio and text for Whisper training"""
    # Load audio (already resampled to 16kHz by Audio feature)
    audio = batch["audio"]
    audio_array = audio["array"]

    # Truncate if too long (30 seconds max)
    max_samples = Config.MAX_AUDIO_LENGTH * Config.SAMPLING_RATE
    if len(audio_array) > max_samples:
        audio_array = audio_array[:max_samples]

    # Compute log-Mel spectrogram features
    batch["input_features"] = processor.feature_extractor(
        audio_array,
        sampling_rate=audio["sampling_rate"]
    ).input_features[0]

    # Tokenize text labels
    batch["labels"] = processor.tokenizer(batch["text"]).input_ids

    return batch

print("Preprocessing function defined!")

Preprocessing function defined!


## 7. Applying Preprocessing

In [12]:
print("Preprocessing datasets...")
print("This will take several minutes. Please be patient.\n")

# Process training data
print("Processing training data...")
train_dataset = train_dataset.map(
    prepare_dataset,
    remove_columns=train_dataset.column_names,
    desc="Processing training data",
    num_proc=1,  # Sequential processing to avoid memory issues
)

# Force garbage collection
gc.collect()
if torch.cuda.is_available():
    torch.cuda.empty_cache()

print("\nProcessing validation data...")
val_dataset = val_dataset.map(
    prepare_dataset,
    remove_columns=val_dataset.column_names,
    desc="Processing validation data",
    num_proc=1,
)

# Clear memory again
gc.collect()
if torch.cuda.is_available():
    torch.cuda.empty_cache()

print(f"\nPreprocessing complete!")
print(f"  Train samples: {len(train_dataset)}")
print(f"  Val samples: {len(val_dataset)}")

Preprocessing datasets...
This will take several minutes. Please be patient.

Processing training data...


Processing training data:   0%|          | 0/5190 [00:00<?, ? examples/s]


Processing validation data...


Processing validation data:   0%|          | 0/577 [00:00<?, ? examples/s]


Preprocessing complete!
  Train samples: 5190
  Val samples: 577



### Data Collator

In [13]:
from dataclasses import dataclass
from typing import Any, Dict, List, Union
import torch

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    """Custom data collator for Whisper training with dynamic padding"""
    processor: Any
    decoder_start_token_id: int

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # Split inputs and labels
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # Get labels
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # Replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(
            labels_batch.attention_mask.ne(1), -100
        )

        # If bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.decoder_start_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        # Explicitly create decoder_input_ids by shifting labels
        # This is crucial because `WhisperForConditionalGeneration` sometimes fails
        # to derive `decoder_input_ids` from `labels` automatically during training
        # when `predict_with_generate=True` or due to other interactions, leading to ValueError.

        # Make a copy of labels. Replace -100 (ignore_index) with the actual pad_token_id
        # because the decoder needs real tokens for its input, not -100.
        decoder_input_ids = labels.clone()
        decoder_input_ids[decoder_input_ids == -100] = self.processor.tokenizer.pad_token_id

        # Shift input_ids right: pad with self.decoder_start_token_id at the beginning.
        # This mirrors the model's internal `_shift_right` function.
        shifted_decoder_input_ids = decoder_input_ids.new_zeros(decoder_input_ids.shape)
        shifted_decoder_input_ids[:, 1:] = decoder_input_ids[:, :-1].clone()
        shifted_decoder_input_ids[:, 0] = self.decoder_start_token_id

        batch["decoder_input_ids"] = shifted_decoder_input_ids

        return batch

# Initialize data collator
data_collator = DataCollatorSpeechSeq2SeqWithPadding(
    processor=processor,
    decoder_start_token_id=model.config.decoder_start_token_id
)

print("Data collator initialized!")

Data collator initialized!


### Evaluation Metrics (WER + Normalized Levenshtein Similarity)

In [14]:
import evaluate
import numpy as np
from jiwer import cer

# Load WER metric
metric = evaluate.load("wer")

def normalized_levenshtein_similarity(refs, hyps):
    """Compute normalized Levenshtein similarity (1 - normalized edit distance)"""
    dist = cer(refs, hyps)  # cer returns normalized distance in [0, 1]
    return 1.0 - dist

def compute_metrics(pred):
    """Compute WER and Normalized Levenshtein Similarity metrics"""
    pred_ids = pred.predictions

    # Handle tuple predictions from generation
    if isinstance(pred_ids, tuple):
        pred_ids = pred_ids[0]

    label_ids = pred.label_ids

    # Replace -100 with pad_token_id for proper decoding
    label_ids = np.where(label_ids != -100, label_ids, processor.tokenizer.pad_token_id)

    # Decode predictions and labels
    pred_str = processor.tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = processor.tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    # Compute metrics
    wer_val = metric.compute(predictions=pred_str, references=label_str)
    norm_lev_sim = normalized_levenshtein_similarity(label_str, pred_str)

    return {
        "wer": wer_val * 100.0,
        "norm_levenshtein_similarity": norm_lev_sim * 100.0,
    }

print("Evaluation metrics configured!")
print("  - WER (Word Error Rate): Lower is better")
print("  - Normalized Levenshtein Similarity: Higher is better (0-100%)")

Downloading builder script: 0.00B [00:00, ?B/s]

Evaluation metrics configured!
  - WER (Word Error Rate): Lower is better
  - Normalized Levenshtein Similarity: Higher is better (0-100%)


### Training Configuration (Optimized)

In [15]:
from transformers import Seq2SeqTrainingArguments
import gc

# Clear memory before training
gc.collect()
if torch.cuda.is_available():
    torch.cuda.empty_cache()

# Check device
device = "cuda" if torch.cuda.is_available() else "cpu"
use_fp16 = device == "cuda"

training_args = Seq2SeqTrainingArguments(
    output_dir=Config.OUTPUT_DIR,
    per_device_train_batch_size=Config.BATCH_SIZE,
    per_device_eval_batch_size=Config.BATCH_SIZE,
    gradient_accumulation_steps=Config.GRADIENT_ACCUMULATION_STEPS,
    learning_rate=Config.LEARNING_RATE,
    warmup_steps=Config.WARMUP_STEPS,
    num_train_epochs=Config.NUM_TRAIN_EPOCHS,
    gradient_checkpointing=True,
    fp16=use_fp16,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    logging_steps=Config.LOGGING_STEPS,
    report_to=["tensorboard"],
    load_best_model_at_end=True,
    metric_for_best_model="norm_levenshtein_similarity",  # Use similarity for best model
    greater_is_better=True,
    predict_with_generate=True,
    generation_max_length=225,
    save_total_limit=2,
    dataloader_num_workers=0,
    seed=42,  # Reproducibility
    # Optimization improvements
    lr_scheduler_type="cosine_with_restarts",
    weight_decay=0.01,
    label_smoothing_factor=0.1,
    max_grad_norm=1.0,
)

print("Training arguments configured!")
print(f"  Device: {device}")
print(f"  Epochs: {Config.NUM_TRAIN_EPOCHS}")
print(f"  Batch size per device: {Config.BATCH_SIZE}")
print(f"  Gradient accumulation: {Config.GRADIENT_ACCUMULATION_STEPS}")
print(f"  Effective batch size: {Config.BATCH_SIZE * Config.GRADIENT_ACCUMULATION_STEPS}")
print(f"  Learning rate: {Config.LEARNING_RATE}")
print(f"  LR scheduler: cosine_with_restarts")
print(f"  Weight decay: 0.01")
print(f"  Label smoothing: 0.1")
print(f"  Max grad norm: 1.0")
print(f"  FP16 training: {use_fp16}")



Training arguments configured!
  Device: cuda
  Epochs: 8
  Batch size per device: 4
  Gradient accumulation: 4
  Effective batch size: 16
  Learning rate: 2e-05
  LR scheduler: cosine_with_restarts
  Weight decay: 0.01
  Label smoothing: 0.1
  Max grad norm: 1.0
  FP16 training: True


### Initialize Trainer

In [16]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor.tokenizer,  # Fixed: use tokenizer, not feature_extractor
    callbacks=[PushEachEpochCallback()],
)

print("Trainer initialized!")
print(f"  Training samples: {len(train_dataset)}")
print(f"  Validation samples: {len(val_dataset)}")

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Trainer initialized!
  Training samples: 5190
  Validation samples: 577


### 8. Training Model

In [None]:
# Critical GPU check
if not torch.cuda.is_available():
    print("="*80)
    print("WARNING: NO GPU DETECTED!")
    print("="*80)
    print("\nTraining on CPU will be EXTREMELY SLOW (days instead of hours)!")
    print("\nTO FIX:")
    print("   1. Go to Runtime ‚Üí Change runtime type")
    print("   2. Set Hardware accelerator: GPU (T4)")
    print("   3. Click Save")
    print("   4. Runtime will restart - rerun cells from the beginning")
    print("\n" + "="*80)
    raise RuntimeError("GPU is required for training. Please enable GPU in runtime settings.")
else:
    print("="*80)
    print(" GPU DETECTED - Ready for training!")
    print("="*80)
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
    print("="*80)

‚úÖ GPU DETECTED - Ready for training!
GPU: Tesla T4
Memory: 15.8 GB


In [None]:
import time

print("="*80)
print("Starting Whisper training...")
print("="*80)
print(f"Training on {device.upper()}")
print(f"Estimated time: ~2-4 hours on T4 GPU for {Config.NUM_TRAIN_EPOCHS} epochs\n")

start_time = time.time()

# Train the model
train_result = trainer.train()

end_time = time.time()
training_duration = end_time - start_time

print("\n" + "="*80)
print("‚úì Whisper training completed!")
print(f"  Duration: {training_duration/3600:.2f} hours ({training_duration/60:.1f} minutes)")
print("="*80)

# Save training metrics
print("\nTraining metrics:")
for key, value in train_result.metrics.items():
    print(f"  {key}: {value}")

Starting Whisper training...
Training on CUDA
Estimated time: ~2-4 hours on T4 GPU for 8 epochs



  return fn(*args, **kwargs)


Epoch,Training Loss,Validation Loss,Wer,Norm Levenshtein Similarity
0,1.4806,1.456514,6.940371,94.997286
2,1.4458,1.432858,4.83871,96.56233
4,1.4285,1.428582,2.932551,97.937398
6,1.422,1.425159,2.981427,97.792654



üì§ [Epoch 0] Pushing model to HF repo: augmented_normal_v2_extra_dataset_bangla-whisper-epoch-0



Non-default generation parameters: {'max_length': 448, 'begin_suppress_tokens': [220, 50257]}


Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...l42uvly/model.safetensors:   0%|          |  578kB /  967MB            

README.md: 0.00B [00:00, ?B/s]

Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Non-default generation parameters: {'max_length': 448, 'begin_suppress_tokens': [220, 50257]}
  return fn(*args, **kwargs)



üì§ [Epoch 2] Pushing model to HF repo: augmented_normal_v2_extra_dataset_bangla-whisper-epoch-2



Non-default generation parameters: {'max_length': 448, 'begin_suppress_tokens': [220, 50257]}


Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...wo0g020/model.safetensors:   0%|          |  579kB /  967MB            

README.md: 0.00B [00:00, ?B/s]

Non-default generation parameters: {'max_length': 448, 'begin_suppress_tokens': [220, 50257]}
  return fn(*args, **kwargs)



üì§ [Epoch 2] Pushing model to HF repo: augmented_normal_v2_extra_dataset_bangla-whisper-epoch-2



Non-default generation parameters: {'max_length': 448, 'begin_suppress_tokens': [220, 50257]}


Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...d5qlgwq/model.safetensors:   0%|          |  580kB /  967MB            

No files have been modified since last commit. Skipping to prevent empty commit.
Non-default generation parameters: {'max_length': 448, 'begin_suppress_tokens': [220, 50257]}
  return fn(*args, **kwargs)



üì§ [Epoch 4] Pushing model to HF repo: augmented_normal_v2_extra_dataset_bangla-whisper-epoch-4



Non-default generation parameters: {'max_length': 448, 'begin_suppress_tokens': [220, 50257]}


Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...7iar4pj/model.safetensors:   0%|          |  581kB /  967MB            

README.md: 0.00B [00:00, ?B/s]

Non-default generation parameters: {'max_length': 448, 'begin_suppress_tokens': [220, 50257]}
  return fn(*args, **kwargs)



üì§ [Epoch 4] Pushing model to HF repo: augmented_normal_v2_extra_dataset_bangla-whisper-epoch-4



Non-default generation parameters: {'max_length': 448, 'begin_suppress_tokens': [220, 50257]}


Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...3onwmrd/model.safetensors:   0%|          |  581kB /  967MB            

No files have been modified since last commit. Skipping to prevent empty commit.
Non-default generation parameters: {'max_length': 448, 'begin_suppress_tokens': [220, 50257]}
  return fn(*args, **kwargs)



üì§ [Epoch 6] Pushing model to HF repo: augmented_normal_v2_extra_dataset_bangla-whisper-epoch-6



Non-default generation parameters: {'max_length': 448, 'begin_suppress_tokens': [220, 50257]}


Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...f3c4za5/model.safetensors:   0%|          |  583kB /  967MB            

README.md: 0.00B [00:00, ?B/s]

Non-default generation parameters: {'max_length': 448, 'begin_suppress_tokens': [220, 50257]}
  return fn(*args, **kwargs)



üì§ [Epoch 6] Pushing model to HF repo: augmented_normal_v2_extra_dataset_bangla-whisper-epoch-6



Non-default generation parameters: {'max_length': 448, 'begin_suppress_tokens': [220, 50257]}


Processing Files (0 / 0)      : |          |  0.00B /  0.00B            

New Data Upload               : |          |  0.00B /  0.00B            

  ...k20q04n/model.safetensors:   0%|          |  581kB /  967MB            

No files have been modified since last commit. Skipping to prevent empty commit.
Non-default generation parameters: {'max_length': 448, 'begin_suppress_tokens': [220, 50257]}
  return fn(*args, **kwargs)


## 9. Evaluation & Basic Inference

In [None]:
print("Evaluating Whisper model on validation set...\n")

eval_results = trainer.evaluate()

print("Validation Results:")
print("="*60)
for key, value in eval_results.items():
    if isinstance(value, float):
        print(f"  {key}: {value:.4f}")
    else:
        print(f"  {key}: {value}")
print("="*60)

final_wer = eval_results.get('eval_wer', 0.0)
final_similarity = eval_results.get('eval_norm_levenshtein_similarity', 0.0)
print(f"\nFinal Validation WER: {final_wer:.2f}%")
print(f"Final Normalized Levenshtein Similarity: {final_similarity:.2f}%")

In [None]:
print("Saving final model...\n")

# Save model and processor
trainer.save_model(Config.OUTPUT_DIR)
processor.save_pretrained(Config.OUTPUT_DIR)


from huggingface_hub import HfApi, HfFolder
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Replace with your HF username and desired repo name
repo_name = "finetuned-modelv7_ultimate"
username = "ishmamzarif"
model_id = f"{username}/{repo_name}"

# Upload
from huggingface_hub import create_repo, upload_folder

create_repo(model_id, exist_ok=True)
upload_folder(
    folder_path=Config.OUTPUT_DIR,
    repo_id=model_id,
    path_in_repo=".",
)
# ====================

print(f"Model saved to: {Config.OUTPUT_DIR}")
print(f"\nSaved files:")
for f in os.listdir(Config.OUTPUT_DIR):
    fpath = os.path.join(Config.OUTPUT_DIR, f)
    if os.path.isfile(fpath):
        size_mb = os.path.getsize(fpath) / (1024 * 1024)
        print(f"  {f}: {size_mb:.1f} MB")