<a href="https://colab.research.google.com/github/ArvindSeram123/Hindi-ASR-Fine-Tuning-Whisper-Small-for-Local-Dialects/blob/main/test2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install transformers datasets==3.0.1 torch torchaudio evaluate jiwer



In [None]:
from datasets import load_dataset
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch
import evaluate

# Set device
device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using device:", device)


Using device: cuda


In [None]:

# Load the Hindi portion of the FLEURS test set
fleurs_test = load_dataset("google/fleurs", "hi_in", split="test")

# Load processor (tokenizer + feature extractor)
processor = WhisperProcessor.from_pretrained("openai/whisper-small")

# Load models (baseline and fine-tuned)
baseline_model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small").to(device)
finetuned_model = WhisperForConditionalGeneration.from_pretrained(
    "/content/drive/MyDrive/hindi_asr_finetuning/fine_tuned/checkpoint-150"
).to(device)

# Evaluation metric
wer_metric = evaluate.load("wer")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
# Function to compute WER
def compute_wer(model, processor, dataset, num_samples=100):
    references = []
    predictions = []

    for example in dataset.select(range(num_samples)):
        audio = example["audio"]["array"]

        # Move tensors to GPU
        input_features = processor(
            audio, sampling_rate=16000, return_tensors="pt"
        ).input_features.to(device)

        with torch.no_grad():
            predicted_ids = model.generate(input_features)

        transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0].lower()
        references.append(example["transcription"].lower())
        predictions.append(transcription)

    wer = wer_metric.compute(references=references, predictions=predictions)
    return wer

# Run evaluations (sample 100 test samples)
print("Evaluating baseline Whisper-small...")
baseline_wer = compute_wer(baseline_model, processor, fleurs_test, num_samples=100)
print(f"Baseline WER: {baseline_wer:.3f}")

print("Evaluating fine-tuned model...")
finetuned_wer = compute_wer(finetuned_model, processor, fleurs_test, num_samples=100)
print(f"Fine-tuned WER: {finetuned_wer:.3f}")

Evaluating baseline Whisper-small...


Using custom `forced_decoder_ids` from the (generation) config. This is deprecated in favor of the `task` and `language` flags/config options.
Transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English. This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`. See https://github.com/huggingface/transformers/pull/28687 for more details.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
`generation_config` default values have been modified to match model-specific defaults: {'suppress_tokens': [], 'begin_suppress_tokens': [220, 50257]}. If this is not desired, please set these values explicitly.


Baseline WER: 0.859
Evaluating fine-tuned model...


A custom logits processor of type <class 'transformers.generation.logits_process.SuppressTokensLogitsProcessor'> has been passed to `.generate()`, but it was also created in `.generate()`, given its parameterization. The custom <class 'transformers.generation.logits_process.SuppressTokensLogitsProcessor'> will take precedence. Please check the docstring of <class 'transformers.generation.logits_process.SuppressTokensLogitsProcessor'> to see related `.generate()` flags.
A custom logits processor of type <class 'transformers.generation.logits_process.SuppressTokensAtBeginLogitsProcessor'> has been passed to `.generate()`, but it was also created in `.generate()`, given its parameterization. The custom <class 'transformers.generation.logits_process.SuppressTokensAtBeginLogitsProcessor'> will take precedence. Please check the docstring of <class 'transformers.generation.logits_process.SuppressTokensAtBeginLogitsProcessor'> to see related `.generate()` flags.


Fine-tuned WER: 1.479


In [None]:
import pandas as pd

results = pd.DataFrame({
    "Model": ["Whisper Small (Pretrained)", "Whisper Small (Fine-tuned)"],
    "Dataset": ["FLEURS Hindi Test", "FLEURS Hindi Test"],
    "WER": [baseline_wer, finetuned_wer]
})

print(results)

                        Model            Dataset       WER
0  Whisper Small (Pretrained)  FLEURS Hindi Test  0.858796
1  Whisper Small (Fine-tuned)  FLEURS Hindi Test  1.478781
