# Fine-Tuned Whisper ASR for Low-Resource Niger-Congo Languages

>[Fine-Tuned Whisper ASR for Low-Resource Niger-Congo Languages](#scrollTo=75b58048-7d14-4fc6-8085-1fc08c81b4a6)

>>[Model Description](#scrollTo=_fYxmZ1ZGsrr)

>>[Dataset Information](#scrollTo=Af1FzuqKOVw7)

>>>[High Resource Language](#scrollTo=NV7rHVJVOvGw)

>>>[Low Resource Language(s)](#scrollTo=NV7rHVJVOvGw)

>[Setting Up the Model](#scrollTo=6dUdjcYjfMro)

>>[Environment](#scrollTo=2n7MzrC-zMNm)

>>[Feature Extractor, Tokenizer and Processor](#scrollTo=GUcsls-Dx-ex)

>>[Load Processor](#scrollTo=GilquOqWyjoI)

>[Datasets](#scrollTo=PjrOPsLOzvag)

>>[Decompress Files](#scrollTo=8E8O9H0bHzSY)

>>[Using Datasets with Hugging Face](#scrollTo=O7qdPyO2GCCp)

>>[Pre-Process Data](#scrollTo=_2xHk87pJeLc)

>[Training and Evaluation](#scrollTo=KQ-Lit5bN-06)

>>[Evaluation](#scrollTo=KBjtkjX9OH6n)

>>>[Pre-Trained Checkpoint](#scrollTo=sWxxEokhOWX2)

>>>[Training Config](#scrollTo=DIifIsD_Odis)

>>[Training](#scrollTo=J2wl-wpmQ783)



## Model Description

whisper-finetuned-yoruba is a model obtained by fine-tuning a pre-trained whisper ASR model on a on a low resource language, Yoruba

The goal is to try and do two stages of fine-tuning: first fine-tune the pre-trained whisper ASR model on a high resource African language (i.e. Swahili) and then train it again on a low resource language (i.e. Yoruba or Igbo)

Guide Followed: https://colab.research.google.com/github/sanchit-gandhi/notebooks/blob/main/fine_tune_whisper.ipynb

## Dataset Information

### High Resource Language
Swahili:
https://datacollective.mozillafoundation.org/datasets/cmj8u3puq00qhnxxbg26y0owu

### Low Resource Language(s)
Yoruba:
https://datacollective.mozillafoundation.org/datasets/cmj8u3q2500v5nxxb6xfa6jn5

Igbo:
https://datacollective.mozillafoundation.org/datasets/cmj8u3p9a00c5nxxb7jwi217j


# Setting Up the Model

## Environment

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.svâ€¦

## Feature Extractor, Tokenizer and Processor

Yoruba is included in the languages that Whisper was pre-trained on but not Igbo (as demonstrated below)

In [None]:
from transformers.models.whisper.tokenization_whisper import TO_LANGUAGE_CODE

TO_LANGUAGE_CODE

{'english': 'en',
 'chinese': 'zh',
 'german': 'de',
 'spanish': 'es',
 'russian': 'ru',
 'korean': 'ko',
 'french': 'fr',
 'japanese': 'ja',
 'portuguese': 'pt',
 'turkish': 'tr',
 'polish': 'pl',
 'catalan': 'ca',
 'dutch': 'nl',
 'arabic': 'ar',
 'swedish': 'sv',
 'italian': 'it',
 'indonesian': 'id',
 'hindi': 'hi',
 'finnish': 'fi',
 'vietnamese': 'vi',
 'hebrew': 'he',
 'ukrainian': 'uk',
 'greek': 'el',
 'malay': 'ms',
 'czech': 'cs',
 'romanian': 'ro',
 'danish': 'da',
 'hungarian': 'hu',
 'tamil': 'ta',
 'norwegian': 'no',
 'thai': 'th',
 'urdu': 'ur',
 'croatian': 'hr',
 'bulgarian': 'bg',
 'lithuanian': 'lt',
 'latin': 'la',
 'maori': 'mi',
 'malayalam': 'ml',
 'welsh': 'cy',
 'slovak': 'sk',
 'telugu': 'te',
 'persian': 'fa',
 'latvian': 'lv',
 'bengali': 'bn',
 'serbian': 'sr',
 'azerbaijani': 'az',
 'slovenian': 'sl',
 'kannada': 'kn',
 'estonian': 'et',
 'macedonian': 'mk',
 'breton': 'br',
 'basque': 'eu',
 'icelandic': 'is',
 'armenian': 'hy',
 'nepali': 'ne',
 'mongol

## Load Processor

In [None]:
from transformers import WhisperProcessor

processor = WhisperProcessor.from_pretrained(
    #"openai/whisper-small", language= "swahili", task="transcribe"
    "openai/whisper-small", language= "yoruba", task="transcribe"
)

preprocessor_config.json: 0.00B [00:00, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

normalizer.json: 0.00B [00:00, ?B/s]

added_tokens.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

# Datasets

API Authentification Guides:

https://colab.research.google.com/github/google-gemini/cookbook/blob/main/quickstarts/Authentication.ipynb#scrollTo=gZDX51Y27pN4

https://www.analyticsvidhya.com/blog/2024/12/api-keys-in-google-colab/

In [None]:
from google.colab import userdata
MOZILLA_API_KEY = userdata.get("MOZILLA_API_KEY")

In [None]:
# Swahili Download Session
!curl -X POST "https://datacollective.mozillafoundation.org/api/datasets/cmj8u3puq00qhnxxbg26y0owu/download" -H "Authorization: Bearer $MOZILLA_API_KEY" -H "Content-Type: application/json"

{"downloadToken":"dlt_f0d03171-348e-483a-b5d9-2e5bd1f57f88","downloadUrl":"https://datacollective.mozillafoundation.org/api/datasets/cmj8u3puq00qhnxxbg26y0owu/download/dlt_f0d03171-348e-483a-b5d9-2e5bd1f57f88","expiresAt":"2025-12-26T21:25:16.726Z","sizeBytes":"22967482338","contentType":"application/gzip","filename":"mcv-scripted-sw-v24.0.tar.gz","checksum":"709aa10aadbc920e36e51bb238185dde5400f26e24f212f5d50a637a67d8e4e6"}

In [None]:
# Swahili Download Session
!curl -X GET "https://datacollective.mozillafoundation.org/api/datasets/cmj8u3puq00qhnxxbg26y0owu/download/dlt_f0d03171-348e-483a-b5d9-2e5bd1f57f88" -H "Authorization: Bearer $MOZILLA_API_KEY" -o "Swahili.tar.gz"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    45    0    45    0     0     58      0 --:--:-- --:--:-- --:--:--    58


In [None]:
# Yoruba Download Session
!curl -X POST "https://datacollective.mozillafoundation.org/api/datasets/cmj8u3q2500v5nxxb6xfa6jn5/download" -H "Authorization: Bearer $MOZILLA_API_KEY" -H "Content-Type: application/json"

{"downloadToken":"dlt_63bd583a-9aeb-41fa-b317-7e6c6a0e1f51","downloadUrl":"https://datacollective.mozillafoundation.org/api/datasets/cmj8u3q2500v5nxxb6xfa6jn5/download/dlt_63bd583a-9aeb-41fa-b317-7e6c6a0e1f51","expiresAt":"2025-12-26T21:35:34.291Z","sizeBytes":"170839274","contentType":"application/gzip","filename":"mcv-scripted-yo-v24.0.tar.gz","checksum":"6afad07dac589bd86a1794c2d918c6d3da232590bc79c4d19085abbf836cf368"}

In [None]:
# Yoruba Download Session
!curl -X GET "https://datacollective.mozillafoundation.org/api/datasets/cmj8u3q2500v5nxxb6xfa6jn5/download/dlt_63bd583a-9aeb-41fa-b317-7e6c6a0e1f51" -H "Authorization: Bearer $MOZILLA_API_KEY" -o "Yoruba.tar.gz"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  162M  100  162M    0     0  20.3M      0  0:00:07  0:00:07 --:--:-- 26.2M


## Decompress Files

In [None]:
!tar -xvzf Swahili.tar.gz


gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now


In [None]:
!tar -xvzf Yoruba.tar.gz

cv-corpus-24.0-2025-12-05/yo/clip_durations.tsv
cv-corpus-24.0-2025-12-05/yo/clips/
cv-corpus-24.0-2025-12-05/yo/dev.tsv
cv-corpus-24.0-2025-12-05/yo/invalidated.tsv
cv-corpus-24.0-2025-12-05/yo/other.tsv
cv-corpus-24.0-2025-12-05/yo/reported.tsv
cv-corpus-24.0-2025-12-05/yo/test.tsv
cv-corpus-24.0-2025-12-05/yo/train.tsv
cv-corpus-24.0-2025-12-05/yo/unvalidated_sentences.tsv
cv-corpus-24.0-2025-12-05/yo/validated.tsv
cv-corpus-24.0-2025-12-05/yo/validated_sentences.tsv
cv-corpus-24.0-2025-12-05/yo/clips/common_voice_yo_36518279.mp3
cv-corpus-24.0-2025-12-05/yo/clips/common_voice_yo_36518280.mp3
cv-corpus-24.0-2025-12-05/yo/clips/common_voice_yo_36518281.mp3
cv-corpus-24.0-2025-12-05/yo/clips/common_voice_yo_36518282.mp3
cv-corpus-24.0-2025-12-05/yo/clips/common_voice_yo_36518283.mp3
cv-corpus-24.0-2025-12-05/yo/clips/common_voice_yo_36520588.mp3
cv-corpus-24.0-2025-12-05/yo/clips/common_voice_yo_36520639.mp3
cv-corpus-24.0-2025-12-05/yo/clips/common_voice_yo_36520640.mp3
cv-corpus-24.

In [None]:
!ls cv-corpus-24.0-2025-12-05/yo

clip_durations.tsv  other.tsv	  unvalidated_sentences.tsv
clips		    reported.tsv  validated_sentences.tsv
dev.tsv		    test.tsv	  validated.tsv
invalidated.tsv     train.tsv


## Using Datasets with Hugging Face
https://huggingface.co/learn/llm-course/en/chapter5/2

In [None]:
from datasets import load_dataset

common_voice = load_dataset("csv", data_files = "cv-corpus-24.0-2025-12-05/yo/test.tsv", delimiter = "\t")

In [None]:
common_voice

DatasetDict({
    train: Dataset({
        features: ['client_id', 'path', 'sentence_id', 'sentence', 'sentence_domain', 'up_votes', 'down_votes', 'age', 'gender', 'accents', 'variant', 'locale', 'segment'],
        num_rows: 1122
    })
})

In [None]:
# Path is the relative path to the audio file (usually .mp3 or .wav)
# It points to a file inside the clips/ directory
common_voice = common_voice.select_columns(["path", "sentence"])

In [None]:
# Rebuilds the correct paths
# Path column is just a filename, not a full or relative path that HF Datasets can resolve to an audio file
import os
from datasets import Audio

BASE_DIR = "/content/cv-corpus-24.0-2025-12-05/yo/clips"

common_voice = common_voice.map(
    lambda x: {
        "path": os.path.join(BASE_DIR, x["path"])
    }
)

Map:   0%|          | 0/1122 [00:00<?, ? examples/s]

In [None]:
common_voice = common_voice.rename_column("path", "audio")

In [None]:
common_voice

DatasetDict({
    train: Dataset({
        features: ['audio', 'sentence'],
        num_rows: 1122
    })
})

In [None]:
common_voice["train"].features

{'audio': Value('string'), 'sentence': Value('string')}

## Pre-Process Data

In [None]:
sampling_rate = processor.feature_extractor.sampling_rate
common_voice = common_voice.cast_column("audio", Audio(sampling_rate=sampling_rate))

In [None]:
common_voice["train"].features

{'audio': Audio(sampling_rate=16000, decode=True, num_channels=None, stream_index=None),
 'sentence': Value('string')}

In [None]:
def prepare_dataset(example):
    audio = example["audio"]

    example = processor(
        audio=audio["array"],
        sampling_rate=audio["sampling_rate"],
        text=example["sentence"],
    )

    # compute input length of audio sample in seconds
    example["input_length"] = len(audio["array"]) / audio["sampling_rate"]

    return example

In [None]:
common_voice = common_voice.map(
    prepare_dataset, remove_columns=common_voice.column_names["train"], num_proc=1
)

Map (num_proc=1):   0%|          | 0/1122 [00:00<?, ? examples/s]

In [None]:
max_input_length = 30.0


def is_audio_in_length_range(length):
    return length < max_input_length

In [None]:
common_voice["train"] = common_voice["train"].filter(
    is_audio_in_length_range,
    input_columns=["input_length"],
)

Filter:   0%|          | 0/1122 [00:00<?, ? examples/s]

In [None]:
common_voice["train"]

Dataset({
    features: ['input_features', 'labels', 'input_length'],
    num_rows: 1122
})

# Training and Evaluation

In [None]:
import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union


@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(
        self, features: List[Dict[str, Union[List[int], torch.Tensor]]]
    ) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [
            {"input_features": feature["input_features"][0]} for feature in features
        ]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(
            labels_batch.attention_mask.ne(1), -100
        )

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch

In [None]:
data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)

## Evaluation

In [None]:
import evaluate

metric = evaluate.load("wer")

Downloading builder script: 0.00B [00:00, ?B/s]

In [None]:
from transformers.models.whisper.english_normalizer import BasicTextNormalizer

normalizer = BasicTextNormalizer()


def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    label_ids[label_ids == -100] = processor.tokenizer.pad_token_id

    # we do not want to group tokens when computing the metrics
    pred_str = processor.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = processor.batch_decode(label_ids, skip_special_tokens=True)

    # compute orthographic wer
    wer_ortho = 100 * metric.compute(predictions=pred_str, references=label_str)

    # compute normalised WER
    pred_str_norm = [normalizer(pred) for pred in pred_str]
    label_str_norm = [normalizer(label) for label in label_str]
    # filtering step to only evaluate the samples that correspond to non-zero references:
    pred_str_norm = [
        pred_str_norm[i] for i in range(len(pred_str_norm)) if len(label_str_norm[i]) > 0
    ]
    label_str_norm = [
        label_str_norm[i]
        for i in range(len(label_str_norm))
        if len(label_str_norm[i]) > 0
    ]

    wer = 100 * metric.compute(predictions=pred_str_norm, references=label_str_norm)

    return {"wer_ortho": wer_ortho, "wer": wer}

In [None]:
compute_metrics(pred)

NameError: name 'pred' is not defined

### Pre-Trained Checkpoint

In [None]:
from transformers import WhisperForConditionalGeneration

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/967M [00:00<?, ?B/s]

generation_config.json: 0.00B [00:00, ?B/s]

In [None]:
from functools import partial

# disable cache during training since it's incompatible with gradient checkpointing
model.config.use_cache = False

# set language and task for generation and re-enable cache
model.generate = partial(
    model.generate, language="sinhalese", task="transcribe", use_cache=True
)

### Training Config

In [None]:
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="./whisper-finetuned-yoruba",  # name on the HF Hub
    per_device_train_batch_size=16,
    gradient_accumulation_steps=1,  # increase by 2x for every 2x decrease in batch size
    learning_rate=1e-5,
    lr_scheduler_type="constant_with_warmup",
    warmup_steps=50,
    max_steps=500,  # increase to 4000 if you have your own GPU or a Colab paid plan
    gradient_checkpointing=True,
    fp16=True,
    fp16_full_eval=True,
    evaluation_strategy="steps",
    per_device_eval_batch_size=16,
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=500,
    eval_steps=500,
    logging_steps=25,
    report_to=["tensorboard"],
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    push_to_hub=True,
)

TypeError: Seq2SeqTrainingArguments.__init__() got an unexpected keyword argument 'evaluation_strategy'

In [None]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=common_voice["train"],
    eval_dataset=common_voice["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor,
)

NameError: name 'training_args' is not defined

## Training

In [None]:
trainer.train()