<a href="https://colab.research.google.com/github/Baah134/Baah134/blob/main/Whisper/FineTune_Whisper_Africa_LoRA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finetuning Whisper on AfriSpeech using Parameter Efficient Fine-Tuning -Lora

This project involves fine-tuning OpenAI's Whisper model using LoRA (Low-Rank Adaptation) on the AfriSpeech dataset—a multilingual African speech corpus. The goal was to improve the model's transcription accuracy for under-resourced African languages.

**Key Details:**          
Base Model: openai/whisper-large.                   

Fine-Tuning Method: Parameter-efficient fine-tuning using PEFT's LoRA

Dataset: AfriSpeech- a curated dataset with audio-transcription pairs from multiple African languages.

Training Environment: Google Colab + bitsandbytes 8-bit quantization to reduce memory usage

Use Case: Improved ASR (Automatic Speech Recognition) for African language speech inputs

**Results:**      
Evaluated the fine-tuned model on a held-out set of 100 samples

Achieved lower WER (Word Error Rate) of 0.18 compared to the base Whisper model on this subset which was 0.4

Demonstrated better handling of African accents and language-specific phonemes





## Inital Setup

Installing Required Libraries

In [None]:
!add-apt-repository -y ppa:jonathonf/ffmpeg-4
!apt update
!apt install -y ffmpeg

Repository: 'deb https://ppa.launchpadcontent.net/jonathonf/ffmpeg-4/ubuntu/ jammy main'
Description:
Backport of FFmpeg 4 and associated libraries. Now includes AOM/AV1 support!

FDK AAC is not compatible with GPL and FFmpeg can't be redistributed with it included. Please don't ask for it to be added to this public PPA.

---

PPA supporters:

BigBlueButton (https://bigbluebutton.org)

---

Donate to FFMPEG: https://ffmpeg.org/donations.html
Donate to Debian: https://www.debian.org/donations
Donate to this PPA: https://ko-fi.com/jonathonf
More info: https://launchpad.net/~jonathonf/+archive/ubuntu/ffmpeg-4
Adding repository.
Adding deb entry to /etc/apt/sources.list.d/jonathonf-ubuntu-ffmpeg-4-jammy.list
Adding disabled deb-src entry to /etc/apt/sources.list.d/jonathonf-ubuntu-ffmpeg-4-jammy.list
Adding key to /etc/apt/trusted.gpg.d/jonathonf-ubuntu-ffmpeg-4.gpg with fingerprint 4AB0F789CBA31744CC7DA76A8CF63AD3F06FC659
Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ In

In [None]:
!pip install datasets>=2.6.1
!pip install git+https://github.com/huggingface/transformers
!pip install librosa
!pip install evaluate>=0.30
!pip install jiwer
!pip install gradio
!pip install -q bitsandbytes datasets accelerate loralib
!pip install -q git+https://github.com/huggingface/transformers.git@main git+https://github.com/huggingface/peft.git@main

Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-sddep3vp
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-sddep3vp
  Resolved https://github.com/huggingface/transformers to commit 3457e8e73e4f5532cc69059682b1ba4484d7e7e8
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: transformers
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Created wheel for transformers: filename=transformers-4.54.0.dev0-py3-none-any.whl size=11788847 sha256=ea8d2503f95d7637a92072c7dbf2f473e65a5bd6d413e84dbfd4a556b037ae67
  Stored in directory: /tmp/pip-ephem-wheel-cache-7fi_kd4r/wheels/04/a3/f1/b88775f8e1665827525b19ac7590250f1038d947067beba9fb
Successfully built transformer

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
# Select CUDA device index
import os

os.environ["CUDA_VISIBLE_DEVICES"] = "0"
model_name_or_path = "openai/whisper-large"
language = "English"
language_abbr = "en"
task = "transcribe"


## Load Dataset

In [None]:
# Mount Drive and Set Paths
from google.colab import drive
drive.mount('/content/drive')

# Define paths
train_audio_dir = '/content/drive/MyDrive/Fine_Tuning/Train_Data/'
test_audio_dir  = '/content/drive/MyDrive/Fine_Tuning/Hundred/'

train_csv = '/content/drive/MyDrive/Fine_Tuning/train_extracted_data.csv'
test_csv  = '/content/drive/MyDrive/Fine_Tuning/Hundred_Transcriptions.csv'


import os
import librosa
import pandas as pd

#Load CSVs and Build Transcription Dictionaries
train_df = pd.read_csv(train_csv)
test_df = pd.read_csv(test_csv)

train_df['filename'] = train_df['filename'].astype(str)
test_df['audio_name'] = test_df['audio_name'].astype(str)

train_transcripts = dict(zip(train_df['filename'], train_df['transcription']))
test_transcripts  = dict(zip(test_df['audio_name'],  test_df['transcript']))

# Function to Load Dataset
def load_dataset(audio_dir, transcript_dict):
    audio_files_in_dir = [f for f in os.listdir(audio_dir) if f.endswith('.wav')]

    data = []
    for audio_file_with_ext in audio_files_in_dir:
        # Get the filename without the .wav extension
        audio_file_base = os.path.splitext(audio_file_with_ext)[0]

        # Check if the base filename is in the transcript dictionary (for test data)
        if audio_file_base in transcript_dict:
            audio_path = os.path.join(audio_dir, audio_file_with_ext)
            waveform, _ = librosa.load(audio_path, sr=16000)
            data.append({
                "audio": waveform,
                "text": transcript_dict[audio_file_base], # Use the base filename for lookup
                "path": audio_path
            })
        # Also check if the full filename is in the transcript dictionary (for train data)
        elif audio_file_with_ext in transcript_dict:
            audio_path = os.path.join(audio_dir, audio_file_with_ext)
            waveform, _ = librosa.load(audio_path, sr=16000)
            data.append({
                "audio": waveform,
                "text": transcript_dict[audio_file_with_ext],
                "path": audio_path
            })
        else:
            print(f"Warning: No transcription found for {audio_file_with_ext} in {audio_dir}")

    return data

# Load and Prepare Datasets
train_data = load_dataset(train_audio_dir, train_transcripts)
test_data  = load_dataset(test_audio_dir,  test_transcripts)


# Print First 5 Samples
print("\nFirst 5 training samples:")
for i, sample in enumerate(train_data[:5]):
    print(f"\nSample {i+1}:")
    print(f"  Audio Path: {sample['path']}")
    print(f"  Transcription: {sample['text'][:200]}")  # limit to 200 chars

print("\nFirst 5 test samples:")
for i, sample in enumerate(test_data[:5]):
    print(f"\nSample {i+1}:")
    print(f"  Audio Path: {sample['path']}")
    print(f"  Transcription: {sample['text'][:200]}")


Mounted at /content/drive
✅ Example train sample:
{'audio': array([ 2.3672277e-04, -4.3032607e-05, -9.1965212e-06, ...,
       -1.7559691e-03, -1.4749651e-03,  1.6117134e-04], dtype=float32), 'text': 'The Gophers were less than a minute from escaping the second with a two-goal lead before the Lions cashed in on a 2-on-1 break, with Wall flipping a backhand shot past LaFontaine, for a 2-1 Minnesota lead after 40 minutes.\n', 'path': '/content/drive/MyDrive/Fine_Tuning/Train_Data/sample_1002.wav'}

📘 First 5 training samples:

Sample 1:
  Audio Path: /content/drive/MyDrive/Fine_Tuning/Train_Data/sample_1002.wav
  Transcription: The Gophers were less than a minute from escaping the second with a two-goal lead before the Lions cashed in on a 2-on-1 break, with Wall flipping a backhand shot past LaFontaine, for a 2-1 Minnesota 

Sample 2:
  Audio Path: /content/drive/MyDrive/Fine_Tuning/Train_Data/sample_1000.wav
  Transcription: Use of restraints may decrease environmental stimulation and 

## Prepare Feature Extractor, Tokenizer and Data

In [None]:
from transformers import WhisperFeatureExtractor

feature_extractor = WhisperFeatureExtractor.from_pretrained(model_name_or_path)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


preprocessor_config.json: 0.00B [00:00, ?B/s]

In [None]:
from transformers import WhisperTokenizer

tokenizer = WhisperTokenizer.from_pretrained(model_name_or_path, language=language, task=task)

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

normalizer.json: 0.00B [00:00, ?B/s]

added_tokens.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

In [None]:
from transformers import WhisperProcessor

processor = WhisperProcessor.from_pretrained(model_name_or_path, language=language, task=task)

### Prepare Data

Re-sampling audio to 16000Hz for Whisper to work with and preparing tokens

In [None]:
def prepare_dataset(example):
    # Convert raw waveform to input features (log-Mel spectrograms)
    example["input_features"] = feature_extractor(
        example["audio"], sampling_rate=16000
    ).input_features[0]

    # Tokenize transcript
    example["labels"] = tokenizer(example["text"]).input_ids
    return example

      # Apply to all training and test samples
train_data = [prepare_dataset(sample) for sample in train_data]
test_data  = [prepare_dataset(sample) for sample in test_data]

      #  View Final Prepared Example
print("Train Sample:")
print(train_data[0])
print("\nTest Sample:")
print(test_data[0])

Train Sample:
{'audio': array([ 2.3672277e-04, -4.3032607e-05, -9.1965212e-06, ...,
       -1.7559691e-03, -1.4749651e-03,  1.6117134e-04], dtype=float32), 'text': 'The Gophers were less than a minute from escaping the second with a two-goal lead before the Lions cashed in on a 2-on-1 break, with Wall flipping a backhand shot past LaFontaine, for a 2-1 Minnesota lead after 40 minutes.\n', 'path': '/content/drive/MyDrive/Fine_Tuning/Train_Data/sample_1002.wav', 'input_features': array([[-0.63684106, -0.63684106, -0.63684106, ..., -0.63684106,
        -0.63684106, -0.63684106],
       [-0.63684106, -0.63684106, -0.63684106, ..., -0.63684106,
        -0.63684106, -0.63684106],
       [-0.63684106, -0.42045987, -0.5785929 , ..., -0.63684106,
        -0.63684106, -0.63684106],
       ...,
       [-0.43240392, -0.16385365, -0.63684106, ..., -0.63684106,
        -0.63684106, -0.63684106],
       [-0.61142516, -0.3355645 , -0.63684106, ..., -0.63684106,
        -0.63684106, -0.63684106],
     

## Training and Evaluation

### Define a Data Collator

In [None]:
import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union


@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        if "num_items_in_batch" in batch:

          del batch["num_items_in_batch"]


        return batch

Initialising the data collator

In [None]:
data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)

### Evaluation Metrics

The word error rate (WER) will be used as the metric for evaluation.

In [None]:
import evaluate

metric = evaluate.load("wer")

Downloading builder script: 0.00B [00:00, ?B/s]

We then simply have to define a function that takes our model
predictions and returns the WER metric. This function, called
`compute_metrics`, first replaces `-100` with the `pad_token_id`
in the `label_ids` (undoing the step we applied in the
data collator to ignore padded tokens correctly in the loss).
It then decodes the predicted and label ids to strings. Finally,
it computes the WER between the predictions and reference labels:

In [None]:
def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    label_ids[label_ids == -100] = tokenizer.pad_token_id

    # we do not want to group tokens when computing the metrics
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

### Load a Pre-Trained Checkpoint

Loads the pre-trained Whisper large checkpoint.

In [None]:
from transformers import WhisperForConditionalGeneration

model = WhisperForConditionalGeneration.from_pretrained(model_name_or_path, load_in_8bit=True, device_map="auto")



config.json: 0.00B [00:00, ?B/s]

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


model.safetensors:   0%|          | 0.00/6.17G [00:00<?, ?B/s]

generation_config.json: 0.00B [00:00, ?B/s]

Override generation arguments - no tokens are forced as decoder outputs (see [`forced_decoder_ids`](https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.generation_utils.GenerationMixin.generate.forced_decoder_ids)), no tokens are suppressed during generation (see [`suppress_tokens`](https://huggingface.co/docs/transformers/main_classes/text_generation#transformers.generation_utils.GenerationMixin.generate.suppress_tokens)):

In [None]:
model.config.forced_decoder_ids = None
model.config.suppress_tokens = []

### Post-processing on the model

Finally, we need to apply some post-processing on the 8-bit model to enable training, let's freeze all our layers, and cast the layer-norm in `float32` for stability. We also cast the output of the last layer in `float32` for the same reasons.

In [None]:
from peft import prepare_model_for_kbit_training

model = prepare_model_for_kbit_training(model)

# Unfreeze proj_out
for name, param in model.named_parameters():
    if "proj_out" in name:
        param.requires_grad = True


### Apply LoRA

LoRA is applied in this step. A `PeftModel` is loaded and we specify that we are going to use low-rank adapters (LoRA) using `get_peft_model` utility function from `peft`.

In [None]:
from peft import LoraConfig, PeftModel, LoraModel, LoraConfig, get_peft_model

config = LoraConfig(r=32, lora_alpha=64, target_modules=["q_proj", "v_proj"], lora_dropout=0.05, bias="none")

model = get_peft_model(model, config)
model.print_trainable_parameters()

trainable params: 15,728,640 || all params: 1,559,033,600 || trainable%: 1.0089


Only using **1%** of the total trainable parameters, thereby performing **Parameter-Efficient Fine-Tuning**

### Define the Training Configuration

In the final step, we define all the parameters related to training.

In [None]:
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
   output_dir="./whisper-afrispeech",
    per_device_train_batch_size=8,
    gradient_accumulation_steps=1,
    learning_rate=1e-3,
    warmup_steps=50,
    num_train_epochs=2,
    eval_strategy="epoch",
    fp16=True,
    per_device_eval_batch_size=8,
    generation_max_length=128,
    logging_steps=25,
    remove_unused_columns=False,
    label_names=["labels"],
)

In [None]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=train_data,
    eval_dataset=test_data,
    data_collator=data_collator,
    tokenizer=processor.feature_extractor,
)
model.config.use_cache = False

  trainer = Seq2SeqTrainer(


In [None]:
trainer.train()



<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mprincemensah915[0m ([33mprincemensah915-ashesi-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


  return fn(*args, **kwargs)


Epoch,Training Loss,Validation Loss
1,0.5936,0.563416
2,0.5104,0.581366




TrainOutput(global_step=476, training_loss=0.625696531301286, metrics={'train_runtime': 2884.229, 'train_samples_per_second': 1.318, 'train_steps_per_second': 0.165, 'total_flos': 8.15411699712e+18, 'train_loss': 0.625696531301286, 'epoch': 2.0})

In [None]:
model_name_or_path = "openai/whisper-large"
peft_type = model.peft_config["default"].peft_type
peft_model_id = "Baah134/" + f"{model_name_or_path}-{peft_type}-colab".replace("/", "-")

model.push_to_hub(peft_model_id)
print(peft_model_id)


README.md: 0.00B [00:00, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/63.0M [00:00<?, ?B/s]

Baah134/openai-whisper-large-PeftType.LORA-colab


# Evaluation and Inference

Loads model from Hugging Face Repo

In [None]:
from peft import PeftModel, PeftConfig
from transformers import WhisperForConditionalGeneration, Seq2SeqTrainer

peft_model_id = "Baah134/openai-whisper-large-PeftType.LORA-colab"
peft_config = PeftConfig.from_pretrained(peft_model_id)
model = WhisperForConditionalGeneration.from_pretrained(
    peft_config.base_model_name_or_path, load_in_8bit=True, device_map="auto"
)
model = PeftModel.from_pretrained(model, peft_model_id)

adapter_config.json:   0%|          | 0.00/948 [00:00<?, ?B/s]

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


adapter_model.safetensors:   0%|          | 0.00/63.0M [00:00<?, ?B/s]

In [None]:
from torch.utils.data import DataLoader
from tqdm import tqdm
import numpy as np
import gc

eval_dataloader = DataLoader(test_data, batch_size=8, collate_fn=data_collator)

model.eval()
for step, batch in enumerate(tqdm(eval_dataloader)):
    with torch.cuda.amp.autocast():
        with torch.no_grad():
            generated_tokens = (
                model.generate(
                    input_features=batch["input_features"].to("cuda"),
                    decoder_input_ids=batch["labels"][:, :4].to("cuda"),
                    max_new_tokens=255,
                )
                .cpu()
                .numpy()
            )
            labels = batch["labels"].cpu().numpy()
            labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
            decoded_preds = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
            decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
            metric.add_batch(
                predictions=decoded_preds,
                references=decoded_labels,
            )
    del generated_tokens, labels, batch
    gc.collect()
wer = 100 * metric.compute()
print(f"{wer=}")

  with torch.cuda.amp.autocast():
Using custom `forced_decoder_ids` from the (generation) config. This is deprecated in favor of the `task` and `language` flags/config options.
Transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English. This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`. See https://github.com/huggingface/transformers/pull/28687 for more details.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
100%|██████████| 13/13 [09:22<00:00, 43.27s/it]

wer=19.596541786743515





Print first batch for qualitative analysis.

In [None]:

from jiwer import wer as jiwer_wer

# Get one batch from the dataloader
batch = next(iter(eval_dataloader))

model.eval()
with torch.cuda.amp.autocast():
    with torch.no_grad():
        generated_tokens = (
            model.generate(
                input_features=batch["input_features"].to("cuda"),
                decoder_input_ids=batch["labels"][:, :4].to("cuda"),
                max_new_tokens=255,
            )
            .cpu()
            .numpy()
        )

# Process labels and decode
labels = batch["labels"].cpu().numpy()
labels = np.where(labels != -100, labels, processor.tokenizer.pad_token_id)
decoded_preds = processor.tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
decoded_labels = processor.tokenizer.batch_decode(labels, skip_special_tokens=True)

# Print first batch results
print("=== Sample Results (First Batch) ===\n")
for i, (ref, pred) in enumerate(zip(decoded_labels, decoded_preds)):
    sample_wer = jiwer_wer(ref, pred) * 100
    print(f"Sample {i+1}")
    print(f"Ground Truth : {ref}")
    print(f"Prediction    : {pred}")
    print(f"WER           : {sample_wer:.2f}%\n")


  with torch.cuda.amp.autocast():


=== Sample Results (First Batch) ===

Sample 1
Ground Truth : Kumariya in the Morning was written by Ifatola and Ugeruomba after Ituaton finished her Oyo Tour on 25041998
Prediction    : Komaria in the morning was written by Ifatola and Uge Rumba after Etwaton finished Daoyoto on 25th April 1998
WER           : 55.56%

Sample 2
Ground Truth : The menstrual history may include perimenstrual symptoms such as anxiety, UID retention, nervousness, mood fluctuations, food cravings, variations in sexual feelings, and difficulty sleeping.
Prediction    : The menstrual history may include peremenstrual symptoms such as anxiety, UID retention, nervousness, mood fluctuations, food cravings, variations in sexual feelings, and difficulty sleeping.
WER           : 4.17%

Sample 3
Ground Truth : Firsttrimester or secondtrimester screening, or both, for Downs syndrome.
Prediction    : 4th-trimester or 2nd-trimester squini, or boot, for Down syndrome
WER           : 66.67%

Sample 4
Ground Truth : This