*  Paa Kwesi Jnr Thompson
*  Isaac Baah
*  Emmanuel Nhyira Freduah-Agyemang

# Checking for GPU Availability
Before starting the model training, we checked if a GPU was available using the nvidia-smi command. This command gives details about any connected NVIDIA GPUs, such as memory usage, temperature, and running processes. The code checks the output for the word "failed" to confirm whether a GPU is accessible. If no GPU is detected, a message is displayed to let us know. Otherwise, the GPU details are printed.

This step is important because deep learning models require a lot of computational power, and training on a GPU is much faster than using a CPU. By checking for a GPU at the start, we made sure that the hardware we needed for the project was available before moving on to training the model.

In [None]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Earlier in the project, we faced repeated system crashes due to resource exhaustion when training on a CPU instead of a GPU. These crashes slowed our progress and highlighted the importance of using the appropriate hardware. By confirming the availability of a high-performance GPU, we ensured that the training process would be stable and efficient, allowing us to proceed confidently with the next stages of the project.

# Installing Necessary Libraries
As part of setting up the environment for our project, we upgraded and installed essential Python libraries using pip. The first command upgraded the pip package installer itself to the latest version, ensuring compatibility with the latest features and dependencies required by modern libraries. This step is critical because outdated versions of pip can lead to installation errors or incompatibility issues.

The second command installed and upgraded several key libraries required for our project. These include:


*   datasets[audio] for handling and processing datasets, particularly those involving audio data.
*   transformers for working with pre-trained models and fine-tuning them on specific tasks.
*   accelerate to optimize the training process across multiple devices, such as GPUs.
*   evaluate and jiwer for calculating metrics like Word Error Rate (WER) to assess the model's performance.
*   tensorboard for tracking training progress and visualizing metrics.
*   gradio for creating user-friendly interfaces to interact with the model.







In [None]:
!pip install --upgrade --quiet pip
!pip install --upgrade --quiet datasets[audio] transformers accelerate evaluate jiwer tensorboard gradio

# Using Hugging Face for Dataset, Model, and Endpoint Management
In this step, we logged into the Hugging Face Hub to manage our datasets, models, and endpoints. We chose Hugging Face because it offers a simple and efficient interface, making it highly convenient for collaborative projects. Previously, we attempted to use Google Drive for these tasks, but it presented significant challenges. Each group member had to individually upload datasets whenever changes were made, leading to version control issues and inefficiencies. Additionally, the upload and download process was time-consuming, which slowed down our workflow.

By switching to Hugging Face, we streamlined collaboration within our group. Hugging Face's centralized platform automatically synchronizes updates, ensuring that all members have access to the latest datasets and models without redundant uploads.

In [None]:


from huggingface_hub import notebook_login

notebook_login()

# Data Preprocessing and Handling

---



# Loading Datasets
To create a more expressive model, we began with the dataset provided, which was split into 90% for training and 10% for testing. This dataset was primarily focused on financial contexts, making it valuable for specialized applications but limited in its ability to generalize across diverse Twi language use cases. To address this limitation, we decided to expand the dataset by incorporating a more general Twi dataset containing 28,000 examples from the repository kojo-george/asante-twi-tts.

The financial training dataset was further split into training (80%) and validation (20%) sets to evaluate model performance effectively during training. To integrate the general dataset with the financial dataset, we aligned the column names and removed any unnecessary columns to ensure consistency. This step allowed us to seamlessly combine the datasets, creating a more comprehensive and diverse dataset for training.

By combining the specialized financial dataset with the larger and more general Twi dataset, we ensured that the model could better handle diverse linguistic contexts while retaining its financial specialization. This approach strikes a balance between domain-specific accuracy and general expressiveness, improving the model's overall utility.


In [None]:
from datasets import load_dataset, DatasetDict, concatenate_datasets

# Load your financial dataset
train_dataset = load_dataset("Ibaahjnr/Twi_Train_Dataset", split="train")
test_dataset = load_dataset("Ibaahjnr/Twi_Test_Dataset", split="train")

# Split your financial training dataset into train and validation
dataset_size = len(train_dataset)
train_size = int(0.8 * dataset_size)
val_size = dataset_size - train_size

train_financial = train_dataset.select(range(train_size))
val_financial = train_dataset.select(range(train_size, dataset_size))

# Load the general Twi dataset
general_dataset = load_dataset("kojo-george/asante-twi-tts")

# **Align Column Names Between Datasets**
general_dataset = general_dataset.rename_column("text", "transcription")
general_dataset = general_dataset.remove_columns(["file_name"])

# Check the column names of your financial dataset
print("Financial Dataset Columns:", train_financial.column_names)

# Check the column names of the general dataset
print("General Dataset Columns:", general_dataset['train'].column_names)



# Combining and Preparing the Datasets
In this step, we focused on preparing a unified dataset for training, validation, and testing. Since the project involves working with audio data, we first ensured consistency in the datasets by casting the audio column in all datasets to the same format using the Audio class from the datasets library. This step standardizes the audio features, making them compatible for model training and evaluation.

To enhance the dataset's expressiveness, we combined the financial dataset with the general Twi dataset. The training and validation sets from the financial dataset were concatenated with the training and validation splits from the general dataset, while the testing set included samples from both datasets. The combined datasets were shuffled with a fixed seed to randomize the examples, ensuring that the model does not learn in a biased sequence. For evaluation purposes, the testing dataset was limited to 100 samples to streamline validation during development.

The combined dataset, stored as a DatasetDict named common_voice, is well-structured and includes distinct splits for training, validation, and testing. This process is critical to ensure that the model is trained on a diverse and representative dataset, evaluated effectively during training, and tested on a balanced subset. By merging these datasets, we aim to build a robust model that balances domain-specific expertise with general linguistic expressiveness.

In [None]:
from datasets import Audio

# Cast 'audio' column in both datasets to ensure consistent features
financial_train_dataset = train_financial.cast_column("audio", Audio(sampling_rate=None))
financial_test_dataset = train_financial.cast_column("audio", Audio(sampling_rate=None))
general_dataset = general_dataset.cast_column("audio", Audio(sampling_rate=None))

# Combine the datasets
train_combined = concatenate_datasets([financial_train_dataset, general_dataset["train"]]).shuffle(seed=42)
val_combined = concatenate_datasets([val_financial, general_dataset["validation"]]).shuffle(seed=42)
test_combined = concatenate_datasets([financial_test_dataset, general_dataset["test"]]).shuffle(seed=42)

# Create the DatasetDict named 'common_voice'
common_voice = DatasetDict({
    "train": train_combined,
    "validation": val_combined,
    "test": test_combined.select(range(100))
})

# Print the resulting DatasetDict
print(common_voice)



# Initializing the Feature Extractor
In this step, we initialized a WhisperFeatureExtractor from the transformers library using the pre-trained openai/whisper-medium model. The feature extractor is a crucial component in processing audio data for training and inference. It transforms raw audio waveforms into a format suitable for the Whisper model, such as spectrogram representations or normalized audio features.

The choice of using a pre-trained feature extractor ensures consistency with the Whisper model's architecture and pre-training configurations. This step is essential because the model expects inputs in a specific format to perform effectively. By leveraging the pre-trained WhisperFeatureExtractor, we save time and avoid potential errors from manually defining audio preprocessing steps, ensuring our data preparation aligns with the model's requirements.

Using this feature extractor also helps maintain the fidelity of the audio data while enabling the model to capture both domain-specific and general acoustic patterns. This ensures that our training and evaluation processes are optimized for the Whisper model's capabilities.

In [None]:
from transformers import WhisperFeatureExtractor

feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-medium")

# Initializing the Tokenizer
In this step, we initialized the WhisperTokenizer from the transformers library using the pre-trained openai/whisper-medium model. The tokenizer plays a critical role in converting text data into numerical tokens that the model can process. While the primary goal of our model is to transcribe Twi, we faced a unique challenge: the tokenizer does not explicitly support Twi. Additionally, our dataset occasionally included English words mixed with Twi, a common linguistic phenomenon.

To address this, we used the multilingual capabilities of the WhisperTokenizer while specifying language="English" and task="transcribe". This allowed the tokenizer to handle English words naturally, while also leveraging its multilingual support for Twi. By adopting this approach, we ensured that the model could effectively tokenize and process mixed-language transcriptions without losing important linguistic nuances.

In [None]:
from transformers import WhisperTokenizer

tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-medium", language="English", task="transcribe")

Building on the earlier steps, we initialized the WhisperProcessor from the transformers library using the pre-trained openai/whisper-medium model. The processor integrates the functionality of both the feature extractor and tokenizer, making it a unified tool for preparing audio data and handling text outputs. This ensures consistency in the preprocessing pipeline for our transcription task.

As noted earlier, the primary goal of our model is to transcribe Twi, but our dataset often contains English words mixed with Twi. To address this, we specified language="English" and task="transcribe", leveraging the multilingual capabilities of the Whisper model. This builds on the earlier tokenizer setup, allowing the processor to handle code-switching seamlessly while preserving the integrity of both Twi and English segments.


In [None]:
from transformers import WhisperProcessor

processor = WhisperProcessor.from_pretrained("openai/whisper-medium", language="English", task="transcribe")

Prepare Data

In [None]:
!pip install numba


In [None]:
print(common_voice['train'][0])

# Resampling Audio Data

In this step, we resampled all the audio data in the common_voice dataset (train, validation, and test splits) to a standardized sampling rate of 16,000 Hz using the cast_column method from the datasets library. This step was crucial for overcoming inconsistencies in the dataset, which could arise from variations in the sampling rates of audio files collected from different sources.

One of the major challenges we faced earlier was dealing with datasets that had varying audio characteristics, including differing sampling rates. This inconsistency led to issues during preprocessing, as the Whisper model's feature extractor expects audio input with a fixed sampling rate. Without standardization, the model's performance could degrade due to mismatched input features or processing errors.

In [None]:
from datasets import Audio

common_voice['train'] = common_voice['train'].cast_column("audio", Audio(sampling_rate=16000))
common_voice['validation'] = common_voice['validation'].cast_column("audio", Audio(sampling_rate=16000))
common_voice['test'] = common_voice['test'].cast_column("audio", Audio(sampling_rate=16000))

In [None]:
print(common_voice["train"][0])
print(common_voice["validation"][0])

The prepare_dataset function processes a batch of data to prepare it for training with the Whisper model. It takes a single batch of data as input and performs two key operations.

First, it computes log-Mel spectrogram input features from the audio data. The function extracts the audio array and its sampling rate from the audio field of the batch and passes them to the feature_extractor. This extracts the spectrogram features, which are stored in the batch under the key input_features. These features represent the audio in a form that the model can process effectively.

Second, the function tokenizes the transcription text. It uses the tokenizer to encode the transcription field into a sequence of label IDs, which are stored in the batch under the key labels. These labels represent the text in a numerical format compatible with the model's output during training.

By performing these operations, the function ensures that each batch contains both the input features and corresponding labels needed for training the model. It returns the processed batch for further use in the training pipeline.

In [None]:
def prepare_dataset(batch):

    audio = batch["audio"]

    # compute log-Mel input features from input audio array
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]

    # encode target text to label ids
    batch["labels"] = tokenizer(batch["transcription"]).input_ids
    return batch

# Mapping the prepare_dataset Function to the Dataset
The line of code applies the prepare_dataset function to all splits (train, validation, test) of the common_voice dataset using the map method. This step transforms the dataset into a format that is ready for training the Whisper model.

Here’s a breakdown of what this does:

Applying the Transformation:

The prepare_dataset function is applied to each batch of the dataset. It preprocesses the audio data into log-Mel spectrogram features and tokenizes the transcriptions into numerical label IDs, which the model requires for training.
Removing Unnecessary Columns:

The remove_columns argument removes all existing columns in the dataset that are no longer needed after processing. It uses the column names from the train split (common_voice.column_names["train"]). This ensures that the dataset only contains the processed input_features and labels columns, reducing redundancy and simplifying the data structure.
Parallel Processing:

The num_proc=4 argument enables multiprocessing, allowing the dataset to be processed in parallel using 4 processes. This significantly speeds up the preprocessing step, especially for large datasets.
By applying this transformation, the common_voice dataset is fully prepared for training. Each example in the dataset is now in the format expected by the Whisper model, containing the extracted audio features (input_features) and the corresponding tokenized transcriptions (labels). This step is crucial for ensuring the data pipeline runs efficiently and effectively during training.

In [None]:
common_voice = common_voice.map(prepare_dataset, remove_columns=common_voice.column_names["train"], num_proc=4)

# Model

---



In [None]:
from transformers import WhisperForConditionalGeneration

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-medium")

In [None]:
model.generation_config.language = "English"
model.generation_config.task = "transcribe"

model.generation_config.forced_decoder_ids = None

In [None]:
import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any
    decoder_start_token_id: int

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.decoder_start_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch

In [None]:
data_collator = DataCollatorSpeechSeq2SeqWithPadding(
    processor=processor,
    decoder_start_token_id=model.config.decoder_start_token_id,
)

In [None]:
import evaluate

metric = evaluate.load("wer")
metric2 = evaluate.load("cer")

In [None]:
def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    label_ids[label_ids == -100] = tokenizer.pad_token_id

    # we do not want to group tokens when computing the metrics
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)
    cer = 100 * metric2.compute(predictions=pred_str, references=label_str)

    return {"wer": wer, "cer": cer}

In [None]:
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="./Asanti_Twi_Model_V2.1",  # change to a repo name of your choice
    per_device_train_batch_size=16,
    gradient_accumulation_steps=1,  # increase by 2x for every 2x decrease in batch size
    learning_rate=1e-5,
    warmup_steps=500,
    max_steps=3000,
    gradient_checkpointing=True,
    fp16=True,
    evaluation_strategy="steps",
    per_device_eval_batch_size=8,
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=1000,
    eval_steps=1000,
    logging_steps=25,
    report_to=["tensorboard"],
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    push_to_hub=True,
)

In [None]:
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=common_voice["train"],
    eval_dataset=common_voice["validation"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
)

In [None]:
processor.save_pretrained(training_args.output_dir)

# Training

In [None]:
trainer.train()

In [None]:
kwargs = {
    "dataset_tags": "Isbaahjnr/Twi_Train_Dataset",  # Ensure this dataset exists on Hugging Face Hub
    "dataset": "Twi_Train_Dataset",  # Human-readable name for the dataset
    "dataset_args": '{"config": "audio translation", "split": "train"}',  # Valid JSON-like string for dataset arguments
    "language": ["twi"],  # Language in lowercase
    "model_name": "Twi_Whisper",  # Human-readable name for the model
    "finetuned_from": "openai/whisper-medium",  # Ensure this is correct
    "tasks": ["automatic-speech-recognition"],  # Valid task name
}



In [None]:
trainer.push_to_hub()

Testing the model on Test Set

In [None]:
!pip install datasets

# Evaluation

In [None]:
import random
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from datasets import load_dataset
import evaluate
import torch
import torchaudio
from tqdm import tqdm  # For progress bar

# Load the Whisper model and processor
model_name = "Ibaahjnr/Asanti_Twi_Model_V2.1"  # Replace with your Hugging Face model path
processor = WhisperProcessor.from_pretrained(model_name)
model = WhisperForConditionalGeneration.from_pretrained(model_name)

# Load the dataset
dataset_name = "Ibaahjnr/Twi_Test_Dataset"  # Replace with your dataset path
test_dataset = load_dataset(dataset_name, split="train")  # Assuming your test split is named "test"

# Randomly select 10 samples from the test dataset
test_samples = random.sample(list(test_dataset), 10)

# Load WER and CER metrics from the evaluate library
wer_metric = evaluate.load("wer")
cer_metric = evaluate.load("cer")

# Prepare evaluation
predictions = []
references = []

# Resample the audio to 16000 Hz
def resample_audio(audio, target_sampling_rate=16000):
    waveform = torch.tensor(audio["array"], dtype=torch.float32)  # Ensure the tensor is float32
    resampled_waveform = torchaudio.transforms.Resample(
        orig_freq=audio["sampling_rate"], new_freq=target_sampling_rate
    )(waveform)
    return resampled_waveform.numpy()

# Iterate through the randomly selected test samples
for sample in tqdm(test_samples, desc="Processing Audio Samples", unit="sample"):
    # Resample audio and load target transcription
    audio = sample['audio']
    resampled_audio = resample_audio(audio)
    reference_text = sample['transcription']  # Replace 'transcription' with your dataset's transcription column name

    # Process the resampled audio
    inputs = processor(resampled_audio, sampling_rate=16000, return_tensors="pt")
    with torch.no_grad():
        predicted_ids = model.generate(inputs["input_features"])

    # Decode the prediction
    predicted_text = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]

    # Append for metric computation
    predictions.append(predicted_text)
    references.append(reference_text)

# Print predictions and references to verify
print("Predictions:", predictions)
print("References:", references)

# Compute WER and CER
wer_score = wer_metric.compute(predictions=predictions, references=references)
cer_score = cer_metric.compute(predictions=predictions, references=references)

print(f"Word Error Rate (WER): {wer_score:.2f}")
print(f"Character Error Rate (CER): {cer_score:.2f}")

In [None]:
import pandas as pd

# displaying the 10 sentences for our held out test data
heldout = pd.DataFrame({"predictions":predictions,"references":references})
heldout

In [None]:

# Load the dataset
dataset_name = "Ibaahjnr/Asante_Twi_Collected_Test"  # Replace with your dataset path
test_dataset = load_dataset(dataset_name, split="train")  # Assuming your test split is named "test"

# Randomly select 10 samples from the test dataset
test_samples = random.sample(list(test_dataset), 10)

# Load WER and CER metrics from the evaluate library
wer_metric = evaluate.load("wer")
cer_metric = evaluate.load("cer")

# Prepare evaluation
predictions = []
references = []

# Resample the audio to 16000 Hz
def resample_audio(audio, target_sampling_rate=16000):
    waveform = torch.tensor(audio["array"], dtype=torch.float32)  # Ensure the tensor is float32
    resampled_waveform = torchaudio.transforms.Resample(
        orig_freq=audio["sampling_rate"], new_freq=target_sampling_rate
    )(waveform)
    return resampled_waveform.numpy()

# Iterate through the randomly selected test samples
for sample in tqdm(test_samples, desc="Processing Audio Samples", unit="sample"):
    # Resample audio and load target transcription
    audio = sample['audio']
    resampled_audio = resample_audio(audio)
    reference_text = sample['transcription']  # Replace 'transcription' with your dataset's transcription column name

    # Process the resampled audio
    inputs = processor(resampled_audio, sampling_rate=16000, return_tensors="pt")
    with torch.no_grad():
        predicted_ids = model.generate(inputs["input_features"])

    # Decode the prediction
    predicted_text = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]

    # Append for metric computation
    predictions.append(predicted_text)
    references.append(reference_text)

# Print predictions and references to verify
print("Predictions:", predictions)
print("References:", references)

# Compute WER and CER
wer_score = wer_metric.compute(predictions=predictions, references=references)
cer_score = cer_metric.compute(predictions=predictions, references=references)

print(f"Word Error Rate (WER): {wer_score:.2f}")
print(f"Character Error Rate (CER): {cer_score:.2f}")

In [None]:
# displaying the 10 sentences for our compiled data
our_test = pd.DataFrame({"predictions":predictions,"references":references})
our_test