# Finetune Wav2Vec2 XLS-R for Luganda ASR using Mozilla CommonVoice Dataset

Log into HuggingFace Hub in order to access the models and to upload checkpoints directly to HuggingFace.   

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Install the GIT LFS in order to upload the model checkpoints to HuggingFace

In [None]:
!apt install git-lfs

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
git-lfs is already the newest version (3.0.2-1ubuntu0.2).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.


# Prepare Data, Tokenizer, Feature Extractor

### Create Wav2Vec2CTCTokenizer

Before being able to download the dataset, you will have to access it on HuggingFace and accept the agree to the terms of the dataset.

In [None]:
# Load the dataset
from datasets import load_dataset, load_metric, Audio

# Load the training, validation and test datasets separately
# We can load a portion of the dataset instead of the whole dataset
luganda_train_dataset = load_dataset("mozilla-foundation/common_voice_7_0", "lg", split="train[:1%]")   # 1% of the training dataset
luganda_valid_dataset = load_dataset("mozilla-foundation/common_voice_7_0", "lg", split="validation")
luganda_test_dataset = load_dataset("mozilla-foundation/common_voice_7_0", "lg", split="test")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


DatasetDict({
    train: Dataset({
        features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment'],
        num_rows: 6626
    })
    test: Dataset({
        features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment'],
        num_rows: 4276
    })
    validation: Dataset({
        features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment'],
        num_rows: 3549
    })
    other: Dataset({
        features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment'],
        num_rows: 29407
    })
    invalidated: Dataset({
        features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment'],
        num_rows: 2195
    })
})


In [None]:
# Select relevant columns from the dataset
luganda_train_dataset = luganda_train_dataset.select_columns(["audio", "sentence", ])
luganda_valid_dataset = luganda_valid_dataset.select_columns(["audio", "sentence", ])
luganda_test_dataset = luganda_test_dataset.select_columns(["audio", "sentence", ])

Calculate the duration of the dataset 

In [None]:
def calculate_duration(batch):
    """Calculate the duration of each data sample in the batch"""
    audio = batch['audio']
    batch['duration'] = len(audio['array'])/audio['sampling_rate']
    return batch

In [None]:
luganda_train_dataset = luganda_train_dataset.map(calculate_duration, num_proc=4)
luganda_valid_dataset = luganda_valid_dataset.map(calculate_duration, num_proc=4)
luganda_test_dataset = luganda_test_dataset.map(calculate_duration, num_proc=4)

Let's calculate the total duration of the each of the dat splits

In [None]:
train_duration = sum(luganda_train_dataset['duration'])/3600
valid_duration = sum(luganda_valid_dataset['duration'])/3600
test_duration = sum(luganda_test_dataset['duration'])/3600
print(f"{train_duration=}")
print(f"{valid_duration=}")
print(f"{test_duration=}")

We can plot a histogram of the durations in the train dataset

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

pd.Series(luganda_train_dataset['duration']).hist()
plt.show()

If we have data samples that are longer than 30s and shorter than 2s, we need to filter the dataset to avoid issues during training

In [None]:
luganda_train_dataset = luganda_train_dataset.filter(lambda x: x['duration'] <= 30 and x['duration'] >= 2)
luganda_valid_dataset = luganda_valid_dataset.filter(lambda x: x['duration'] <= 30 and x['duration'] >= 2)

### Display some of the rows in the dataset
Check out samples from the dataset to make sure that they are what we expect. We want to look out for numbers. These need to be converted to text. We also want to check whether there are any letters than are part of the alphabet of the language we are training for. We need to substitute these with the closest characters in the language we are interested in.

In [None]:
from datasets import ClassLabel
import random
import pandas as pd
from IPython.display import display, HTML
import re

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    display(HTML(df.to_html()))

show_random_elements(luganda_valid_dataset.remove_columns(["audio"]))

Unnamed: 0,sentence
0,Amazzi kyetaago bu bulamu obwa bulijjo.
1,"""Abayizi ababiri baayitira mu ddaala erisooka mu bigezo bya pulayimale eby'akamalirizo."""
2,Ebikolwa ebimu biggya omuntu mu kkubo lya Katonda.
3,"""Empuuta y'ennaku zino tekyawunnya nnyo nga ey'edda."""
4,"""Gavumenti yeetaaga ssente endala okubeezaawo abanoonyi b'obubuddamu."""
5,Nagenda okusoma obulimi mu lukumi lwenda kyenda.
6,Yakola ekikolwa kibi okwoleka ebyama bya mukyalawe.
7,"""Bwe yalaba anaatera okubatuukako n'atta ku bigere baleme mmulaba."""
8,Kakuyege atongozeddwa leero.
9,Ebintu ebimu tusaanye tusooke kubiteesaako kuba bya nkizo nnyo.


In CTC chunks of speech are classified into letters. We need to extract all distinct letters in the dataset and builf a vocabulary.   
We need a mapping function that will concatenate all the transcriptions into a long transcription and transforms the strings into a set of characters.

In [None]:
# Let's use the batched = True so that the map function can access all the transcriptions at a go
def extract_all_chars(batch):
  all_text = " ".join(batch["sentence"])
  vocab = list(set(all_text))
  return {"vocab": [vocab], "all_text": [all_text]}


luganda_train_vocab   = luganda_train_dataset.map(extract_all_chars, batched=True, batch_size=-1, keep_in_memory=True, remove_columns=luganda_train_dataset.column_names)
luganda_valid_vocab   = luganda_valid_dataset.map(extract_all_chars, batched=True, batch_size=-1, keep_in_memory=True, remove_columns=luganda_valid_dataset.column_names)
luganda_test_vocab    = luganda_test_dataset.map(extract_all_chars, batched=True, batch_size=-1, keep_in_memory=True, remove_columns=luganda_test_dataset.column_names)

Map:   0%|          | 0/6626 [00:00<?, ? examples/s]

Map:   0%|          | 0/4276 [00:00<?, ? examples/s]

Map:   0%|          | 0/3549 [00:00<?, ? examples/s]

Map:   0%|          | 0/29407 [00:00<?, ? examples/s]

Map:   0%|          | 0/2195 [00:00<?, ? examples/s]

In [None]:
vocab_list = list(set(luganda_train_vocab["vocab"][0]) | set(luganda_valid_vocab["vocab"][0]) | set(luganda_test_vocab["vocab"][0]))

In [None]:
# Create a vocabulary of all letters in the train dataset
vocab_dict = {v: k for k, v in enumerate(vocab_list)}
vocab_dict

{'(': 0,
 ' ': 1,
 'w': 2,
 'x': 3,
 's': 4,
 'm': 5,
 'y': 6,
 'k': 7,
 'r': 8,
 'o': 9,
 'g': 10,
 'p': 11,
 'u': 12,
 'e': 13,
 'v': 14,
 'i': 15,
 '’': 16,
 "'": 17,
 ')': 18,
 'h': 19,
 'b': 20,
 'f': 21,
 '‘': 22,
 'n': 23,
 'a': 24,
 'c': 25,
 'l': 26,
 'j': 27,
 't': 28,
 'd': 29,
 'z': 30}

In [None]:
# Let's normalize the dataset to only lower case letters and ignore any special tokens because without a language model it is difficult to classify such tokens as they do not correspond to a characteristic sound.
chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"]'

def remove_special_characters(batch):
    batch["transcription"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
    return batch

luganda_train_dataset = luganda_train_dataset.map(remove_special_characters)
luganda_valid_dataset = luganda_valid_dataset.map(remove_special_characters)
luganda_test_dataset = luganda_test_dataset.map(remove_special_characters)

We need to replace the " " in the dataset with a more visible character. We also need to add the UNKNOWN token so that to deal with characters not encountered in the training dataset.

In [None]:
vocab_dict["|"] = vocab_dict[" "]
del vocab_dict[" "]

We need to add the pad token that corresponds to CTC's blank token. The blank token is a core component of the CTC algorithm.

In [None]:
vocab_dict["[UNK]"] = len(vocab_dict)
vocab_dict["[PAD]"] = len(vocab_dict)
print(len(vocab_dict))

33


In [None]:
# Save the vocabulary to a json file
import json
with open('./vocab.json', 'w') as vocab_file:
    json.dump(vocab_dict, vocab_file)

In [None]:
# Use the json file to instantiate an object of the Wav2Vec2CTCTokenizer class
from transformers import Wav2Vec2CTCTokenizer

tokenizer = Wav2Vec2CTCTokenizer("./vocab.json", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")

In [None]:
# Let's push the tokenizer to the hub
repo_name = "luganda-wav2vec2-xls-r-common-voice-7-0"
tokenizer.push_to_hub(repo_name)

CommitInfo(commit_url='https://huggingface.co/dmusingu/luganda_wav2vec2_ctc/commit/0e024f5993351c4bf9d5ed2ab34dba98bb001df1', commit_message='Upload tokenizer', commit_description='', oid='0e024f5993351c4bf9d5ed2ab34dba98bb001df1', pr_url=None, pr_revision=None, pr_num=None)

# Create Wav2Vec Feature Extractor

In [None]:
# Create a feature extractor using Wav2Vec2FeatureExtractor. We shall pass feature size as 1 because we are dealing with raw audio files.
from transformers import Wav2Vec2FeatureExtractor

feature_extractor = Wav2Vec2FeatureExtractor(feature_size=1, sampling_rate=16000, padding_value=0.0, do_normalize=True, return_attention_mask=False)

In [None]:
# Wrap the feature extractor and the tokenizer into a Wav2VecProcessor
from transformers import Wav2Vec2Processor

processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)

# Preprocess the dataset

In [None]:
from datasets import Audio

In [None]:
luganda_train_dataset = luganda_train_dataset.cast_column("audio", Audio(sampling_rate=16000))
luganda_valid_dataset = luganda_valid_dataset.cast_column("audio", Audio(sampling_rate=16000))
luganda_test_dataset = luganda_test_dataset.cast_column("audio", Audio(sampling_rate=16000))

In [None]:
# Listen to sample audio from the dataset
import IPython.display as ipd
import numpy as np
import random

rand_int = random.randint(0, len(luganda_train_dataset)-1)

print(luganda_train_dataset["train"][rand_int]["sentence"])
ipd.Audio(data=np.asarray(luganda_train_dataset["train"][rand_int]["audio"]["array"]), autoplay=True, rate=16000)

ekibiina kyajaguzza emyaka amakumi asatu mu ena bukya nga kibeerawo


In [None]:
rand_int = random.randint(0, len(luganda_train_dataset))

print("Target text:", luganda_train_dataset[rand_int]["transcription"])
print("Input array shape:", np.asarray(luganda_train_dataset[rand_int]["audio"]["array"]).shape)
print("Sampling rate:", luganda_train_dataset[rand_int]["audio"]["sampling_rate"])

Target text: abaserikale batuuse kikeerezi okuzikiza ennyumba eyabadde ekutte omuliro
Input array shape: (95713,)
Sampling rate: 16000


In [None]:
# Convert the sampling frewquency to 16kHz since the model was pretrained on audio sampled at 16kHz
def prepare_dataset(batch):
    audio = batch["audio"]

    # batched output is "un-batched" to ensure mapping is correct
    batch["input_values"] = processor(audio["array"], sampling_rate=16000).input_values[0]

    with processor.as_target_processor():
        batch["labels"] = processor(batch["transcription"]).input_ids
    return batch

In [None]:
# Apply the map function to the dataset
luganda_train_dataset = luganda_train_dataset.map(prepare_dataset, remove_columns=['audio', 'transcription'], num_proc=4)
luganda_valid_dataset = luganda_valid_dataset.map(prepare_dataset, remove_columns=['audio', 'transcription'], num_proc=4)

### Training and Evaluation

In [None]:
# Set up the trainer
import torch

from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Union

@dataclass
class DataCollatorCTCWithPadding:
    """
    Data collator that will dynamically pad the inputs received.
    Args:
        processor (:class:`~transformers.Wav2Vec2Processor`)
            The processor used for proccessing the data.
        padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`True`):
            Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
            among:
            * :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
              sequence if provided).
            * :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the
              maximum acceptable input length for the model if that argument is not provided.
            * :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
              different lengths).
        max_length (:obj:`int`, `optional`):
            Maximum length of the ``input_values`` of the returned list and optionally padding length (see above).
        max_length_labels (:obj:`int`, `optional`):
            Maximum length of the ``labels`` returned list and optionally padding length (see above).
        pad_to_multiple_of (:obj:`int`, `optional`):
            If set will pad the sequence to a multiple of the provided value.
            This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >=
            7.5 (Volta).
    """

    processor: Wav2Vec2Processor
    padding: Union[bool, str] = True
    max_length: Optional[int] = None
    max_length_labels: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None
    pad_to_multiple_of_labels: Optional[int] = None

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need
        # different padding methods
        input_features = [{"input_values": feature["input_values"]} for feature in features]
        label_features = [{"input_ids": feature["labels"]} for feature in features]

        batch = self.processor.pad(
            input_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="pt",
        )
        with self.processor.as_target_processor():
            labels_batch = self.processor.pad(
                label_features,
                padding=self.padding,
                max_length=self.max_length_labels,
                pad_to_multiple_of=self.pad_to_multiple_of_labels,
                return_tensors="pt",
            )

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        batch["labels"] = labels

        return batch

In [None]:
# Initialize the data_collator
data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True)

We shall evaluate the model using WER and CER

In [None]:
# Load WER and CER from evaluate
import evaluate 

wer_metric = evaluate.load("wer")
cer_metric = evaluate.load("cer")

  wer_metric = load_metric("wer")
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


In [None]:
def compute_metrics(pred):
    pred_logits = pred.predictions
    pred_ids = np.argmax(pred_logits, axis=-1)

    pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id

    pred_str = processor.batch_decode(pred_ids)
    # we do not want to group tokens when computing the metrics
    label_str = processor.batch_decode(pred.label_ids, group_tokens=False)

    wer = wer_metric.compute(predictions=pred_str, references=label_str)
    cer = cer_metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer, "cer": cer}

In [None]:
# Load the pretrained Wav2Vec2 checkpoint. We use the tokenizers pad token id to degine the model's pad token id
from transformers import Wav2Vec2ForCTC

model = Wav2Vec2ForCTC.from_pretrained(
    "facebook/wav2vec2-xls-r-300m",
    attention_dropout=0.1,
    hidden_dropout=0.1,
    feat_proj_dropout=0.1,
    mask_time_prob=0.05,
    layerdrop=0.1,
    ctc_loss_reduction="mean",
    pad_token_id=processor.tokenizer.pad_token_id,
    vocab_size=len(processor.tokenizer)
)

  return self.fget.__get__(instance, owner)()
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base and are newly initialized: ['lm_head.bias', 'lm_head.weight', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# Freeze the feature extractor
model.freeze_feature_extractor()



We log training progress to Weights and Biases

In [None]:
import wandb

# Log in to Weights & Biases
wandb.login()

[34m[1mwandb[0m: Currently logged in as: [33mdmusingu[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

In [None]:
%env WANDB_LOG_MODEL=end
%env WANDB_PROJECT=ASR Africa
%env WANDB_WATCH=all
%env WANDB_SILENT=true

env: WANDB_PROJECT=LugandaASR-wav2vec
env: WANDB_LOG_MODEL="checkpoint"


Use callbacks to log training progress. This helps us to monitor whether transcriptions predicted by the model improves as the model is trained.

In [None]:
from transformers.integrations import WandbCallback
import pandas as pd


def map_to_result(batch):
  with torch.no_grad():
    input_values = torch.tensor(batch["input_values"], device="cuda").unsqueeze(0)
    logits = model(input_values).logits

  pred_ids = torch.argmax(logits, dim=-1)
  batch["pred_str"] = processor.batch_decode(pred_ids)[0]
  batch["text"] = processor.decode(batch["labels"], group_tokens=False)

  return batch

def decode_predictions(tokenizer, predictions):
    labels = tokenizer.batch_decode(predictions.label_ids)
    logits = predictions.predictions.argmax(axis=-1)
    prediction_text = tokenizer.batch_decode(logits)
    return {"labels": labels, "predictions": prediction_text}


class WandbPredictionProgressCallback(WandbCallback):
    """Custom WandbCallback to log model predictions during training.

    This callback logs model predictions and labels to a wandb.Table at each
    logging step during training. It allows to visualize the
    model predictions as the training progresses.

    Attributes:
        trainer (Trainer): The Hugging Face Trainer instance.
        tokenizer (AutoTokenizer): The tokenizer associated with the model.
        sample_dataset (Dataset): A subset of the validation dataset
          for generating predictions.
        num_samples (int, optional): Number of samples to select from
          the validation dataset for generating predictions. Defaults to 100.
        freq (int, optional): Frequency of logging. Defaults to 2.
    """

    def __init__(self, trainer, tokenizer, val_dataset,
                 num_samples=10, freq=2):
        """Initializes the WandbPredictionProgressCallback instance.

        Args:
            trainer (Trainer): The Hugging Face Trainer instance.
            tokenizer (AutoTokenizer): The tokenizer associated
              with the model.
            val_dataset (Dataset): The validation dataset.
            num_samples (int, optional): Number of samples to select from
              the validation dataset for generating predictions.
              Defaults to 100.
            freq (int, optional): Frequency of logging. Defaults to 2.
        """
        super().__init__()
        self.trainer = trainer
        self.tokenizer = tokenizer
        self.sample_dataset = val_dataset.select(range(num_samples))
        self.freq = freq

    def on_evaluate(self, args, state, control, **kwargs):
        super().on_evaluate(args, state, control, **kwargs)
        # control the frequency of logging by logging the predictions
        # every `freq` epochs
        if state.epoch % self.freq == 0:
            # generate predictions
            predictions = self.trainer.predict(self.sample_dataset)
            # decode predictions and labels
            predictions = decode_predictions(self.tokenizer, predictions)
            # add predictions to a wandb.Table
            predictions_df = pd.DataFrame(predictions)
            predictions_df["epoch"] = state.epoch
            records_table = self._wandb.Table(dataframe=predictions_df)
            # log the table to wandb
            self._wandb.log({"sample_predictions": records_table})

In [None]:
# Define the training arguments
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir=repo_name,
    group_by_length=True,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=16,
    evaluation_strategy="steps",
    save_strategy="steps",
    num_train_epochs=50,
    torch_compile = True,
    bf16=True,
    gradient_checkpointing=True,
    save_steps=500,
    eval_steps=500,
    logging_steps=500,
    learning_rate=1e-4,
    weight_decay=0.005,
    warmup_steps=1000,
    save_total_limit=2,
    load_best_model_at_end=True,
    metric_for_best_model="wer",
    greater_is_better=False,
    report_to="wandb",
    run_name="luganda-wav2vec2-xls-r-common-voice-7-0",
    push_to_hub=True,
    hub_model_id=repo_name,
    hub_model_id = "luganda-wav2vec2-xls-r-common-voice-7-0",
    dataloader_num_workers=8,
    dataloader_pin_memory=True,
    dataloader_prefetch_factor=2,
)

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    data_collator=data_collator,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=luganda_train_dataset,
    eval_dataset=luganda_valid_dataset,
    tokenizer=processor.feature_extractor,
)


# Instantiate the WandbPredictionProgressCallback
progress_callback = WandbPredictionProgressCallback(
    trainer=trainer,
    tokenizer=tokenizer,
    val_dataset=luganda_valid_dataset,
    num_samples=10,
    freq=2,
)

# Add the callback to the trainer
trainer.add_callback(progress_callback)

In [None]:
# Train the model
trainer.train()
wandb.finish()

VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.011112768177781618, max=1.0…



Step,Training Loss,Validation Loss,Wer
500,4.2675,1.999853,0.999869
1000,0.5754,0.697634,0.705035
1500,0.231,0.615251,0.644045
2000,0.1557,0.658143,0.613043
2500,0.1221,0.671751,0.606266
3000,0.1013,0.67106,0.593433
3500,0.0871,0.672803,0.57307
4000,0.0751,0.672918,0.572612
4500,0.0666,0.688354,0.568945
5000,0.0604,0.745167,0.560859




Step,Training Loss,Validation Loss,Wer
500,4.2675,1.999853,0.999869
1000,0.5754,0.697634,0.705035
1500,0.231,0.615251,0.644045
2000,0.1557,0.658143,0.613043
2500,0.1221,0.671751,0.606266
3000,0.1013,0.67106,0.593433
3500,0.0871,0.672803,0.57307
4000,0.0751,0.672918,0.572612
4500,0.0666,0.688354,0.568945
5000,0.0604,0.745167,0.560859




VBox(children=(Label(value='0.002 MB of 0.036 MB uploaded\r'), FloatProgress(value=0.06400189090527089, max=1.…

0,1
eval/loss,██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂
eval/runtime,██▄▄▄▄▄▄▅▅▅▅▆▆▆▆▆▆██▅▅▂▂▂▂▃▃▃▃▁▁
eval/samples_per_second,▁▁▅▅▅▅▅▅▄▄▄▄▃▃▃▃▃▃▁▁▄▄▇▇▇▇▆▆▆▆██
eval/steps_per_second,▁▁▅▅▅▅▅▅▄▄▄▄▃▃▃▃▃▃▁▁▄▄▇▇▇▇▆▆▆▆██
eval/wer,██▃▃▃▃▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
train/epoch,▁▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇████
train/global_step,▁▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇████
train/learning_rate,▄▄████▇▇▇▇▆▆▆▆▅▅▅▅▄▄▄▄▃▃▃▃▂▂▂▂▁▁
train/loss,██▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
train/total_flos,▁▁

0,1
eval/loss,0.76218
eval/runtime,214.2001
eval/samples_per_second,19.963
eval/steps_per_second,2.498
eval/wer,0.5422
train/epoch,60.0
train/global_step,8340.0
train/learning_rate,0.0
train/loss,0.0353
train/total_flos,1.991286461060895e+19


In [None]:
# Push the model to hub
trainer.push_to_hub(repo_name)

CommitInfo(commit_url='https://huggingface.co/dmusingu/luganda_wav2vec2_ctc/commit/2c8b124fafa8c6e46cea736e5169d7dbce996d9c', commit_message='luganda_wav2vec2_ctc', commit_description='', oid='2c8b124fafa8c6e46cea736e5169d7dbce996d9c', pr_url=None, pr_revision=None, pr_num=None)

### Testing

In [None]:
processor = Wav2Vec2Processor.from_pretrained(repo_name)
model = Wav2Vec2ForCTC.from_pretrained(repo_name)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

In [None]:
# Evaluation is carried out with a batch size of 1
def map_to_result(batch):
    with torch.no_grad():
        input_values = torch.tensor(batch["input_values"], device="cuda").unsqueeze(0)
        logits = model(input_values).logits

    pred_ids = torch.argmax(logits, dim=-1)
    batch["pred_str"] = processor.batch_decode(pred_ids)[0]
    batch["text"] = processor.decode(batch["labels"], group_tokens=False)

    return batch

results = luganda_test_dataset.map(map_to_result, remove_columns=luganda_test_dataset.column_names)



Map:   0%|          | 0/4276 [00:00<?, ? examples/s]

In [None]:
print("Test WER: {:.3f}".format(wer_metric.compute(predictions=results["pred_str"], references=results["text"])))
print("Test CER: {:.3f}".format(cer_metric.compute(predictions=results["pred_str"], references=results["text"])))

Test WER: 0.456


In [None]:
# Chech the errors made by the model
show_random_elements(results)

Unnamed: 0,pred_str,text
0,abazadde b'abayizi be baagoba ku ssomero bakiriza okukozesa emeeza abaana ze baayonoona,abazadde b'abayizi be baagoba ku ssomero bakkirizza okukozesa emmeeza abaana ze bayonoona
1,akaakantu kennyumidde nnyo bampi,ako akantu kakunyumidde nnyo bambi
2,abakulembeze ab'enjawulo beetabi mu musomo,abakulembeze ab'enjawulo beetabye mu musomo
3,ti mmy'odlwalinga gya mu kutti mwu esinga mu ntaunta,ttiimu ya proline y'emu ku ttiimu ezisinga mu uganda
4,buzi bu ki obuvo mu kufumbo obw'ekito,buzibu ki obuva mu bufumbo bw'ekito
5,omukulembeze omulungi 'oyo ategeera ebizibu byabo baakulembera,omukulembeze omulungi y'oyo ategeera ebizibu by'abo b'akulembera
6,maama wange amayi okuboobeza emmere,maama wange amanyi okuboobeza emmere
7,ensi ezimu ze tugenda okukoleramu olwayo yannungi mu ndabikonaye ebikolebwayo gya ttabbu,ensi ezimu ze tugenda okukoleramu obwa yaaya nnungi mu ndabika naye ebikolebwayo bya ttabu
8,entampuzider etelababanti okuvuumi bifo biitabwe erebaagennda mu lokipe biriemw'edddembe,entalo zireetera abantu okuva mu bifo byabwe ne bagenda mu bifo ebirimu eddembe
9,omutamiivu abeera nga te yeebasse naye ngaategeera bulungi nnya,omutamiivu abeera nga eyeebase naye nga ategeera bulungi nnyo


From the output above we can make the following observations
1. The model finds it difficult to predict the position of the ' which plays a significant role in Luganda.
2. The wav2vec model was pretrained on English which has a different morphology from Luganda and this could could be one of the possible causes of the high WER on the test set. More experiments need to be carried out to prove if this is the case.
3. The predictions made by the model separates words that should be combined in Luganda.
4. In some instances the model separates words that should be combined.
5. Training for multiple epochs results in overfitting. There is no significant improvement in model performance after 4000 steps.
6. The performance of the modek could be improved by adding a language model.