## Import Packages
We need to install transformers and datasets. soundfile is used to load audio files and jiwer is used to evaluate the finetuned model using word errot rate

In [None]:
!pip install datasets>=1.18.3
!pip install transformers==4.11.3
!pip install librosa
!pip install jiwer

To upload our training checkpoints directly to huggingface, we have to store the huggingface authentication key.

In [4]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Install the GIT LFS in order to upload the model checkpoints

In [5]:
!apt install git-lfs

# Prepare Data, Tokenizer, Feature Extractor 

### Create Wav2Vec2CTCTokenizer 

In [7]:
# Load the dataset
from datasets import load_dataset, load_metric

# You can pass the streaming option to load_dataset to stream the data from the source instead of downloading and caching it
luganda = load_dataset("mozilla-foundation/common_voice_16_1", "lg")

print(luganda)

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/8.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/12.3k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.74k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/77.3k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/14.6k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.47G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.13G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/457M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/487M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.29G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.44G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.14M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.13M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/8.53M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/9.52M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Reading metadata...: 71069it [00:00, 258823.31it/s]


Generating validation split: 0 examples [00:00, ? examples/s]

Reading metadata...: 13331it [00:00, 152638.60it/s]


Generating test split: 0 examples [00:00, ? examples/s]

Reading metadata...: 13358it [00:00, 255269.24it/s]


Generating other split: 0 examples [00:00, ? examples/s]

Reading metadata...: 36998it [00:00, 277481.30it/s]


Generating invalidated split: 0 examples [00:00, ? examples/s]

Reading metadata...: 39170it [00:00, 268377.68it/s]


In [10]:
luganda

DatasetDict({
    train: Dataset({
        features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment', 'variant'],
        num_rows: 71069
    })
    validation: Dataset({
        features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment', 'variant'],
        num_rows: 13331
    })
    test: Dataset({
        features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment', 'variant'],
        num_rows: 13358
    })
    other: Dataset({
        features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment', 'variant'],
        num_rows: 36998
    })
    invalidated: Dataset({
        features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment', 'variant'],
        num_

In [11]:
# Remove the unnecessary columns from the dataset
luganda = luganda.remove_columns(["client_id", "up_votes", "down_votes", "age", "gender", "accent", "locale", "segment", "variant"])

### Display some of the rows in the dataset

In [13]:
from datasets import ClassLabel
import random
import pandas as pd
from IPython.display import display, HTML
import re

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    display(HTML(df.to_html()))

show_random_elements(luganda["train"].remove_columns(["path", "audio"]))


Unnamed: 0,sentence
0,Ab’e Buddu bakola nnyo.
1,Njagala kugenda mu kibuga.
2,Ebisolo ebingi awamu biyitibwa eggana.
3,Nga tonnaba kusalawo sooka olowooze ku kintu ekiddako.
4,Abantu bafuna empeereza y'obujjanjabi embi.
5,Kati nsuubira Mukuumaddamula okwogera ku nsonga y'emirimu egigootaanye olwa Corona
6,Omugoba w'ekidduka ateekeddwa okugoberera amateeka g'enguudo okusobola okwewala obubenje.
7,Akasawo kano kalabika bulungi.
8,Buvunaanyizibwa bwa gavumenti okulaba ng'abaana bonna basoma.
9,Obulwadde bwa Anthrax buzibu nnyo okutangira.


In [None]:
# Let's normalize the dataset to only lower case letters and ignore any special tokens because without a language model it is difficult to classify such tokens as they do not correspond to a characteristic sound.
chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"]'

def remove_special_characters(batch):
    batch["text"] = re.sub(chars_to_ignore_regex, '', batch["text"]).lower()
    return batch

luganda = luganda.map(remove_special_characters)

In [None]:
# Display samples from the normalized dataset
show_random_elements(luganda["train"].remove_columns(["file", "audio"]))

In CTC chunks of speech are classified into letters. We need to extract all ditsinct letters in the dataset and builf a vocabulary.   
We need a mapping function that will concatenate all the transcriptions into a long transcription and transforms the strings into a set of characters.

In [None]:
# Let's use the batched = True so that the map function can access all the transcriptions at a go
def extract_all_chars(batch):
  all_text = " ".join(batch["text"])
  vocab = list(set(all_text))
  return {"vocab": [vocab], "all_text": [all_text]}

vocabs = luganda.map(extract_all_chars, batched=True, batch_size=-1, keep_in_memory=True, remove_columns=luganda.column_names["train"])

In [None]:
# Create a vocabulary of all letters in the train dataset
vocab_list = list(set(vocabs["train"]["vocab"][0]) | set(vocabs["test"]["vocab"][0]))

vocab_dict = {v: k for k, v in enumerate(vocab_list)}
vocab_dict

We need to replace the " " in the dataset with a more visible character. We also need to add the UNKNOWN token so that to deal with characters not encountered in the training dataset. 

In [None]:
vocab_dict["|"] = vocab_dict[" "]
del vocab_dict[" "]

We need to add the pad token that corresponds to CTC's blank token. The blank token is a core component of the CTC algorithm.

vocab_dict["[UNK]"] = len(vocab_dict)
vocab_dict["[PAD]"] = len(vocab_dict)
print(len(vocab_dict))

In [None]:
# Save the vocabulary to a json file
import json
with open('vocab.json', 'w') as vocab_file:
    json.dump(vocab_dict, vocab_file)

In [None]:
# Use the json file to instantiate an object of the Wav2Vec2CTCTokenizer class
from transformers import Wav2Vec2CTCTokenizer

tokenizer = Wav2Vec2CTCTokenizer("./vocab.json", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")

In [None]:
# Push the tokenizer to the hub
repo_name = "luganda_wav2vec2_ctc_tokenizer" 
tokenizer.push_to_hub(repo_name, use_auth_token=True)

# Create Wav2Vec Feature Extractor