## Import Packages
We need to install transformers and datasets. soundfile is used to load audio files and jiwer is used to evaluate the finetuned model using word errot rate

In [5]:
!pip install datasets>=1.18.3
!pip install transformers==4.11.3
!pip install librosa
!pip install jiwer
# Restart the runtime for this change to take effect
!pip install accelerate -U

Collecting transformers==4.11.3
  Downloading transformers-4.11.3-py3-none-any.whl (2.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.9/2.9 MB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
Collecting sacremoses (from transformers==4.11.3)
  Downloading sacremoses-0.1.1-py3-none-any.whl (897 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m897.5/897.5 kB[0m [31m71.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tokenizers<0.11,>=0.10.1 (from transformers==4.11.3)
  Downloading tokenizers-0.10.3.tar.gz (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.7/212.7 kB[0m [31m27.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: tokenizers
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mBuilding whe

To upload our training checkpoints directly to huggingface, we have to store the huggingface authentication key.

In [1]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Install the GIT LFS in order to upload the model checkpoints

In [2]:
!apt install git-lfs

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
git-lfs is already the newest version (3.0.2-1ubuntu0.2).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.


# Prepare Data, Tokenizer, Feature Extractor

### Create Wav2Vec2CTCTokenizer

In [None]:
# Load the dataset
from datasets import load_dataset, load_metric, Audio

# You can pass the streaming option to load_dataset to stream the data from the source instead of downloading and caching it
luganda = load_dataset("mozilla-foundation/common_voice_7_0", "lg")

print(luganda)

In [4]:
# Remove the unnecessary columns from the dataset
luganda = luganda.remove_columns(["client_id", "up_votes", "down_votes", "age", "gender", "accent", "locale", "segment"])

### Display some of the rows in the dataset

In [5]:
from datasets import ClassLabel
import random
import pandas as pd
from IPython.display import display, HTML
import re

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    display(HTML(df.to_html()))

show_random_elements(luganda["train"].remove_columns(["path", "audio"]))

Unnamed: 0,sentence
0,"""Tulina okulwanyisa ebikolwa ebityoboola eddembe ly'abaana mu kitundu."""
1,Maama alina okunaaba mu ngalo nga tannakwata mwana yakazaalibwa..
2,"""Abantu bajja kufuna ensimbi balongoose embeera y'obulamu bwabwe."""
3,"""Abamenyi b'amateeka balina okuloopebwa ku poliisi eri okumpi."""
4,"""Abakulu b'ebika banenya abakazi ne bwe kiba nga abasajja be bali mu nsobi."""
5,"""Ebbanga wakati w'abayizi n'amasomero likosa ensoma yaabwe."""
6,Endwadde zonna zigemebwa?
7,Ebibira birina emigaso mingi.
8,"""Disitulikiti ejja kugabanya ensimbi okusinziira ku byetaago by'ekitundu."""
9,"""Obujjanjabi bw'akafuba weebuli eri abalwadde b'akafuba."""


In [6]:
# Let's normalize the dataset to only lower case letters and ignore any special tokens because without a language model it is difficult to classify such tokens as they do not correspond to a characteristic sound.
chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"]'

def remove_special_characters(batch):
    batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
    return batch

luganda = luganda.map(remove_special_characters)

Map:   0%|          | 0/6626 [00:00<?, ? examples/s]

Map:   0%|          | 0/4276 [00:00<?, ? examples/s]

Map:   0%|          | 0/3549 [00:00<?, ? examples/s]

Map:   0%|          | 0/29407 [00:00<?, ? examples/s]

Map:   0%|          | 0/2195 [00:00<?, ? examples/s]

In [7]:
# Display samples from the normalized dataset
show_random_elements(luganda["train"].remove_columns(["path", "audio"]))

Unnamed: 0,sentence
0,akade ko okateze kuvuga ssaawa mmeka
1,kaweefube w'okukomya omusujja gw'ensiri
2,lekera awo okumala ebiseera nga wennyonnyolako mu abalala
3,ekitabo kino kyali kya mugaso nnyo
4,omusajja avunaanyizibwa ku ndiisa y'abaana yatugaanye okulya nga bwe tunywa omulundi ogumu
5,mu kulunda embuzi ngasseeko n'okulima nsobole okufuna emiganyulo mingi mu by'enkola
6,endokwa z'emiti ennungi zirina okugabirwa amaka okusimbibwa
7,weetaaga okukola osobole okusomesa abaana bo
8,baasubwa ebigezo kubanga baalina ebbanja ly'ebisale by'essomero
9,ani yali alina okukoola ebijanjaalo bino


In CTC chunks of speech are classified into letters. We need to extract all distinct letters in the dataset and builf a vocabulary.   
We need a mapping function that will concatenate all the transcriptions into a long transcription and transforms the strings into a set of characters.

In [8]:
# Let's use the batched = True so that the map function can access all the transcriptions at a go
def extract_all_chars(batch):
  all_text = " ".join(batch["sentence"])
  vocab = list(set(all_text))
  return {"vocab": [vocab], "all_text": [all_text]}

vocabs = luganda.map(extract_all_chars, batched=True, batch_size=-1, keep_in_memory=True, remove_columns=luganda.column_names["train"])

Map:   0%|          | 0/6626 [00:00<?, ? examples/s]

Map:   0%|          | 0/4276 [00:00<?, ? examples/s]

Map:   0%|          | 0/3549 [00:00<?, ? examples/s]

Map:   0%|          | 0/29407 [00:00<?, ? examples/s]

Map:   0%|          | 0/2195 [00:00<?, ? examples/s]

In [9]:
# Create a vocabulary of all letters in the train dataset
vocab_list = list(set(vocabs["train"]["vocab"][0]) | set(vocabs["test"]["vocab"][0]))

vocab_dict = {v: k for k, v in enumerate(vocab_list)}
vocab_dict

{'f': 0,
 '’': 1,
 'v': 2,
 "'": 3,
 'j': 4,
 'i': 5,
 'e': 6,
 'a': 7,
 '(': 8,
 'h': 9,
 'y': 10,
 't': 11,
 '‘': 12,
 'b': 13,
 'o': 14,
 'd': 15,
 'p': 16,
 'c': 17,
 'm': 18,
 'z': 19,
 'x': 20,
 ')': 21,
 'w': 22,
 'g': 23,
 'n': 24,
 's': 25,
 ' ': 26,
 'l': 27,
 'u': 28,
 'r': 29,
 'k': 30}

We need to replace the " " in the dataset with a more visible character. We also need to add the UNKNOWN token so that to deal with characters not encountered in the training dataset.

In [10]:
vocab_dict["|"] = vocab_dict[" "]
del vocab_dict[" "]

We need to add the pad token that corresponds to CTC's blank token. The blank token is a core component of the CTC algorithm.

In [11]:
vocab_dict["[UNK]"] = len(vocab_dict)
vocab_dict["[PAD]"] = len(vocab_dict)
print(len(vocab_dict))

33


In [12]:
# Save the vocabulary to a json file
import json
with open('./vocab.json', 'w') as vocab_file:
    json.dump(vocab_dict, vocab_file)

In [13]:
# Use the json file to instantiate an object of the Wav2Vec2CTCTokenizer class
from transformers import Wav2Vec2CTCTokenizer

tokenizer = Wav2Vec2CTCTokenizer("./vocab.json", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")

In [14]:
# Push the tokenizer to the hub
repo_name = "luganda_wav2vec2_ctc_tokenizer_with_lm"
tokenizer.push_to_hub(repo_name)

CommitInfo(commit_url='https://huggingface.co/dmusingu/luganda_wav2vec2_ctc_tokenizer_with_lm/commit/614fe0f0a1165b20ad1b140a0658086fd1b1aa8f', commit_message='Upload tokenizer', commit_description='', oid='614fe0f0a1165b20ad1b140a0658086fd1b1aa8f', pr_url=None, pr_revision=None, pr_num=None)

# Create Wav2Vec Feature Extractor

In [15]:
# Create a feature extractor using Wav2Vec2FeatureExtractor. We shall pass feature size as 1 because we are dealing with raw audio files.
from transformers import Wav2Vec2FeatureExtractor

feature_extractor = Wav2Vec2FeatureExtractor(feature_size=1, sampling_rate=16000, padding_value=0.0, do_normalize=True, return_attention_mask=False)

In [16]:
# Wrap the feature extractor and the tokenizer into a Wav2VecProcessor
from transformers import Wav2Vec2Processor

processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)

# Preprocess the dataset

In [17]:
from datasets import Audio

In [18]:
luganda = luganda.cast_column("audio", Audio(sampling_rate=16000))

In [19]:
# Dislay an audio sample from the dataset
luganda['train'][10]["audio"]

{'path': '/root/.cache/huggingface/datasets/downloads/extracted/f41fa3f1dcaf43ac5071b5022181b4a0d77f872115acbc3c8386d9568963751d/cv-corpus-7.0-2021-07-21/lg/clips/common_voice_lg_23722908.mp3',
 'array': array([ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
         1.87113613e-09, -7.67451491e-10,  0.00000000e+00]),
 'sampling_rate': 16000}

In [20]:
# Listen to sample audio from the dataset
import IPython.display as ipd
import numpy as np
import random

rand_int = random.randint(0, len(luganda["train"]))

print(luganda["train"][rand_int]["sentence"])
ipd.Audio(data=np.asarray(luganda["train"][rand_int]["audio"]["array"]), autoplay=True, rate=16000)

obuuma obuyamba ba kiggala okuwulira bukoze ebyewunyo eri ba kiggala


In [21]:
rand_int = random.randint(0, len(luganda["train"]))

print("Target text:", luganda["train"][rand_int]["sentence"])
print("Input array shape:", np.asarray(luganda["train"][rand_int]["audio"]["array"]).shape)
print("Sampling rate:", luganda["train"][rand_int]["audio"]["sampling_rate"])

Target text: osoma amawulire g'empapula
Input array shape: (48960,)
Sampling rate: 16000


In [22]:
# Convert the sampling frewquency to 16kHz since the model was pretrained on audio sampled at 16kHz
def prepare_dataset(batch):
    audio = batch["audio"]

    # batched output is "un-batched" to ensure mapping is correct
    batch["input_values"] = processor(audio["array"], sampling_rate=16000).input_values[0]

    with processor.as_target_processor():
        batch["labels"] = processor(batch["sentence"]).input_ids
    return batch

In [23]:
# Apply the map function to the dataset
luganda = luganda.map(prepare_dataset, remove_columns=luganda.column_names["train"], num_proc=4)

Map (num_proc=4):   0%|          | 0/6626 [00:00<?, ? examples/s]



Map (num_proc=4):   0%|          | 0/4276 [00:00<?, ? examples/s]



Map (num_proc=4):   0%|          | 0/3549 [00:00<?, ? examples/s]



Map (num_proc=4):   0%|          | 0/29407 [00:00<?, ? examples/s]



Map (num_proc=4):   0%|          | 0/2195 [00:00<?, ? examples/s]



In [25]:
luganda.push_to_hub("mozilla_commonvoices_7_0_luganda")

Uploading the dataset shards:   0%|          | 0/5 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Uploading the dataset shards:   0%|          | 0/4 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Uploading the dataset shards:   0%|          | 0/3 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Uploading the dataset shards:   0%|          | 0/22 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Uploading the dataset shards:   0%|          | 0/2 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

CommitInfo(commit_url='https://huggingface.co/datasets/dmusingu/mozilla_commonvoices_7_0_luganda/commit/27fa9407c98028e07028e982f1d9f7324149f4e3', commit_message='Upload dataset', commit_description='', oid='27fa9407c98028e07028e982f1d9f7324149f4e3', pr_url=None, pr_revision=None, pr_num=None)

### N gram Language model

In [1]:
%pip install https://github.com/kpu/kenlm/archive/master.zip pyctcdecode

Collecting https://github.com/kpu/kenlm/archive/master.zip
  Downloading https://github.com/kpu/kenlm/archive/master.zip (553 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m553.6/553.6 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting pyctcdecode
  Downloading pyctcdecode-0.5.0-py2.py3-none-any.whl (39 kB)
Collecting pygtrie<3.0,>=2.1 (from pyctcdecode)
  Downloading pygtrie-2.5.0-py3-none-any.whl (25 kB)
Collecting hypothesis<7,>=6.14 (from pyctcdecode)
  Downloading hypothesis-6.98.9-py3-none-any.whl (446 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m446.8/446.8 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: kenlm
  Building wheel for kenlm (pyproject.toml) ... [?25l[?25hdone
  Created wheel for kenlm: filename=kenlm-0.2

In [2]:
# Install KenLM library binaries
!sudo apt install build-essential cmake libboost-system-dev libboost-thread-dev libboost-program-options-dev libboost-test-dev libeigen3-dev zlib1g-dev libbz2-dev liblzma-dev

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
build-essential is already the newest version (12.9ubuntu3).
libboost-program-options-dev is already the newest version (1.74.0.3ubuntu7).
libboost-program-options-dev set to manually installed.
libboost-system-dev is already the newest version (1.74.0.3ubuntu7).
libboost-system-dev set to manually installed.
libboost-thread-dev is already the newest version (1.74.0.3ubuntu7).
libboost-thread-dev set to manually installed.
libbz2-dev is already the newest version (1.0.8-5build1).
libbz2-dev set to manually installed.
liblzma-dev is already the newest version (5.2.5-2ubuntu1).
liblzma-dev set to manually installed.
libboost-test-dev is already the newest version (1.74.0.3ubuntu7).
libboost-test-dev set to manually installed.
cmake is already the newest version (3.22.1-1ubuntu1.22.04.1).
zlib1g-dev is already the newest version (1:1.2.11.dfsg-2ubuntu9.2).
zlib1g-dev set to manually installed.

In [3]:
# Download and unpack the KenLM repo
!wget -O - https://kheafield.com/code/kenlm.tar.gz | tar xz

--2024-02-22 12:56:02--  https://kheafield.com/code/kenlm.tar.gz
Resolving kheafield.com (kheafield.com)... 35.196.63.85
Connecting to kheafield.com (kheafield.com)|35.196.63.85|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 491888 (480K) [application/x-gzip]
Saving to: ‘STDOUT’


2024-02-22 12:56:02 (2.79 MB/s) - written to stdout [491888/491888]



In [4]:
!mkdir kenlm/build && cd kenlm/build && cmake .. && make -j2
!ls kenlm/build/bin

  Compatibility with CMake < 3.5 will be removed from a future version of
  CMake.

  Update the VERSION argument <min> value or use a ...<max> suffix to tell
  CMake that the project does not need compatibility with older versions.

[0m
-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Boost: /usr/lib/x86_64-linux-gnu/cmake/Boost-1.74.0/BoostConfig.cmake (found suitable version "1.74.0", minimum required is "1.41.0") found components: program_options system thread unit_test_framework 
-- Found Threads: TRUE  
-- Found ZLIB: /usr

In [26]:
print(luganda)

DatasetDict({
    train: Dataset({
        features: ['input_values', 'labels'],
        num_rows: 6626
    })
    test: Dataset({
        features: ['input_values', 'labels'],
        num_rows: 4276
    })
    validation: Dataset({
        features: ['input_values', 'labels'],
        num_rows: 3549
    })
    other: Dataset({
        features: ['input_values', 'labels'],
        num_rows: 29407
    })
    invalidated: Dataset({
        features: ['input_values', 'labels'],
        num_rows: 2195
    })
})


KenLM by default computes an n-gram with Kneser-Ney smooting. All text data used to create the n-gram is expected to be stored in a text file. We download our dataset and save it as a .txt file.

In [30]:
print(dataset['labels'])

[[24, 25, 7, 13, 7, 26, 14, 18, 16, 6, 6, 10, 14, 26, 14, 18, 28, 30, 5, 25, 7, 26, 14, 18, 28, 27, 7, 27, 7], [24, 6, 6, 13, 7, 19, 7, 26, 24, 24, 10, 14, 26, 25, 25, 6, 19, 7, 7, 27, 7, 26, 22, 7, 24, 23, 6, 26, 30, 28, 13, 7, 26, 10, 7, 24, 23, 7, 13, 5, 29, 7, 26, 6, 24, 11, 6], [11, 28, 13, 28, 28, 27, 5, 29, 6, 26, 18, 28, 26, 13, 28, 0, 28, 24, 19, 6, 26, 6, 13, 5, 30, 28, 30, 22, 7, 11, 7, 30, 14, 26, 6, 29, 7, 26, 14, 13, 5, 29, 7, 18, 13, 28, 27, 28, 27, 6], [7, 24, 5, 26, 10, 7, 27, 5, 26, 7, 18, 7, 24, 10, 5, 26, 24, 11, 5, 26, 24, 7, 24, 23, 6, 26, 24, 15, 5, 0, 28, 24, 7, 26, 14, 18, 28, 30, 7, 19, 5, 26, 14, 18, 28, 27, 28, 24, 23, 5, 26, 24, 23, 7, 26, 14, 24, 14], [6, 30, 5, 22, 14, 24, 2, 28, 26, 30, 5, 24, 14, 26, 30, 5, 29, 5, 18, 28, 26, 7, 18, 7, 19, 19, 5, 26, 7, 18, 7, 27, 28, 24, 23, 5, 26, 14, 30, 28, 22, 7, 26, 6, 24, 11, 6], [6, 18, 5, 29, 7, 24, 15, 5, 29, 7, 26, 23, 10, 7, 26, 18, 28, 22, 14, 23, 14, 26, 11, 6, 23, 5, 25, 14, 13, 14, 27, 7, 26, 30, 28, 2, 

In [28]:
# username = "dmusingu"  # change to your username

# dataset = load_dataset(f"{username}/mozilla_commonvoices_7_0_luganda", split="train")

with open("text.txt", "w") as file:
  file.write(" ".join(dataset["labels"]))

Downloading readme:   0%|          | 0.00/785 [00:00<?, ?B/s]

Resolving data files:   0%|          | 0/22 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/22 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/439M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/490M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/445M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/445M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/468M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/385M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/399M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/383M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/372M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/436M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/408M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/446M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/498M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/509M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/495M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/493M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/471M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/472M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/467M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/461M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/478M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/494M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/476M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/488M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/483M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/499M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/483M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/458M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/468M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/469M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/480M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/495M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/483M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/490M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/412M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/395M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/6626 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/4276 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3549 [00:00<?, ? examples/s]

Generating other split:   0%|          | 0/29407 [00:00<?, ? examples/s]

Generating invalidated split:   0%|          | 0/2195 [00:00<?, ? examples/s]

TypeError: sequence item 0: expected str instance, list found

run KenLM's lmplz command to build our n-gram, called "5gram.arpa".

In [None]:
!kenlm/build/bin/lmplz -o 5 <"text.txt" > "5gram.arpa"

In [None]:
!head -20 5gram.arpa

In [None]:
# combine the ngram model with wav2vec
# downloading the currently "LM-less" processor of xls-r-300m-sv.
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("hf-test/xls-r-300m-sv")

In [None]:
# we extract the vocabulary of its tokenizer as it represents the "labels" of pyctcdecode's BeamSearchDecoder class.
vocab_dict = processor.tokenizer.get_vocab()
sorted_vocab_dict = {k.lower(): v for k, v in sorted(vocab_dict.items(), key=lambda item: item[1])}

In [None]:
# Build the decoder
from pyctcdecode import build_ctcdecoder

decoder = build_ctcdecoder(
    labels=list(sorted_vocab_dict.keys()),
    kenlm_model_path="5gram_correct.arpa",
)

In [None]:
# wrap the just created decoder, together with the processor's tokenizer and feature_extractor into a Wav2Vec2ProcessorWithLM class.
from transformers import Wav2Vec2ProcessorWithLM

processor_with_lm = Wav2Vec2ProcessorWithLM(
    feature_extractor=processor.feature_extractor,
    tokenizer=processor.tokenizer,
    decoder=decoder
)

# Add an end of sentence token

In [None]:
with open("5gram.arpa", "r") as read_file, open("5gram_correct.arpa", "w") as write_file:
  has_added_eos = False
  for line in read_file:
    if not has_added_eos and "ngram 1=" in line:
      count=line.strip().split("=")[-1]
      write_file.write(line.replace(f"{count}", f"{int(count)+1}"))
    elif not has_added_eos and "<s>" in line:
      write_file.write(line)
      write_file.write(line.replace("<s>", "</s>"))
      has_added_eos = True
    else:
      write_file.write(line)

In [None]:
# Inspect the corrected arpa file
!head -20 5gram_correct.arpa

### Training and Evaluation

In [None]:
# Set up the trainer
import torch

from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Union

@dataclass
class DataCollatorCTCWithPadding:
    """
    Data collator that will dynamically pad the inputs received.
    Args:
        processor (:class:`~transformers.Wav2Vec2Processor`)
            The processor used for proccessing the data.
        padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`True`):
            Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
            among:
            * :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
              sequence if provided).
            * :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the
              maximum acceptable input length for the model if that argument is not provided.
            * :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
              different lengths).
        max_length (:obj:`int`, `optional`):
            Maximum length of the ``input_values`` of the returned list and optionally padding length (see above).
        max_length_labels (:obj:`int`, `optional`):
            Maximum length of the ``labels`` returned list and optionally padding length (see above).
        pad_to_multiple_of (:obj:`int`, `optional`):
            If set will pad the sequence to a multiple of the provided value.
            This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >=
            7.5 (Volta).
    """

    processor: Wav2Vec2Processor
    padding: Union[bool, str] = True
    max_length: Optional[int] = None
    max_length_labels: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None
    pad_to_multiple_of_labels: Optional[int] = None

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need
        # different padding methods
        input_features = [{"input_values": feature["input_values"]} for feature in features]
        label_features = [{"input_ids": feature["labels"]} for feature in features]

        batch = self.processor.pad(
            input_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="pt",
        )
        with self.processor.as_target_processor():
            labels_batch = self.processor.pad(
                label_features,
                padding=self.padding,
                max_length=self.max_length_labels,
                pad_to_multiple_of=self.pad_to_multiple_of_labels,
                return_tensors="pt",
            )

        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        batch["labels"] = labels

        return batch

In [None]:
# Initialize the data_collator
data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True)

In [None]:
# Load the metric
wer_metric = load_metric("wer")

  wer_metric = load_metric("wer")
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


In [None]:
def compute_metrics(pred):
    pred_logits = pred.predictions
    pred_ids = np.argmax(pred_logits, axis=-1)

    pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id

    pred_str = processor.batch_decode(pred_ids)
    # we do not want to group tokens when computing the metrics
    label_str = processor.batch_decode(pred.label_ids, group_tokens=False)

    wer = wer_metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

In [None]:
# Load the pretrained Wav2Vec2 checkpoint. We use the tokenizers pad token id to degine the model's pad token id
from transformers import Wav2Vec2ForCTC

model = Wav2Vec2ForCTC.from_pretrained(
    "facebook/wav2vec2-base",
    ctc_loss_reduction="mean",
    pad_token_id=processor.tokenizer.pad_token_id,
    vocab_size=len(processor.tokenizer)
)

  return self.fget.__get__(instance, owner)()
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base and are newly initialized: ['lm_head.bias', 'lm_head.weight', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# Freeze the feature extractor
model.freeze_feature_extractor()



In [None]:
!pip install transformers[torch]

In [None]:
# Define the training arguments
from transformers import TrainingArguments

training_args = TrainingArguments(
  output_dir=repo_name,
  group_by_length=True,
  per_device_train_batch_size=32,
  evaluation_strategy="steps",
  num_train_epochs=30,
  fp16=True,
  gradient_checkpointing=True,
  save_steps=500,
  eval_steps=500,
  logging_steps=500,
  learning_rate=1e-4,
  weight_decay=0.005,
  warmup_steps=1000,
  save_total_limit=2
)

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    data_collator=data_collator,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=luganda["train"],
    eval_dataset=luganda["test"],
    tokenizer=processor.feature_extractor,
)

In [None]:
# Train the model
trainer.train()



Step,Training Loss,Validation Loss,Wer
500,4.1365,1.959844,1.0
1000,0.5695,0.585338,0.732862
1500,0.176,0.538128,0.67472
2000,0.0845,0.512766,0.626989
2500,0.0424,0.46507,0.601355
3000,0.0127,0.539471,0.604891
3500,-0.0063,0.516856,0.584168
4000,-0.0212,0.499033,0.583284
4500,-0.0336,0.531802,0.568029
5000,-0.0424,0.546537,0.570156




Step,Training Loss,Validation Loss,Wer
500,4.1365,1.959844,1.0
1000,0.5695,0.585338,0.732862
1500,0.176,0.538128,0.67472
2000,0.0845,0.512766,0.626989
2500,0.0424,0.46507,0.601355
3000,0.0127,0.539471,0.604891
3500,-0.0063,0.516856,0.584168
4000,-0.0212,0.499033,0.583284
4500,-0.0336,0.531802,0.568029
5000,-0.0424,0.546537,0.570156


In [None]:
# Push the model to hub
trainer.push_to_hub(repo_name)

model.safetensors:   0%|          | 0.00/378M [00:00<?, ?B/s]

events.out.tfevents.1708526749.30ee7847a043.7857.0:   0%|          | 0.00/12.2k [00:00<?, ?B/s]

training_args.bin:   0%|          | 0.00/4.73k [00:00<?, ?B/s]

Upload 3 LFS files:   0%|          | 0/3 [00:00<?, ?it/s]

CommitInfo(commit_url='https://huggingface.co/dmusingu/luganda_wav2vec2_ctc_tokenizer/commit/841b4b6b64d379592f1cb8f9a059ab6139b30c28', commit_message='luganda_wav2vec2_ctc_tokenizer', commit_description='', oid='841b4b6b64d379592f1cb8f9a059ab6139b30c28', pr_url=None, pr_revision=None, pr_num=None)

### Evaluation

In [None]:
processor = Wav2Vec2Processor.from_pretrained(repo_name)
# model = Wav2Vec2ForCTC.from_pretrained(repo_name)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

In [None]:
# Evaluation is carried out with a batch size of 1
def map_to_result(batch):
  with torch.no_grad():
    input_values = torch.tensor(batch["input_values"], device="cuda").unsqueeze(0)
    logits = model(input_values).logits

  pred_ids = torch.argmax(logits, dim=-1)
  batch["pred_str"] = processor.batch_decode(pred_ids)[0]
  batch["text"] = processor.decode(batch["labels"], group_tokens=False)

  return batch

results = luganda["test"].map(map_to_result, remove_columns=luganda["test"].column_names)

Map:   0%|          | 0/4276 [00:00<?, ? examples/s]

In [None]:
print("Test WER: {:.3f}".format(wer_metric.compute(predictions=results["pred_str"], references=results["text"])))

Test WER: 0.475


In [None]:
# Chech the errors made by the model
show_random_elements(results)

Unnamed: 0,pred_str,text
0,abazadde b'aba[UNK]izi be baagobaku ssomero bakkiriza okukozesa emeeza abaana ze baa[UNK]onoona,abazadde b'aba[UNK]izi be baagoba ku ssomero bakkirizza okukozesa emmeeza abaana ze ba[UNK]onoona
1,akaakanu kann[UNK]umidde nn[UNK]o bambi,ako akanu kakun[UNK]umidde nn[UNK]o bambi
2,abakulembeze ab'enjawulo beeab[UNK]e mu musomo,abakulembeze ab'enjawulo beeab[UNK]e mu musomo
3,im[UNK]obulwanga [UNK]amukui we esinga mu nabanda,iimu [UNK]a proline [UNK]'emu ku iimu ezisinga mu uganda
4,buzi bu ki obuba mu kufumbo obw'ekio,buzibu ki obuva mu bufumbo bw'ekio
5,omukulembeze omulungi [UNK]'o[UNK]o aegeera ebizibu b[UNK]abo baakulembera,omukulembeze omulungi [UNK]'o[UNK]o aegeera ebizibu b[UNK]'abo b'akulembera
6,maama wange ama[UNK]i okuboobeza emmere,maama wange aman[UNK]i okuboobeza emmere
7,ensi ezimu ze ugenda okukoleramu olwa oa [UNK]annungi mu ndabika na[UNK]e ebikole bwa[UNK]o b[UNK]a abpu,ensi ezimu ze ugenda okukoleramu obwa [UNK]aa[UNK]a nnungi mu ndabika na[UNK]e ebikolebwa[UNK]o b[UNK]a abu
8,eeampu zi realaba nu okuvukumi bifobira bwe erewagenda muolokippe ebi[UNK]emweddreeme,enalo zireeera abanu okuva mu bifo b[UNK]abwe ne bagenda mu bifo ebirimu eddembe
9,omuamiivu abeera nga e [UNK]eeba se na[UNK]e ngaai ngeera bulungi nn[UNK]a,omuamiivu abeera nga e[UNK]eebase na[UNK]e nga aegeera bulungi nn[UNK]o


From the output above we can make the following observations
1. The model is unable classifies the y's in the input as [UNK]. A similar observation can be made about the about the true values. This means that there is an error in the tokenization process.
2. The wav2vec model was pretrained on English which has a different morphology from Luganda and this could could be one of the possible causes of the high WER on the test set. More experiments need to be carried out to prove if this is the case.