<a href="https://colab.research.google.com/github/Marconi-Lab/Swahili_ASR_Model/blob/main/Fine_tuning_XLS_R_Wav2Vec2_with_Swahili_corpus_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook, we aim to take a pre-trained model from hugging face using [Wav2Vec2-XLS-R-300M](https://huggingface.co/facebook/wav2vec2-xls-r-300m) and fine-tuning with [swahili data](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) from mozilla common voice hosted in hugging face dataset platform. 

(we will also try using [Wav2Vec2](https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/) and  [XLSR-Wav2Vec2](https://ai.facebook.com/blog/-xlm-r-state-of-the-art-cross-lingual-understanding-through-self-supervision/) :
[Whisper](https://huggingface.co/openai/whisper-large-v2)
, [Wav2Vec2-XLS-R-1B](https://huggingface.co/facebook/wav2vec2-xls-r-1b)
and [Wav2Vec2-XLS-R-2B](https://huggingface.co/facebook/wav2vec2-xls-r-2b). )
 
 



## Install all the requirements

In [None]:
!nvidia-smi
!pip install datasets
!pip install transformers==4.27.0
!pip install torchaudio==1.13.1+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html #Install version 0.10.0 with CUDA support for NVIDIA GPUs.
!pip install jiwer

Mon Mar 20 13:10:47 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   60C    P0    28W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

 We will use notebook_login() function to access token which we then use to authenticate to the Hugging Face Hub and allow us to download datasets,  models, and save our checkpoints during training. The Git Large File Storage (LFS) package will help us upload our model checkpoints:

In [None]:
from huggingface_hub import notebook_login
notebook_login()

Token is valid.
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.cache/huggingface/token
Login successful


Git-LFS to upload your model checkpoints

In [None]:
%%capture
!apt install git-lfs

## Data

In this stage, we download the common voice data, and the prepare it for fine-tuning one of the three pre-trained models we mentioned at the beginning. (took about 17minutes to download 15% of the training set and 3% of test set at an internet speed of 6mbs download and 0.08 upload)

In [None]:
from datasets import load_dataset

training_data = load_dataset("mozilla-foundation/common_voice_11_0", "sw", split="train[:14%]")
testing_data = load_dataset("mozilla-foundation/common_voice_11_0", "sw", split="test[:1%]")

Downloading builder script:   0%|          | 0.00/8.30k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/14.4k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.44k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/60.9k [00:00<?, ?B/s]

Downloading and preparing dataset common_voice_11_0/sw to /root/.cache/huggingface/datasets/mozilla-foundation___common_voice_11_0/sw/11.0.0/2c65b95d99ca879b1b1074ea197b65e0497848fd697fdb0582e0f6b75b6f4da0...


Downloading data:   0%|          | 0.00/12.2k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/5 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/907M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/352M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/355M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.28G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.27G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.19G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.24G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.26G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.28G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/3.30M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.44G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/284M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/5 [00:00<?, ?it/s]

Downloading data files:   0%|          | 0/5 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/6.30M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.40M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/55.9M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/11.2M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/5 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]


Reading metadata...: 0it [00:00, ?it/s][A
Reading metadata...: 6882it [00:00, 68815.47it/s][A
Reading metadata...: 26614it [00:00, 101793.08it/s]


Generating validation split: 0 examples [00:00, ? examples/s]



Reading metadata...: 10233it [00:00, 107496.43it/s]


Generating test split: 0 examples [00:00, ? examples/s]



Reading metadata...: 10238it [00:00, 131058.40it/s]


Generating other split: 0 examples [00:00, ? examples/s]




Reading metadata...: 0it [00:00, ?it/s][A[A[A


Reading metadata...: 5162it [00:00, 51615.62it/s][A[A[A


Reading metadata...: 10324it [00:00, 50687.02it/s][A[A[A


Reading metadata...: 18517it [00:00, 64809.95it/s][A[A[A


Reading metadata...: 27006it [00:00, 72685.77it/s][A[A[A


Reading metadata...: 35214it [00:00, 76058.39it/s][A[A[A


Reading metadata...: 43540it [00:00, 78496.24it/s][A[A[A


Reading metadata...: 51932it [00:00, 80264.19it/s][A[A[A


Reading metadata...: 60000it [00:00, 80394.84it/s][A[A[A


Reading metadata...: 68111it [00:00, 80615.84it/s][A[A[A


Reading metadata...: 76305it [00:01, 81021.55it/s][A[A[A


Reading metadata...: 84950it [00:01, 82680.10it/s][A[A[A


Reading metadata...: 93220it [00:01, 82033.75it/s][A[A[A


Reading metadata...: 101426it [00:01, 79650.48it/s][A[A[A


Reading metadata...: 109407it [00:01, 75873.76it/s][A[A[A


Reading metadata...: 117222it [00:01, 76514.03it/s][A[A[A


Reading metada

Generating invalidated split: 0 examples [00:00, ? examples/s]


Reading metadata...: 0it [00:00, ?it/s][A
Reading metadata...: 9703it [00:00, 97020.61it/s][A
Reading metadata...: 23694it [00:00, 122244.78it/s][A
Reading metadata...: 47470it [00:00, 127047.81it/s]


Dataset common_voice_11_0 downloaded and prepared to /root/.cache/huggingface/datasets/mozilla-foundation___common_voice_11_0/sw/11.0.0/2c65b95d99ca879b1b1074ea197b65e0497848fd697fdb0582e0f6b75b6f4da0. Subsequent calls will reuse this data.




We have 3992 sentence for training up from 2660 sentences

In [None]:
training_data

Dataset({
    features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment'],
    num_rows: 3726
})

In [None]:
testing_data

Dataset({
    features: ['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment'],
    num_rows: 102
})

###  We observe that:

For training data, we have 11 columns : `['client_id', 'path', 'audio', 'sentence', 'up_votes', 'down_votes', 'age', 'gender', 'accent', 'locale', 'segment']`, and `3992` rows. 

For testing data, we have the same number of columns but now `307` rows.
    

###  Let try exploring the data:

1. Remove the unrequired columns from both training and testing set
2. Then output 10 random sentences from the trainin set

In [None]:
#we only remain with the path, audio and sentence which are the only columns the model will require for training
training_data = training_data.remove_columns(["accent", "age", "client_id", "down_votes", "gender", "locale", "segment", "up_votes"])
testing_data = testing_data.remove_columns(["accent", "age", "client_id", "down_votes", "gender", "locale", "segment", "up_votes"])

In [None]:
# we only have the path, audio and sentence as we expected
print(training_data)
print(testing_data)

Dataset({
    features: ['path', 'audio', 'sentence'],
    num_rows: 3726
})
Dataset({
    features: ['path', 'audio', 'sentence'],
    num_rows: 102
})


In [None]:
#we now generate 10 random sentences from our two datasets
import random
import pandas as pd
from IPython.display import display, HTML
from datasets import ClassLabel

#this function will receive a dataset then output 10 random sentences

def display_random_elements(dataset, num_examples=10):

  #we first confirm that the dataset is more than 10 sentences
    if num_examples > len(dataset):
        raise ValueError("Can't pick more elements than there are in the dataset.")

  # returns a list of unique, randomly selected integers from 0 to dataset-1
    random_indices = random.sample(range(len(dataset)), num_examples)

    # converts the picked examples into a Pandas DataFrame and displays
    df = pd.DataFrame(dataset[random_indices])
    display(HTML(df.to_html()))

In [None]:
# lets see any 10 examples on the train set
display_random_elements(training_data.remove_columns(["path", "audio"]), num_examples=10)

Unnamed: 0,sentence
0,alirejea kutoka Zanzibar elfu moja mia tisa sitini na tatu mwaka ambao visiwa hivyo
1,Huu ni miongoni mwa miji ya zamani sana nchini Uswidi.
2,Hivyo aliweza kumpa mwanawe elimu nzuri.
3,nakukuona enzini
4,kumekuwa na hadithi nyingi juu ya eneo hilo ambazo nyingi ni za kutisha
5,Bidhaa za kila aina hupatikana hapo.
6,Kidogo huku vita vya malumbano ya Ardhi kati nchi hizo mbili
7,Tatu kati ya hizo ni kwa kazi yake ya utayarishaji.
8,kila nchi duniani inakuwa na vivutio vyake vya utaliii
9,Eneo lililofanya biashara hasa ya watumwa na pembe za Ndovu


In [None]:
#lets also see 10 from test set
display_random_elements(testing_data.remove_columns(["path", "audio"]), num_examples=10)

Unnamed: 0,sentence
0,Baada ya wapelelezi kupeleka tarifa na ramani
1,"""Imani ni kuamini, huruma ni huruma"""
2,Kinywa ni jumba la maneno
3,Mwidhi aliyejepa simu amekatiwa kifungo cha mwaka mmoya
4,Aliamka mapema sana jana
5,Serikali za Tanzania na Kenya hutunza idadi kadhaa ya maeneo yaliyotengwa
6,Kuku wanga kanaata mayai maili
7,Ni kati ya miji ya kwanza iliyoundwa na Wahispania katika Amerika Kusini.
8,Sote twesangaa twelipomuona mwalimu Ali apika
9,Serikali ilipatana na upande mmoja na upande mwingine


Let's extract all distinct letters of the training and test data and build our vocabulary from this set of letters.

In [None]:
# take a batch of sentences
def extract_all_chars(batch):

  # Concatenates all the sentences in the batch into a single string, separating each sentence with a space character
  all_text = " ".join(batch["sentence"])

  # we remove any duplicates from the sentences 
  vocab = list(set(all_text))

  # we then return a list of unique characters
  return {"vocab": [vocab], "all_text": [all_text]}

In [None]:
# we use map function to both the entire training and testing data
training_vocabulary = training_data.map(extract_all_chars, batched=True, batch_size=-1, remove_columns= training_data.column_names )
testing_vocabulary = testing_data.map(extract_all_chars, batched=True, batch_size=-1, remove_columns = testing_data.column_names)

Map:   0%|          | 0/3726 [00:00<?, ? examples/s]

Map:   0%|          | 0/102 [00:00<?, ? examples/s]

In [None]:
print(training_vocabulary)
print(testing_vocabulary)

Dataset({
    features: ['vocab', 'all_text'],
    num_rows: 1
})
Dataset({
    features: ['vocab', 'all_text'],
    num_rows: 1
})


We will now create a list of all the unique letters found in both the training and test datasets, and then creating a dictionary where each unique letter is assigned a numerical value 

In [None]:
# we create a new list of unique elements from the training and testing vocabulary list
vocabulary_list = list(set(training_vocabulary["vocab"][0]) | set(testing_vocabulary["vocab"][0]))

# then we create a dictionary of the unique elements and their count
vocabulary_dict = {cha: i for i, cha in enumerate(sorted(vocabulary_list))}
vocabulary_dict


{' ': 0,
 '!': 1,
 '"': 2,
 "'": 3,
 '*': 4,
 ',': 5,
 '-': 6,
 '.': 7,
 '/': 8,
 ':': 9,
 ';': 10,
 '?': 11,
 'A': 12,
 'B': 13,
 'C': 14,
 'D': 15,
 'E': 16,
 'F': 17,
 'G': 18,
 'H': 19,
 'I': 20,
 'J': 21,
 'K': 22,
 'L': 23,
 'M': 24,
 'N': 25,
 'O': 26,
 'P': 27,
 'Q': 28,
 'R': 29,
 'S': 30,
 'T': 31,
 'U': 32,
 'V': 33,
 'W': 34,
 'X': 35,
 'Y': 36,
 'Z': 37,
 'a': 38,
 'b': 39,
 'c': 40,
 'd': 41,
 'e': 42,
 'f': 43,
 'g': 44,
 'h': 45,
 'i': 46,
 'j': 47,
 'k': 48,
 'l': 49,
 'm': 50,
 'n': 51,
 'o': 52,
 'p': 53,
 'q': 54,
 'r': 55,
 's': 56,
 't': 57,
 'u': 58,
 'v': 59,
 'w': 60,
 'x': 61,
 'y': 62,
 'z': 63,
 'á': 64,
 'â': 65,
 'é': 66,
 'ó': 67,
 'ː': 68,
 '‘': 69,
 '’': 70}

From above code, we see we have 70 characters including special characters and both capital and small letters

We can remove special characters that do not change the pronounciation of words. As we can see above, characters such as ":",".", etc

We also don't want the model to think that "R" and "r" have different pronunciation. So we can convert all the characters into lower case

In [None]:
import re
chars_to_remove_regex = '[\,\?\.\!\-\;\:\"\“\%\‘\”\?\'\…\•\°\(\)\=\*\/\`\ː\’]'

def remove_special_characters(batch):
    batch["sentence"] = re.sub(chars_to_remove_regex, '', batch["sentence"]).lower()
    return batch

In [None]:
# we map the "remove_special_characters" function on both the trainnng and testing data

training_data = training_data.map(remove_special_characters)
testing_data = testing_data.map(remove_special_characters)


Map:   0%|          | 0/3726 [00:00<?, ? examples/s]

Map:   0%|          | 0/102 [00:00<?, ? examples/s]

Let substitute the characters with hatted characters. In swahili, we really don't have characters such as "ū", "ó","á", etc. We will assume that they meant "u", "o","a", etc. So we just want to convert all those characters with those special marks into their close substitutions.

In [None]:
def replace_hatted_characters(batch):
    batch["sentence"] = re.sub('[á]', 'a', batch["sentence"])
    batch["sentence"] = re.sub('[â]', 'a', batch["sentence"])
    batch["sentence"] = re.sub('[é]', 'e', batch["sentence"])
    batch["sentence"] = re.sub('[ó]', 'o', batch["sentence"])
    return batch


In [None]:
training_data = training_data.map(replace_hatted_characters)
testing_data = testing_data.map(replace_hatted_characters)

Map:   0%|          | 0/3726 [00:00<?, ? examples/s]

Map:   0%|          | 0/102 [00:00<?, ? examples/s]

Lets re-run the function that extracts the characters from both training and testing set, then run the code block below to extract a new vocabulary list.  See whether we have removed the special characters and whether all the letters are lower case

In [None]:
# re-running the extract_all character function to see whether we have successfully replaced the hatted characters and removed the punctuation marks.
training_vocabulary = training_data.map(extract_all_chars, batched=True, batch_size=-1, remove_columns= training_data.column_names )
testing_vocabulary = testing_data.map(extract_all_chars, batched=True, batch_size=-1, remove_columns = testing_data.column_names)

Map:   0%|          | 0/3726 [00:00<?, ? examples/s]

Map:   0%|          | 0/102 [00:00<?, ? examples/s]

Lets generete the vocabulary list again

In [None]:
# we create a new list of unique elements from the training and testing vocabulary list
vocabulary_list = list(set(training_vocabulary["vocab"][0]) | set(testing_vocabulary["vocab"][0]))

# then we create a dictionary of the unique elements and their count
vocabulary_dict = {cha: i for i, cha in enumerate(sorted(vocabulary_list))}
vocabulary_dict

{' ': 0,
 'a': 1,
 'b': 2,
 'c': 3,
 'd': 4,
 'e': 5,
 'f': 6,
 'g': 7,
 'h': 8,
 'i': 9,
 'j': 10,
 'k': 11,
 'l': 12,
 'm': 13,
 'n': 14,
 'o': 15,
 'p': 16,
 'q': 17,
 'r': 18,
 's': 19,
 't': 20,
 'u': 21,
 'v': 22,
 'w': 23,
 'x': 24,
 'y': 25,
 'z': 26}

We can see now we have only lower case alphabet and space (" "). These are the only characters the model needs to learn. 

Now we give the space (" ") a token "|" as that is a requirement in Connectionist Temporal Classification algorithm.

We only add the padding token and unknown token.

In [None]:
vocabulary_dict["|"] = vocabulary_dict[" "]
del vocabulary_dict[" "]

In [None]:
vocabulary_dict["[UNK]"] = len(vocabulary_dict)
vocabulary_dict["[PAD]"] = len(vocabulary_dict)
len(vocabulary_dict)

29

We now save the vocabulary into a file. We will name that file as vocab_file

In [None]:
import json
with open('vocab.json', 'w') as vocab_file:
    json.dump(vocabulary_dict, vocab_file)

In [None]:
# lets see how the training set looks like now
display_random_elements(training_data.remove_columns(["path", "audio"]), num_examples=5)


# and lets also see how the testing set looks like now
display_random_elements(testing_data.remove_columns(["path", "audio"]), num_examples=5)

Unnamed: 0,sentence
0,ndiyo sababu mkatoliki hawezi kufunga ndoa bila ya kuhusisha kanisa na kufuata taratibu zake
1,ya chati ya joto na utabiri juu
2,wazee hawa walipokuwa wakikaa na kuamua tanzania mambo yakanyooka
3,kupitia katika bahari hii ya atlantik enzi hizo
4,mengi yalishuhudiwa katika uchaguzi mkuu wa nchini tanzania


Unnamed: 0,sentence
0,kuku wanga kanaata mayai maili
1,laki tisa arobaini na tano elfu na themanini na saba
2,makao makuu yalikuwa arraqqah syria
3,anasema baada ya hapo ulikuja utawala wa mahdali na ujenzi uliimarishwa zaidi
4,maamoun pia ni mwanachama wa the academy of the arts of the world


we load the vocabulary to [wav2vecCTC tokenizer](https://huggingface.co/docs/transformers/v4.26.1/en/main_classes/tokenizer#transformers.PreTrainedTokenizer)

In [None]:
from transformers import Wav2Vec2CTCTokenizer

tokenizer = Wav2Vec2CTCTokenizer("/content/vocab.json", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")

Let add the tokenizer to a hugging face repository

In [None]:
hugging_face_repo = "AntonyG/fine-tune-wav2vec2-large-xls-r-1b-sw"

In [None]:
tokenizer.push_to_hub(hugging_face_repo)


CommitInfo(commit_url='https://huggingface.co/AntonyG/fine-tune-wav2vec2-large-xls-r-1b-sw/commit/76d61cba8ec90a1f745f7d31cd60f310b286b750', commit_message='Upload tokenizer', commit_description='', oid='76d61cba8ec90a1f745f7d31cd60f310b286b750', pr_url=None, pr_revision=None, pr_num=None)

We then create the [wav2vec feature extractor](https://huggingface.co/docs/transformers/v4.26.1/en/model_doc/wav2vec2#transformers.Wav2Vec2FeatureExtractor):
converts the raw audio to torch array which will input in the model

In [None]:
from transformers import Wav2Vec2FeatureExtractor

feature_extractor = Wav2Vec2FeatureExtractor(feature_size=1, sampling_rate=16000, padding_value=0.0, do_normalize=True, return_attention_mask=True)

Wav2Vec2 processor wraps a Wav2Vec2 feature extractor and a Wav2Vec2 CTC tokenizer into a single processor. [Wav2Vec2Processor](https://huggingface.co/docs/transformers/v4.26.1/en/model_doc/wav2vec2#transformers.Wav2Vec2Processor) offers all the functionalities of Wav2Vec2FeatureExtractor and PreTrainedTokenizer. 

In [None]:
from transformers import Wav2Vec2Processor

processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)

### Audio data
- loading the swahili audio data
- resampling it and preprocessing it in accordance with the training data used to train whisper model

In [None]:
# load the path of the first audio file of our training data
print(training_data[0]["path"])


# lets see the sampling rate and other characteristics of that audio
print(training_data[0]["audio"])

/root/.cache/huggingface/datasets/downloads/extracted/d0c515954317076cb4654c80caae844d8922490cb28ecb53f2df4b52ab7baa52/common_voice_sw_28660554.mp3
{'path': '/root/.cache/huggingface/datasets/downloads/extracted/d0c515954317076cb4654c80caae844d8922490cb28ecb53f2df4b52ab7baa52/common_voice_sw_28660554.mp3', 'array': array([ 0.0000000e+00,  0.0000000e+00,  0.0000000e+00, ...,
        1.2611529e-06, -3.0917859e-06, -4.2307597e-06], dtype=float32), 'sampling_rate': 48000}


We notice that the audio is sampled at 48KHZ. So we will resample to 16khz as the wav2vec model was trained on 16khz.

In [None]:
#we use the Audio object from hugging face datasets function
from datasets import Audio

#we resample the training and testing data to 16KHz as that is the sampling rate used to train the model
training_data = training_data.cast_column("audio", Audio(sampling_rate=16000))
testing_data = testing_data.cast_column("audio", Audio(sampling_rate=16000))

In [None]:
#lets see the first columns

print(training_data[0]["audio"])
print(testing_data[0]["audio"])

{'path': '/root/.cache/huggingface/datasets/downloads/extracted/d0c515954317076cb4654c80caae844d8922490cb28ecb53f2df4b52ab7baa52/common_voice_sw_28660554.mp3', 'array': array([ 0.0000000e+00,  0.0000000e+00,  0.0000000e+00, ...,
        3.9681845e-06, -4.1867329e-06, -2.1413080e-06], dtype=float32), 'sampling_rate': 16000}
{'path': '/root/.cache/huggingface/datasets/downloads/extracted/51be8e6185f5507509f311a8e86535649a659f2e388d898f86a8a45ad84306fd/common_voice_sw_31428161.mp3', 'array': array([0., 0., 0., ..., 0., 0., 0.], dtype=float32), 'sampling_rate': 16000}


We can listen to a random clips

In [None]:
import IPython.display as ipd
import numpy as np
import random

# Choose a random integer between 0 and the number of examples in the dataset (exclusive)
rand_index = random.randint(0, len(training_data) - 1)

# Print the sentence at the chosen random index from the dataset
sentence = training_data[rand_index]["sentence"]
print(sentence)

# Play the audio at the chosen random index from the dataset
audio = training_data[rand_index]["audio"]["array"]
ipd.Audio(data=audio, autoplay=True, rate=16000)


sababu hizo hazina mashiko kutokana na kuwa wanahistoria hawajaeleza


We want to generate random sentences and their transcripts and the shape of their arrays of the training data.

In [None]:
rand_int = random.randint(0, len(training_data)-1)

print("Target text:", training_data[rand_int]["sentence"])
print("Input array shape:", training_data[rand_int]["audio"]["array"].shape)
print("Sampling rate:", training_data[rand_int]["audio"]["sampling_rate"])

Target text: jambo ambalo wakati mwingine husababisha migogoro hasa ya mipaka
Input array shape: (78336,)
Sampling rate: 16000


the data is 1-dimensional array, the sampling rate is 16kHz, and the target text is normalized (But it seems like we removed the spaces between words, which we shall correct before fine tuning it on a model)


So going forward, we complete the data preparation by passing the  wav2vec processor to transform for training. 

In [None]:
def process_dataset(batch):
    # Retrieve the audio data from the batch
    audio = batch["audio"]

    # Process the audio data using the Wav2Vec2CTCModel
    # and store the input values in the batch
    batch["input_values"] = processor(audio["array"], sampling_rate=audio["sampling_rate"]).input_values[0]
    # Store the length of the input values in the batch
    batch["input_length"] = len(batch["input_values"])
    
    # Using the Wav2Vec2CTCModel as a target processor
    with processor.as_target_processor():
        # Process the sentence and store the input IDs in the batch
        batch["labels"] = processor(batch["sentence"]).input_ids
        
    # Return the modified batch
    return batch


The process_data function is passed on all examples in the training and testing sets. The Wav2Vec2Processor only normalizes the data.

In [None]:
training_data = training_data.map(process_dataset, remove_columns = training_data.column_names)
testing_data = testing_data.map(process_dataset, remove_columns = testing_data.column_names)

Map:   0%|          | 0/3726 [00:00<?, ? examples/s]



Map:   0%|          | 0/102 [00:00<?, ? examples/s]

Only allow training of data that is 5 secs or below.

In [None]:
#max_input_length_in_sec = 5.0
#training_data = training_data.filter(lambda x: x < max_input_length_in_sec * processor.feature_extractor.sampling_rate, input_columns=["input_length"])


Filter:   0%|          | 0/3992 [00:00<?, ? examples/s]

## Training: 

Lets set up a [trainer](https://huggingface.co/docs/transformers/main/main_classes/trainer). The Trainer class provides an API for feature-complete training in PyTorch for most standard use cases.

1. DataCollatorCTCWithPadding(), is used for padding input sequences and labels to the same length. * :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a single sequence if provided) This is the strategy we'll use.

The collactor function is inspired by the [huggingface/transformer run_speech_recognition_ctc.py](https://github.com/huggingface/transformers/blob/7e61d56a45c19284cfda0cee8995fb552f6b1f4e/examples/pytorch/speech-recognition/run_speech_recognition_ctc.py#L219) script on GitHub.

## Data collactor

In [None]:
import torch

from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional, Union

# dataclass generator generates special methods for a class, such as __init__, __repr__, and __eq__, based on the class variables.
@dataclass
class DataCollatorCTCWithPadding:

    processor: Wav2Vec2Processor

    # padding method used for the input sequences and defaults to True.
    padding: Union[bool, str] = True

    # the function takes in a list of features, where each feature is a dictionary containing input values and labels,
    # and returns a dictionary containing the padded input values and labels.
    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        
        # extracts the input values from the features.
        input_features = [{"input_values": feature["input_values"]} for feature in features]

        # extracts the labels from the features.
        label_features = [{"input_ids": feature["labels"]} for feature in features]

        # pads the input sequences to the same length then return the result as PyTorch tensors.
        batch = self.processor.pad(input_features,padding=self.padding,return_tensors="pt", )

        
        with self.processor.as_target_processor():
          # pads the labels to the same length as the input sequences 
            labels_batch = self.processor.pad(label_features,padding=self.padding,return_tensors="pt",)

        # replaces padding in the labels with -100 so that it is ignored when calculating the loss.
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        #  sets the padded labels as the "labels" key in the batch dictionary.
        batch["labels"] = labels

        return batch

**Note:** A data collator is a function or class that takes a list of samples from a dataset and combines them into batches to feed into a machine learning model. The data collator can apply padding or truncation to ensure that the sequences within each batch have the same length. It can also apply any necessary data pre-processing steps, such as tokenization or numerical encoding. The purpose of the data collator is to ensure that the model receives input data in a format that it can process efficiently.

In [None]:
# define the data collator
data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True)

## Load the checkpoints

We now load the pretrained checkpoint of [Wav2Vec2-XLS-R-300M](https://huggingface.co/facebook/wav2vec2-xls-r-300m).

The hyperparameters we choose is quite random (and inspired by [Patrick's blog](https://huggingface.co/blog/fine-tune-xlsr-wav2vec2)) and [swahili finetuned model](https://huggingface.co/alokmatta/wav2vec2-large-xlsr-53-sw) at hugging face: We will play with the parameters to find better results.


In [None]:
from transformers import Wav2Vec2ForCTC

model = Wav2Vec2ForCTC.from_pretrained(
    "facebook/wav2vec2-xls-r-300m", 
    attention_dropout=0.1,
    hidden_dropout=0.1,
    feat_proj_dropout=0.0,
    mask_time_prob=0.05,
    layerdrop=0.1,
    gradient_checkpointing=True, 
    ctc_loss_reduction="mean",
    ctc_zero_infinity=True,
    pad_token_id=processor.tokenizer.pad_token_id,
    vocab_size=len(processor.tokenizer)
)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.57k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.27G [00:00<?, ?B/s]

Some weights of the model checkpoint at facebook/wav2vec2-xls-r-300m were not used when initializing Wav2Vec2ForCTC: ['quantizer.weight_proj.bias', 'quantizer.codevectors', 'project_q.bias', 'quantizer.weight_proj.weight', 'project_hid.bias', 'project_q.weight', 'project_hid.weight']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-xls-r-300m and are newly initialized: ['lm_head.weight', 'lm_head.bias']
You should probably TRAIN this model on a down-stream task to be able to use it 

We are freezing the layers of the model that were already trained. We only want to add a CTC loss on top of the transformer without trying to fine tune the CNN part of the architecture.

In [None]:
model.freeze_feature_extractor()



## Metric

We now create an evaluation:

1. we need the evaluate library
2. then import the downloaded library
3. We will then define the metric as Word Error Rate (WER).
4. we will finaly define the metric function ( it will basically computes the Word Error Rate (WER) metric between the predicted and true labels for a given batch of examples) that we shall eventually pass it on to out trainer.

In [None]:
%%capture
!pip install evaluate nltk rouge_score

In [None]:
import evaluate

In [None]:
metric = evaluate.load("wer")


Downloading builder script:   0%|          | 0.00/4.49k [00:00<?, ?B/s]

This is the function that computes the Word Error Rate (WER).

In [None]:
def compute_metrics(pred):
    # Get the predicted logits from the model's output.
    pred_logits = pred.predictions

    # Get the predicted token ids by taking the index with maximum probability across the last dimension of the logits tensor.
    pred_ids = np.argmax(pred_logits, axis=-1)
 
    # we replace the -100 pad with corresponding padding id
    pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id

    # Convert the predicted token ids to string
    pred_str = processor.batch_decode(pred_ids)
   

    # Convert the true label token ids to string
    label_str = processor.batch_decode(pred.label_ids, group_tokens=False)

    # Compute the WER metric between the predicted and true label strings 
    wer = metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}

## Training arguements

TrainingArguments accesses all the points of customization during training.

In our case, we specify where the checkpoints will be stored, the evaluation strategy, learning rate and so on.

notes:
group_by_length ->  groups training samples of similar input length into one batch when true.




In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
  output_dir= hugging_face_repo,
  group_by_length=True,
  per_device_train_batch_size=8,
  gradient_accumulation_steps=4,
  evaluation_strategy="steps",
  num_train_epochs=9,
  gradient_checkpointing=True,
  fp16=True,
  save_steps=200,
  eval_steps=200,
  logging_steps=400,
  learning_rate=3e-4,
  warmup_steps=500,
  save_total_limit=2,
  push_to_hub=True,
)


The trainer will take in the instances of the functions we had instanciated in different variables.

They are:
1. the model -> that is the Wav2Vec2ForCTC, which has the pretrained XLS-R wav2vec 2.0 model with 300m parameters.

2. the data collator -> that converts inputs into the required format to be inputs in a machine learning model. 

3. THe training arguments -> that contains the specifics of the model we are training, like the directory we store the model, learning rate among other hyperparameters, which we shall be playing with inorder to get the best model.

4. The metric function -> that computes the word error rate between the predicted string and the true label string.

5. the training and testing sets with allow for training and testing of the model.The testing data in this case is evaluating the performance of the model.

6. The tokenizer ->  used during training, evaluation, and inference to ensure that the input data is properly processed and tokenized in the same way. (The tokenizer is responsible for converting the input data into tokens that can be used as input to the model)

refence on [hugging face](https://huggingface.co/docs/evaluate/transformers_integrations)

In [None]:
%%bash
git clone https://huggingface.co/AntonyG/fine-tune-wav2vec2-large-xls-r-1b-sw
cd fine-tune-wav2vec2-large-xls-r-1b-sw
git lfs install
git config --global user.email "antony.gitau@students.ku.ac.ke"
git config --global user.name “AntonyG”

Updated git hooks.
Git LFS initialized.


Cloning into 'fine-tune-wav2vec2-large-xls-r-1b-sw'...


In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    data_collator=data_collator,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=training_data,
    eval_dataset=testing_data,
    tokenizer=processor.feature_extractor,
)

/content/AntonyG/fine-tune-wav2vec2-large-xls-r-1b-sw is already a clone of https://huggingface.co/AntonyG/fine-tune-wav2vec2-large-xls-r-1b-sw. Make sure you pull the latest changes with `repo.git_pull()`.


## Fine tuning

We then call the train function on our trainer. Fine tuning just began!

In [None]:
trainer.train()



Step,Training Loss,Validation Loss,Wer
200,No log,3.009205,1.0
400,4.130500,2.915897,1.0
600,4.130500,1.430101,0.704019
800,0.921700,1.31434,0.652862
1000,0.921700,1.283393,0.583435




TrainOutput(global_step=1044, training_loss=2.0035566059565637, metrics={'train_runtime': 4733.3199, 'train_samples_per_second': 7.085, 'train_steps_per_second': 0.221, 'total_flos': 5.502951198457689e+18, 'train_loss': 2.0035566059565637, 'epoch': 8.96})

In [None]:
trainer.push_to_hub()