# Africas Talking-Mozilla Common Voice Hackathon

This is the main notebook used in model development. For experimentation use notebooks in the `misc_notebooks` folder

## 1. Problem Statement

This project is originally a submission to the Africa's Talking x Mozilla Common Voice Hackathon, from October 13th through November

### 1.1 Objectives
>> The main objective is to build a deep learning model that is capable of inferring `text` having been conditioned on `voice` sequences.

The model is expected to achieve best performance on the selected evaluation metrics:
* Character Error Rate
* Word Error Rate
* Phone Error Rate
* Loss

## 2. Data Loading

The data used is sourced from Mozilla's site [here](https://commonvoice.mozilla.org/en/datasets)

In [65]:
# importations
import pandas as pd
import numpy as np
import os
from scripts import coqui_data_prepper
# import scripts/to_csv.py as to_csv


In [66]:
# read `tsv` files
test_df = pd.read_csv('sw/test.tsv', delimiter='\t')
train_df = pd.read_csv('sw/train.tsv', delimiter='\t')
# reported_df = pd.read_csv('sw/reported.tsv', engine='c', delimiter='\t') # causing EOF error in reading, possibly corrupt or inconsistent
# others_df = pd.read_csv('sw/other.tsv', delimiter='\t') # causing EOF error in reading, possibly corrupt or inconsistent
invalidated_df = pd.read_csv('sw/invalidated.tsv', delimiter='\t')
dev_df = pd.read_csv('sw/dev.tsv', delimiter='\t')
durations_df = pd.read_csv('sw/clip_durations.tsv', delimiter='\t')

## 3. EDA

Let's inspect some of these DataFrames to get our bearing

In [67]:
#train_df
train_df.head()

Unnamed: 0,client_id,path,sentence,up_votes,down_votes,age,gender,accents,variant,locale,segment
0,70c2b9896ca6150002ef6e888a3f7788e49e61aef85ec8...,common_voice_sw_37083656.mp3,yeyote yule atakaepatikana akiandamana katika ...,2,0,twenties,female,,,sw,
1,70c2b9896ca6150002ef6e888a3f7788e49e61aef85ec8...,common_voice_sw_37085166.mp3,kwa ujumla kwenye kwenye sehemu za wilaya ama ...,3,1,twenties,female,,,sw,
2,70c2b9896ca6150002ef6e888a3f7788e49e61aef85ec8...,common_voice_sw_37085182.mp3,Hawai ni kuzuri,2,0,twenties,female,,,sw,
3,70c2b9896ca6150002ef6e888a3f7788e49e61aef85ec8...,common_voice_sw_37085525.mp3,na kuhatarisha mifumo mizima ya uhai katika se...,13,3,twenties,female,,,sw,
4,70c2b9896ca6150002ef6e888a3f7788e49e61aef85ec8...,common_voice_sw_37085535.mp3,kundi zima lililosababisha mauaji lilitakiwa k...,2,1,twenties,female,,,sw,


In [68]:
# test_df
test_df.head()

Unnamed: 0,client_id,path,sentence,up_votes,down_votes,age,gender,accents,variant,locale,segment
0,011544b78c18417a5869b8e70ebc7675c7eb3b517754b8...,common_voice_sw_37190902.mp3,Tuma pesa sahii.,2,0,,,,,sw,
1,0133d8ddf5c1a3c678fde017e0b07d2835bfd707d5b3ec...,common_voice_sw_31428161.mp3,wachambuzi wa soka wanamtaja Messi kama nyota ...,2,0,twenties,female,,,sw,
2,01c95772efd3fbe4a1122206c7474c77ed6591c8c9fb00...,common_voice_sw_30317714.mp3,romario aliingia kwenye orodha ya wachezaji wa...,2,1,,,,,sw,
3,023711185d4404ff398c2697f2e72868d1ecf69a92b581...,common_voice_sw_32116997.mp3,Sote twesangaa twelipomuona mwalimu Ali apika,2,1,twenties,male,,,sw,
4,0244639ffd7ec755a01b21ea204735ca3c44443e9cf46c...,common_voice_sw_29002392.mp3,Inajulikana kama shina la Warangi.,2,0,,,,,sw,


In [69]:
# dev_df
dev_df.head()

Unnamed: 0,client_id,path,sentence,up_votes,down_votes,age,gender,accents,variant,locale,segment
0,8ed66acaaecf3d1ffd887ed5af76a220976a8b3dd1209f...,common_voice_sw_30628088.mp3,Mbali na kuwa afisa wa serikali Jokate pia ni ...,2,0,twenties,male,,,sw,
1,8ed66acaaecf3d1ffd887ed5af76a220976a8b3dd1209f...,common_voice_sw_30628121.mp3,Kukosa pesa ni hatari sana,3,1,twenties,male,,,sw,
2,8ed66acaaecf3d1ffd887ed5af76a220976a8b3dd1209f...,common_voice_sw_30628126.mp3,Jina Karagwe linatokana na kilima kinachopatik...,2,0,twenties,male,,,sw,
3,8ed66acaaecf3d1ffd887ed5af76a220976a8b3dd1209f...,common_voice_sw_30628160.mp3,sokwe wanaopatikana mahale ni aina adimu zaidi...,2,0,twenties,male,,,sw,
4,8ed66acaaecf3d1ffd887ed5af76a220976a8b3dd1209f...,common_voice_sw_30628161.mp3,mamba wa mto naili wana sumu kali inayoozesha ...,3,0,twenties,male,,,sw,


We need to change the path to reflect the relative path in this directory. We can then create new dfs to be used to train on coqui_stt

In [70]:
def attach_prefix(df, prefix):
  """Attaches a prefix to all entries in the column 'path' of a dataframe.

  Args:
    df: A Pandas dataframe with the column 'path'.
    prefix: The prefix to attach to the entries in the column 'path'.

  Returns:
    A Pandas dataframe with the column 'path' updated to include the prefix.
  """

  df["path"] = df["path"].apply(lambda x: prefix + x)
  return df



In [71]:
# Create a list of Pandas DataFrames
df_list = [train_df, dev_df, test_df]

# iterate over the dfs and attatch directory prefix on each
for df, df_name in zip(df_list, ["train", "dev", "test"]):
    df = df

    prefixed_df = attach_prefix(df, "sw/clips/")

In [72]:
train_df.head()

Unnamed: 0,client_id,path,sentence,up_votes,down_votes,age,gender,accents,variant,locale,segment
0,70c2b9896ca6150002ef6e888a3f7788e49e61aef85ec8...,sw/clips/common_voice_sw_37083656.mp3,yeyote yule atakaepatikana akiandamana katika ...,2,0,twenties,female,,,sw,
1,70c2b9896ca6150002ef6e888a3f7788e49e61aef85ec8...,sw/clips/common_voice_sw_37085166.mp3,kwa ujumla kwenye kwenye sehemu za wilaya ama ...,3,1,twenties,female,,,sw,
2,70c2b9896ca6150002ef6e888a3f7788e49e61aef85ec8...,sw/clips/common_voice_sw_37085182.mp3,Hawai ni kuzuri,2,0,twenties,female,,,sw,
3,70c2b9896ca6150002ef6e888a3f7788e49e61aef85ec8...,sw/clips/common_voice_sw_37085525.mp3,na kuhatarisha mifumo mizima ya uhai katika se...,13,3,twenties,female,,,sw,
4,70c2b9896ca6150002ef6e888a3f7788e49e61aef85ec8...,sw/clips/common_voice_sw_37085535.mp3,kundi zima lililosababisha mauaji lilitakiwa k...,2,1,twenties,female,,,sw,


In [73]:
dev_df.head()

Unnamed: 0,client_id,path,sentence,up_votes,down_votes,age,gender,accents,variant,locale,segment
0,8ed66acaaecf3d1ffd887ed5af76a220976a8b3dd1209f...,sw/clips/common_voice_sw_30628088.mp3,Mbali na kuwa afisa wa serikali Jokate pia ni ...,2,0,twenties,male,,,sw,
1,8ed66acaaecf3d1ffd887ed5af76a220976a8b3dd1209f...,sw/clips/common_voice_sw_30628121.mp3,Kukosa pesa ni hatari sana,3,1,twenties,male,,,sw,
2,8ed66acaaecf3d1ffd887ed5af76a220976a8b3dd1209f...,sw/clips/common_voice_sw_30628126.mp3,Jina Karagwe linatokana na kilima kinachopatik...,2,0,twenties,male,,,sw,
3,8ed66acaaecf3d1ffd887ed5af76a220976a8b3dd1209f...,sw/clips/common_voice_sw_30628160.mp3,sokwe wanaopatikana mahale ni aina adimu zaidi...,2,0,twenties,male,,,sw,
4,8ed66acaaecf3d1ffd887ed5af76a220976a8b3dd1209f...,sw/clips/common_voice_sw_30628161.mp3,mamba wa mto naili wana sumu kali inayoozesha ...,3,0,twenties,male,,,sw,


## 4. Preprocessing

Coqui-STT requires data to be in a specific format before training begins, i.e:
* wav_filename: PATH to the wav files ( not mp3; format not supported)
* wav_filesize: size of each wav clip in bytes
* transcript: the sentence column transcribed for each clip

**Note**: there are scripts to convert mp3 to wav and to do this preprocessing in the scripts module incuded in this project

In [74]:
from pydub import AudioSegment

def convert_mp3_to_wav(df, label):
  """Converts all MP3 files in the `path` column to WAV files and saves them to a new directory (test, train, or dev) in the same folder as the original MP3 files.
  Resamples at 16Khz
  Args:
    df: A Pandas DataFrame.
    label: The label for the new directory (test, train, or dev).

  Returns:
    A Pandas DataFrame with the `path` column updated to point to the new WAV files.
  """

  # Create a new directory for the WAV files.
  new_dir = os.path.join(df['path'].iloc[0].split('/')[0], label)
  os.makedirs(new_dir, exist_ok=True)

  # Iterate over the DataFrame and convert each MP3 file to WAV.
  for i in range(len(df)):
    mp3_path = df['path'].iloc[i]
    wav_path = os.path.join(new_dir, os.path.basename(mp3_path).replace('.mp3', '.wav'))

    # Convert the MP3 file to WAV.
    audio = AudioSegment.from_mp3(mp3_path)
    audio.export(wav_path, format='wav')

    # Update the `path` column to point to the new WAV file.
    df.loc[i, 'path'] = wav_path

  return df

>> **NOTE**: the 3 cells below need only be ran once and they may take some time to run as they are parsing a rather large amount of data

In [75]:
# convert train_df clips to .wav
# train_df = convert_mp3_to_wav(train_df, 'train')

In [76]:
# convert dev_df clips to .wav
# tdev_df = convert_mp3_to_wav(dev_df, 'dev')

In [77]:
# convert test_df clips to .wav
# test_df = convert_mp3_to_wav(test_df, 'test')

At this stage we have already ran the notebook and obtained formatted csv files. We will re-read them to avoid having to process and format them again.

In [78]:
# read formatted csv files
train_df = pd.read_csv('sw/train.csv')
test_df = pd.read_csv('sw/test.csv')
dev_df = pd.read_csv('sw/dev.csv')

In [79]:
# temporary script to do further data formatting; not required after converting to csv
# dfs = [train_df, test_df, dev_df]

# for df in dfs:
#     df['wav_filename'] = df['wav_filename'].str.replace('sw/', '', regex=False)

In [80]:
train_df.head()

Unnamed: 0,wav_filename,wav_filesize,transcript
0,train/common_voice_sw_37083656.wav,359468,yeyote yule atakaepatikana akiandamana katika ...
1,train/common_voice_sw_37085166.wav,391724,kwa ujumla kwenye kwenye sehemu za wilaya ama ...
2,train/common_voice_sw_37085182.wav,191276,hawai ni kuzuri
3,train/common_voice_sw_37085525.wav,391724,na kuhatarisha mifumo mizima ya uhai katika se...
4,train/common_voice_sw_37085535.wav,391724,kundi zima lililosababisha mauaji lilitakiwa k...


In [81]:
test_df.head()

Unnamed: 0,wav_filename,wav_filesize,transcript
0,test/common_voice_sw_37190902.wav,225836,tuma pesa sahii
1,test/common_voice_sw_31428161.wav,502316,wachambuzi wa soka wanamtaja messi kama nyota ...
2,test/common_voice_sw_30317714.wav,331820,romario aliingia kwenye orodha ya wachezaji wa...
3,test/common_voice_sw_32116997.wav,640556,sote twesangaa twelipomuona mwalimu ali apika
4,test/common_voice_sw_29002392.wav,294956,inajulikana kama shina la warangi


In [82]:
# format DFs for training (Repetitive), needs to be better implemented (DRY)
train_df = coqui_data_prepper.format_df(train_df, 'train')
dev_df = coqui_data_prepper.format_df(dev_df, 'dev')
test_df = coqui_data_prepper.format_df(test_df, label='test')

Format acceptable: checking clip directories


Format acceptable: checking clip directories
Format acceptable: checking clip directories


>> **END NOTE**: try commenting the cells above if they have already be ran

In [83]:
train_df.head()

Unnamed: 0,wav_filename,wav_filesize,transcript
0,train/common_voice_sw_37083656.wav,359468,yeyote yule atakaepatikana akiandamana katika ...
1,train/common_voice_sw_37085166.wav,391724,kwa ujumla kwenye kwenye sehemu za wilaya ama ...
2,train/common_voice_sw_37085182.wav,191276,hawai ni kuzuri
3,train/common_voice_sw_37085525.wav,391724,na kuhatarisha mifumo mizima ya uhai katika se...
4,train/common_voice_sw_37085535.wav,391724,kundi zima lililosababisha mauaji lilitakiwa k...


In [84]:
test_df.head()

Unnamed: 0,wav_filename,wav_filesize,transcript
0,test/common_voice_sw_37190902.wav,225836,tuma pesa sahii
1,test/common_voice_sw_31428161.wav,502316,wachambuzi wa soka wanamtaja messi kama nyota ...
2,test/common_voice_sw_30317714.wav,331820,romario aliingia kwenye orodha ya wachezaji wa...
3,test/common_voice_sw_32116997.wav,640556,sote twesangaa twelipomuona mwalimu ali apika
4,test/common_voice_sw_29002392.wav,294956,inajulikana kama shina la warangi


In [85]:
# convert transcript to lower case characters
def to_lower(df, column_name):
  """Converts all characters in a column to lowercase.

  Args:
    df: A Pandas DataFrame.
    column_name: The name of the column to convert to lowercase.

  Returns:
    A Pandas DataFrame with the column `column_name` converted to lowercase.
  """

  df[column_name] = df[column_name].str.lower()
  return df

In [86]:
train_df = to_lower(train_df, 'transcript')
dev_df = to_lower(dev_df, 'transcript')
test_df = to_lower(test_df, 'transcript')

In [87]:
# remove punctuations, numbers and unnecessay whitespace
import re

def preprocess_text(text):
    # Remove punctuation
    text = re.sub(r'[^\w\s]', '', text)
    
    # Remove numbers
    text = re.sub(r'\d+', '', text)
    
    # Remove whitespace
    text = re.sub(r'\s+', ' ', text)
    
    # Remove underscores
    text = text.replace('_', '')

    return text.strip()

# Apply preprocessing to 'transcript' column in train_df
train_df['transcript'] = train_df['transcript'].apply(preprocess_text)

# Apply preprocessing to 'transcript' column in dev_df
dev_df['transcript'] = dev_df['transcript'].apply(preprocess_text)

# Apply preprocessing to 'transcript' column in test_df
test_df['transcript'] = test_df['transcript'].apply(preprocess_text)

In [88]:
train_df.head()

Unnamed: 0,wav_filename,wav_filesize,transcript
0,train/common_voice_sw_37083656.wav,359468,yeyote yule atakaepatikana akiandamana katika ...
1,train/common_voice_sw_37085166.wav,391724,kwa ujumla kwenye kwenye sehemu za wilaya ama ...
2,train/common_voice_sw_37085182.wav,191276,hawai ni kuzuri
3,train/common_voice_sw_37085525.wav,391724,na kuhatarisha mifumo mizima ya uhai katika se...
4,train/common_voice_sw_37085535.wav,391724,kundi zima lililosababisha mauaji lilitakiwa k...


In [89]:
dev_df.head()

Unnamed: 0,wav_filename,wav_filesize,transcript
0,dev/common_voice_sw_30628088.wav,322604,mbali na kuwa afisa wa serikali jokate pia ni ...
1,dev/common_voice_sw_30628121.wav,225836,kukosa pesa ni hatari sana
2,dev/common_voice_sw_30628126.wav,375596,jina karagwe linatokana na kilima kinachopatik...
3,dev/common_voice_sw_30628160.wav,336428,sokwe wanaopatikana mahale ni aina adimu zaidi...
4,dev/common_voice_sw_30628161.wav,359468,mamba wa mto naili wana sumu kali inayoozesha ...


In [90]:
test_df.head()

Unnamed: 0,wav_filename,wav_filesize,transcript
0,test/common_voice_sw_37190902.wav,225836,tuma pesa sahii
1,test/common_voice_sw_31428161.wav,502316,wachambuzi wa soka wanamtaja messi kama nyota ...
2,test/common_voice_sw_30317714.wav,331820,romario aliingia kwenye orodha ya wachezaji wa...
3,test/common_voice_sw_32116997.wav,640556,sote twesangaa twelipomuona mwalimu ali apika
4,test/common_voice_sw_29002392.wav,294956,inajulikana kama shina la warangi


Need to convert the *train*, *dev* and *test* tsc files to csv to work with `coqui`

In [91]:
# Create a list of Pandas DataFrames
df_list = [train_df, dev_df, test_df]

# Create the sw/ directory if it does not exist
csv_directory_path = "sw/"
if not os.path.exists(csv_directory_path):
    os.makedirs(csv_directory_path)

# Iterate over the list of DataFrames and save each DataFrame to a CSV file
for df, df_name in zip(df_list, ["train", "dev", "test"]):

    # Join the DataFrame name with the CSV directory path
    csv_file_path = os.path.join(csv_directory_path, df_name + ".csv")

    # Save the DataFrame to the CSV file
    df.to_csv(csv_file_path, index=False)

The csv training files are saved, but the `transcript` columns need additional preprocessing e.g **tokenization** to work.


In [92]:
# detect alphabet
import chardet

def detect_characters_in_dataframes(train_df, dev_df, test_df, alphabet_file_path):
  """Detects all the characters in the column 'transcript' for dataframes 'train_df, dev_df and test_df' and saves them to a file.

  Args:
    train_df: A Pandas dataframe containing the training data.
    dev_df: A Pandas dataframe containing the development data.
    test_df: A Pandas dataframe containing the test data.
    alphabet_file_path: The path to the file to save the alphabet to.
  """

  alphabet = set()

  for df in [train_df, dev_df, test_df]:
    transcripts = df["transcript"]

    for transcript in transcripts:
      characters = chardet.detect(transcript.encode())["encoding"]

      for character in characters:
        alphabet.add(character)

  with open(alphabet_file_path, "w") as alphabet_file:
    for character in alphabet:
      alphabet_file.write(character + "\n")

The function above detects alphabet characters in the corpus. It is (optionally) ran in the cell below.

In [93]:
# alphabet_file_path = "sw/alphabet.txt"

# detect_characters_in_dataframes(train_df, dev_df, test_df, alphabet_file_path)

## 5. Modelling

### INITIALIZE DEFAULT HYPERPARAMETERS (Coqui-stt)

In [94]:
from coqui_stt_training.util.config import initialize_globals_from_args

initialize_globals_from_args(
    checkpoint_dir="ckpt_dir",
    train_files=["sw/train.csv"],
    dev_files=["sw/dev.csv"],
    test_files=["sw/test.csv"],
    load_train="init",
    n_hidden=100,
    epochs=2,
    train_batch_size=1,
    dev_batch_size=1,
    test_batch_size= 1,
)

I --alphabet_config_path not specified, but all input datasets are present and in the same folder (--train_files, --dev_files and --test_files), and an alphabet.txt file was found alongside the sets (sw/alphabet.txt). Will use this alphabet file for this run.


In [95]:
# !python -m coqui_stt_training.train --train_files sw/train.csv --dev_files sw/dev.csv --test_files sw/test.csv --checkpoint_dir ckpt_dir --n_hidden 100 --epochs 100

In [96]:
# Kick off training job; configures CUDA to only use one GPU
from coqui_stt_training.train import train

# use maximum one GPU
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

train()

I Performing dummy training to check for memory problems.
I If the following process crashes, you likely have batch sizes that are too big for your available system memory (or GPU memory).
I Initializing all variables.
I STARTING Optimization
Epoch 0 |   Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000
Epoch 0 |   Training | Elapsed Time: 0:00:02 | Steps: 1 | Loss: 1806.064331
Epoch 0 |   Training | Elapsed Time: 0:00:02 | Steps: 2 | Loss: 1506.971375
Epoch 0 |   Training | Elapsed Time: 0:00:03 | Steps: 3 | Loss: 1351.944580
Epoch 0 |   Training | Elapsed Time: 0:00:03 | Steps: 3 | Loss: 1351.944580
Epoch 0 | Validation | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000 | Dataset: sw/dev.csv
Epoch 0 | Validation | Elapsed Time: 0:00:01 | Steps: 1 | Loss: 812.518127 | Dataset: sw/dev.csv
Epoch 0 | Validation | Elapsed Time: 0:00:02 | Steps: 3 | Loss: 709.431641 | Dataset: sw/dev.csv
Epoch 0 | Validation | Elapsed Time: 0:00:02 | Steps: 3 | Loss: 709.431641 | Dataset: sw/de

2023-10-21 12:31:19.133362: W tensorflow/core/framework/op_kernel.cc:1639] Invalid argument: ValueError: Alphabet cannot encode transcript "kocha wa real madrid alikuwa vincent del bosque" while processing sample "sw/train/common_voice_sw_30111211.wav", check that your alphabet contains all characters in the training corpus. Missing characters are: ['q'].
Traceback (most recent call last):

  File "/home/josh/Desktop/proj/mcv-stt-hackathon/mcv-stt-env/lib/python3.7/site-packages/tensorflow_core/python/ops/script_ops.py", line 235, in __call__
    ret = func(*args)

  File "/home/josh/Desktop/proj/mcv-stt-hackathon/mcv-stt-env/lib/python3.7/site-packages/tensorflow_core/python/data/ops/dataset_ops.py", line 594, in generator_py_func
    values = next(generator_state.get_iterator(iterator_id))

  File "/home/josh/Desktop/proj/mcv-stt-hackathon/mcv-stt-env/lib/python3.7/site-packages/coqui_stt_training/util/feeding.py", line 188, in generate_values
    sample.transcript, Config.alphabet, 

InvalidArgumentError: ValueError: Alphabet cannot encode transcript "kocha wa real madrid alikuwa vincent del bosque" while processing sample "sw/train/common_voice_sw_30111211.wav", check that your alphabet contains all characters in the training corpus. Missing characters are: ['q'].
Traceback (most recent call last):

  File "/home/josh/Desktop/proj/mcv-stt-hackathon/mcv-stt-env/lib/python3.7/site-packages/tensorflow_core/python/ops/script_ops.py", line 235, in __call__
    ret = func(*args)

  File "/home/josh/Desktop/proj/mcv-stt-hackathon/mcv-stt-env/lib/python3.7/site-packages/tensorflow_core/python/data/ops/dataset_ops.py", line 594, in generator_py_func
    values = next(generator_state.get_iterator(iterator_id))

  File "/home/josh/Desktop/proj/mcv-stt-hackathon/mcv-stt-env/lib/python3.7/site-packages/coqui_stt_training/util/feeding.py", line 188, in generate_values
    sample.transcript, Config.alphabet, context=sample.sample_id

  File "/home/josh/Desktop/proj/mcv-stt-hackathon/mcv-stt-env/lib/python3.7/site-packages/coqui_stt_training/util/text.py", line 22, in text_to_char_array
    list(ch for ch in transcript if not alphabet.CanEncodeSingle(ch)),

ValueError: Alphabet cannot encode transcript "kocha wa real madrid alikuwa vincent del bosque" while processing sample "sw/train/common_voice_sw_30111211.wav", check that your alphabet contains all characters in the training corpus. Missing characters are: ['q'].


	 [[{{node PyFunc}}]]
	 [[tower_0/IteratorGetNext]]

### Try Transfer Learning

In [None]:
# Import pretrained model


# END