# Africas Talking-Mozilla Common Voice Hackathon

This is the main notebook used in model development. For experimentation use notebooks in the `misc_notebooks` folder

## 1. Problem Statement

This project is originally a submission to the Africa's Talking x Mozilla Common Voice Hackathon, from October 13th through November

### 1.1 Objectives
>> The main objective is to build a deep learning model that is capable of inferring `text` having been conditioned on `voice` sequences.

The model is expected to achieve best performance on the selected evaluation metrics:
* Character Error Rate
* Word Error Rate
* Phone Error Rate
* Loss

## 2. Data Loading

The data used is sourced from Mozilla's site [here](https://commonvoice.mozilla.org/en/datasets)

In [1]:
# importations
import pandas as pd
import numpy as np
import os
# import scripts/to_csv.py as to_csv


In [2]:
# read `tsv` files
test_df = pd.read_csv('sw/test.tsv', delimiter='\t')
train_df = pd.read_csv('sw/train.tsv', delimiter='\t')
# reported_df = pd.read_csv('sw/reported.tsv', engine='c', delimiter='\t') # causing EOF error in reading, possibly corrupt or inconsistent
# others_df = pd.read_csv('sw/other.tsv', delimiter='\t') # causing EOF error in reading, possibly corrupt or inconsistent
invalidated_df = pd.read_csv('sw/invalidated.tsv', delimiter='\t')
dev_df = pd.read_csv('sw/dev.tsv', delimiter='\t')
durations_df = pd.read_csv('sw/clip_durations.tsv', delimiter='\t')

## 3. EDA

Let's inspect some of these DataFrames to get our bearing

In [3]:
#train_df
train_df.head()

Unnamed: 0,client_id,path,sentence,up_votes,down_votes,age,gender,accents,variant,locale,segment
0,70c2b9896ca6150002ef6e888a3f7788e49e61aef85ec8...,common_voice_sw_37083656.mp3,yeyote yule atakaepatikana akiandamana katika ...,2,0,twenties,female,,,sw,
1,70c2b9896ca6150002ef6e888a3f7788e49e61aef85ec8...,common_voice_sw_37085166.mp3,kwa ujumla kwenye kwenye sehemu za wilaya ama ...,3,1,twenties,female,,,sw,
2,70c2b9896ca6150002ef6e888a3f7788e49e61aef85ec8...,common_voice_sw_37085182.mp3,Hawai ni kuzuri,2,0,twenties,female,,,sw,
3,70c2b9896ca6150002ef6e888a3f7788e49e61aef85ec8...,common_voice_sw_37085525.mp3,na kuhatarisha mifumo mizima ya uhai katika se...,13,3,twenties,female,,,sw,
4,70c2b9896ca6150002ef6e888a3f7788e49e61aef85ec8...,common_voice_sw_37085535.mp3,kundi zima lililosababisha mauaji lilitakiwa k...,2,1,twenties,female,,,sw,


In [4]:
# test_df
test_df.head()

Unnamed: 0,client_id,path,sentence,up_votes,down_votes,age,gender,accents,variant,locale,segment
0,011544b78c18417a5869b8e70ebc7675c7eb3b517754b8...,common_voice_sw_37190902.mp3,Tuma pesa sahii.,2,0,,,,,sw,
1,0133d8ddf5c1a3c678fde017e0b07d2835bfd707d5b3ec...,common_voice_sw_31428161.mp3,wachambuzi wa soka wanamtaja Messi kama nyota ...,2,0,twenties,female,,,sw,
2,01c95772efd3fbe4a1122206c7474c77ed6591c8c9fb00...,common_voice_sw_30317714.mp3,romario aliingia kwenye orodha ya wachezaji wa...,2,1,,,,,sw,
3,023711185d4404ff398c2697f2e72868d1ecf69a92b581...,common_voice_sw_32116997.mp3,Sote twesangaa twelipomuona mwalimu Ali apika,2,1,twenties,male,,,sw,
4,0244639ffd7ec755a01b21ea204735ca3c44443e9cf46c...,common_voice_sw_29002392.mp3,Inajulikana kama shina la Warangi.,2,0,,,,,sw,


In [5]:
# dev_df
dev_df.head()

Unnamed: 0,client_id,path,sentence,up_votes,down_votes,age,gender,accents,variant,locale,segment
0,8ed66acaaecf3d1ffd887ed5af76a220976a8b3dd1209f...,common_voice_sw_30628088.mp3,Mbali na kuwa afisa wa serikali Jokate pia ni ...,2,0,twenties,male,,,sw,
1,8ed66acaaecf3d1ffd887ed5af76a220976a8b3dd1209f...,common_voice_sw_30628121.mp3,Kukosa pesa ni hatari sana,3,1,twenties,male,,,sw,
2,8ed66acaaecf3d1ffd887ed5af76a220976a8b3dd1209f...,common_voice_sw_30628126.mp3,Jina Karagwe linatokana na kilima kinachopatik...,2,0,twenties,male,,,sw,
3,8ed66acaaecf3d1ffd887ed5af76a220976a8b3dd1209f...,common_voice_sw_30628160.mp3,sokwe wanaopatikana mahale ni aina adimu zaidi...,2,0,twenties,male,,,sw,
4,8ed66acaaecf3d1ffd887ed5af76a220976a8b3dd1209f...,common_voice_sw_30628161.mp3,mamba wa mto naili wana sumu kali inayoozesha ...,3,0,twenties,male,,,sw,


Need to change `sentence` column to `transcript`

In [6]:
# Create a list of Pandas DataFrames
df_list = [train_df, dev_df, test_df]

# Create the sw/ directory if it does not exist
csv_directory_path = "sw/"
if not os.path.exists(csv_directory_path):
    os.makedirs(csv_directory_path)

# Iterate over the list of DataFrames and save each DataFrame to a CSV file
for df, df_name in zip(df_list, ["train", "dev", "test"]):

    # Join the DataFrame name with the CSV directory path
    csv_file_path = os.path.join(csv_directory_path, df_name + ".csv")

    # Save the DataFrame to the CSV file
    df.to_csv(csv_file_path, index=False)

## 4. Preprocessing

In [7]:
# Implement wave tokenizer to obtain feature table


## 5. Modelling

### INITIALIZE DEFAULT HYPERPARAMETERS (Coqui-stt)

In [8]:
from coqui_stt_training.util.config import initialize_globals_from_args

initialize_globals_from_args(
    alphabet_config_path="sw/alphabet.txt",
    checkpoint_dir="ckpt_dir",
    train_files=["sw/train.csv"],
    dev_files=["sw/dev.csv"],
    test_files=["sw/test.csv"],
    load_train="init",
    n_hidden=200,
    epochs=100,
    train_batch_size=2,
    dev_batch_size=2,
    test_batch_size= 2,
)

2023-10-14 12:27:16.583535: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2023-10-14 12:27:16.609092: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2400000000 Hz
2023-10-14 12:27:16.610181: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x211e4a0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2023-10-14 12:27:16.610218: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version


In [9]:
# Kick off training job; configures CUDA to only use one GPU
from coqui_stt_training.train import train

# use maximum one GPU
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

train()

I Performing dummy training to check for memory problems.
I If the following process crashes, you likely have batch sizes that are too big for your available system memory (or GPU memory).
I Initializing all variables.
I STARTING Optimization
Epoch 0 |   Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000
Epoch 0 |   Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000
Epoch 0 | Validation | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000 | Dataset: sw/dev.csv


2023-10-14 12:27:19.604910: W tensorflow/core/framework/op_kernel.cc:1639] Unknown: RuntimeError: No transcript data (missing CSV column)
Traceback (most recent call last):

  File "/home/josh/Desktop/proj/mcv-stt-hackathon/mcv-stt-env/lib/python3.7/site-packages/tensorflow_core/python/ops/script_ops.py", line 235, in __call__
    ret = func(*args)

  File "/home/josh/Desktop/proj/mcv-stt-hackathon/mcv-stt-env/lib/python3.7/site-packages/tensorflow_core/python/data/ops/dataset_ops.py", line 594, in generator_py_func
    values = next(generator_state.get_iterator(iterator_id))

  File "/home/josh/Desktop/proj/mcv-stt-hackathon/mcv-stt-env/lib/python3.7/site-packages/coqui_stt_training/util/feeding.py", line 160, in generate_values
    sources, buffering=buffering, labeled=True, reverse=reverse

  File "/home/josh/Desktop/proj/mcv-stt-hackathon/mcv-stt-env/lib/python3.7/site-packages/coqui_stt_training/util/sample_collections.py", line 722, in samples_from_sources
    sample_sources[0], 

UnknownError: RuntimeError: No transcript data (missing CSV column)
Traceback (most recent call last):

  File "/home/josh/Desktop/proj/mcv-stt-hackathon/mcv-stt-env/lib/python3.7/site-packages/tensorflow_core/python/ops/script_ops.py", line 235, in __call__
    ret = func(*args)

  File "/home/josh/Desktop/proj/mcv-stt-hackathon/mcv-stt-env/lib/python3.7/site-packages/tensorflow_core/python/data/ops/dataset_ops.py", line 594, in generator_py_func
    values = next(generator_state.get_iterator(iterator_id))

  File "/home/josh/Desktop/proj/mcv-stt-hackathon/mcv-stt-env/lib/python3.7/site-packages/coqui_stt_training/util/feeding.py", line 160, in generate_values
    sources, buffering=buffering, labeled=True, reverse=reverse

  File "/home/josh/Desktop/proj/mcv-stt-hackathon/mcv-stt-env/lib/python3.7/site-packages/coqui_stt_training/util/sample_collections.py", line 722, in samples_from_sources
    sample_sources[0], buffering=buffering, labeled=labeled, reverse=reverse

  File "/home/josh/Desktop/proj/mcv-stt-hackathon/mcv-stt-env/lib/python3.7/site-packages/coqui_stt_training/util/sample_collections.py", line 681, in samples_from_source
    return CSV(sample_source, labeled=labeled, reverse=reverse)

  File "/home/josh/Desktop/proj/mcv-stt-hackathon/mcv-stt-env/lib/python3.7/site-packages/coqui_stt_training/util/sample_collections.py", line 545, in __init__
    raise RuntimeError("No transcript data (missing CSV column)")

RuntimeError: No transcript data (missing CSV column)


	 [[{{node PyFunc}}]]
	 [[tower_0/IteratorGetNext]]

### Try Transfer Learning

In [None]:
# Import pretrained model


# END