# How to Access Open Voice Training Data: The Mozilla’s Common Voice Platform

**Assessment Three**

## Introduction
In this notebook, you will train a [Mozilla Voice STT / DeepSpeech](https://deepspeech.readthedocs.io/en/v0.8.0/) using the Single Word Target Segment dataset.

### Learning Objectives
1. How to train a simple speech to text model.
2. How to evaluate the trained model.

### General Steps:
1. Configure the notebook.
2. Get the training data.
3. Prepare the data for training.
4. Train and Evaluate the STT model.

### 1.0 Notebook Configuration
We shall  start with the installation of the dependency libraries required for the Speech to Text model.
***
Steps:
1. Clone the [DeepSpeech](https://github.com/mozilla/DeepSpeech) repository from GitHub. We use `git` command to clone the repository.
2. Install environment softwares required by the DeepSpeech library including [Sound eXchange](http://sox.sourceforge.net/), which is an audio format conversion library i.e. from `.mp3` to `.wav`.
3. Setup the cloned repository by installing the libraries listed in the `setup.py` script. This step installs all the dependency libraries required by `DeepSpeech`.
4. `DeepSpeech` library is built on `Tensorflow` version 1.15.2, however this environment uses the latest `Tensorflow` version, it is therefore important to downgarde `Tensorflow` to the right version.
5. Finally, pretrained English models are downloaded. The pretrained enable us to leverage transfer learning by training on a small dataset and achieving reasonable results.
Run the cell below to run through the steps listed above. If you are curious to understand what exactly is happening, double click on the cell.
***
**NOTE: The setup takes quite a while to finish, kindly be patient.**

In [None]:
# Helper libraries
import os

# 1. Clone the STT repository from GitHub.
!git clone https://github.com/mozilla/STT.git

# 2. Install the dependency libraries.
!apt-get install sox libsox-fmt-mp3
!pip install sox
!apt-get install python3-dev

# 3a. Change into the STT directory
%cd STT

# 3b. Install the model's required libraries.
!pip3 install --upgrade pip==20.0.2 wheel==0.34.2 setuptools==46.1.3
!pip3 install --upgrade -e .

# 4. Downgrade the Tensorflow version to 1.15.2, STT uses this version
#    of Tensorflow.
!pip install tensorflow_gpu==1.15.2

# 5. Download a pretrained model into the ckeckpoint directory.
# 5a. Download the English (en-US) pre-trained DeepSpeech model.
!wget -P checkpoints/ https://github.com/mozilla/STT/releases/download/v0.8.0/deepspeech-0.8.0-checkpoint.tar.gz
!wget -P checkpoints/ https://github.com/mozilla/STT/releases/download/v0.8.0/deepspeech-0.8.0-models.scorer

# 6. Untar the checkpoints into the checkpoint folder
!tar -xvf /content/STT/checkpoints/deepspeech-0.8.0-checkpoint.tar.gz -C /content/STT/checkpoints/

# 7. Install git-lfs
!apt-get install git-lfs
!git lfs pull
!git lfs install

# 8. In this step we define the utility functions that abstract the training process
#    in functions as opposed to calling the commands.
def prepare_data(path):
  """
    This function calls the import_cv2.py script that 
    converts the dataset into the .wav format.

    path <string> the path to the dataset directory.
  """
  !bin/import_cv2.py {path}

def train_model(train, dev, test, epochs):
  """
    Calls the DeepSpeech script to train on the specified 
    sets.

    train <string> the path to the train files.
    dev <string>   the path to the val files.
    test <string>  the path to the test files.
    epochs <int>   the number of training epochs.
  """
  !python3 DeepSpeech.py \
    --train_files {train} \
    --dev_files {dev} \
    --test_files {test} \
    --epochs {epochs} \
    --scorer_path /content/STT/checkpoints/deepspeech-0.8.0-models.scorer\
    --load_checkpoint_dir /content/STT/checkpoints/deepspeech-0.8.0-checkpoint \
    --save_checkpoint_dir /content/STT/checkpoints/deepspeech-0.8.0-checkpoint \
    --n_hidden 2048 \
    --train_cudnn True

### 2.0 Getting the training dataset.
For the training, we shall use the Single Word Target Segment dataset, this is a use case driven segment containing data to power spoken digit recognition, yes / no detection, and wakeword testing data for [Firefox Voice](https://voice.mozilla.org/firefox-voice).
***
Steps:
1. Download the dataset from [Common Voice Datasets](https://commonvoice.mozilla.org/en/datasets) using `wget`.
2. Uncompress the `tar.gz` file.
***
**NOTE: This process takes some time (about 3 minutes), kindly be patient as the cell runs.**

In [None]:
# 1. Download the dataset from Common Voice Datasets.
!wget https://voice-prod-bundler-ee1969a6ce8178826482b88e843c335139bd3fb4.s3.amazonaws.com/cv-corpus-5-singleword/cv-corpus-5-singleword.tar.gz

# 2. Uncompress the tar.gz file.
!tar -xvf /content/STT/cv-corpus-5-singleword.tar.gz

### 3.0 Prepare the data
In this section, the dataset will be converted into the audio format `.wav` that the STT model uses. When this process is done, each `.mp3` in the dataset will be converted to a `.wav` file. Under the `clips` folder, three `.csv` files will be created.
- `clips\train.csv`, this file contains the names of the training files.
- `clips\dev.csv`, contains the names of the validation files.
- `clips\test.csv` contains the names of the test files.
***
**NOTE: This process takes quite a while (about 12 minutes) to finish, kindly be patient.**

In [None]:
# We use the prepare_data function defined in the notebook configuration.
prepare_data('/content/STT/cv-corpus-5-singleword/en')

### 4.0 Train and Evaluate the STT model.
With the data prepared, we can now train the model, STT uses the `DeepSpeech.py` as a central training, testing, evaluation and model exporting script. When the specified epochs are done, the function evaluates the trained model on the test set, returning the [`WER`](https://en.wikipedia.org/wiki/Word_error_rate), the [`CER`](https://rechtsprechung-im-ostseeraum.archiv.uni-greifswald.de/word-error-rate-character-error-rate-how-to-evaluate-a-model/#:~:text=The%20Word%20Error%20Rate%20(WER,punctuations%2C%20spaces%2C%20etc.) and the `loss`. <br>
The `WER` value is given between 0 (i.e no error) and 1 (i.e. the model didn't recognize any word).
***
We shall use the `train_model` function defined in the notebook configuration. The function takes four arguments:
- `train`, this is the path to the `train.csv` file.
- `dev`, this is the path to the `dev.csv` file, this file contains the names of the validation files.
- `test`, this is the path to the `test.csv` file.
- `epochs`, the number of iterations to train the model on the train set.

The `train_model` function encapsulates a number of attributes to the `DeepSpeech.py` script, to learn more about these attributes, run the cell below.

In [None]:
# Displays the full list of parameters used with the training script,
# You could expirement with these flags by adding them to the train_model 
# function defined in the notebook configuration.
!python DeepSpeech.py --helpfull

In [None]:
# The full path to the train.csv file.
train = r'/content/STT/cv-corpus-5-singleword/en/clips/train.csv'

# The full path to the dev.csv file.
dev = r'/content/STT/cv-corpus-5-singleword/en/clips/dev.csv'

# The full path to the test.csv.
test = r'/content/STT/cv-corpus-5-singleword/en/clips/test.csv'

# The number of epochs to train the model, increase the epochs to get better
# performance.
epochs = 30

# Finally, we start the model training process by calling the train_model
# function.
train_model(train, dev, test, epochs)

### Reporting to Atingi.
1. Record the best `WER` that you get from the model evaluation, this value that will be reported back to Atingi.