# Train Adapt Optimize (TAO) Toolkit

Train Adapt Optimize (TAO) Toolkit  is a python based AI toolkit for taking purpose-built pre-trained AI models and customizing them with your own data. 

Transfer learning extracts learned features from an existing neural network to a new one. Transfer learning is often used when creating a large training dataset is not feasible. 

Developers, researchers and software partners building intelligent vision AI apps and services, can bring their own data to fine-tune pre-trained models instead of going through the hassle of training from scratch.

![Train Adapt Optimize (TAO) Toolkit](https://developer.nvidia.com/sites/default/files/akamai/TAO/tlt-tao-toolkit-bring-your-own-model-diagram.png)

The goal of this toolkit is to reduce that 80 hour workload to an 8 hour workload, which can enable data scientist to have considerably more train-test iterations in the same time frame.

Let's see this in action with a use case for Automatic Speech Recognition!

#### Note
1. This notebook uses AN4 dataset by default, which should be around ~91 MB.
1. Using the default config/spec file provided in this notebook, each weight file size of speech_to_text-Jasper created during training will be ~1.2 GB and, each weight file size of speech_to_text-QuartzNet will be ~73MB

## Automatic Speech Recognition

Automatic Speech Recognition (ASR) is often the first step in building a Conversational AI model. An ASR model converts audible speech into text. The main metric for these models is to reduce Word Error Rate (WER) while transcribing the text. Simply put, the goal is to take an audio file and transcribe it.

In this work, we are going to discuss two models, [QuartzNet](https://arxiv.org/pdf/1910.10261.pdf) and [Jasper (Just Another SPeech Recognizer) model](https://arxiv.org/abs/1904.03288), both of which are end to end ASR models which take in audio and produce text.

Jasper architectures consist of a repeated block structure that utilizes 1D convolutions. In a Jasper_KxR model, R sub-blocks (consisting of a 1D convolution, batch norm, ReLU, and dropout) are grouped into a single block, which is then repeated K times. We also have a one extra block at the beginning and a few more at the end that are invariant of K and R, and we use CTC loss.

The QuartzNet is better variant of Jasper with a key difference that it uses time-channel separable 1D convolutions. This allows it to dramatically reduce number of weights while keeping similar accuracy.

![QuartzNet with CTC](https://developer.nvidia.com/blog/wp-content/uploads/2020/05/quartznet-model-architecture-1-625x742.png)

## Connect to a GPU Runtime

1.   Change Runtime type to GPU by Runtime(Top Left tab)->Change Runtime Type->GPU(Hardware Accelerator)
2.   Then click on Connect (Top Right)



## Mounting Google drive
Mount your Google drive storage to this Colab instance

In [None]:
try:
    import google.colab
    %env GOOGLE_COLAB=1
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
except:
    %env GOOGLE_COLAB=0
    print("Warning: Not a Colab Environment")

## Setup Python Environment
Setup the environment necessary to run the TAO Networks by running the bash script

#### FIXME
1. COLAB_NOTEBOOKS_PATH - for Google Colab environment, set this path where you want to clone the repo to; for local system environment, set this path to the already cloned repo
1. NUM_GPUS - set this to <= number of GPU's availble on the instance
1. DATA_DIR - set this path to a folder location where you want to dataset to be present
1. SPECS_DIR - set this path to a folder location where the configuration/spec files will be saved
1. RESULTS_DIR - set this path to a folder location where pretrained models, checkpoints and log files during different model actions will be saved

In [None]:
import os
#FIXME1
%env COLAB_NOTEBOOKS_PATH=/content/drive/MyDrive/nvidia-tao
if os.environ["GOOGLE_COLAB"] == "1":
    os.environ["bash_script"] = "setup_env.sh"
    if not os.path.exists(os.path.join(os.environ["COLAB_NOTEBOOKS_PATH"])):
        !git clone https://github.com/NVIDIA-AI-IOT/nvidia-tao.git $COLAB_NOTEBOOKS_PATH
else:
    os.environ["bash_script"] = "setup_env_desktop.sh"
    if not os.path.exists(os.environ["COLAB_NOTEBOOKS_PATH"]):
        raise Exception("Error, enter the path of the colab notebooks repo correctly")

!sed -i "s|PATH_TO_COLAB_NOTEBOOKS|$COLAB_NOTEBOOKS_PATH|g" $COLAB_NOTEBOOKS_PATH/pytorch/$bash_script
!sh $COLAB_NOTEBOOKS_PATH/pytorch/$bash_script

---
## Let's Dig in: ASR using TAO

---
### Set Relevant Paths
Set these paths according to your environment.

In [None]:
%env TAO_DOCKER_DISABLE=1

# NOTE: The following paths are set from the perspective of the TAO Docker. 

# The data is saved here
#FIXME2
%env DATA_DIR=/data/asr
!sudo mkdir -p $DATA_DIR && sudo chmod -R 777 $DATA_DIR

# The configuration files are stored here
#FIXME3
%env SPECS_DIR=/specs/asr
!sudo mkdir -p $SPECS_DIR && sudo chmod -R 777 $SPECS_DIR

# The results are saved at this path
#FIXME4
%env RESULTS_DIR=/results/asr
!sudo mkdir -p $RESULTS_DIR && sudo chmod -R 777 $RESULTS_DIR

# Set your encryption key, and use the same key for all commands
%env KEY=tlt_encode

Now that everything is setup, we would like to take a bit of time to explain the tao interface for ease of use. The command structure can be broken down as follows: `tao <task name> <subcommand>` <br> 

Let's see this in further detail.


### Downloading Specs
TAO's Conversational AI Toolkit works off of spec files which make it easy to edit hyperparameters on the fly. We can proceed to downloading the spec files. The user may choose to modify/rewrite these specs, or even individually override them through the launcher. You can download the default spec files by using the `download_specs` command. <br>

The -o argument indicating the folder where the default specification files will be downloaded, and -r that instructs the script where to save the logs. **Make sure the -o points to an empty folder!**

In [None]:
!tao speech_to_text download_specs \
    -r $RESULTS_DIR/speech_to_text \
    -o $SPECS_DIR/speech_to_text

### Download Data

For the purposes of demonstration we will use the popular AN4 dataset. Let's download it.

In [None]:
! wget https://dldata-public.s3.us-east-2.amazonaws.com/an4_sphere.tar.gz  # for the original source, please visit http://www.speech.cs.cmu.edu/databases/an4/an4_sphere.tar.gz

After downloading, untar the dataset, and move it to the correct directory.

In [None]:
! tar -xvf an4_sphere.tar.gz 
! mv an4 $DATA_DIR

### Pre-Processing

This step converts the mp3 files into wav files and splits the data into training and testing sets. It also generates a "meta-data" file to be consumed by the dataloader for training and testing.

In [None]:
!tao speech_to_text dataset_convert \
    -e $SPECS_DIR/speech_to_text/dataset_convert_an4.yaml \
    -r $RESULTS_DIR/quartznet/dataset_convert \
    source_data_dir=$DATA_DIR/an4 \
    target_data_dir=$DATA_DIR/an4_converted

Let's take a listen to a sample audio file

In [None]:
# change path of the file here
import os
import IPython.display as ipd
path = os.environ["DATA_DIR"] + '/an4_converted/wavs/an268-mbmg-b.wav'
ipd.Audio(os.environ["DATA_DIR"] + '/an4_converted/wavs/an268-mbmg-b.wav')

As previously discussed, there are two models we would like to discuss, the QuartzNet model and the Jasper Model. Training commands for both of them are similar. Let's have a look!

### Training 

We have a very neat interface which allows the end user to configure training parameters from the command line interface. <br>

The process of opening the training script; finding the parameters of interest (which might be spread across multiple files), making the changes needed, and double checking everything is being replaced by a much more easy to use and visible command line interface.

For instance if the number of epochs are needed to be modified along with a change in learning rate, the user can add `trainer.max_epochs=10` and `optim.lr=0.02` and train the model. Sample commands are given below.


<b>A list of some of the customizable parameters along with their default values is as follows:</b>

trainer:<br>
<ul>  
  <li>gpus: 1 </li>
  <li>num_nodes: 1 </li>
  <li>max_epochs: 5 </li>
  <li>max_steps: null </li>
  <li>checkpoint_callback: false </li>
</ul>

training_ds:
<ul>  
  <li>sample_rate: 16000 </li>
  <li>labels: [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m",
           "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"] </li>
  <li>batch_size: 32 </li>
  <li>trim_silence: true </li>
  <li>max_duration: 16.7 </li>
  <li>shuffle: true </li>
  <li>is_tarred: false </li>
  <li>tarred_audio_filepaths: null </li>
</ul>  

validation_ds:
<ul>  
  <li>sample_rate: 16000 </li>
  <li>labels: [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m",
           "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"] </li>
  <li>batch_size: 32 </li>
  <li>shuffle: false </li>
</ul>  
optim:
<ul>
  <li>name: novograd </li>
  <li>lr: 0.01 </li>
  <li>betas: [0.8, 0.5] </li>
  <li>weight_decay: 0.001 </li>
</ul>

The steps below might take considerable time depending on the GPU being used. For best experience, we recommend using an A100 GPU.

For training an ASR model in TAO, we use the `tao speech_to_text train` command with the following args:
<ul>
    <li> <b>-e</b> : Path to the spec file </li>
    <li> <b>-g</b> : Number of GPUs to use </li>
    <li> <b>-r</b> : Path to the results folder </li>
    <li> <b>-m</b> : Path to the model </li>
    <li> <b>-k</b> : User specified encryption key to use while saving/loading the model </li>
    <li> Any overrides to the spec file eg. trainer.max_epochs </li>
</ul>

#### Training QuartzNet 15x5

In [None]:
!tao speech_to_text train \
     -e $SPECS_DIR/speech_to_text/train_quartznet.yaml \
     -g 1 \
     -k $KEY \
     -r $RESULTS_DIR/quartznet/train \
     training_ds.manifest_filepath=$DATA_DIR/an4_converted/train_manifest.json \
     validation_ds.manifest_filepath=$DATA_DIR/an4_converted/test_manifest.json \
     trainer.max_epochs=1 \
     training_ds.num_workers=4 \
     validation_ds.num_workers=4

#### Training Jasper 10x5

In [None]:
!tao speech_to_text train \
     -e $SPECS_DIR/speech_to_text/train_jasper.yaml \
     -g 1 \
     -k $KEY \
     -r $RESULTS_DIR/jasper/train \
     training_ds.manifest_filepath=$DATA_DIR/an4_converted/train_manifest.json \
     validation_ds.manifest_filepath=$DATA_DIR/an4_converted/test_manifest.json \
     trainer.max_epochs=1 \
     training_ds.num_workers=4 \
     validation_ds.num_workers=4

### ASR evaluation

Now that we have a model trained, we need to check how well it performs.

In [None]:
!tao speech_to_text evaluate \
     -e $SPECS_DIR/speech_to_text/evaluate.yaml \
     -g 1 \
     -k $KEY \
     -m $RESULTS_DIR/quartznet/train/checkpoints/trained-model.tlt \
     -r $RESULTS_DIR/quartznet/evaluate \
     test_ds.manifest_filepath=$DATA_DIR/an4_converted/test_manifest.json

### ASR finetuning

Once the model is trained and evaluated and there is a need for fine tuning, the following command can be used to fine tune the ASR model. This step can also be used for transfer learning by making changes in the `train.json` and `dev.json` files to add new data.

The list for customizations is same as the training parameters with the exception for parameters which affect the model architecture. Also, instead of `training_ds` we have `finetuning_ds`

Note: If you wish to proceed with a pre-trained model for better inference results, you can find a .nemo model [here](
https://ngc.nvidia.com/catalog/collections/nvidia:nemotrainingframework).

Simply re-name the .nemo file to .tlt and pass it through the finetune pipeline.

**Note: The finetune spec files contain specifics to finetune the English model we just trained to Russian. If you wish to proceed with English, please make the changes in the spec file *finetune.yaml* which you can find in the SPEC_DIR folder you mapped. Be sure to delete older finetuning checkpoints if you choose to change the language after finetuning it as is.**

In [None]:
!tao speech_to_text finetune \
     -e $SPECS_DIR/speech_to_text/finetune.yaml \
     -g 1 \
     -k $KEY \
     -m $RESULTS_DIR/quartznet/train/checkpoints/trained-model.tlt \
     -r $RESULTS_DIR/quartznet/finetune \
     finetuning_ds.manifest_filepath=$DATA_DIR/an4_converted/train_manifest.json \
     validation_ds.manifest_filepath=$DATA_DIR/an4_converted/test_manifest.json \
     trainer.max_epochs=1 \
     finetuning_ds.num_workers=20 \
     validation_ds.num_workers=20 \
     trainer.gpus=1

## What's Next?

 You could use TAO to build custom models for your own applications, or you could deploy the custom model to Nvidia Riva!