<img src="http://developer.download.nvidia.com/notebooks/dlsw-notebooks/riva_asr_asr-python-advanced-finetune-am-citrinet-tao-finetuning/nvidia_logo.png" style="width: 90px; float: right;">

# How to fine-tune a Riva ASR Acoustic Model (Citrinet) with TAO Toolkit
This tutorial walks you through how to fine-tune a Riva ASR acoustic model (Citrinet) with TAO Toolkit.

## Overview

In this tutorial, we are going to discuss the Citrinet model, which is an end-to-end ASR model that takes in audio and produces text.

Citrinet is a descendent of QuartzNet that features the squeeze-and-excitation (SE) block and sub-word tokenization and has a better accuracy/performance than QuartzNet.

![CitriNet with CTC](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/_images/citrinet_vertical.png)

---
## ASR using TAO

The TAO launcher uses Docker containers under the hood, and **for our data and results directory to be visible to Docker, they need to be mapped**. The launcher can be configured using the config file `~/.tao_mounts.json`. Apart from the mounts, you can also configure additional options like the environment variables and the amount of shared memory available to the TAO launcher. <br>

`IMPORTANT NOTE:` The following code creates a sample `~/.tao_mounts.json`  file. Here, we can map directories in which we save the data, specs, results, and cache. You should configure it for your specific use case so these directories are correctly visible to the Docker container.

In [None]:
# Working directory for this tutorial
WORKING_DIR = 'asr_am_finetuning'

# Defining paths on the local host machine
%env HOST_DATA_DIR = {WORKING_DIR}/data
%env HOST_SPECS_DIR = {WORKING_DIR}/specs
%env HOST_RESULTS_DIR = {WORKING_DIR}/results

In [None]:
! mkdir -p $WORKING_DIR
! mkdir -p $HOST_DATA_DIR
! mkdir -p $HOST_SPECS_DIR
! mkdir -p $HOST_RESULTS_DIR

In [None]:
# Mapping up the local directories to the TAO docker.
import json
import os
mounts_file = os.path.expanduser("~/.tao_mounts.json")
tlt_configs = {
   "Mounts":[
       {
           "source": os.environ["HOST_DATA_DIR"],
           "destination": "/data"
       },
       {
           "source": os.environ["HOST_SPECS_DIR"],
           "destination": "/specs"
       },
       {
           "source": os.environ["HOST_RESULTS_DIR"],
           "destination": "/results"
       },
       {
           "source": os.path.expanduser("~/.cache"),
           "destination": "/root/.cache"
       }
   ],
   "DockerOptions": {
        "shm_size": "128G",
        "ulimits": {
            "memlock": -1,
            "stack": 67108864
         }
   }
}
# Writing the mounts file.
with open(mounts_file, "w") as mfile:
    json.dump(tlt_configs, mfile, indent=4)

In [None]:
!cat ~/.tao_mounts.json

You can check the Docker image versions and the tasks that it performs. You can also check by issuing `tao --help` or:

In [None]:
! tao info --verbose

### Set Relevant Paths

In [None]:
# NOTE: The following paths are set from the perspective of the TAO Docker.

# The data is saved here:
DATA_DIR = "/data"
SPECS_DIR = "/specs"
RESULTS_DIR = "/results"

# Set the encryption key and use the same key for all commands.
KEY = 'tlt_encode'

The command structure for the TAO interface can be broken down as follows: `tao <task name> <subcommand>` <br> 

Let's see this in further detail.


### Downloading Specs
TAO's conversational AI toolkit works off of spec files which make it easy to edit hyperparameters on the fly. We can proceed to downloading the spec files. You may choose to modify/rewrite these specs or even individually override them through the launcher. You can download the default spec files by using the `download_specs` command.<br>

The `-o` argument indicates the folder where the default specification files will be downloaded. The `-r` argument instructs the script on where to save the logs. **Ensure the `-o` points to an empty folder.**

In [None]:
# delete the specs directory if it is already there to avoid errors
! tao speech_to_text_citrinet download_specs \
    -r $RESULTS_DIR/speech_to_text_citrinet \
    -o $SPECS_DIR/speech_to_text_citrinet

### Download Data

In this tutorial we will use the popular AN4 dataset. Let's download it.

In [None]:
! wget https://dldata-public.s3.us-east-2.amazonaws.com/an4_sphere.tar.gz  # for the original source, please visit http://www.speech.cs.cmu.edu/databases/an4/an4_sphere.tar.gz

After downloading, untar the dataset and move it to the correct directory.

In [None]:
! tar -xvf an4_sphere.tar.gz 
! mv an4 $HOST_DATA_DIR

### Pre-Processing

This step converts the `.mp3` files into `.wav` files and splits the data into training and testing sets. It also generates a "meta-data" file to be consumed by the data-loader for training and testing.

In [None]:
! tao speech_to_text_citrinet dataset_convert \
    -e $SPECS_DIR/speech_to_text_citrinet/dataset_convert_an4.yaml \
    -r $RESULTS_DIR/citrinet/dataset_convert \
    source_data_dir=$DATA_DIR/an4 \
    target_data_dir=$DATA_DIR/an4_converted

Let's listen to a sample audio file.

In [None]:
# change path of the file here
import os
import IPython.display as ipd
path = os.environ["HOST_DATA_DIR"] + '/an4_converted/wavs/an268-mbmg-b.wav'
ipd.Audio(path)

### Finetuning 

#### Create Tokenizer

Before we can do the actual finetuning, we need to pre-process the text. This step is called subword tokenization that creates a subword vocabulary for the text. In Citrinet, the subword can be one or multiple characters. We can use the `create_tokenizer` command to create the tokenizer that generates the subword vocabulary for us for use in training.

In [None]:
!tao speech_to_text_citrinet create_tokenizer \
-e $SPECS_DIR/speech_to_text_citrinet/create_tokenizer.yaml \
-r $RESULTS_DIR/citrinet/create_tokenizer \
manifests=$DATA_DIR/an4_converted/train_manifest.json \
output_root=$DATA_DIR/an4 \
vocab_size=32

For finetuning an ASR Citrinet model in TAO, we use the `tao speech_to_text_citrinet finetune` command with the following arguments:
<ul>
    <li>`-e`: Path to the spec file </li>
    <li>`-g`: Number of GPUs to use </li>
    <li>`-r`: Path to the results folder </li>
    <li>`-m`: Path to the model </li>
    <li>`-k`: User specified encryption key to use while saving/loading the model </li>
    <li>Any overrides to the spec file. For example, `trainer.max_epochs`. </li>
</ul>

Now that we have the data and the tokenizer ready, let's download the pre-trained Citrinet checkpoint that we will use for finetuning. We will download the ASR model, [Citrinet-1024](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/speechtotext_en_us_citrinet), that is used in Riva ASR Speech skill.


In [None]:
! ngc registry model download-version "nvidia/tao/speechtotext_en_us_citrinet:trainable_v3.0"
! mv speechtotext_en_us_citrinet_vtrainable_v3.0/ $HOST_RESULTS_DIR/

Note: The fine-tune spec file ($SPECS_DIR/finetune.yaml) contain specifics to fine-tune the English model, that we just downloaded, to Russian language. In order to fine-tune the model for English (an4 is an English ASR dataset), we will that spec file.

Here is the minimal spec file that we will use for finetuning.

In [None]:
%%writefile finetune_en.yaml

# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
# TLT spec file for fine-tuning a previously trained ASR models based on CTC over the MCV Russian dataset.

trainer:
  max_epochs: 3   # This is low for demo purposes

tlt_checkpoint_interval: 1

# Whether or not to change the decoder vocabulary.
# Note that this MUST be set if the labels change, e.g. to a different language's character set
# or if additional punctuation characters are added.
change_vocabulary: false

tokenizer:
  dir: ???
  type: "bpe"  # Can be either bpe or wpe

# Fine-tuning settings: training dataset
finetuning_ds:
  manifest_filepath: ???
  sample_rate: 16000
  batch_size: 32
  trim_silence: true
  max_duration: 16.7
  shuffle: true
  is_tarred: false
  tarred_audio_filepaths: null

# Fine-tuning settings: validation dataset
validation_ds:
  manifest_filepath: ???
  sample_rate: 16000
  batch_size: 32
  shuffle: false

# Fine-tuning settings: optimizer
optim:
  name: novograd
  lr: 0.001

In [None]:
# Moving the above created specs file
!mv finetune_en.yaml $HOST_SPECS_DIR/

In [None]:
!tao speech_to_text_citrinet finetune \
     -e $SPECS_DIR/finetune_en.yaml \
     -g 1 \
     -k $KEY \
     -m $RESULTS_DIR/speechtotext_en_us_citrinet_vtrainable_v3.0/speechtotext_en_us_citrinet.tlt \
     -r $RESULTS_DIR/citrinet/finetune \
     finetuning_ds.manifest_filepath=$DATA_DIR/an4_converted/train_manifest.json \
     validation_ds.manifest_filepath=$DATA_DIR/an4_converted/test_manifest.json \
     trainer.max_epochs=5 \
     finetuning_ds.num_workers=20 \
     validation_ds.num_workers=20 \
     tokenizer.dir=$DATA_DIR/an4/tokenizer_spe_unigram_v32

### ASR evaluation

Now that we have a model trained, we need to check how well it performs.

In [None]:
!tao speech_to_text_citrinet evaluate \
     -e $SPECS_DIR/speech_to_text_citrinet/evaluate.yaml \
     -g 1 \
     -k $KEY \
     -m $RESULTS_DIR/citrinet/finetune/checkpoints/finetuned-model.tlt \
     -r $RESULTS_DIR/citrinet/evaluate \
     test_ds.manifest_filepath=$DATA_DIR/an4_converted/test_manifest.json

### ASR model export

With TAO, you can also export your model in a format that can deployed using NVIDIA Riva; a highly performant application framework for multi-modal conversational AI services using GPUs. The same command for exporting to ONNX can be used here. The only small variation is the configuration for `export_format` in the spec file.

#### Export to Riva

In [None]:
!tao speech_to_text_citrinet export \
     -e $SPECS_DIR/speech_to_text_citrinet/export.yaml \
     -g 1 \
     -k $KEY \
     -m $RESULTS_DIR/citrinet/finetune/checkpoints/finetuned-model.tlt \
     -r $RESULTS_DIR/citrinet/riva \
     export_format=RIVA \
     export_to=asr-model.riva

#### Export to ONNX (Note: Export to ONNX is not needed for Riva)

In [None]:
!tao speech_to_text_citrinet export \
     -e $SPECS_DIR/speech_to_text_citrinet/export.yaml \
     -g 1 \
     -k $KEY \
     -m $RESULTS_DIR/citrinet/finetune/checkpoints/finetuned-model.tlt \
     -r $RESULTS_DIR/citrinet/export \
     export_format=ONNX

### ASR Inference using TLT checkpoint

#### ASR Inference with TAO Toolkit

In this section, we are going to run inference on the tlt checkpoint with TAO Toolkit. 
 For real-time inference and best latency, we need to deploy this model on Riva, which would be covered in the next tutorial. 

In [None]:
!tao speech_to_text_citrinet infer \
     -e $SPECS_DIR/speech_to_text_citrinet/infer.yaml \
     -g 1 \
     -k $KEY \
     -m $RESULTS_DIR/citrinet/finetune/checkpoints/finetuned-model.tlt \
     -r $RESULTS_DIR/citrinet/infer \
     file_paths=[$DATA_DIR/an4_converted/wavs/an268-mbmg-b.wav]

You can upload your recorded `.wav` file and provide its path to the `file_paths` argument in the cell above to get the transcribed speech.

## What's Next?

Now that we've fine-tuned Citrinet accoustic model, we can now deploy this custom model to NVIDIA Riva.

Make sure to keep the path of `asr-model.riva` handy for deployment i.e. $HOST_RESULTS_DIR/results/citrinet/riva/asr-model.riva