# Downloads and Preparations for Example Use Case

In this notebook we download and prepare the models and data required for the example use-case. 

To exemplify the end-to-end Domain Specific NeMo ASR application, we start with an acoustic model pre-trained on open-source English datasets [LibriSpeech](http://www.openslr.org/12) and [English - Mozilla Common Voice](https://voice.mozilla.org/en/datasets). Then we fine-tune the pre-trained acoustic and language models with **Wall Street Journal (WSJ)** news dataset. Through this example case, we show we can easily do transfer learning or domain adaptation from [relatively old] fiction books (LibriSpeech) to [relatively modern] business news (WSJ).

The steps followed in this notebook are:
1. Create a folder named `example_data` inside the `data directory` you input when starting the container.
2. Get the `pre-trained` acoustic model from Nvidia GPU Cloud (NGC).
3. Build a baseline Language Model (6-gram KenLM model trained on LibriSpeech)
4. Download the WSJ trained models created by this application from NGC or alternatively the dataset for *acoustic model* and *language model* training (fine-tuning) and understand how to build NeMo ready datasets.

In [None]:
import os 
import sys

os.environ['APP_DIR']='..'
os.environ['DATA_DIR']=os.path.join(os.environ['APP_DIR'],'data')
sys.path.append(os.environ['APP_DIR'])

In [None]:
# required imports
%load_ext autoreload
%autoreload 2

import os
from tools.System.config import cfg
from tools.filetools import mkdir_p
from tools.misc import create_lm_dataset

# 1. Create `example_data` folder
We use the `example_data` folder within the data directory to host all pre-trained models and example datasets used in our use case.

In [None]:
# expected data path
print("Expected path for 'example_data' folder:", cfg.DATASET.PATHS.EXAMPLE_DATA)

# create folder
mkdir_p(cfg.DATASET.PATHS.EXAMPLE_DATA)
print("Created example_data folder at: ", cfg.DATASET.PATHS.EXAMPLE_DATA)

# 2. Download pre-trained model from NGC

In this ASR workflow we use the QuartzNet model, a high-performing yet small end-to-end neural acoustic model for automatic speech recognition. Learn more about QuartzNet here: [tutorial](https://nvidia.github.io/NeMo/asr/quartznet.html) and [paper](https://arxiv.org/pdf/1910.10261.pdf). 

### QuartzNet 15x5 for NeMo
You can find this pre-trained model inside the demo folder: `/tmp/nemo_asr_app/demo/pre-trained/`

You can also download it from:
https://ngc.nvidia.com/catalog/models/nvidia:quartznet15x5

QuarzNet is a Jasper-like network which uses separable convolutions and larger filter sizes. It has comparable accuracy to Jasper while having much fewer parameters. This particular model has 15 blocks each repeated 5 times.

QuartzNet15x5 Encoder and Decoder neural module's checkpoints are trained using Neural Modules(NeMo) toolkit. NVIDIA’s Apex/Amp O1 optimization level was used for training on 8xV100 GPUs. These modules were trained using LibriSpeech (+-10% speed perturbation) and Mozilla's EN Common Voice "validated" set. 

# 3.  Build Baseline Language Model
Next, we provide the command and required scripts (`build_6-gram_OpenSLR_lm.sh`) to build a baseline language model for any ASR application. This language model uses [Baidu's CTC decoder with LM implementation](https://github.com/PaddlePaddle/DeepSpeech), specifically a **6-gram KenLM model** trained on **LibriSpeech**.

**Note**: If you wish to use a language model in your ASR pipeline, you need to install the necessary software. Please refer to the Dockerfile in `/tmp/nemo_asr_app/Dockerfile` and re-build the container.

In [None]:
cmd = "! cd "+ cfg.NEMO.TOOLS + " && ./build_6-gram_OpenSLR_lm.sh " + cfg.DATASET.PATHS.EXAMPLE_DATA
print(cmd)

In [None]:
os.environ

In [None]:
! cd ../tools/NeMo && ./build_6-gram_OpenSLR_lm.sh ../data/example_data

Copy (select cmd + shift RMB + copy) below the command generated to build the KenLM model. Note, this can take some time, we recommend you run it inside the container's terminal.

In [None]:
# ! cd /home/adrianaf/projects/asr_system/nemo_asr_app/tools/NeMo && ./build_6-gram_OpenSLR_lm.sh /raid/datasets/asr/data/example_data

# 4. WSJ dataset

In this example use-case, we fine-tune the pre-trained acoustic and language models with Wall Street Journal news datasets. 
Through this use case, we show we can easily do transfer learning or domain adaptation from old fiction books (LibriSpeech) to business news (WSJ).

This dataset is part of the Linguistic Data Consortium and can be found here:
- CSR-I (WSJ0) Complete: https://catalog.ldc.upenn.edu/LDC93S6A
- CSR-II (WSJ1) Complete: https://catalog.ldc.upenn.edu/LDC94S13A

To use this dataset you must normalize the text, i.e. lowercase text, remove punctuations and change digits to text representation. We provide utility functions in `tools/transcript_tools.py` that can help you with dataset preparation.

Note: To  download this dataset a license is required please refer to [LDC to learn more](https://www.ldc.upenn.edu/language-resources/data/obtaining). 


## WSJ Fine-tuned Model Checkpoints

To help you walk through this example use-case, we provide you with the WSJ fine-tuned models for both acoustic and language models. These models can be found in NGC:
1. The acoustic model created in Step 1, which finetunes the pre-trained acoustic model with WSJ data, can be downloaded from this [link](https://ngc.nvidia.com/models/nvidia:wsj_quartznet_15x5).
2. The language model trained on WSJ in Step 2 can be downloaded from this [link](https://ngc.nvidia.com/models/nvidia:wsj_lm_decoder).

When downloading the models, you can save these inside the folders listed below (which are automatically generated when running the Step 1 and Step 2 notebooks): 
- Acoustic model
    - `[data_dir]/models/acoustic_models/WSJ/WSJ_finetuning-lr_0.0001-bs_16-e_100-wd_0.0-opt_novograd/checkpoints/`
    - You can also find this finetuned model inside the demo folder: `/tmp/nemo_asr_app/demo/finetuned/`
- Language model
    - `[data_dir]/models/language_models/`[WS_lm.binary]
    
Alternatively, if you downloaded the WSJ data or for your own dataset you can follow the instructions below to create NeMo ready datasets.
As well, these models can be use directly on your own data to perform inference.

## 4.1 Create NeMo ready - Acoustic Model Dataset

NeMo requires datasets to be in the format of:
- `wav` audio clips with sampling rate (16000) and max clip duration (16.5) specified by the [configuration file](/tools/NeMo/example_configs/quartznet15x5.yaml).
- The dataset format as a `json` file where each entry has the keys: `audio_filepath`, `duration` and `text`.

You can see the script used to create NeMo datasets from common_voice datasets in `tools/NeMo/create_common_voice_dataset.py`. This script can help you correctly format your audio clips and json training dataset.

## 4.2 Create NeMo ready - Language Model Dataset

A Language Model (LM) can improve decoder's performance by resolving ambiguities in speech to text transcription, more information on how LM help ASR systems is provided in Step 2.

The Language Model dataset has the pre-processed text from the WSJ dataset in a single column, this file is saved as a `.txt` file. You can see the example LibriSpeech LM dataset `example_data/language_model/librispeech-lm-norm.txt` which was used to create the baseline language model. To create the WSJ language model dataset you can use the pre-processed text from the WSJ training dataset and save it as a single column in  a `.txt` file.

For this WSJ end-to-end use case we use 3 datasets:
1. `wsj-train-si284-speed-0.9-1.1.json` for acoustic model training (Note: Audio speed perturbation helps build better models)
2. `wsj-eval-92.json` or `wsj-dev-93.json` for model evaluation
3. `wsj-lm-data.txt` for language model training

You can use the commands below to check and correct the paths to the audio files inside the acoustic datasets.

In [None]:
#!head -n1 /raid/datasets/asr/data/example_data/wsj/wsj-dev-93.json

In [None]:
# Now replace this path with the correct path inside the container:
#!sed -i 's,/data,/raid/datasets/asr/data/example_data,g' /raid/datasets/asr/data/example_data/wsj/wsj-dev-93.json

In [None]:
# confirm change
#!head -n1 /raid/datasets/asr/data/example_data/wsj/wsj-dev-93.json

At this point you have downloaded and pre-process the necessary models and datasets to walk through our example use-case.

Your `example_data` folder now contains 2 sub-folders: 1) `wsj` with finetuning data and 2) `language_model` with baseline LM model and datasets.