# N-gram Language Modelling using Transfer Learning Toolkit

*Transfer learning* is the process of transferring learned features from one application to another. It is a commonly used training technique where you use a model trained on one task and re-train to use it on a different task.

**Train Adapt Optimize (TAO) Toolkit** is a simple and easy-to-use Python based AI toolkit for taking purpose-built AI models and customizing them with users' own data. Developers, researchers and software partners building Conversational AI and Vision AI can leverage TAO Toolkit to avoid the hassle of training from scratch, and significantly accelerate their workflow.

<center><img src="https://developer.nvidia.com/sites/default/files/akamai/TAO/tlt-tao-toolkit-bring-your-own-model-diagram.png"><\center>

## Learning Objectives
In this notebook, you will learn how to leverage the simplicity and convenience of TAO Toolkit to:
- Pre-process/convert a dataset for the [**Language Modelling**](#isc-task-description) task.
- [**Train/Finetune**](#isc-training) a [N-gram language model](https://web.stanford.edu/~jurafsky/slp3/3.pdf) on the [Librispeech LM Normalized](https://www.openslr.org/11/) dataset
- Run [**Inference**](#isc-inference) and [**Evaluate**](#evaluation) our trained model on the [Librispeech dev-clean](https://www.openslr.org/12/) dataset
- [**Export**](#isc-export-riva) in a format suitable for deployment in [Riva](https://developer.nvidia.com/riva).

The earlier sections in the notebook give a brief introduction to the N-gram Language Modelling task, the datasets used for training and evaluating our N-gram language model. If you are already familiar with these, and want to jump right in, you can start at section on [Data Preparation](#isc-prepare-data).

#### Note
1. This notebook uses Librispeech LM dataset by default, which should be around ~4.6 GB.
1. Using the default config/spec file provided in this notebook, each weight file size of n_gram created during training will be ~7 MB

## Connect to a GPU Runtime

1.   Change Runtime type to GPU by Runtime(Top Left tab)->Change Runtime Type->GPU(Hardware Accelerator)
2.   Then click on Connect (Top Right)


## Mounting Google drive
Mount your Google drive storage to this Colab instance

In [None]:
try:
    import google.colab
    %env GOOGLE_COLAB=1
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
except:
    %env GOOGLE_COLAB=0
    print("Warning: Not a Colab Environment")

## Setup Python Environment
Setup the environment necessary to run the TAO Networks by running the bash script

#### FIXME
1. COLAB_NOTEBOOKS_PATH - for Google Colab environment, set this path where you want to clone the repo to; for local system environment, set this path to the already cloned repo
1. DATA_DIR - set this path to a folder location where you want to dataset to be present
1. SPECS_DIR - set this path to a folder location where the configuration/spec files will be saved
1. RESULTS_DIR - set this path to a folder location where pretrained models, checkpoints and log files during different model actions will be saved

In [None]:
import os
#FIXME1
%env COLAB_NOTEBOOKS_PATH=/content/drive/MyDrive
if os.environ["GOOGLE_COLAB"] == "1":
    if not os.path.exists(os.path.join(os.environ["COLAB_NOTEBOOKS_PATH"],"nvidia-tao")):
        !git clone https://github.com/NVIDIA-AI-IOT/nvidia-tao.git $COLAB_NOTEBOOKS_PATH
else:
    if not os.path.exists(os.environ["COLAB_NOTEBOOKS_PATH"]):
        raise Exception("Error, enter the path of the colab notebooks repo correctly")

!sed -i "s|PATH_TO_COLAB_NOTEBOOKS|$COLAB_NOTEBOOKS_PATH|g" $COLAB_NOTEBOOKS_PATH/pytorch/$bash_script
!sh $COLAB_NOTEBOOKS_PATH/pytorch/$bash_script


---
<a id='isc-task-description'></a>
## Language Modelling

### Task Description

Language modelling returns a probability distribution over a sequence of words. Besides assigning a probability to a sequence of words, the language models also assign a probability for the likelihood of a given word (or a sequence of words) that follows a sequence of words. <br>

> The sentence:  **all of a sudden I notice three guys standing on the sidewalk**
> would be scored higher than 
> the sentence: **on guys all I of notice sidewalk three a sudden standing the** by the language model. <br>

A language model trained on large corpus can significantly improve the accuracy of an Automatic Speech Recognition system as suggested in many recent research.

### N-gram Language Model
There are primarily two types of Language Models

- **N-gram Language Models**: These models use frequency of n-grams to learn the probability distribution over words. Two benefits of N-gram Language Model are simplicity and scalability – with larger n, a model can store more context with a well-understood space–time tradeoff, enabling small experiments to scale up efficiently.
- **Neural Language Models**: They use different kinds of Neural Networks to model the probability distribution over words, and have surpassed the N-gram language models in the ability to model language, but are generally slower to evaluate.

In this notebook, we will show how to train, evaluate and optionally finetune a [N-gram language model](https://web.stanford.edu/~jurafsky/slp3/3.pdf) leveraging TAO Toolkit.

---
### Set Relevant Paths
Please set these paths according to your environment.

In [None]:
%env TAO_DOCKER_DISABLE=1

# The data is saved here
#FIXME2
DATA_DIR='/data/lm'
!sudo mkdir -p $DATA_DIR && sudo chmod -R 777 $DATA_DIR

# The configuration files are stored here
#FIXME3
SPECS_DIR='/specs/lm'
!sudo mkdir -p $SPECS_DIR && sudo chmod -R 777 $SPECS_DIR

# The results are saved at this path
#FIXME4
RESULTS_DIR='/results/lm'
!sudo mkdir -p $RESULTS_DIR && sudo chmod -R 777 $RESULTS_DIR

# Set your encryption key, and use the same key for all commands
KEY='tlt_encode'

---
<a id='isc-prepare-data'></a>
### Preparing the dataset
#### Librispeech LM Normalized dataset
For this tutorial, we use the normalized version of `Librispeech LM` dataset to train our N-gram language model. The normalized version of `Librispeech LM` dataset is available [here](https://www.openslr.org/11/).

#### Librispeech dev-clean dataset
For this tutorial, we also use the clean version of `Librispeech` development set to evaluate our N-gram language model. The clean version of `Librispeech` development set is available [here](https://www.openslr.org/12/).

#### Downloading the dataset

#### Librispeech LM Normalized dataset
The training data is publicly available [here](https://www.openslr.org/resources/11/librispeech-lm-corpus.tgz) and can be downloaded directly.

In [None]:
# NOTE: Ensure that wget and unzip utilities are available. If not, please install them
!wget 'https://www.openslr.org/resources/11/librispeech-lm-norm.txt.gz' -P $DATA_DIR

# Extract the data
!gzip -dk $DATA_DIR/librispeech-lm-norm.txt.gz

#### Librispeech dev-clean dataset
The evaluation data is publicly available [here](https://www.openslr.org/resources/12/dev-clean.tar.gz) and can be downloaded directly. We provided a Python script to download and preprocess the dataset for users

In [None]:
"""
Scripts to download and preprocess LibriSpeech dev-clean
"""
from multiprocessing import Pool
import numpy
import os

LOG_STR = " To regenerate this file, please, remove it."

def find_transcript_files(dir):
    files = []
    for dirpath, _, filenames in os.walk(dir):
        for filename in filenames:
            if filename.endswith(".trans.txt"):
                files.append(os.path.join(dirpath, filename))
    return files

def transcript_to_list(file):
    audio_path = os.path.dirname(file)
    ret = []
    with open(file, "r") as f:
        for line in f:
            file_id, trans = line.strip().split(" ", 1)
            audio_file = os.path.abspath(os.path.join(audio_path, file_id + ".flac"))
            duration = 0  # We are not using the audio
            ret.append([file_id, audio_file, str(duration), trans.lower()])

    return ret


if __name__ == "__main__":
    
    name = "dev-clean"
    data_path = os.path.join(DATA_DIR, "eval_data")
    text_path = os.path.join(DATA_DIR, "text")
    lists_path = os.path.join(DATA_DIR, "lists")
    os.makedirs(data_path, exist_ok=True)
    os.makedirs(text_path, exist_ok=True)
    os.makedirs(lists_path, exist_ok=True)
    data_http = "http://www.openslr.org/resources/12/"

    # Download the audio data
    print("Downloading the evaluation data.", flush=True)
    if not os.path.exists(os.path.join(data_path, "LibriSpeech", name)):
        print("Downloading and unpacking {}...".format(name))
        cmd = """wget -c {http}{name}.tar.gz -P {path};
                 yes n 2>/dev/null | gunzip {path}/{name}.tar.gz;
                 tar -C {path} -xf {path}/{name}.tar"""
        os.system(cmd.format(path=data_path, http=data_http, name=name))
    else:
        log_str = "{} part of data exists, skip its downloading and unpacking"
        print(log_str.format(name) + LOG_STR, flush=True)

    # Prepare the audio data
    print("Converting data into necessary format.", flush=True)
    word_dict = {}
    word_dict[name] = set()
    src = os.path.join(data_path, "LibriSpeech", name)
    assert os.path.exists(src), "Unable to find the directory - '{src}'".format(
        src=src
    )

    dst_list = os.path.join(lists_path, name + ".lst")
    if os.path.exists(dst_list):
        print(
            "Path {} exists, skip its generation.".format(dst_list) + LOG_STR,
            flush=True,
        )
        

    print("Analyzing {src}...".format(src=src), flush=True)
    transcript_files = find_transcript_files(src)
    transcript_files.sort()

    print("Writing to {dst}...".format(dst=dst_list), flush=True)
    with Pool(processes=8) as p:
        samples = list(p.imap(transcript_to_list, transcript_files))

    with open(dst_list, "w") as fout:
        for sp in samples:
            for s in sp:
                word_dict[name].update(s[-1].split(" "))
                s[0] = name + "-" + s[0]
                fout.write(" ".join(s) + "\n")

    current_path = os.path.join(text_path, name + ".txt")
    if not os.path.exists(current_path):
        with open(os.path.join(lists_path, name + ".lst"), "r") as flist, open(
            os.path.join(text_path, name + ".txt"), "w"
        ) as fout:
            for line in flist:
                fout.write(" ".join(line.strip().split(" ")[3:]) + "\n")
    else:
        print(
            "Path {} exists, skip its generation.".format(current_path) + LOG_STR,
            flush=True,
        )

print("Done!", flush=True)


For the sake of reducing the time this demo takes, we reduce the number of lines of the training dataset. Feel free to modify the number of used lines.

In [None]:
# Use a random 10,000 lines for training
!shuf -n 10000 $DATA_DIR/librispeech-lm-norm.txt  > $DATA_DIR/reduced_training.txt

---
## TAO Toolkit workflow
The rest of the notebook exemplifies the simplicity of the TAO Toolkit workflow. Users with basic knowledge of Deep Learning can get started building their own custom models using a simple specification file. It's essentially just one command each to run data preprocessing, training, fine-tuning, evaluation, inference, and export! All configurations happen through YAML spec files <br>

---
### Configuration/Specification Files

The essence of all commands in TAO Toolkit lies in the YAML spec files. There are sample spec files already available for you to use directly or as reference to create your own.  Through these spec files, you can tune many knobs like the model, dataset, hyperparameters etc. Each command (like train, finetune, evaluate etc.) should have a dedicated spec file with configurations pertinent to it. <br>

Here is an example of the training spec file:

---
```
model:
  intermediate: True
  order: 2
  pruning:
    - 0
training_ds:
  is_tarred: false
  is_file: true
  data_file: ???

vocab_file: ""
encryption_key: "tlt_encode"
...
```


---
### Downloading Specs
We can proceed to downloading the spec files. The user may choose to modify/rewrite these specs, or even individually override them through the launcher. You can download the default spec files by using the `download_specs` command. <br>

The -o argument indicating the folder where the default specification files will be downloaded, and -r that instructs the script where to save the logs. **Make sure the -o points to an empty folder!**

In [None]:
!tao n_gram download_specs \
    -r $RESULTS_DIR \
    -o $SPECS_DIR

---
### Data Convert


In preparation for training/fine-tuning, we need to preprocess the dataset. `tao n_gram dataset_convert` command can be used in conjunction with appropriate configuration in the spec file. Here is the sample `dataset_convert.yaml` spec file we use:
```
# Dataset. Available options: [assistant]
dataset_name: assistant

# Extension of the files containing in dataset
extension: ???

# Path to the folder containing the dataset source files.
source_data_dir: ???

# Path to the output folder.
target_data_file: ???

```
 We encourage you to take a look at the .yaml spec files we provide!
As we show below, you can override the `source_data_dir` and `target_data_dir` options with appropriate paths.

In [None]:
# Preprocess training data (Librispeech LM Normalized)
!tao n_gram dataset_convert \
            -e $SPECS_DIR/dataset_convert.yaml \
            -r $RESULTS_DIR/dataset_convert \
            extension=*.txt \
            source_data_dir=$DATA_DIR/reduced_training.txt \
            target_data_file=$DATA_DIR/preprocessed.txt

# Preprocess evaluation data (Librispeech dev-clean)
!tao n_gram dataset_convert \
            -e $SPECS_DIR/dataset_convert.yaml \
            -r $RESULTS_DIR/dataset_convert \
            extension=*.txt \
            source_data_dir=$DATA_DIR/text/dev-clean.txt \
            target_data_file=$DATA_DIR/preprocessed_dev_clean.txt

The command preprocess training and evaluation dataset using basic text preprocessings include convert lowercase, normalization, remove punctuation, ... and write the results into files named `preprocessed.txt` and `preprocessed_dev_clean.txt` for training and evaluation correspondingly. In both `preprocessed.txt` and `preprocessed_dev_clean.txt`, each preprocessed sentence corresponds to a new line.

---
<a id='isc-training'></a>
### Training / Fine-tuning


Training a model using TAO Toolkit is as simple as configuring your spec file and running the train command. The code cell below uses the train.yaml spec file available for users as reference. The spec file configurations can easily be overridden using the tao-launcher CLI as shown below. For instance, below we override `model.order`, `model.pruning` and `training_ds.data_file` configurations to suit our needs. <br>

For training a N-gram language model in TAO Toolkit, we use the `tao n_gram train` command with the following args:
- `-e`: Path to the spec file
- `-k`: User specified encryption key to use while saving/loading the model
- `-r`: Path to a folder where the outputs should be written. Make sure this is mapped in tlt_mounts.json
- Any overrides to the spec file eg. `model.order`
<br>


More details about these arguments are present in the [TAO Toolkit Getting Started Guide](https://docs.nvidia.com/tao/tao-toolkit/text/overview.html) <br>
`Note:` All file paths correspond to the destination mounted directory that is visible in the TAO Toolkit docker container used in backend.<br>

In [None]:
!tao n_gram train \
            -e $SPECS_DIR/train.yaml \
            -r $RESULTS_DIR/train \
            training_ds.data_file=$DATA_DIR/preprocessed.txt \
            model.order=3 \
            model.pruning=[0,0,1]

The train command produces 3 files called `train_n_gram.arpa`, `train_n_gram.vocab` and `train_n_gram.kenlm_intermediate` saved at `$RESULTS_DIR/train/checkpoints`.

---
<a id='evaluation'></a>
### Evaluation
The evaluation spec .yaml is as simple as:

```
# Name of the .arpa or .binary file where trained model will be restored from.
restore_from: ???

test_ds:
  data_file: ???
  
```

In [None]:
!tao n_gram evaluate \
     -e $SPECS_DIR/evaluate.yaml \
     -r $RESULTS_DIR/evaluate \
     restore_from=$RESULTS_DIR/train/checkpoints/train_n_gram.arpa \
     test_ds.data_file=$DATA_DIR/preprocessed_dev_clean.txt

The output of Evaluation give us the perplexity of the N-gram language model on the evaluation (Librispeech dev-clean) dataset!

---
<a id='isc-inference'></a>
### Inference
Inference using a trained `.arpa` or `.binary` model uses the `tao n_gram infer` command.  <br>
The infer.yaml is also very simple, and we can directly give inputs for the model to run inference.
```
# "Simulate" user input:
input_batch:
  - 'set alarm for seven thirty am'
  - 'lower volume by fifty percent'
  - 'what is my schedule for tomorrow'

restore_from: ???

```

We encourage you to try out your own inputs as an exercise!

In [None]:
!tao n_gram infer \
            -e $SPECS_DIR/infer.yaml \
            -r $RESULTS_DIR/infer \
            restore_from=$RESULTS_DIR/train/checkpoints/train_n_gram.arpa

This command returns the log likelihood, perplexity and all n-grams for each of the input sequences that users provided.

---
### What's Next?

You could use TAO Toolkit to build custom models for your own applications, or you could deploy the custom model to Nvidia Riva!