# Train Adapt Optimize (TAO) Toolkit

Train Adapt Optimize (TAO) Toolkit  is a python based AI toolkit for taking purpose-built pre-trained AI models and customizing them with your own data.

Transfer learning extracts learned features from an existing neural network to a new one. Transfer learning is often used when creating a large training dataset is not feasible.

Developers, researchers and software partners building intelligent AI apps and services, can bring their own data to fine-tune pre-trained models instead of going through the hassle of training from scratch.

![Train Adapt Optimize (TAO) Toolkit](https://developer.nvidia.com/sites/default/files/akamai/TAO/tlt-tao-toolkit-bring-your-own-model-diagram.png)

The goal of this toolkit is to reduce that 80 hour workload to an 8 hour workload, which can enable data scientist to have considerably more train-test iterations in the same time frame.

Let's see this in action with a use case for Speech Synthesis!

#### Note
1. This notebook uses Librispeech dataset by default, which should be around ~3.75 GB.
1. Using the default config/spec file provided in this notebook, each weight file size of spectrogen created during training will be ~1.76 GB and, each weight file size of vocoder created during training will be around ~324 MB

## Text to Speech

Text to Speech (TTS) is often the last step in building a Conversational AI model. A TTS model converts text into audible speech. The main objective is to synthesize reasonable and natural speech for given text. Since there are no universal standard to measure quality of synthesized speech, you will need to listen to some inferred speech to tell whether a TTS model is well trained.

In TAO Toolkit, TTS is made up with two models: [FastPitch](https://arxiv.org/pdf/2006.06873.pdf) for spectrogram generation and [HiFiGAN](https://arxiv.org/pdf/2010.05646.pdf) as vocoder.

## Connect to a GPU Runtime

1.   Change Runtime type to GPU by Runtime(Top Left tab)->Change Runtime Type->GPU(Hardware Accelerator)
2.   Then click on Connect (Top Right)



## Mounting Google drive
Mount your Google drive storage to this Colab instance

In [None]:
try:
    import google.colab
    %env GOOGLE_COLAB=1
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
except:
    %env GOOGLE_COLAB=0
    print("Warning: Not a Colab Environment")

## Setup Python Environment
Setup the environment necessary to run the TAO Networks by running the bash script

#### FIXME
1. COLAB_NOTEBOOKS_PATH - for Google Colab environment, set this path where you want to clone the repo to; for local system environment, set this path to the already cloned repo
1. NUM_GPUS - set this to <= number of GPU's availble on the instance
1. DATA_DIR - set this path to a folder location where you want to dataset to be present
1. SPECS_DIR - set this path to a folder location where the configuration/spec files will be saved
1. RESULTS_DIR - set this path to a folder location where pretrained models, checkpoints and log files during different model actions will be saved

In [None]:
import os
#FIXME1
%env COLAB_NOTEBOOKS_PATH=/content/drive/MyDrive/nvidia-tao
if os.environ["GOOGLE_COLAB"] == "1":
    os.environ["bash_script"] = "setup_env.sh"
    if not os.path.exists(os.path.join(os.environ["COLAB_NOTEBOOKS_PATH"])):
        !git clone https://github.com/NVIDIA-AI-IOT/nvidia-tao.git $COLAB_NOTEBOOKS_PATH
else:
    os.environ["bash_script"] = "setup_env_desktop.sh"
    if not os.path.exists(os.environ["COLAB_NOTEBOOKS_PATH"]):
        raise Exception("Error, enter the path of the colab notebooks repo correctly")

!sed -i "s|PATH_TO_COLAB_NOTEBOOKS|$COLAB_NOTEBOOKS_PATH|g" $COLAB_NOTEBOOKS_PATH/pytorch/$bash_script
!sh $COLAB_NOTEBOOKS_PATH/pytorch/$bash_script

---
## Let's Dig in: TTS using TAO

This notebook assumes that you are already familiar with TTS Training using TAO, as described in the [text-to-speech-training](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/resources/texttospeech_notebook) notebook, and that you have a pretrained TTS model.

### Set Relevant Paths

In [None]:
%env TAO_DOCKER_DISABLE=1

#FIXME2
%env NUM_GPUS=1

#FIXME3
%env DATA_DIR=/data/tts
!sudo mkdir -p $DATA_DIR && sudo chmod -R 777 $DATA_DIR

#FIXME4
%env SPECS_DIR=/specs/tts
!sudo mkdir -p $SPECS_DIR && sudo chmod -R 777 $SPECS_DIR

#FIXME5
%env RESULTS_DIR=/results/tts
!sudo mkdir -p $RESULTS_DIR && sudo chmod -R 777 $RESULTS_DIR


# Set your encryption key, and use the same key for all commands
%env KEY=tlt_encode

Now that everything is setup, we would like to take a bit of time to explain the tao interface for ease of use. The command structure can be broken down as follows: `tao <task name> <subcommand>` <br> 

Let's see this in further detail.


### Downloading Specs
TAO's Conversational AI Toolkit works off of spec files which make it easy to edit hyperparameters on the fly. We can proceed to downloading the spec files. The user may choose to modify/rewrite these specs, or even individually override them through the launcher. You can download the default spec files by using the `download_specs` command. <br>

The -o argument indicating the folder where the default specification files will be downloaded, and -r that instructs the script where to save the logs. **Make sure the -o points to an empty folder!**

In [None]:
# download spec files for FastPitch
! tao spectro_gen download_specs \
    -r $RESULTS_DIR/spectro_gen \
    -o $SPECS_DIR/spectro_gen

In [None]:
# download spec files for HiFiGAN
! tao vocoder download_specs \
    -r $RESULTS_DIR/vocoder \
    -o $SPECS_DIR/vocoder

### Data

For the rest of this notebook, it is assumed that you have:

 - Pretrained FastPitch and HiFiGAN models that were trained on LJSpeech sampled at 22kHz
 
In the case that you are not using a TTS model trained on LJSpeech at the correct sampling rate. Please ensure that you have the original data, including wav files and a .json manifest file. If you have a TTS model but not at 22kHz, please ensure that you set the correct sampling rate, and fft parameters.

For the rest of the notebook, we will be using a toy dataset consisting of 5 mins of audio. This dataset is for demo purposes only. For a good quality model, we recommend at least 30 minutes of audio. We recommend using the [NVIDIA Custom Voice Recorder](https://developer.nvidia.com/riva-voice-recorder-early-access) tool, to generate a good dataset for finetuning.

Let's first download the original LJSpeech dataset and set variables that point to this as the original data's `.json` file.

In [None]:
! wget -O $DATA_DIR/ljspeech.tar.bz2 https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2

After downloading, untar the dataset, and move it to the correct directory.

In [None]:
# Extracting and moving the data to the correct directories.
! tar -xvf $DATA_DIR/ljspeech.tar.bz2
! sudo rm -rf $DATA_DIR/ljspeech
! mv LJSpeech-1.1 $DATA_DIR/ljspeech

### Pre-Processing

This step downloads audio to text file lists from NVIDIA for LJSpeech and generates the manifest files. If you use your own dataset, you have to generate three files: `ljs_audio_text_train_filelist.txt`, `ljs_audio_text_val_filelist.txt`, `ljs_audio_text_test_filelist.txt` yourself. Those files correspond to your train / val / test split. For each text file, the number of rows should be equal to number of samples in this split and each row should be like:

```
DUMMY/<file_name>.wav|<text_of_the_audio>
```

An example row is:

```
DUMMY/LJ045-0096.wav|Mrs. De Mohrenschildt thought that Oswald,
```

After having those three files in your `data_dir`, you can run following command as you would do for LJSpeech dataset.

Be patient! This step can take several minutes.

In [None]:
! tao spectro_gen dataset_convert \
      -e $SPECS_DIR/spectro_gen/dataset_convert_ljs.yaml \
      -r $RESULTS_DIR/spectro_gen/dataset_convert \
      data_dir=$DATA_DIR/ljspeech \
      dataset_name=ljspeech

In [None]:
import os

original_data_json = os.path.join(os.environ["DATA_DIR"], "ljspeech/ljspeech_train.json")
os.environ["original_data_json"] = original_data_json

Let's now download the data from the NVIDIA Custom Voice Recorder tool, and place the data in the `$DATA_DIR`

In [None]:
import os

# Name of the untarred dataset from the NVIDIA Custom Voice Recorder.
finetune_data_name = FIXME
finetune_data_path = os.path.join(os.environ["DATA_DIR"], finetune_data_name)

os.environ["finetune_data_name"] = finetune_data_name

Now that you have downloaded the data, let's make sure that the audio clips and sample at the same sampling frequency as the clips used to train the pretrained model. For the course of this notebook, NVIDIA recommends using a model trained on the LJSpeech dataset. The sampling rate for this model is 22.05kHz.

In [None]:
! pip install soundfile

import soundfile
import librosa
import json
import os

def resample_audio(input_file_path, output_path, target_sampling_rate=22050):
    """Resample a single audio file.
    
    Args:
        input_file_path (str): Path to the input audio file.
        output_path (str): Path to the output audio file.
        target_sampling_rate (int): Sampling rate for output audio file.
        
    Returns:
        No explicit returns
    """
    if not input_file_path.endswith(".wav"):
        raise NotImplementedError("Loading only implemented for wav files.")
    if not os.path.exists(input_file_path):
        raise FileNotFoundError(f"Cannot file input file at {input_file_path}")
    audio, sampling_rate = librosa.load(
      input_file_path,
      sr=target_sampling_rate
    )
    # Filtering out empty audio files.
    if librosa.get_duration(y=audio, sr=sampling_rate) == 0:
        print(f"0 duration audio file encountered at {input_file_path}")
        return None
    filename = os.path.basename(input_file_path)
    if not os.path.exists(output_path):
        os.makedirs(output_path)
    soundfile.write(
        os.path.join(output_path, filename),
        audio,
        samplerate=target_sampling_rate,
        format="wav"
    )
    return filename

In [None]:
! pip install tqdm
from tqdm.notebook import tqdm

relative_path = f"{finetune_data_name}/clips_resampled"
resampled_manifest_file = os.path.join(
    os.environ["DATA_DIR"],
    f"{finetune_data_name}/manifest_resampled.json"
)
input_manifest_file = os.path.join(
    os.environ["DATA_DIR"],
    f"{finetune_data_name}/manifest.json"
)
sampling_rate = 22050
output_path = os.path.join(os.environ["DATA_DIR"], relative_path)

# Resampling the audio clip.
with open(input_manifest_file, "r") as finetune_file:
    with open(resampled_manifest_file, "w") as resampled_file:
        for line in tqdm(finetune_file.readlines()):
            data = json.loads(line)
            filename = resample_audio(
                os.path.join(
                    os.environ["DATA_DIR"],
                    finetune_data_name,
                    data["audio_filepath"]
                ),
                output_path,
                target_sampling_rate=sampling_rate
            )
            if not filename:
                print("Skipping clip {} from training dataset")
                continue
            data["audio_filepath"] = os.path.join(
                os.environ["DATA_DIR"],
                relative_path, filename
            )
            resampled_file.write(f"{json.dumps(data)}\n")

assert resampled_file.closed, "Output file wasn't closed properly"
assert finetune_file.closed, "Input file wasn't closed properly"

In [None]:
# Splitting the dataset to train and val set.
! cat $finetune_data_path/manifest_resampled.json | tail -n 2 > $finetune_data_path/manifest_val.json
! cat $finetune_data_path/manifest_resampled.json | head -n -2 > $finetune_data_path/manifest_train.json

In [None]:
from pathlib import Path

finetune_data_json = os.path.join(os.environ["DATA_DIR"], f'{finetune_data_name}/manifest_train.json')
os.environ["finetune_data_json"] = finetune_data_json

The first step is to create a json that contains data from both the original data and the finetuning data. We can do this using dataset_convert:

In [None]:
! tao spectro_gen dataset_convert \
      dataset_name=merge \
      original_json=$original_data_json \
      finetune_json=$finetune_data_json \
      save_path=$DATA_DIR/$finetune_data_name/merged_train.json \
      -r $DATA_DIR/dataset_convert/merge \
      -e $SPECS_DIR/spectro_gen/dataset_convert_ljs.yaml

In [None]:
import json
finetune_val_json = os.path.join(
    os.getenv("DATA_DIR"), f'{finetune_data_name}/manifest_val.json'
)
finetune_val_dataset = os.path.join(
    os.getenv("DATA_DIR"), f'{finetune_data_name}/merged_val.json'
)
os.environ["finetune_val_dataset"] = finetune_val_dataset

with open(finetune_val_json, "r") as val_json:
    with open(finetune_val_dataset, "w") as out_file:
        for line in val_json.readlines():
            data = json.loads(line)
            data["speaker"] = 1
            out_file.write(f"{json.dumps(data)}\n")

# You may uncomment this line to view the file after the modification.
# ! cat $finetune_val_dataset


### Getting Pitch Statistics

Training Fastpitch requires you to set 4 values for pitch extraction:
  - `fmin`: The minimum frequency value in Hz used to estimate the fundamental frequency (f0)
  - `fmax`: The maximum frequency value in Hz used to estimate the fundamental frequency (f0)
  - `avg`: The average used to normalize the pitch
  - `std`: The std deviation used to normalize the pitch

In order to get these, we first find a good `fmin` and `fmax` which are hyperparameters to librosa's pyin function.
After we set those, we can iterate over the finetuning dataset to extract the pitch mean and standard deviation.

#### Obtain fmin and fmax

To get fmin and fmax, we start with some defaults, and iterate through random samples of the dataset to ensure that pyin is correctly extracting the pitch.

We look at the plotted spectrogram as well as the predicted fundamental frequency, f0. We want the predicted f0 (the cyan line) to match the lowest energy band in the spectrogram. Here is an example of a good match between the predicted f0 and the spectrogram:

![good_pitch.png](https://github.com/vpraveen-nv/model_card_images/raw/main/conv_ai/samples/texttospeech/good_pitch.png)

Here is an example of a bad match between the f0 and the spectrogram. The fmin was likely set too high. The f0 algorithm is missing the first two vocalizations, and is correctly matching the last half of speech. To fix this, the fmin should be set lower.

![bad_pitch.png](https://github.com/vpraveen-nv/model_card_images/raw/main/conv_ai/samples/texttospeech/bad_pitch.png)

Here is an example of samples that have low frequency noise. In order to eliminate the effects of noise, you have to set fmin above the noise frequency. Unfortunately, this will result in degraded TTS quality. It would be best to re-record the data in a environment with less noise.

![noise.png](https://github.com/vpraveen-nv/model_card_images/raw/main/conv_ai/samples/texttospeech/noise.png)


*Note: You will have to run the below cell multiple times with different hyperparameters before you are able to find a good value for fmin and fmax.*

*We set the `num_files` parameter to 10, so as to visualize only 10 plots in the dataset. You may choose to increase or decrease this value to generate more or fewer plots*

*Note: As a starting point, we have set `fmin` to `80Hz` and `fmax` to `2094` Hz.*

In [None]:
!pip3 install matplotlib==3.3.3
import matplotlib.pyplot as plt
%matplotlib inline
import os
from math import ceil
from IPython.display import Image

valid_image_ext = ['.jpg', '.png', '.jpeg', '.ppm']

pitch_fmin = 65    # in Hz
pitch_fmax = 2094    # in Hz

os.environ["pitch_fmin"] = str(pitch_fmin)
os.environ["pitch_fmax"] = str(pitch_fmax)

def visualize_images(image_dir, num_cols=2, num_images=10):
    """Visualize images in the notebook.
    
    Args:
        image_dir (str): Path to the directory containing images.
        num_cols (int): Number of columns.
        num_images (int): Number of images.

    """
    output_path = os.path.join(os.environ['RESULTS_DIR'], image_dir)
    num_rows = int(ceil(float(num_images) / float(num_cols)))
    f, axarr = plt.subplots(num_rows, num_cols, figsize=[240,90])
    f.tight_layout()
    a = [os.path.join(output_path, image) for image in os.listdir(output_path) 
         if os.path.splitext(image)[1].lower() in valid_image_ext]
    for idx, img_path in enumerate(a[:num_images]):
        col_id = idx % num_cols
        row_id = idx // num_cols
        img = plt.imread(img_path)
        axarr[row_id, col_id].imshow(img)
        

# Computing f0 with a default fmin=64 and fmax=512
!tao spectro_gen pitch_stats num_files=10 \
     pitch_fmin=$pitch_fmin \
     pitch_fmax=$pitch_fmax \
     output_path=results/spectro_gen/pitch_stats \
     compute_stats=false \
     render_plots=true \
     manifest_filepath=$DATA_DIR/$finetune_data_name/manifest_train.json \
     --results_dir $RESULTS_DIR/spectro_gen/pitch_stats

visualize_images("spectro_gen/pitch_stats", num_cols=5, num_images=10)

Once you have chosen a good value for your `pitch_fmin` and `pitch_fmax`, the cell below will compute the pitch statistics (`pitch_mean` and `pitch_std`) to be used to finetune the model.

In [None]:
! tao spectro_gen pitch_stats num_files=10 \
      pitch_fmin=$pitch_fmin \
      pitch_fmax=$pitch_fmax \
      output_path=results/spectro_gen/pitch_stats \
      compute_stats=true \
      render_plots=false \
      manifest_filepath=$DATA_DIR/$finetune_data_name/manifest_train.json \
      --results_dir $RESULTS_DIR/spectro_gen/pitch_stats

Setting the `pitch_fmean` and `pitch_fmax` based on the results from the cell above.

In [None]:
# Please set the fmin, fmax, pitch_mean and pitch_std values based on
# the output from the tao spectro_gen pitch_stats task.
pitch_mean = FIXME
pitch_std = FIXME

os.environ["pitch_mean"] = str(pitch_mean)
os.environ["pitch_std"] = str(pitch_std)

print(f"pitch fmin: {pitch_fmin}")
print(f"pitch fmax: {pitch_fmax}")
print(f"pitch mean: {pitch_mean}")
print(f"pitch std: {pitch_std}")

assert pitch_fmin < pitch_fmax , f"pitch_fmin [{pitch_fmin}] > pitch_fmax [{pitch_fmax}]"

### Finetuning

For finetuning TTS models in TAO, we use the `tao spectro_gen finetune` and `tao vocoder finetune` command with the following args:
<ul>
    <li> <b>-m</b> : Path to the model weights we want to finetune from </li>
    <li> <b>-e</b> : Path to the spec file </li>
    <li> <b>-g</b> : Number of GPUs to use </li>
    <li> <b>-r</b> : Path to the results folder </li>
    <li> <b>-k</b> : User specified encryption key to use while saving/loading the model </li>
    <li> Any overrides to the spec file </li>
</ul>

In order to get a finetuned TTS pipeline, you need to finetune FastPitch. For best results, you need to finetune HiFiGAN as well.

Please update the `-m` parameter to the path of your pre-trained checkpoint. This can be a previously trained `.tlt` or `.nemo` file.

####  Downloading the pretrained model

NVIDIA recommends using these [FastPitch](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/tts_en_fastpitch) and [HiFiGAN](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/tts_hifigan) checkpoints on [NGC](https://ngc.nvidia.com)

Cells below execute commands to install the NGC CLI on your local environment, and used said CLI to download the models.

In [None]:
# Installing NGC CLI on the local machine.
## Download and install
%env LOCAL_PROJECT_DIR=/ngc_content/
%env CLI=ngccli_cat_linux.zip
!sudo mkdir -p $LOCAL_PROJECT_DIR/ngccli && sudo chmod -R 777 $LOCAL_PROJECT_DIR

# Remove any previously existing CLI installations
!sudo rm -rf $LOCAL_PROJECT_DIR/ngccli/*
!wget "https://ngc.nvidia.com/downloads/$CLI" -P $LOCAL_PROJECT_DIR/ngccli
!unzip -u -q "$LOCAL_PROJECT_DIR/ngccli/$CLI" -d $LOCAL_PROJECT_DIR/ngccli/
!rm $LOCAL_PROJECT_DIR/ngccli/*.zip 
os.environ["PATH"]="{}/ngccli/ngc-cli:{}".format(os.getenv("LOCAL_PROJECT_DIR", ""), os.getenv("PATH", ""))
!cp /usr/lib/x86_64-linux-gnu/libstdc++.so.6 $LOCAL_PROJECT_DIR/ngccli/ngc-cli/libstdc++.so.6

!ngc registry model download-version "nvidia/nemo/tts_en_fastpitch:1.8.1" --dest $DATA_DIR/
!ngc registry model download-version "nvidia/nemo/tts_hifigan:1.0.0rc1" --dest $DATA_DIR/

In [None]:
pretrained_fastpitch_model = os.path.join(os.environ["DATA_DIR"], "tts_en_fastpitch_v1.8.1/tts_en_fastpitch_align.nemo")
os.environ["pretrained_fastpitch_model"] = pretrained_fastpitch_model
pretrained_hifigan_model = os.path.join(os.environ["DATA_DIR"], "tts_hifigan_v1.0.0rc1/tts_hifigan.nemo")
os.environ["pretrained_hifigan_model"] = pretrained_hifigan_model

#### Finetuning FastPitch

In [None]:
# Prior is needed for FastPitch training. If empty folder is provided, prior will generate on-the-fly
# Please be patient especially if you provided an empty prior folder.
! mkdir -p $RESULTS_DIR/spectro_gen/finetune/prior_folder

In [None]:
## Downloading auxillary files to train.
!wget -O $DATA_DIR/cmudict-0.7b_nv22.01 https://github.com/NVIDIA/NeMo/raw/v1.9.0/scripts/tts_dataset_files/cmudict-0.7b_nv22.01
!wget -O $DATA_DIR/heteronyms-030921 https://github.com/NVIDIA/NeMo/raw/v1.9.0/scripts/tts_dataset_files/heteronyms-030921
!wget -O $DATA_DIR/lj_speech.tsv https://github.com/NVIDIA/NeMo/raw/v1.9.0//nemo_text_processing/text_normalization/en/data/whitelist/lj_speech.tsv

In [None]:
!tao spectro_gen finetune \
     -e $SPECS_DIR/spectro_gen/finetune.yaml \
     -g $NUM_GPUS \
     -k tlt_encode \
     -r $RESULTS_DIR/spectro_gen/finetune \
     -m $pretrained_fastpitch_model \
     train_dataset=$DATA_DIR/$finetune_data_name/merged_train.json \
     validation_dataset=$DATA_DIR/$finetune_data_name/merged_val.json \
     prior_folder=$RESULTS_DIR/spectro_gen/finetune/prior_folder \
     trainer.max_epochs=2 \
     n_speakers=2 \
     pitch_fmin=$pitch_fmin \
     pitch_fmax=$pitch_fmax \
     pitch_avg=$pitch_mean \
     pitch_std=$pitch_std \
     trainer.precision=16 \
     phoneme_dict_path=$DATA_DIR/cmudict-0.7b_nv22.01 \
     heteronyms_path=$DATA_DIR/heteronyms-030921 \
     whitelist_path=$DATA_DIR/lj_speech.tsv

#### Finetuning HiFiGAN

In order to get the best audio from HiFiGAN, we need to finetune it:
  - on the new speaker
  - using mel spectrograms from our finetuned FastPitch Model

Let's first generate mels from our FastPitch model, and save it to a new .json manifest for use with HiFiGAN

In [None]:
!sudo mkdir -p /raid && sudo chmod -R 777 /raid

import json
import os

def infer_and_save_json(infer_json, save_json, subdir="train"):
    # Get records from the training manifest
    manifest_path = os.path.join(os.environ["DATA_DIR"], infer_json)
    os.environ["tao_manifest_path"] = os.path.join(os.environ["DATA_DIR"], infer_json)
    os.environ["subdir"] = subdir
    save_json = os.path.join(os.environ["DATA_DIR"], save_json)
    records = []
    text = {"input_batch": []}
    print("Appending mel spectrogram paths to the dataset.")
    with open(manifest_path, "r") as f:
        for i, line in enumerate(f):
            manifest_info = json.loads(line)
            manifest_info["mel_filepath"] = f"{os.environ['RESULTS_DIR']}/spectro_gen/infer/spectro/{subdir}/{i}.npy"
            records.append(manifest_info)
            text["input_batch"].append(manifest_info["text"])

    !tao spectro_gen infer \
         -e $SPECS_DIR/spectro_gen/infer.yaml \
         -g 1 \
         -k $KEY \
         -m $RESULTS_DIR/spectro_gen/finetune/checkpoints/finetuned-model.tlt \
         -r $RESULTS_DIR/spectro_gen/infer \
         output_path=$RESULTS_DIR/spectro_gen/infer/spectro/$subdir \
         speaker=1 \
         mode="infer_hifigan_ft" \
         input_json=$tao_manifest_path

    # Save to a new json
    with open(save_json, "w") as f:
        for r in records:
            f.write(json.dumps(r) + '\n')

# Infer for train
infer_and_save_json(f"{finetune_data_name}/manifest_train.json", f"{finetune_data_name}/hifigan_train_ft.json")
# Infer for dev
infer_and_save_json(f"{finetune_data_name}/manifest_val.json", f"{finetune_data_name}/hifigan_dev_ft.json", "dev")

Now let's finetune hifigan.

Please update the `-m` parameter to the path of your pre-trained checkpoint.

In [None]:
!tao vocoder finetune \
     -e $SPECS_DIR/vocoder/finetune.yaml \
     -g $NUM_GPUS \
     -k $KEY \
     -r $RESULTS_DIR/vocoder/finetune \
     -m $pretrained_hifigan_model \
     train_dataset=$DATA_DIR/$finetune_data_name/hifigan_train_ft.json \
     validation_dataset=$DATA_DIR/$finetune_data_name/hifigan_dev_ft.json \
     trainer.max_epochs=2 \
     training_ds.dataloader_params.batch_size=8

### TTS Inference

As aforementioned, since there are no universal standard to measure quality of synthesized speech, you will need to listen to some inferred speech to tell whether a TTS model is well trained. Therefore, we do not provide `evaluate` functionality in TAO Toolkit for TTS but only provide `infer` functionality.

#### Generate spectrogram

The first step for inference is generating spectrogram. That's a numpy array (saved as `.npy` file) for a sentence which can be converted to voice by a vocoder. We use FastPitch we just trained to generate spectrogram

Please update the infer.yaml configuration file in the `$DATA_DIR/specs` to add new sentences. The sample infer.yaml file, contains 3 sentence texts.

```yaml

input_batch:
  - "by the end of no such thing the audience , like beatrice , has a watchful affection for the monster ."
  - "director rob marshall went out gunning to make a great one ."
  - "uneasy mishmash of styles and genres ."
```

You may add new sentences by adding new lines to the `input_batch` field.

In [None]:
!tao spectro_gen infer \
     -e $SPECS_DIR/spectro_gen/infer.yaml \
     -g 1 \
     -k $KEY \
     -m $RESULTS_DIR/spectro_gen/finetune/checkpoints/finetuned-model.tlt \
     -r $RESULTS_DIR/spectro_gen/infer_output \
     output_path=$RESULTS_DIR/spectro_gen/infer_output/spectro \
     speaker=1

#### Generate sound file

The second step for inference is generating wav sound file based on spectrogram you generated in last step.

In [None]:
!tao vocoder infer \
     -e $SPECS_DIR/vocoder/infer.yaml \
     -g 1 \
     -k $KEY \
     -m $RESULTS_DIR/vocoder/finetune/checkpoints/finetuned-model.tlt \
     -r $RESULTS_DIR/vocoder/infer_output \
     input_path=$RESULTS_DIR/spectro_gen/infer_output/spectro \
     output_path=$RESULTS_DIR/vocoder/infer_output/wav

In [None]:
import os
import IPython.display as ipd
# change path of the file here
ipd.Audio(os.environ["RESULTS_DIR"] + '/vocoder/infer_output/wav/0.wav')
# ipd.Audio(os.environ["RESULTS_DIR"] + '/vocoder/infer_output/wav/1.wav')
# ipd.Audio(os.environ["RESULTS_DIR"] + '/vocoder/infer_output/wav/2.wav')

#### Debug

The data provided is only meant to be a sample to understand how finetuning works in TAO. In order to generate better speech quality, we recommend recording at least 30 mins of audio, and increasing the number of finetuning steps from the current `trainer.max_steps=1000` to `trainer.max_steps=5000` for both models.

### What's Next ?

 You could use TAO to build custom models for your own applications, and deploy them to Nvidia Riva!