# Introduction

In this tutorial we show how to run the scripts for training or fine-tuning an audio codecs.

# License

> Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
>
> Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
>
> http://www.apache.org/licenses/LICENSE-2.0
>
> Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

# Install

In [None]:
BRANCH = 'main'
# Install NeMo library. If you are running locally (rather than on Google Colab), comment out the below line
# and instead follow the instructions at https://github.com/NVIDIA/NeMo#Installation
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]

In [None]:
NEMO_ROOT_DIR = "/content/nemo"
# Download local version of NeMo scripts. If you are running locally and want to use your own local NeMo code,
# comment out the below line and set NEMO_ROOT_DIR to your local path.
!git clone -b $BRANCH https://github.com/NVIDIA/NeMo.git $NEMO_ROOT_DIR

# Configuration

In [None]:
from pathlib import Path

# Choose target sample rate for codec.
# This notebook has out of the box configurations for 16000, 22050, 24000, and 44100.
SAMPLE_RATE = 16000

# Configure nemo paths
NEMO_DIR = Path(NEMO_ROOT_DIR)
NEMO_EXAMPLES_DIR = NEMO_DIR / "examples" / "tts"
NEMO_CONFIG_DIR = NEMO_EXAMPLES_DIR / "conf"
NEMO_SCRIPT_DIR = NEMO_DIR / "scripts" / "dataset_processing" / "tts"

# Dataset Preparation

For our tutorial, we use a subset of [VCTK](https://datashare.ed.ac.uk/handle/10283/2950) dataset with 5 speakers (p225-p229).

In [None]:
import os
import tarfile
import wget

from nemo.collections.asr.parts.utils.manifest_utils import read_manifest, write_manifest

In [None]:
# Create dataset directory
root_dir = Path("/content")
data_root = root_dir / "data"

data_root.mkdir(parents=True, exist_ok=True)

In [None]:
# Download the dataset
dataset_url = "https://vctk-subset.s3.amazonaws.com/vctk_subset_multispeaker.tar.gz"
dataset_tar_filepath = data_root / "vctk.tar.gz"

if not os.path.exists(dataset_tar_filepath):
    wget.download(dataset_url, out=str(dataset_tar_filepath))

In [None]:
# Extract the dataset
with tarfile.open(dataset_tar_filepath) as tar_f:
    tar_f.extractall(data_root)

In [None]:
DATA_DIR = data_root / "vctk_subset_multispeaker"

In [None]:
# Visualize the raw dataset
train_raw_filepath = DATA_DIR / "train.json"
!head $train_raw_filepath

## Manifest Processing

The downloaded manifest is formatted for TTS training, which contains metadata such as text and speaker.

For codec training we only need the `audio_filepath`. The `audio_filepath` field can either be an *absolute path*, or a *relative path* with the root directory provided as an argument to each script. Here we use relative paths.

If you include `duration` the training script will automatically calculate the total size of every dataset used, and can be useful for filtering based on utterance length.

In [None]:
def update_manifest(data_type):
    input_filepath = DATA_DIR / f"{data_type}.json"
    output_filepath = DATA_DIR / f"{data_type}_raw.json"

    entries = read_manifest(input_filepath)
    new_entries = []
    for entry in entries:
        # Provide relative path instead of absolute path
        audio_filepath = entry["audio_filepath"].replace("audio/", "")
        duration = round(entry["duration"], 2)
        new_entry = {
            "audio_filepath": audio_filepath,
            "duration": duration
        }
        new_entries.append(new_entry)

    write_manifest(output_path=output_filepath, target_manifest=new_entries, ensure_ascii=False)

In [None]:
update_manifest("dev")
update_manifest("train")

In [None]:
# Visualize updated 'audio_filepath' field.
train_filepath = DATA_DIR / "train_raw.json"
!head $train_filepath

## Audio Preprocessing

Next we process the audio data using [preprocess_audio.py](https://github.com/NVIDIA/NeMo/blob/main/scripts/dataset_processing/tts/preprocess_audio.py).

During this step we can apply the following transformations:

1. Resample the audio from 48khz to the target sample rate for codec training.
2. Remove long silence from the beginning and end of each audio file. This can be done using an *energy* based approach which will work on clean audio, or using *voice activity detection (VAD)* which is slower but also works on audio with background or static noise (eg. from a microphone). Here we suggest VAD because some audio in VCTK has background noise.

In [None]:
import IPython.display as ipd

In [None]:
# Python wrapper to invoke the given bash script with the given input args
def run_script(script, args):
    args = ' \\'.join(args)
    cmd = f"python {script} \\{args}"

    print(cmd.replace(" \\", "\n"))
    print()
    !$cmd

In [None]:
audio_preprocessing_script = NEMO_SCRIPT_DIR / "preprocess_audio.py"

# Directory with raw audio data
input_audio_dir = DATA_DIR / "audio"
# Directory to write preprocessed audio to
output_audio_dir = DATA_DIR / "audio_preprocessed"
# Whether to overwrite existing audio, if it exists in the output directory
overwrite_audio = True
# Whether to overwrite output manifest, if it exists
overwrite_manifest = True
# Number of threads to parallelize audio processing across
num_workers = 4
# Format of output audio files. Use "flac" to compress to a smaller file size.
output_format = "flac"
# Method for silence trimming. Can use "energy.yaml" or "vad.yaml".
trim_config_path = NEMO_CONFIG_DIR / "trim" / "vad.yaml"

def preprocess_audio(data_type):
    input_filepath = DATA_DIR / f"{data_type}_raw.json"
    output_filepath = DATA_DIR / f"{data_type}_manifest.json"

    args = [
        f"--input_manifest={input_filepath}",
        f"--output_manifest={output_filepath}",
        f"--input_audio_dir={input_audio_dir}",
        f"--output_audio_dir={output_audio_dir}",
        f"--num_workers={num_workers}",
        f"--output_sample_rate={SAMPLE_RATE}",
        f"--output_format={output_format}",
        f"--trim_config_path={trim_config_path}"
    ]
    if overwrite_manifest:
        args.append("--overwrite_manifest")
    if overwrite_audio:
        args.append("--overwrite_audio")

    run_script(audio_preprocessing_script, args)

In [None]:
preprocess_audio("dev")

In [None]:
preprocess_audio("train")

We should listen to a few audio files before and after the processing so be sure we configured it correctly.

Note that the processed audio is shorter because we trimmed the leading and trailing silence.

In [None]:
audio_file = "p228_009.wav"
audio_filepath = input_audio_dir / audio_file
processed_audio_filepath = output_audio_dir / audio_file.replace(".wav", ".flac")

print("Original audio.")
ipd.display(ipd.Audio(audio_filepath))

print("Processed audio.")
ipd.display(ipd.Audio(processed_audio_filepath))

# Audio Codec Training

Here we show how to train an audio codec model from scratch. Instructions and checkpoints for fine-tuning will be provided later.


In [None]:
import torch
from omegaconf import OmegaConf

In [None]:
dataset_name = "vctk"
audio_dir = DATA_DIR / "audio_preprocessed"
train_manifest_filepath = DATA_DIR / "train_manifest.json"
dev_manifest_filepath = DATA_DIR / "dev_manifest.json"

In [None]:
audio_codec_training_script = NEMO_EXAMPLES_DIR / "audio_codec.py"

# The total number of training steps will be (epochs * steps_per_epoch)
epochs = 10
steps_per_epoch = 10

# Config files specifying all codec parameters
codec_config_dir = NEMO_CONFIG_DIR / "audio_codec"

# Select model config depending on target sample rate.
if SAMPLE_RATE == 16000:
  codec_config_filename = "audio_codec_16000.yaml"
  ngc_model_name = "audio_codec_16khz_small"
  ngc_model_url = "https://api.ngc.nvidia.com/v2/models/nvidia/nemo/audio_codec_16khz_small/versions/v1/files/audio_codec_16khz_small.nemo"
elif SAMPLE_RATE == 22050:
  codec_config_filename = "mel_codec_22050.yaml"
  ngc_model_name = None
  ngc_model_url = None
elif SAMPLE_RATE == 24000:
  codec_config_filename = "audio_codec_24000.yaml"
  ngc_model_name = None
  ngc_model_url = None
elif SAMPLE_RATE == 44100:
  codec_config_filename = "mel_codec_44100.yaml"
  ngc_model_name = None
  ngc_model_url = None
else:
  raise ValueError(f"Config file not available for sample rate {SAMPLE_RATE}")

config_filepath = codec_config_dir / codec_config_filename
omega_conf = OmegaConf.load(config_filepath)
model_name = omega_conf.name

# Name of the experiment that will determine where it is saved locally and in TensorBoard and WandB
run_id = "test_run"
exp_dir = root_dir / "exps"
codec_exp_output_dir = exp_dir / model_name / run_id
# Directory where predicted audio will be stored periodically throughout training
codec_log_dir = codec_exp_output_dir / "logs"
# Optionally log visualization of learned codes.
log_dequantized = True
# Optionally log predicted audio and other artifacts to WandB
log_to_wandb = False
# Optionally log predicted audio and other artifacts to Tensorboard
log_to_tensorboard = False

if torch.cuda.is_available():
    accelerator="gpu"
    batch_size = 4
else:
    accelerator="cpu"
    batch_size = 2

args = [
    f"--config-path={codec_config_dir}",
    f"--config-name={codec_config_filename}",
    f"max_epochs={epochs}",
    f"weighted_sampling_steps_per_epoch={steps_per_epoch}",
    f"batch_size={batch_size}",
    f"log_dir={codec_log_dir}",
    f"exp_manager.exp_dir={exp_dir}",
    f"+exp_manager.version={run_id}",
    f"model.log_config.log_wandb={log_to_wandb}",
    f"model.log_config.log_tensorboard={log_to_tensorboard}",
    f"model.log_config.generators.0.log_dequantized={log_dequantized}",
    f"trainer.accelerator={accelerator}",
    f"+train_ds_meta.{dataset_name}.manifest_path={train_manifest_filepath}",
    f"+train_ds_meta.{dataset_name}.audio_dir={audio_dir}",
    f"+val_ds_meta.{dataset_name}.manifest_path={dev_manifest_filepath}",
    f"+val_ds_meta.{dataset_name}.audio_dir={audio_dir}",
    f"+log_ds_meta.{dataset_name}.manifest_path={dev_manifest_filepath}",
    f"+log_ds_meta.{dataset_name}.audio_dir={audio_dir}"
]

# Optionally load pretrained checkpoint
if ngc_model_name is not None:
  model_checkpoint_path = root_dir / "models" / f"{ngc_model_name}.nemo"

  if not os.path.exists(model_checkpoint_path):
      model_checkpoint_path.parent.mkdir(exist_ok=True)
      wget.download(ngc_model_url, out=str(model_checkpoint_path))

  args.append(f"+init_from_nemo_model={model_checkpoint_path}")

In [None]:
# If an error occurs, log the entire stacktrace.
os.environ["HYDRA_FULL_ERROR"] = "1"

In [None]:
# Do the model training. For some configurations this step might hang when using CPU.
run_script(audio_codec_training_script, args)

During training, the model will automatically save predictions for all audio files specified in the `log_ds_meta` manifest.

In [None]:
codec_log_epoch_dir = codec_log_dir / "epoch_10" / dataset_name
!ls $codec_log_epoch_dir

This makes it easy to listen to the audio to determine how well the model is performing. We can decide to stop training when either:

*   The predicted audio sounds almost identical to the original audio.
*   The predicted audio stops improving in between epochs.

**Note that when training from scratch, the dataset in this tutorial is too small to get good audio quality.**

In [None]:
audio_filepath_ground_truth = output_audio_dir / "p228_009.flac"
audio_filepath_reconstructed = codec_log_epoch_dir / "p228_009_audio_out.wav"

print("Ground truth audio.")
ipd.display(ipd.Audio(audio_filepath_ground_truth))

print("Reconstructed audio.")
ipd.display(ipd.Audio(audio_filepath_reconstructed))

dequantized_filepath = codec_log_epoch_dir / "p228_009_dequantized.png"
ipd.Image(dequantized_filepath)