# Tutorial: NeMo & Lhotse Data Loading

![image](https://raw.githubusercontent.com/lhotse-speech/lhotse/master/docs/logo.png)

In this tutorial we introduce the integration of NeMo with Lhotse, a library for speech data preparation and loading. Lhotse adds new capabilities to NeMo, allowing to move certain operations such as bucketing or dataset blending to dataloading runtime, rather than dataset preparation. This allows a greater flexibility in trying out various dataset setups without the need to prepare different data variants ahead of time.

In [None]:
import os
import pickle
from pathlib import Path

import nemo
import lhotse
from lhotse import CutSet
from lhotse.recipes import (
    download_librispeech, 
    download_yesno, 
    prepare_librispeech, 
    prepare_yesno,
)

In [None]:
nemo_root = str(Path(nemo.__path__[0]).parent)
root_dir = Path("data")
root_dir.mkdir(parents=True, exist_ok=True)
num_jobs = os.cpu_count() - 1

## Introduction

### Quick Lhotse Primer

[Lhotse](https://github.com/lhotse-speech/lhotse) is a toolkit for speech data handling that originated as a part of the next-generation Kaldi framework. Next-gen Kaldi includes other tools, such as [k2](https://github.com/k2-fsa/k2), which is also [integrated into NeMo](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/experimental/k2/speech_to_text_bpe.py).

Lhotse represents speech data as python objects, and provides efficient sampling and dataloading mechanisms. The core Lhotse concepts leveraged in NeMo integration are:
- Representation for individual examples: [Recording](https://lhotse.readthedocs.io/en/latest/corpus.html#recording-manifest), [SupervisionSegment](https://lhotse.readthedocs.io/en/latest/corpus.html#supervision-manifest), and [Cut](https://lhotse.readthedocs.io/en/latest/cuts.html#cuts).
- Representation for a dataset and/or mini-batch: [CutSet](https://lhotse.readthedocs.io/en/latest/api.html#lhotse.cut.CutSet).
- Samplers (stratified and non-stratified): [DynamicCutSampler](https://lhotse.readthedocs.io/en/latest/datasets.html#lhotse.dataset.sampling.DynamicCutSampler) and [DynamicBucketingSampler](https://lhotse.readthedocs.io/en/latest/datasets.html#lhotse.dataset.sampling.DynamicBucketingSampler).
- A specific method of blending multiple datasets together, based on a probabilistic multiplexer: [CutSet.mux](https://lhotse.readthedocs.io/en/latest/api.html?highlight=mux#lhotse.cut.CutSet.mux).

### How is Lhotse integrated into NeMo?

Like NeMo, Lhotse leverages JSON Lines (JSONL) format to keep metadata in manifests. Unlike NeMo, Lhotse's manifests are mapped into Python objects. Most data represented by NeMo manifests are a special case supported by Lhotse and can be adapted on-the-fly to be directly read as a Lhotse `CutSet`. This conversion is enabled by [LazyNeMoIterator](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/common/data/lhotse/nemo_adapters.py#L33) and [LazyTarredNeMoIterator](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/common/data/lhotse/nemo_adapters.py#L131) classes inside NeMo.

Building a `DataLoader` with Lhotse relies heavily on `CutSampler` to sample a mini-batch on metadata level as a `CutSet`, which is then passed to a map-style PyTorch dataset that converts the `CutSet` to a tuple/dict of tensors. This is covered in more detail in Lhotse tutorials:
- Introductory tutorial [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lhotse-speech/lhotse/blob/master/examples/00-basic-workflow.ipynb)
- Tutorial on Lhotse Shar format ("tarred" data format in Lhotse): [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lhotse-speech/lhotse/blob/master/examples/04-lhotse-shar.ipynb)

The dataloader is being constructed in NeMo from a dataset configuration using function [get_lhotse_dataloader_from_config](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/common/data/lhotse/dataloader.py#L103).

### Getting mini LibriSpeech

For the sake of the tutorial, we'll be using very small datasets to keep things quick. The first example is mini LibriSpeech, a 5h train set and 2h dev set collection that was released alongside the popular LibriSpeech dataset. Lhotse provides a download function that fetches and unpacks the archive with data, and a prepare function that creates Lhotse manifests.

In [None]:
libri_root = download_librispeech(root_dir, dataset_parts="mini_librispeech")

libri = prepare_librispeech(
    libri_root, output_dir=root_dir, num_jobs=num_jobs
)

for split in ("train-clean-5", "dev-clean-2"):
    (
        CutSet
        .from_manifests(**libri[split])
        .to_file(f"data/librispeech_cuts_{split}.jsonl.gz")
    )

# Training NeMo models with Lhotse 

For quick illustration, we'll fine-tune a small NeMo model `nvidia/stt_en_conformer_ctc_small` for 100 steps, running a validation every 50 steps. When training from scratch, you'd simply use a different NeMo script, while keeping the same dataloading options (perhaps tuning the batch size settings to fit your model choice).

**Technical note.** We train using existing NeMo `speech_to_text_finetune.py` script to illustrate real-world usage patterns of NeMo. If you're unfamiliar with running bash commands from a Jupyter notebook, starting the line with an exclamation mark `!` runs a single command, and starting a cell with `%%bash -s {var}` runs the whole cell in bash, and passes `var` as argument `$1` (subsequent -s args will be passed as `$2`, `$3`, and so on). 

**Common arguments.** Let's briefly discuss the CLI arguments we provide to the fine-tuning script.
- Most of the options are in `model.train_ds` and `model.validation_ds` namespaces:
  - `use_lhotse=true` enables Lhotse dataloading backend
  - Batch size is dynamic and controlled via `batch_duration`, `use_bucketing`, `num_buckets`, and `quadratic_duration`. [Please refer to **NeMo documentation** for details](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/datasets.html#enabling-lhotse-via-configuration).
  - We set `batch_size=null` explicitly (the base config has some value that would have otherwise limited our dynamic batch size).
  - Although we don't enable dynamic batch size for validation in this example for brevity, it does also work (including with bucketing).
- `trainer.use_distributed_sampler=false` is required by Lhotse (it has its own distributed sampling handling).
- In `trainer` namespace, `max_steps` controls the total number of steps in training; `val_check_interval` is the number of steps between validation runs.
- The `+` notation is used to append a value to a config (i.e., requires when these are not present in the YAML config file). [See **Hydra override syntax** for details](https://hydra.cc/docs/1.1/advanced/override_grammar/basic/#basic-override-syntax).

**Model inference.** We skip inference in this tutorial for brevity. [Please refer to **NeMo ASR inference documentation** to learn more](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/intro.html).

## I. Training using Lhotse CutSet

Let's start with training from our existing Lhotse manifests for mini LibriSpeech. Highlights for relevant CLI arguments:
- The manifest path is provided via `cuts_path` so the trainer knows to read this as a Lhotse manifest.
- We have to set `manifest_filepath=null` explicitly as older NeMo configs expect some value to be provided there regardless. 

In [None]:
%%bash -s {nemo_root}
python $1/examples/asr/speech_to_text_finetune.py \
    +init_from_pretrained_model="nvidia/stt_en_conformer_ctc_small" \
    name="finetune_from_cutset" \
    +model.train_ds.use_lhotse=true \
    model.train_ds.manifest_filepath=null \
    +model.train_ds.cuts_path=data/librispeech_cuts_train-clean-5.jsonl.gz \
    model.train_ds.batch_size=null \
    +model.train_ds.batch_duration=300 \
    +model.train_ds.use_bucketing=true \
    +model.train_ds.num_buckets=30 \
    +model.train_ds.quadratic_duration=15 \
    +model.validation_ds.use_lhotse=true \
    model.validation_ds.manifest_filepath=null \
    +model.validation_ds.cuts_path=data/librispeech_cuts_dev-clean-2.jsonl.gz \
    model.validation_ds.batch_size=64 \
    +trainer.use_distributed_sampler=false \
    trainer.max_steps=100 \
    ++trainer.val_check_interval=50

In [None]:
!ls nemo_experiments/finetune_from_cutset/*/checkpoints/

## II. Training using NeMo manifest

Training from NeMo manifest format is done in a similar way. Highlights:
- We'll convert Lhotse CutSet manifests to NeMo manifests for this exercise. If you have existing data in NeMo format, just use it as-is.
- We can now use `manifest_filepath` argument directly.

In [None]:
for split in ("train-clean-5", "dev-clean-2"):
    lhotse.serialization.save_to_jsonl(
        (
            {
                "audio_filepath": cut.recording.sources[0].source,
                "duration": cut.duration,
                "text": cut.supervisions[0].text,
                "lang": cut.supervisions[0].language,
                "sampling_rate": cut.sampling_rate,
            }
            for cut in CutSet.from_file(f"data/librispeech_cuts_{split}.jsonl.gz")
        ),
        f"data/nemo_{split}.json",
    )

In [None]:
%%bash -s {nemo_root}
python $1/examples/asr/speech_to_text_finetune.py \
    +init_from_pretrained_model="nvidia/stt_en_conformer_ctc_small" \
    name="finetune_from_nemo_json" \
    +model.train_ds.use_lhotse=true \
    model.train_ds.manifest_filepath=data/nemo_train-clean-5.json \
    model.train_ds.batch_size=null \
    +model.train_ds.batch_duration=300 \
    +model.train_ds.use_bucketing=true \
    +model.train_ds.num_buckets=30 \
    +model.train_ds.quadratic_duration=15 \
    +model.validation_ds.use_lhotse=true \
    model.validation_ds.manifest_filepath=data/nemo_dev-clean-2.json \
    model.validation_ds.batch_size=64 \
    +trainer.use_distributed_sampler=false \
    trainer.max_steps=100 \
    ++trainer.val_check_interval=50

In [None]:
!ls nemo_experiments/finetune_from_nemo_json/*/checkpoints/

## III. Training using NeMo tarred manifest

Tarred manifest format is useful when dealing with I/O bottlenecks. Typical scenarios are compute grids with shared resources, magnetic disks, cloud storage such as AWS S3, GCS, Azure Blob Storage, etc. 

We'll cover NeMo native tarred format below. If you're interested in Lhotse Shar tarred format, these also work in NeMo -- [please visit the **relevant Lhotse Shar tutorial** for details](https://colab.research.google.com/github/lhotse-speech/lhotse/blob/master/examples/04-lhotse-shar.ipynb). 

Highlights:
- First, let's convert data to NeMo tarred manifests using dedicated NeMo script `convert_to_tarred_audio_dataset.py`. We'll split the data into 32 shards here.
- For training, we'll use `manifest_filepath` for sharded JSONL manifests, and `tarred_audio_filepaths` for sharded audio tar files. Note the NeMo-specific syntax of `_OP_0..31_CL_` which tells NeMo to expand it into a list of 32 items.
- The only other modification is that we're adding `trainer.train_limit_batches` option which tells the trainer what's the size of a pseudo-epoch. In previous examples we could have measured training length in epochs, but chose not to. However, with tarred datasets, we completely discard the notion of an epoch -- the data iterator is infinite, so we use the count of steps for everything instead. This design prevents issues with hanging due to uneven dataloader lengths in distributed training.

In [None]:
%%bash -s {nemo_root}
python $1/scripts/speech_recognition/convert_to_tarred_audio_dataset.py \
    --manifest_path data/nemo_train-clean-5.json \
    --num_shards 32 \
    --workers 4 \
    --max_duration 30

In [None]:
%%bash -s {nemo_root}
python $1/examples/asr/speech_to_text_finetune.py \
    +init_from_pretrained_model="nvidia/stt_en_conformer_ctc_small" \
    name="finetune_from_nemo_tarred" \
    +model.train_ds.use_lhotse=true \
    model.train_ds.manifest_filepath=tarred/sharded_manifests/manifest__OP_0..31_CL_.json \
    model.train_ds.tarred_audio_filepaths=tarred/audio__OP_0..31_CL_.tar \
    model.train_ds.batch_size=null \
    +model.train_ds.batch_duration=300 \
    +model.train_ds.use_bucketing=true \
    +model.train_ds.num_buckets=30 \
    +model.train_ds.quadratic_duration=15 \
    +model.validation_ds.use_lhotse=true \
    model.validation_ds.manifest_filepath=data/nemo_dev-clean-2.json \
    model.validation_ds.batch_size=64 \
    +trainer.use_distributed_sampler=false \
    trainer.max_steps=100 \
    +trainer.limit_train_batches=50 \
    ++trainer.val_check_interval=50

In [None]:
!ls nemo_experiments/finetune_from_nemo_tarred/*/checkpoints/

## IV. Dynamically mixing multiple NeMo tarred datasets

Let's introduce another feature of NeMo+Lhotse dataloading: mixing multiple datasets together. 

Lhotse supports a special type of data mixing that we call **weighted stochastic multiplexing**. It means that when we iterate multiple independent data sources, at each step we'll sample one source to pick the next item from according to the source weights. When the weight is not provided, we count the number of elements in each source before starting dataloading and use those as "natural" weights (this can be time-expensive, so it's best to precompute and provide those explicitly). This sampling strategy ensures that each mini-batch consists of a roughly constant blend of data from multiple sources throughout the training. 

### Fetch another dataset (yesno)

For these experiments, we download another small dataset called `yesno`. It's a bunch of recordings where a single person says either "yes" or "no" in Hebrew, with transcriptions in English (so it's technically a speech translation task). We convert the data to NeMo tarred format, same as previously.

In [None]:
yesno_root = download_yesno(root_dir)

yesno = prepare_yesno(yesno_root, output_dir=root_dir)

for split in ("train", "test"):
    cuts = CutSet.from_manifests(**yesno[split])
    cuts.to_file(f"data/yesno_cuts_{split}.jsonl.gz")
    lhotse.serialization.save_to_jsonl(
        (
            {
                "audio_filepath": cut.recording.sources[0].source,
                "duration": cut.duration,
                "text": cut.supervisions[0].text,
                "lang": cut.supervisions[0].language,
                "sampling_rate": cut.sampling_rate,
            }
            for cut in cuts
        ),
        f"data/yesno_{split}.json",
    )

In [None]:
%%bash -s {nemo_root}
python $1/scripts/speech_recognition/convert_to_tarred_audio_dataset.py \
    --manifest_path data/yesno_train.json \
    --num_shards 1 \
    --workers 1 \
    --max_duration 30 \
    --target_dir yesno_tarred

### [optional] Maximum efficiency: precomputing bucket duration bins

We'll use a NeMo script `estimate_duration_bins.py` to precompute the optimal bucketing settings for our data blend. The result will be passed to the training script below.

In [None]:
# %%bash -s {nemo_root}
# python $1/scripts/speech_recognition/estimate_duration_bins.py \
#     --buckets 30 \
#     '[[tarred/sharded_manifests/manifest__OP_0..31_CL_.json,0.8],[yesno_tarred/tarred_audio_manifest.json,0.2]]'

### Training with NeMo tarred manifests from two separate datasets

The training command is similar to the examples above. Highlights:
- Note that `manifest_filepath` and `tarred_audio_filepaths` now use a special list syntax for providing multiple data sources. Both lists need to have the same order of datasets.
- The list in `manifest_filepath` is additionally specifying the weight for each dataset. The weights can be greater than one, and will be automatically re-normalized later. 

In [None]:
%%bash -s {nemo_root}
python $1/examples/asr/speech_to_text_finetune.py \
    +init_from_pretrained_model="nvidia/stt_en_conformer_ctc_small" \
    name="finetune_from_nemo_multidataset" \
    +model.train_ds.use_lhotse=true \
    model.train_ds.manifest_filepath=[[tarred/sharded_manifests/manifest__OP_0..31_CL_.json,0.8],[yesno_tarred/tarred_audio_manifest.json,0.2]] \
    model.train_ds.tarred_audio_filepaths=[[tarred/audio__OP_0..31_CL_.tar],[yesno_tarred/audio_0.tar]] \
    model.train_ds.batch_size=null \
    +model.train_ds.batch_duration=300 \
    +model.train_ds.use_bucketing=true \
    +model.train_ds.num_buckets=30 \
    +model.train_ds.quadratic_duration=15 \
    +model.validation_ds.use_lhotse=true \
    model.validation_ds.manifest_filepath=[data/nemo_dev-clean-2.json,data/yesno_test.json] \
    model.validation_ds.batch_size=64 \
    +trainer.use_distributed_sampler=false \
    trainer.max_steps=100 \
    +trainer.limit_train_batches=50 \
    ++trainer.val_check_interval=50

In [None]:
!ls nemo_experiments/finetune_from_nemo_multidataset/*/checkpoints/