# Tutorial: NeMo & Lhotse Data Loading

![image](https://raw.githubusercontent.com/lhotse-speech/lhotse/master/docs/logo.png)

In this tutorial we introduce the integration of NeMo with Lhotse, a library for speech data preparation and loading. Lhotse adds new capabilities to NeMo, allowing to move certain operations such as bucketing or dataset blending to dataloading runtime, rather than dataset preparation. This allows a greater flexibility in trying out various dataset setups without the need to prepare different data variants ahead of time.

In [1]:
import os
import pickle
from pathlib import Path

import nemo
import lhotse
from lhotse import CutSet
from lhotse.recipes import (
    download_librispeech, 
    download_yesno, 
    prepare_librispeech, 
    prepare_yesno,
)

In [2]:
nemo_root = str(Path(nemo.__path__[0]).parent)
root_dir = Path("data")
root_dir.mkdir(parents=True, exist_ok=True)
num_jobs = os.cpu_count() - 1

## Introduction

### Quick Lhotse Primer

[Lhotse](https://github.com/lhotse-speech/lhotse) is a toolkit for speech data handling that originated as a part of the next-generation Kaldi framework. Next-gen Kaldi includes other tools, such as [k2](https://github.com/k2-fsa/k2), which is also [integrated into NeMo](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/experimental/k2/speech_to_text_bpe.py).

Lhotse represents speech data as python objects, and provides efficient sampling and dataloading mechanisms. The core Lhotse concepts leveraged in NeMo integration are:
- Representation for individual examples: [Recording](https://lhotse.readthedocs.io/en/latest/corpus.html#recording-manifest), [SupervisionSegment](https://lhotse.readthedocs.io/en/latest/corpus.html#supervision-manifest), and [Cut](https://lhotse.readthedocs.io/en/latest/cuts.html#cuts).
- Representation for a dataset and/or mini-batch: [CutSet](https://lhotse.readthedocs.io/en/latest/api.html#lhotse.cut.CutSet).
- Samplers (stratified and non-stratified): [DynamicCutSampler](https://lhotse.readthedocs.io/en/latest/datasets.html#lhotse.dataset.sampling.DynamicCutSampler) and [DynamicBucketingSampler](https://lhotse.readthedocs.io/en/latest/datasets.html#lhotse.dataset.sampling.DynamicBucketingSampler).
- A specific method of blending multiple datasets together, based on a probabilistic multiplexer: [CutSet.mux](https://lhotse.readthedocs.io/en/latest/api.html?highlight=mux#lhotse.cut.CutSet.mux).

### How is Lhotse integrated into NeMo?

Like NeMo, Lhotse leverages JSON Lines (JSONL) format to keep metadata in manifests. Unlike NeMo, Lhotse's manifests are mapped into Python objects. Most data represented by NeMo manifests are a special case supported by Lhotse and can be adapted on-the-fly to be directly read as a Lhotse `CutSet`. This conversion is enabled by [LazyNeMoIterator](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/common/data/lhotse/nemo_adapters.py#L33) and [LazyTarredNeMoIterator](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/common/data/lhotse/nemo_adapters.py#L131) classes inside NeMo.

Building a `DataLoader` with Lhotse relies heavily on `CutSampler` to sample a mini-batch on metadata level as a `CutSet`, which is then passed to a map-style PyTorch dataset that converts the `CutSet` to a tuple/dict of tensors. This is covered in more detail in Lhotse tutorials:
- Introductory tutorial [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lhotse-speech/lhotse/blob/master/examples/00-basic-workflow.ipynb)
- Tutorial on Lhotse Shar format ("tarred" data format in Lhotse): [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lhotse-speech/lhotse/blob/master/examples/04-lhotse-shar.ipynb)

The dataloader is being constructed in NeMo from a dataset configuration using function [get_lhotse_dataloader_from_config](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/common/data/lhotse/dataloader.py#L103).

### Getting mini LibriSpeech

For the sake of the tutorial, we'll be using very small datasets to keep things quick. The first example is mini LibriSpeech, a 5h train set and 2h dev set collection that was released alongside the popular LibriSpeech dataset. Lhotse provides a download function that fetches and unpacks the archive with data, and a prepare function that creates Lhotse manifests.

In [3]:
libri_root = download_librispeech(root_dir, dataset_parts="mini_librispeech")

libri = prepare_librispeech(
    libri_root, output_dir=root_dir, num_jobs=num_jobs
)

for split in ("train-clean-5", "dev-clean-2"):
    (
        CutSet
        .from_manifests(**libri[split])
        .to_file(f"data/librispeech_cuts_{split}.jsonl.gz")
    )

Downloading LibriSpeech parts:   0%|          | 0/2 [00:00<?, ?it/s]

Dataset parts:   0%|          | 0/2 [00:00<?, ?it/s]

Distributing tasks: 0it [00:00, ?it/s]

Processing:   0%|          | 0/1519 [00:00<?, ?it/s]

Distributing tasks: 0it [00:00, ?it/s]

Processing:   0%|          | 0/1089 [00:00<?, ?it/s]

# Training NeMo models with Lhotse 

For quick illustration, we'll fine-tune a small NeMo model `nvidia/stt_en_conformer_ctc_small` for 100 steps, running a validation every 50 steps. When training from scratch, you'd simply use a different NeMo script, while keeping the same dataloading options (perhaps tuning the batch size settings to fit your model choice).

**Technical note.** We train using existing NeMo `speech_to_text_finetune.py` script to illustrate real-world usage patterns of NeMo. If you're unfamiliar with running bash commands from a Jupyter notebook, starting the line with an exclamation mark `!` runs a single command, and starting a cell with `%%bash -s {var}` runs the whole cell in bash, and passes `var` as argument `$1` (subsequent -s args will be passed as `$2`, `$3`, and so on). 

**Common arguments.** Let's briefly discuss the CLI arguments we provide to the fine-tuning script.
- Most of the options are in `model.train_ds` and `model.validation_ds` namespaces:
  - `use_lhotse=true` enables Lhotse dataloading backend
  - Batch size is dynamic and controlled via `batch_duration`, `use_bucketing`, `num_buckets`, and `quadratic_duration`. [Please refer to **NeMo documentation** for details](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/datasets.html#enabling-lhotse-via-configuration).
  - We set `batch_size=null` explicitly (the base config has some value that would have otherwise limited our dynamic batch size).
  - Although we don't enable dynamic batch size for validation in this example for brevity, it does also work (including with bucketing).
- `trainer.use_distributed_sampler=false` is required by Lhotse (it has its own distributed sampling handling).
- In `trainer` namespace, `max_steps` controls the total number of steps in training; `val_check_interval` is the number of steps between validation runs.
- The `+` notation is used to append a value to a config (i.e., requires when these are not present in the YAML config file). [See **Hydra override syntax** for details](https://hydra.cc/docs/1.1/advanced/override_grammar/basic/#basic-override-syntax).

**Model inference.** We skip inference in this tutorial for brevity. [Please refer to **NeMo ASR inference documentation** to learn more](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/intro.html).

## I. Training using Lhotse CutSet

Let's start with training from our existing Lhotse manifests for mini LibriSpeech. Highlights for relevant CLI arguments:
- The manifest path is provided via `cuts_path` so the trainer knows to read this as a Lhotse manifest.
- We have to set `manifest_filepath=null` explicitly as older NeMo configs expect some value to be provided there regardless. 

In [4]:
%%bash -s {nemo_root}
python $1/examples/asr/speech_to_text_finetune.py \
    +init_from_pretrained_model="nvidia/stt_en_conformer_ctc_small" \
    name="finetune_from_cutset" \
    +model.train_ds.use_lhotse=true \
    model.train_ds.manifest_filepath=null \
    +model.train_ds.cuts_path=data/librispeech_cuts_train-clean-5.jsonl.gz \
    model.train_ds.batch_size=null \
    +model.train_ds.batch_duration=300 \
    +model.train_ds.use_bucketing=true \
    +model.train_ds.num_buckets=30 \
    +model.train_ds.quadratic_duration=15 \
    +model.validation_ds.use_lhotse=true \
    model.validation_ds.manifest_filepath=null \
    +model.validation_ds.cuts_path=data/librispeech_cuts_dev-clean-2.jsonl.gz \
    model.validation_ds.batch_size=64 \
    +trainer.use_distributed_sampler=false \
    trainer.max_steps=100 \
    ++trainer.val_check_interval=50

    
    See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
      ret = run_job(
    


[NeMo I 2024-02-26 09:59:11 speech_to_text_finetune:190] Hydra config: name: finetune_from_cutset
    init_from_nemo_model: null
    model:
      sample_rate: 16000
      compute_eval_loss: false
      log_prediction: true
      rnnt_reduction: mean_volume
      skip_nan_grad: false
      train_ds:
        manifest_filepath: null
        sample_rate: ${model.sample_rate}
        batch_size: null
        shuffle: true
        num_workers: 8
        pin_memory: true
        max_duration: 20
        min_duration: 0.1
        is_tarred: false
        tarred_audio_filepaths: null
        shuffle_n: 2048
        bucketing_strategy: fully_randomized
        bucketing_batch_size: null
        use_lhotse: true
        cuts_path: data/librispeech_cuts_train-clean-5.jsonl.gz
        batch_duration: 300
        use_bucketing: true
        num_buckets: 30
        quadratic_duration: 15
      validation_ds:
        manifest_filepath: null
        sample_rate: ${model.sample_rate}
        batch_size:

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


[NeMo I 2024-02-26 09:59:11 exp_manager:396] Experiments will be logged at /home/pzelasko/code/canary/nemo_experiments/finetune_from_cutset/2024-02-26_09-59-11
[NeMo I 2024-02-26 09:59:11 exp_manager:856] TensorboardLogger has been set up


[NeMo W 2024-02-26 09:59:11 exp_manager:966] The checkpoint callback was told to monitor a validation value and trainer's max_steps was set to 100. Please ensure that max_steps will run for at least 1 epochs to ensure that checkpointing will not error out.


[NeMo I 2024-02-26 09:59:11 speech_to_text_finetune:99] Sleeping for at least 60 seconds to wait for model download to finish.
[NeMo I 2024-02-26 10:00:11 mixins:172] Tokenizer SentencePieceTokenizer initialized with 1024 tokens


[NeMo W 2024-02-26 10:00:12 modelPT:165] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /data/NeMo_ASR_SET/English/v2.0/train/tarred_audio_manifest.json
    sample_rate: 16000
    batch_size: 64
    shuffle: true
    num_workers: 8
    pin_memory: true
    use_start_end_token: false
    trim_silence: false
    max_duration: 20.0
    min_duration: 0.1
    shuffle_n: 2048
    is_tarred: true
    tarred_audio_filepaths: /data/NeMo_ASR_SET/English/v2.0/train/audio__OP_0..4095_CL_.tar
    
[NeMo W 2024-02-26 10:00:12 modelPT:172] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath:
    - /data/ASR/LibriSpeech/librispeech_withs

[NeMo I 2024-02-26 10:00:12 features:289] PADDING: 0
[NeMo I 2024-02-26 10:00:12 save_restore_connector:263] Model EncDecCTCModelBPE was successfully restored from /home/pzelasko/.cache/huggingface/hub/models--nvidia--stt_en_conformer_ctc_small/snapshots/f879b51de584983383de815ce87d25469b2abbf3/stt_en_conformer_ctc_small.nemo.
[NeMo I 2024-02-26 10:00:12 speech_to_text_finetune:131] Reusing the vocabulary from the pre-trained model.
We will be using a Lhotse DataLoader.
Creating a Lhotse DynamicBucketingSampler (max_batch_duration=300.0 max_batch_size=None)
We will be using a Lhotse DataLoader.
Creating a Lhotse DynamicCutSampler (bucketing is disabled, (max_batch_duration=None max_batch_size=64)


[NeMo W 2024-02-26 10:00:12 modelPT:612] Trainer wasn't specified in model constructor. Make sure that you really wanted it.


[NeMo I 2024-02-26 10:00:12 modelPT:723] Optimizer config = AdamW (
    Parameter Group 0
        amsgrad: False
        betas: [0.9, 0.98]
        capturable: False
        differentiable: False
        eps: 1e-08
        foreach: None
        fused: None
        lr: 0.0001
        maximize: False
        weight_decay: 0.001
    )


[NeMo W 2024-02-26 10:00:12 lr_scheduler:895] Neither `max_steps` nor `iters_per_batch` were provided to `optim.sched`, cannot compute effective `max_steps` !
    Scheduler will not be instantiated !
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

You are using a CUDA device ('NVIDIA RTX 6000 Ada Generation') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


[NeMo I 2024-02-26 10:00:13 modelPT:723] Optimizer config = AdamW (
    Parameter Group 0
        amsgrad: False
        betas: [0.9, 0.98]
        capturable: False
        differentiable: False
        eps: 1e-08
        foreach: None
        fused: None
        lr: 0.0001
        maximize: False
        weight_decay: 0.001
    )
[NeMo I 2024-02-26 10:00:13 lr_scheduler:915] Scheduler "<nemo.core.optim.lr_scheduler.CosineAnnealing object at 0x7f77c01db850>" 
    will be used during training (effective maximum steps = 100) - 
    Parameters : 
    (warmup_steps: 5000
    warmup_ratio: null
    min_lr: 5.0e-06
    max_steps: 100
    )



  | Name              | Type                              | Params
------------------------------------------------------------------------
0 | preprocessor      | AudioToMelSpectrogramPreprocessor | 0     
1 | encoder           | ConformerEncoder                  | 13.0 M
2 | decoder           | ConvASRDecoder                    | 181 K 
3 | loss              | CTCLoss                           | 0     
4 | spec_augmentation | SpectrogramAugmentation           | 0     
5 | wer               | WER                               | 0     
6 | spec_augment      | SpectrogramAugmentation           | 0     
------------------------------------------------------------------------
13.2 M    Trainable params
0         Non-trainable params
13.2 M    Total params
52.616    Total estimated model params size (MB)


Epoch 0: : 9it [00:02,  3.58it/s, v_num=9-11, train_step_timing in s=0.151][NeMo I 2024-02-26 10:00:17 wer:318] 
    
[NeMo I 2024-02-26 10:00:17 wer:319] reference:his line rang suddenly jack she cried you got a bite he pulled missed the strike and wound in the minnow was all right so he tossed it back again that isn't your name he said
[NeMo I 2024-02-26 10:00:17 wer:320] predicted:ng suddenly youtll the strike andound thener so hes again isn't said
Epoch 0: : 19it [00:03,  4.78it/s, v_num=9-11, train_step_timing in s=0.144][NeMo I 2024-02-26 10:00:19 wer:318] 
    
[NeMo I 2024-02-26 10:00:19 wer:319] reference:the giant's heavy eyes lifted quickly but he spoke to the girl you go on home
[NeMo I 2024-02-26 10:00:19 wer:320] predicted:the giant's heavy eyes lifted quickly but he spoke to the girl you go on home
Epoch 0: : 29it [00:05,  5.43it/s, v_num=9-11, train_step_timing in s=0.113][NeMo I 2024-02-26 10:00:20 wer:318] 
    
[NeMo I 2024-02-26 10:00:20 wer:319] reference:in your w

    
    
    


Epoch 0: : 50it [00:15,  3.22it/s, v_num=9-11, train_step_timing in s=0.118]


Epoch 0, global step 50: 'val_wer' reached 0.53526 (best 0.53526), saving model to '/home/pzelasko/code/canary/nemo_experiments/finetune_from_cutset/2024-02-26_09-59-11/checkpoints/finetune_from_cutset--val_wer=0.5353-epoch=0.ckpt' as top 5


                                                  [A[NeMo I 2024-02-26 10:00:31 nemo_model_checkpoint:223] New .nemo model saved to: /home/pzelasko/code/canary/nemo_experiments/finetune_from_cutset/2024-02-26_09-59-11/checkpoints/finetune_from_cutset.nemo
[NeMo I 2024-02-26 10:00:32 nemo_model_checkpoint:223] New .nemo model saved to: /home/pzelasko/code/canary/nemo_experiments/finetune_from_cutset/2024-02-26_09-59-11/checkpoints/finetune_from_cutset.nemo
Epoch 0: : 59it [00:18,  3.16it/s, v_num=9-11, train_step_timing in s=0.0971][NeMo I 2024-02-26 10:00:33 wer:318] 
    
[NeMo I 2024-02-26 10:00:33 wer:319] reference:and enjoyed municipal liberty under the suzerainty of the empire justinian displayed in his day of adversity a degree of capacity which astonished his contemporaries he fled from cherson and took refuge with the khan of the khazars
[NeMo I 2024-02-26 10:00:33 wer:320] predicted:and enjoynipalberty under they of thempirepyed in his dayversity agacitytonishedtempories hes

Epoch 0, global step 100: 'val_wer' reached 0.47179 (best 0.47179), saving model to '/home/pzelasko/code/canary/nemo_experiments/finetune_from_cutset/2024-02-26_09-59-11/checkpoints/finetune_from_cutset--val_wer=0.4718-epoch=0.ckpt' as top 5


                                                  [A[NeMo I 2024-02-26 10:00:46 nemo_model_checkpoint:223] New .nemo model saved to: /home/pzelasko/code/canary/nemo_experiments/finetune_from_cutset/2024-02-26_09-59-11/checkpoints/finetune_from_cutset.nemo
[NeMo I 2024-02-26 10:00:47 nemo_model_checkpoint:223] New .nemo model saved to: /home/pzelasko/code/canary/nemo_experiments/finetune_from_cutset/2024-02-26_09-59-11/checkpoints/finetune_from_cutset.nemo


`Trainer.fit` stopped: `max_steps=100` reached.


Epoch 0: : 100it [00:32,  3.05it/s, v_num=9-11, train_step_timing in s=0.152]


In [5]:
!ls nemo_experiments/finetune_from_cutset/*/checkpoints/

 finetune_from_cutset.nemo
'finetune_from_cutset--val_wer=0.4718-epoch=0.ckpt'
'finetune_from_cutset--val_wer=0.4718-epoch=0-last.ckpt'
'finetune_from_cutset--val_wer=0.5353-epoch=0.ckpt'


## II. Training using NeMo manifest

Training from NeMo manifest format is done in a similar way. Highlights:
- We'll convert Lhotse CutSet manifests to NeMo manifests for this exercise. If you have existing data in NeMo format, just use it as-is.
- We can now use `manifest_filepath` argument directly.

In [6]:
for split in ("train-clean-5", "dev-clean-2"):
    lhotse.serialization.save_to_jsonl(
        (
            {
                "audio_filepath": cut.recording.sources[0].source,
                "duration": cut.duration,
                "text": cut.supervisions[0].text,
                "lang": cut.supervisions[0].language,
                "sampling_rate": cut.sampling_rate,
            }
            for cut in CutSet.from_file(f"data/librispeech_cuts_{split}.jsonl.gz")
        ),
        f"data/nemo_{split}.json",
    )

In [7]:
%%bash -s {nemo_root}
python $1/examples/asr/speech_to_text_finetune.py \
    +init_from_pretrained_model="nvidia/stt_en_conformer_ctc_small" \
    name="finetune_from_nemo_json" \
    +model.train_ds.use_lhotse=true \
    model.train_ds.manifest_filepath=data/nemo_train-clean-5.json \
    model.train_ds.batch_size=null \
    +model.train_ds.batch_duration=300 \
    +model.train_ds.use_bucketing=true \
    +model.train_ds.num_buckets=30 \
    +model.train_ds.quadratic_duration=15 \
    +model.validation_ds.use_lhotse=true \
    model.validation_ds.manifest_filepath=data/nemo_dev-clean-2.json \
    model.validation_ds.batch_size=64 \
    +trainer.use_distributed_sampler=false \
    trainer.max_steps=100 \
    ++trainer.val_check_interval=50

    
    See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
      ret = run_job(
    


[NeMo I 2024-02-26 10:00:55 speech_to_text_finetune:190] Hydra config: name: finetune_from_nemo_json
    init_from_nemo_model: null
    model:
      sample_rate: 16000
      compute_eval_loss: false
      log_prediction: true
      rnnt_reduction: mean_volume
      skip_nan_grad: false
      train_ds:
        manifest_filepath: data/nemo_train-clean-5.json
        sample_rate: ${model.sample_rate}
        batch_size: null
        shuffle: true
        num_workers: 8
        pin_memory: true
        max_duration: 20
        min_duration: 0.1
        is_tarred: false
        tarred_audio_filepaths: null
        shuffle_n: 2048
        bucketing_strategy: fully_randomized
        bucketing_batch_size: null
        use_lhotse: true
        batch_duration: 300
        use_bucketing: true
        num_buckets: 30
        quadratic_duration: 15
      validation_ds:
        manifest_filepath: data/nemo_dev-clean-2.json
        sample_rate: ${model.sample_rate}
        batch_size: 64
        shu

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


[NeMo I 2024-02-26 10:00:55 exp_manager:396] Experiments will be logged at /home/pzelasko/code/canary/nemo_experiments/finetune_from_nemo_json/2024-02-26_10-00-55
[NeMo I 2024-02-26 10:00:55 exp_manager:856] TensorboardLogger has been set up


[NeMo W 2024-02-26 10:00:55 exp_manager:966] The checkpoint callback was told to monitor a validation value and trainer's max_steps was set to 100. Please ensure that max_steps will run for at least 1 epochs to ensure that checkpointing will not error out.


[NeMo I 2024-02-26 10:00:55 speech_to_text_finetune:99] Sleeping for at least 60 seconds to wait for model download to finish.
[NeMo I 2024-02-26 10:01:56 mixins:172] Tokenizer SentencePieceTokenizer initialized with 1024 tokens


[NeMo W 2024-02-26 10:01:56 modelPT:165] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /data/NeMo_ASR_SET/English/v2.0/train/tarred_audio_manifest.json
    sample_rate: 16000
    batch_size: 64
    shuffle: true
    num_workers: 8
    pin_memory: true
    use_start_end_token: false
    trim_silence: false
    max_duration: 20.0
    min_duration: 0.1
    shuffle_n: 2048
    is_tarred: true
    tarred_audio_filepaths: /data/NeMo_ASR_SET/English/v2.0/train/audio__OP_0..4095_CL_.tar
    
[NeMo W 2024-02-26 10:01:56 modelPT:172] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath:
    - /data/ASR/LibriSpeech/librispeech_withs

[NeMo I 2024-02-26 10:01:56 features:289] PADDING: 0
[NeMo I 2024-02-26 10:01:56 save_restore_connector:263] Model EncDecCTCModelBPE was successfully restored from /home/pzelasko/.cache/huggingface/hub/models--nvidia--stt_en_conformer_ctc_small/snapshots/f879b51de584983383de815ce87d25469b2abbf3/stt_en_conformer_ctc_small.nemo.
[NeMo I 2024-02-26 10:01:56 speech_to_text_finetune:131] Reusing the vocabulary from the pre-trained model.
We will be using a Lhotse DataLoader.
Initializing Lhotse CutSet from a single NeMo manifest (tarred): 'data/nemo_train-clean-5.json'
Creating a Lhotse DynamicBucketingSampler (max_batch_duration=300.0 max_batch_size=None)
We will be using a Lhotse DataLoader.
Initializing Lhotse CutSet from a single NeMo manifest (tarred): 'data/nemo_dev-clean-2.json'
Creating a Lhotse DynamicCutSampler (bucketing is disabled, (max_batch_duration=None max_batch_size=64)


[NeMo W 2024-02-26 10:01:57 modelPT:612] Trainer wasn't specified in model constructor. Make sure that you really wanted it.


[NeMo I 2024-02-26 10:01:57 modelPT:723] Optimizer config = AdamW (
    Parameter Group 0
        amsgrad: False
        betas: [0.9, 0.98]
        capturable: False
        differentiable: False
        eps: 1e-08
        foreach: None
        fused: None
        lr: 0.0001
        maximize: False
        weight_decay: 0.001
    )


[NeMo W 2024-02-26 10:01:57 lr_scheduler:895] Neither `max_steps` nor `iters_per_batch` were provided to `optim.sched`, cannot compute effective `max_steps` !
    Scheduler will not be instantiated !
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

You are using a CUDA device ('NVIDIA RTX 6000 Ada Generation') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


[NeMo I 2024-02-26 10:01:58 modelPT:723] Optimizer config = AdamW (
    Parameter Group 0
        amsgrad: False
        betas: [0.9, 0.98]
        capturable: False
        differentiable: False
        eps: 1e-08
        foreach: None
        fused: None
        lr: 0.0001
        maximize: False
        weight_decay: 0.001
    )
[NeMo I 2024-02-26 10:01:58 lr_scheduler:915] Scheduler "<nemo.core.optim.lr_scheduler.CosineAnnealing object at 0x7f49e013a890>" 
    will be used during training (effective maximum steps = 100) - 
    Parameters : 
    (warmup_steps: 5000
    warmup_ratio: null
    min_lr: 5.0e-06
    max_steps: 100
    )



  | Name              | Type                              | Params
------------------------------------------------------------------------
0 | preprocessor      | AudioToMelSpectrogramPreprocessor | 0     
1 | encoder           | ConformerEncoder                  | 13.0 M
2 | decoder           | ConvASRDecoder                    | 181 K 
3 | loss              | CTCLoss                           | 0     
4 | spec_augmentation | SpectrogramAugmentation           | 0     
5 | wer               | WER                               | 0     
6 | spec_augment      | SpectrogramAugmentation           | 0     
------------------------------------------------------------------------
13.2 M    Trainable params
0         Non-trainable params
13.2 M    Total params
52.616    Total estimated model params size (MB)


Epoch 0: : 9it [00:02,  3.52it/s, v_num=0-55, train_step_timing in s=0.163][NeMo I 2024-02-26 10:02:01 wer:318] 
    
[NeMo I 2024-02-26 10:02:01 wer:319] reference:his line rang suddenly jack she cried you got a bite he pulled missed the strike and wound in the minnow was all right so he tossed it back again that isn't your name he said
[NeMo I 2024-02-26 10:02:01 wer:320] predicted:his line rang suddenlyck sheried herake and in the mi was he back again that he
Epoch 0: : 19it [00:03,  4.82it/s, v_num=0-55, train_step_timing in s=0.140][NeMo I 2024-02-26 10:02:03 wer:318] 
    
[NeMo I 2024-02-26 10:02:03 wer:319] reference:the giant's heavy eyes lifted quickly but he spoke to the girl you go on home
[NeMo I 2024-02-26 10:02:03 wer:320] predicted:the giant's heavy eyes lifted quickly but he spoke to the girl you go on home
Epoch 0: : 29it [00:05,  5.41it/s, v_num=0-55, train_step_timing in s=0.158][NeMo I 2024-02-26 10:02:04 wer:318] 
    
[NeMo I 2024-02-26 10:02:04 wer:319] referenc

    
    
    


Epoch 0: : 50it [00:15,  3.21it/s, v_num=0-55, train_step_timing in s=0.109]


Epoch 0, global step 50: 'val_wer' reached 0.53069 (best 0.53069), saving model to '/home/pzelasko/code/canary/nemo_experiments/finetune_from_nemo_json/2024-02-26_10-00-55/checkpoints/finetune_from_nemo_json--val_wer=0.5307-epoch=0.ckpt' as top 5


                                                  [A[NeMo I 2024-02-26 10:02:15 nemo_model_checkpoint:223] New .nemo model saved to: /home/pzelasko/code/canary/nemo_experiments/finetune_from_nemo_json/2024-02-26_10-00-55/checkpoints/finetune_from_nemo_json.nemo
[NeMo I 2024-02-26 10:02:16 nemo_model_checkpoint:223] New .nemo model saved to: /home/pzelasko/code/canary/nemo_experiments/finetune_from_nemo_json/2024-02-26_10-00-55/checkpoints/finetune_from_nemo_json.nemo
Epoch 0: : 59it [00:18,  3.16it/s, v_num=0-55, train_step_timing in s=0.147] [NeMo I 2024-02-26 10:02:18 wer:318] 
    
[NeMo I 2024-02-26 10:02:18 wer:319] reference:and enjoyed municipal liberty under the suzerainty of the empire justinian displayed in his day of adversity a degree of capacity which astonished his contemporaries he fled from cherson and took refuge with the khan of the khazars
[NeMo I 2024-02-26 10:02:18 wer:320] predicted:ed the wholeberty under the in of thempire just in his day ofdversity a degree of

Epoch 0, global step 100: 'val_wer' reached 0.46767 (best 0.46767), saving model to '/home/pzelasko/code/canary/nemo_experiments/finetune_from_nemo_json/2024-02-26_10-00-55/checkpoints/finetune_from_nemo_json--val_wer=0.4677-epoch=0.ckpt' as top 5


                                                  [A[NeMo I 2024-02-26 10:02:30 nemo_model_checkpoint:223] New .nemo model saved to: /home/pzelasko/code/canary/nemo_experiments/finetune_from_nemo_json/2024-02-26_10-00-55/checkpoints/finetune_from_nemo_json.nemo
[NeMo I 2024-02-26 10:02:31 nemo_model_checkpoint:223] New .nemo model saved to: /home/pzelasko/code/canary/nemo_experiments/finetune_from_nemo_json/2024-02-26_10-00-55/checkpoints/finetune_from_nemo_json.nemo


`Trainer.fit` stopped: `max_steps=100` reached.


Epoch 0: : 100it [00:32,  3.04it/s, v_num=0-55, train_step_timing in s=0.128]


In [8]:
!ls nemo_experiments/finetune_from_nemo_json/*/checkpoints/

 finetune_from_nemo_json.nemo
'finetune_from_nemo_json--val_wer=0.4677-epoch=0.ckpt'
'finetune_from_nemo_json--val_wer=0.4677-epoch=0-last.ckpt'
'finetune_from_nemo_json--val_wer=0.5307-epoch=0.ckpt'


## III. Training using NeMo tarred manifest

Tarred manifest format is useful when dealing with I/O bottlenecks. Typical scenarios are compute grids with shared resources, magnetic disks, cloud storage such as AWS S3, GCS, Azure Blob Storage, etc. 

We'll cover NeMo native tarred format below. If you're interested in Lhotse Shar tarred format, these also work in NeMo -- [please visit the **relevant Lhotse Shar tutorial** for details](https://colab.research.google.com/github/lhotse-speech/lhotse/blob/master/examples/04-lhotse-shar.ipynb). 

Highlights:
- First, let's convert data to NeMo tarred manifests using dedicated NeMo script `convert_to_tarred_audio_dataset.py`. We'll split the data into 32 shards here.
- For training, we'll use `manifest_filepath` for sharded JSONL manifests, and `tarred_audio_filepaths` for sharded audio tar files. Note the NeMo-specific syntax of `_OP_0..31_CL_` which tells NeMo to expand it into a list of 32 items.
- The only other modification is that we're adding `trainer.train_limit_batches` option which tells the trainer what's the size of a pseudo-epoch. In previous examples we could have measured training length in epochs, but chose not to. However, with tarred datasets, we completely discard the notion of an epoch -- the data iterator is infinite, so we use the count of steps for everything instead. This design prevents issues with hanging due to uneven dataloader lengths in distributed training.

In [9]:
%%bash -s {nemo_root}
python $1/scripts/speech_recognition/convert_to_tarred_audio_dataset.py \
    --manifest_path data/nemo_train-clean-5.json \
    --num_shards 32 \
    --workers 4 \
    --max_duration 30

The version_base parameter is not specified.
Please specify a compatability version level, or None.
Will assume defaults for version 1.1
  @hydra.main(config_path=None, config_name='index_config')
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.


Creating new tarred dataset ...
After filtering, manifest has 1519 files which amounts to 19118.674812500023 seconds of audio.
Number of samples added : 1519
Remainder: 15
Shard 0 has entries 0 ~ 47
Shard 0 contains 47 files
Shard 1 has entries 47 ~ 94
Shard 1 contains 47 files
Shard 2 has entries 94 ~ 141
Shard 2 contains 47 files
Shard 3 has entries 141 ~ 188
Shard 3 contains 47 files
Shard 4 has entries 188 ~ 235
Shard 4 contains 47 files
Shard 5 has entries 235 ~ 282
Shard 5 contains 47 files
Shard 6 has entries 282 ~ 329
Shard 6 contains 47 files
Shard 7 has entries 329 ~ 376
Shard 7 contains 47 files
Shard 8 has entries 376 ~ 423
Shard 8 contains 47 files
Shard 9 has entries 423 ~ 470
Shard 9 contains 47 files
Shard 10 has entries 470 ~ 517
Shard 10 contains 47 files
Shard 11 has entries 517 ~ 564
Shard 11 contains 47 files
Shard 12 has entries 564 ~ 611
Shard 12 contains 47 files
Shard 13 has entries 611 ~ 658
Shard 13 contains 47 files
Shard 14 has entries 658 ~ 705
Shard 14 co

[Parallel(n_jobs=4)]: Done   1 tasks      | elapsed:    0.2s
[Parallel(n_jobs=4)]: Done   2 tasks      | elapsed:    0.2s
[Parallel(n_jobs=4)]: Done   3 tasks      | elapsed:    0.2s
[Parallel(n_jobs=4)]: Done   4 tasks      | elapsed:    0.2s
[Parallel(n_jobs=4)]: Done   5 tasks      | elapsed:    0.3s
[Parallel(n_jobs=4)]: Done   6 tasks      | elapsed:    0.3s
[Parallel(n_jobs=4)]: Done   7 tasks      | elapsed:    0.3s
[Parallel(n_jobs=4)]: Done   8 tasks      | elapsed:    0.3s
[Parallel(n_jobs=4)]: Done   9 tasks      | elapsed:    0.3s
[Parallel(n_jobs=4)]: Done  10 tasks      | elapsed:    0.3s
[Parallel(n_jobs=4)]: Batch computation too fast (0.17278412062133794s.) Setting batch_size=2.
[Parallel(n_jobs=4)]: Done  11 tasks      | elapsed:    0.3s
[Parallel(n_jobs=4)]: Done  12 tasks      | elapsed:    0.3s
[Parallel(n_jobs=4)]: Done  13 tasks      | elapsed:    0.3s
[Parallel(n_jobs=4)]: Done  14 tasks      | elapsed:    0.3s
[Parallel(n_jobs=4)]: Done  15 tasks      | elapsed

Total number of entries in manifest : 1504
Note: we estimated the optimal bucketing duration bins for 30 buckets. You can enable dynamic bucketing by setting the following options in your training script:
  use_lhotse=true
  use_bucketing=true
  num_buckets=30
  bucket_duration_bins=[6.435,8.76,10.155,11.185,12.005,12.435,12.71,13.085,13.31,13.56,13.785,13.95,14.13,14.285,14.47,14.6,14.715,14.855,14.985,15.135,15.24,15.355,15.445,15.53,15.66,15.79,15.92,16.1299375,16.43]
  batch_duration=<tune-this-value>
If you'd like to use a different number of buckets, re-estimate this option manually using scripts/speech_recognition/estimate_duration_bins.py


In [10]:
%%bash -s {nemo_root}
python $1/examples/asr/speech_to_text_finetune.py \
    +init_from_pretrained_model="nvidia/stt_en_conformer_ctc_small" \
    name="finetune_from_nemo_tarred" \
    +model.train_ds.use_lhotse=true \
    model.train_ds.manifest_filepath=tarred/sharded_manifests/manifest__OP_0..31_CL_.json \
    model.train_ds.tarred_audio_filepaths=tarred/audio__OP_0..31_CL_.tar \
    model.train_ds.batch_size=null \
    +model.train_ds.batch_duration=300 \
    +model.train_ds.use_bucketing=true \
    +model.train_ds.num_buckets=30 \
    +model.train_ds.quadratic_duration=15 \
    +model.validation_ds.use_lhotse=true \
    model.validation_ds.manifest_filepath=data/nemo_dev-clean-2.json \
    model.validation_ds.batch_size=64 \
    +trainer.use_distributed_sampler=false \
    trainer.max_steps=100 \
    +trainer.limit_train_batches=50 \
    ++trainer.val_check_interval=50

    
    See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
      ret = run_job(
    


[NeMo I 2024-02-26 10:02:43 speech_to_text_finetune:190] Hydra config: name: finetune_from_nemo_tarred
    init_from_nemo_model: null
    model:
      sample_rate: 16000
      compute_eval_loss: false
      log_prediction: true
      rnnt_reduction: mean_volume
      skip_nan_grad: false
      train_ds:
        manifest_filepath: tarred/sharded_manifests/manifest__OP_0..31_CL_.json
        sample_rate: ${model.sample_rate}
        batch_size: null
        shuffle: true
        num_workers: 8
        pin_memory: true
        max_duration: 20
        min_duration: 0.1
        is_tarred: false
        tarred_audio_filepaths: tarred/audio__OP_0..31_CL_.tar
        shuffle_n: 2048
        bucketing_strategy: fully_randomized
        bucketing_batch_size: null
        use_lhotse: true
        batch_duration: 300
        use_bucketing: true
        num_buckets: 30
        quadratic_duration: 15
      validation_ds:
        manifest_filepath: data/nemo_dev-clean-2.json
        sample_rate: ${m

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


[NeMo I 2024-02-26 10:02:43 exp_manager:396] Experiments will be logged at /home/pzelasko/code/canary/nemo_experiments/finetune_from_nemo_tarred/2024-02-26_10-02-43
[NeMo I 2024-02-26 10:02:43 exp_manager:856] TensorboardLogger has been set up


[NeMo W 2024-02-26 10:02:43 exp_manager:966] The checkpoint callback was told to monitor a validation value and trainer's max_steps was set to 100. Please ensure that max_steps will run for at least 1 epochs to ensure that checkpointing will not error out.


[NeMo I 2024-02-26 10:02:43 speech_to_text_finetune:99] Sleeping for at least 60 seconds to wait for model download to finish.
[NeMo I 2024-02-26 10:03:44 mixins:172] Tokenizer SentencePieceTokenizer initialized with 1024 tokens


[NeMo W 2024-02-26 10:03:44 modelPT:165] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /data/NeMo_ASR_SET/English/v2.0/train/tarred_audio_manifest.json
    sample_rate: 16000
    batch_size: 64
    shuffle: true
    num_workers: 8
    pin_memory: true
    use_start_end_token: false
    trim_silence: false
    max_duration: 20.0
    min_duration: 0.1
    shuffle_n: 2048
    is_tarred: true
    tarred_audio_filepaths: /data/NeMo_ASR_SET/English/v2.0/train/audio__OP_0..4095_CL_.tar
    
[NeMo W 2024-02-26 10:03:44 modelPT:172] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath:
    - /data/ASR/LibriSpeech/librispeech_withs

[NeMo I 2024-02-26 10:03:44 features:289] PADDING: 0
[NeMo I 2024-02-26 10:03:45 save_restore_connector:263] Model EncDecCTCModelBPE was successfully restored from /home/pzelasko/.cache/huggingface/hub/models--nvidia--stt_en_conformer_ctc_small/snapshots/f879b51de584983383de815ce87d25469b2abbf3/stt_en_conformer_ctc_small.nemo.
[NeMo I 2024-02-26 10:03:45 speech_to_text_finetune:131] Reusing the vocabulary from the pre-trained model.
We will be using a Lhotse DataLoader.
Initializing Lhotse CutSet from a single NeMo manifest (tarred): 'tarred/sharded_manifests/manifest__OP_0..31_CL_.json'
Creating a Lhotse DynamicBucketingSampler (max_batch_duration=300.0 max_batch_size=None)


[NeMo W 2024-02-26 10:03:45 ctc_models:372] Model Trainer was not set before constructing the dataset, incorrect number of training batches will be used. Please set the trainer and rebuild the dataset.


We will be using a Lhotse DataLoader.
Initializing Lhotse CutSet from a single NeMo manifest (tarred): 'data/nemo_dev-clean-2.json'
Creating a Lhotse DynamicCutSampler (bucketing is disabled, (max_batch_duration=None max_batch_size=64)


[NeMo W 2024-02-26 10:03:46 modelPT:612] Trainer wasn't specified in model constructor. Make sure that you really wanted it.


[NeMo I 2024-02-26 10:03:46 modelPT:723] Optimizer config = AdamW (
    Parameter Group 0
        amsgrad: False
        betas: [0.9, 0.98]
        capturable: False
        differentiable: False
        eps: 1e-08
        foreach: None
        fused: None
        lr: 0.0001
        maximize: False
        weight_decay: 0.001
    )


[NeMo W 2024-02-26 10:03:46 lr_scheduler:895] Neither `max_steps` nor `iters_per_batch` were provided to `optim.sched`, cannot compute effective `max_steps` !
    Scheduler will not be instantiated !
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

You are using a CUDA device ('NVIDIA RTX 6000 Ada Generation') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


[NeMo I 2024-02-26 10:03:46 modelPT:723] Optimizer config = AdamW (
    Parameter Group 0
        amsgrad: False
        betas: [0.9, 0.98]
        capturable: False
        differentiable: False
        eps: 1e-08
        foreach: None
        fused: None
        lr: 0.0001
        maximize: False
        weight_decay: 0.001
    )
[NeMo I 2024-02-26 10:03:46 lr_scheduler:915] Scheduler "<nemo.core.optim.lr_scheduler.CosineAnnealing object at 0x7f995c13c520>" 
    will be used during training (effective maximum steps = 100) - 
    Parameters : 
    (warmup_steps: 5000
    warmup_ratio: null
    min_lr: 5.0e-06
    max_steps: 100
    )



  | Name              | Type                              | Params
------------------------------------------------------------------------
0 | preprocessor      | AudioToMelSpectrogramPreprocessor | 0     
1 | encoder           | ConformerEncoder                  | 13.0 M
2 | decoder           | ConvASRDecoder                    | 181 K 
3 | loss              | CTCLoss                           | 0     
4 | spec_augmentation | SpectrogramAugmentation           | 0     
5 | wer               | WER                               | 0     
6 | spec_augment      | SpectrogramAugmentation           | 0     
------------------------------------------------------------------------
13.2 M    Trainable params
0         Non-trainable params
13.2 M    Total params
52.616    Total estimated model params size (MB)


Epoch 0:  18%|█▊        | 9/50 [00:02<00:09,  4.26it/s, v_num=2-43, train_step_timing in s=0.118][NeMo I 2024-02-26 10:03:50 wer:318] 
    
[NeMo I 2024-02-26 10:03:50 wer:319] reference:our itching is really the itching for the infinite the immeasurable like the rider on his forward panting horse we let the reins fall before the infinite we modern men we semi barbarians
[NeMo I 2024-02-26 10:03:50 wer:320] predicted:chingching for thefinite themmeasurable theder hiswarding theinsll before the infin barb
Epoch 0:  38%|███▊      | 19/50 [00:03<00:05,  5.22it/s, v_num=2-43, train_step_timing in s=0.227][NeMo I 2024-02-26 10:03:51 wer:318] 
    
[NeMo I 2024-02-26 10:03:51 wer:319] reference:and he saw them in the sand where the first tiny brook tinkled across the path from a gloomy ravine there the little creature had taken a flying leap across it and beyond he could see the prints no more he little guessed that while he halted to let his horse drink
[NeMo I 2024-02-26 10:03:51 wer:320] 

    
    
    


Epoch 0: 100%|██████████| 50/50 [00:15<00:00,  3.31it/s, v_num=2-43, train_step_timing in s=0.135]


Epoch 0, global step 50: 'val_wer' reached 0.24734 (best 0.24734), saving model to '/home/pzelasko/code/canary/nemo_experiments/finetune_from_nemo_tarred/2024-02-26_10-02-43/checkpoints/finetune_from_nemo_tarred--val_wer=0.2473-epoch=0.ckpt' as top 5


                                                  [A[NeMo I 2024-02-26 10:04:03 nemo_model_checkpoint:223] New .nemo model saved to: /home/pzelasko/code/canary/nemo_experiments/finetune_from_nemo_tarred/2024-02-26_10-02-43/checkpoints/finetune_from_nemo_tarred.nemo
[NeMo I 2024-02-26 10:04:04 nemo_model_checkpoint:223] New .nemo model saved to: /home/pzelasko/code/canary/nemo_experiments/finetune_from_nemo_tarred/2024-02-26_10-02-43/checkpoints/finetune_from_nemo_tarred.nemo
Epoch 1:  18%|█▊        | 9/50 [00:01<00:05,  7.66it/s, v_num=2-43, train_step_timing in s=0.102][NeMo I 2024-02-26 10:04:06 wer:318] 
    
[NeMo I 2024-02-26 10:04:06 wer:319] reference:gave me the creeps too makes me surer than ever that he has an abominably deep purpose in using his wits to hang on here he suggests resources as hard to understand as anything that has happened in the old room you'll confess bobby he's had a good deal of influence over you an influence for evil
[NeMo I 2024-02-26 10:04:06 wer:320

Epoch 1, global step 100: 'val_wer' reached 0.44016 (best 0.24734), saving model to '/home/pzelasko/code/canary/nemo_experiments/finetune_from_nemo_tarred/2024-02-26_10-02-43/checkpoints/finetune_from_nemo_tarred--val_wer=0.4402-epoch=1.ckpt' as top 5


                                                  [A[NeMo I 2024-02-26 10:04:19 nemo_model_checkpoint:223] New .nemo model saved to: /home/pzelasko/code/canary/nemo_experiments/finetune_from_nemo_tarred/2024-02-26_10-02-43/checkpoints/finetune_from_nemo_tarred.nemo
[NeMo I 2024-02-26 10:04:19 nemo_model_checkpoint:223] New .nemo model saved to: /home/pzelasko/code/canary/nemo_experiments/finetune_from_nemo_tarred/2024-02-26_10-02-43/checkpoints/finetune_from_nemo_tarred.nemo


`Trainer.fit` stopped: `max_steps=100` reached.


Epoch 1: 100%|██████████| 50/50 [00:14<00:00,  3.36it/s, v_num=2-43, train_step_timing in s=0.131]


In [11]:
!ls nemo_experiments/finetune_from_nemo_tarred/*/checkpoints/

 finetune_from_nemo_tarred.nemo
'finetune_from_nemo_tarred--val_wer=0.2473-epoch=0.ckpt'
'finetune_from_nemo_tarred--val_wer=0.4402-epoch=1.ckpt'
'finetune_from_nemo_tarred--val_wer=0.4402-epoch=1-last.ckpt'


## IV. Dynamically mixing multiple NeMo tarred datasets

Let's introduce another feature of NeMo+Lhotse dataloading: mixing multiple datasets together. 

Lhotse supports a special type of data mixing that we call **weighted stochastic multiplexing**. It means that when we iterate multiple independent data sources, at each step we'll sample one source to pick the next item from according to the source weights. When the weight is not provided, we count the number of elements in each source before starting dataloading and use those as "natural" weights (this can be time-expensive, so it's best to precompute and provide those explicitly). This sampling strategy ensures that each mini-batch consists of a roughly constant blend of data from multiple sources throughout the training. 

### Fetch another dataset (yesno)

For these experiments, we download another small dataset called `yesno`. It's a bunch of recordings where a single person says either "yes" or "no" in Hebrew, with transcriptions in English (so it's technically a speech translation task). We convert the data to NeMo tarred format, same as previously.

In [12]:
yesno_root = download_yesno(root_dir)

yesno = prepare_yesno(yesno_root, output_dir=root_dir)

for split in ("train", "test"):
    cuts = CutSet.from_manifests(**yesno[split])
    cuts.to_file(f"data/yesno_cuts_{split}.jsonl.gz")
    lhotse.serialization.save_to_jsonl(
        (
            {
                "audio_filepath": cut.recording.sources[0].source,
                "duration": cut.duration,
                "text": cut.supervisions[0].text,
                "lang": cut.supervisions[0].language,
                "sampling_rate": cut.sampling_rate,
            }
            for cut in cuts
        ),
        f"data/yesno_{split}.json",
    )

In [13]:
%%bash -s {nemo_root}
python $1/scripts/speech_recognition/convert_to_tarred_audio_dataset.py \
    --manifest_path data/yesno_train.json \
    --num_shards 1 \
    --workers 1 \
    --max_duration 30 \
    --target_dir yesno_tarred

The version_base parameter is not specified.
Please specify a compatability version level, or None.
Will assume defaults for version 1.1
  @hydra.main(config_path=None, config_name='index_config')
[NeMo W 2024-02-26 10:04:26 manifest:225] Manifest file `yesno_tarred/tarred_audio_manifest.json` seems to be part of a tarred dataset, skip checking for relative paths. If this is not intended, please avoid having `/sharded_manifests/` and `tarred_audio_manifest.json` in manifest_filepath.


Creating new tarred dataset ...
After filtering, manifest has 30 files which amounts to 181.39000000000001 seconds of audio.
Number of samples added : 30
Remainder: 0
Shard 0 has entries 0 ~ 30
Shard 0 contains 30 files
Have 0 entries left over that will be discarded.
Total number of entries in manifest : 30
Note: we estimated the optimal bucketing duration bins for 30 buckets. You can enable dynamic bucketing by setting the following options in your training script:
  use_lhotse=true
  use_bucketing=true
  num_buckets=30
  bucket_duration_bins=[5.58,5.71,5.87,5.92,5.98,6.02,6.05,6.08,6.11,6.16,6.18,6.18,6.22,6.22,6.23,6.24,6.32,6.35,6.4,6.58,6.6,6.74]
  batch_duration=<tune-this-value>
If you'd like to use a different number of buckets, re-estimate this option manually using scripts/speech_recognition/estimate_duration_bins.py


### [optional] Maximum efficiency: precomputing bucket duration bins

We'll use a NeMo script `estimate_duration_bins.py` to precompute the optimal bucketing settings for our data blend. The result will be passed to the training script below.

In [14]:
# %%bash -s {nemo_root}
# python $1/scripts/speech_recognition/estimate_duration_bins.py \
#     --buckets 30 \
#     '[[tarred/sharded_manifests/manifest__OP_0..31_CL_.json,0.8],[yesno_tarred/tarred_audio_manifest.json,0.2]]'

### Training with NeMo tarred manifests from two separate datasets

The training command is similar to the examples above. Highlights:
- Note that `manifest_filepath` and `tarred_audio_filepaths` now use a special list syntax for providing multiple data sources. Both lists need to have the same order of datasets.
- The list in `manifest_filepath` is additionally specifying the weight for each dataset. The weights can be greater than one, and will be automatically re-normalized later. 

In [15]:
%%bash -s {nemo_root}
python $1/examples/asr/speech_to_text_finetune.py \
    +init_from_pretrained_model="nvidia/stt_en_conformer_ctc_small" \
    name="finetune_from_nemo_multidataset" \
    +model.train_ds.use_lhotse=true \
    model.train_ds.manifest_filepath=[[tarred/sharded_manifests/manifest__OP_0..31_CL_.json,0.8],[yesno_tarred/tarred_audio_manifest.json,0.2]] \
    model.train_ds.tarred_audio_filepaths=[[tarred/audio__OP_0..31_CL_.tar],[yesno_tarred/audio_0.tar]] \
    model.train_ds.batch_size=null \
    +model.train_ds.batch_duration=300 \
    +model.train_ds.use_bucketing=true \
    +model.train_ds.num_buckets=30 \
    +model.train_ds.quadratic_duration=15 \
    +model.validation_ds.use_lhotse=true \
    model.validation_ds.manifest_filepath=[data/nemo_dev-clean-2.json,data/yesno_test.json] \
    model.validation_ds.batch_size=64 \
    +trainer.use_distributed_sampler=false \
    trainer.max_steps=100 \
    +trainer.limit_train_batches=50 \
    ++trainer.val_check_interval=50

    
    See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
      ret = run_job(
    


[NeMo I 2024-02-26 10:04:31 speech_to_text_finetune:190] Hydra config: name: finetune_from_nemo_multidataset
    init_from_nemo_model: null
    model:
      sample_rate: 16000
      compute_eval_loss: false
      log_prediction: true
      rnnt_reduction: mean_volume
      skip_nan_grad: false
      train_ds:
        manifest_filepath:
        - - tarred/sharded_manifests/manifest__OP_0..31_CL_.json
          - 0.8
        - - yesno_tarred/tarred_audio_manifest.json
          - 0.2
        sample_rate: ${model.sample_rate}
        batch_size: null
        shuffle: true
        num_workers: 8
        pin_memory: true
        max_duration: 20
        min_duration: 0.1
        is_tarred: false
        tarred_audio_filepaths:
        - - tarred/audio__OP_0..31_CL_.tar
        - - yesno_tarred/audio_0.tar
        shuffle_n: 2048
        bucketing_strategy: fully_randomized
        bucketing_batch_size: null
        use_lhotse: true
        batch_duration: 300
        use_bucketing: true
   

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


[NeMo I 2024-02-26 10:04:31 exp_manager:396] Experiments will be logged at /home/pzelasko/code/canary/nemo_experiments/finetune_from_nemo_multidataset/2024-02-26_10-04-31
[NeMo I 2024-02-26 10:04:31 exp_manager:856] TensorboardLogger has been set up


[NeMo W 2024-02-26 10:04:31 exp_manager:966] The checkpoint callback was told to monitor a validation value and trainer's max_steps was set to 100. Please ensure that max_steps will run for at least 1 epochs to ensure that checkpointing will not error out.


[NeMo I 2024-02-26 10:04:31 speech_to_text_finetune:99] Sleeping for at least 60 seconds to wait for model download to finish.
[NeMo I 2024-02-26 10:05:32 mixins:172] Tokenizer SentencePieceTokenizer initialized with 1024 tokens


[NeMo W 2024-02-26 10:05:32 modelPT:165] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /data/NeMo_ASR_SET/English/v2.0/train/tarred_audio_manifest.json
    sample_rate: 16000
    batch_size: 64
    shuffle: true
    num_workers: 8
    pin_memory: true
    use_start_end_token: false
    trim_silence: false
    max_duration: 20.0
    min_duration: 0.1
    shuffle_n: 2048
    is_tarred: true
    tarred_audio_filepaths: /data/NeMo_ASR_SET/English/v2.0/train/audio__OP_0..4095_CL_.tar
    
[NeMo W 2024-02-26 10:05:32 modelPT:172] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath:
    - /data/ASR/LibriSpeech/librispeech_withs

[NeMo I 2024-02-26 10:05:32 features:289] PADDING: 0
[NeMo I 2024-02-26 10:05:33 save_restore_connector:263] Model EncDecCTCModelBPE was successfully restored from /home/pzelasko/.cache/huggingface/hub/models--nvidia--stt_en_conformer_ctc_small/snapshots/f879b51de584983383de815ce87d25469b2abbf3/stt_en_conformer_ctc_small.nemo.
[NeMo I 2024-02-26 10:05:33 speech_to_text_finetune:131] Reusing the vocabulary from the pre-trained model.
We will be using a Lhotse DataLoader.
Initializing Lhotse CutSet from multiple tarred NeMo manifest sources with a weighted multiplexer. We found the following sources and weights: 
- manifest_path='tarred/sharded_manifests/manifest__OP_0..31_CL_.json' weight=0.8
- manifest_path='yesno_tarred/tarred_audio_manifest.json' weight=0.2
Creating a Lhotse DynamicBucketingSampler (max_batch_duration=300.0 max_batch_size=None)


[NeMo W 2024-02-26 10:05:36 ctc_models:372] Model Trainer was not set before constructing the dataset, incorrect number of training batches will be used. Please set the trainer and rebuild the dataset.


We will be using a Lhotse DataLoader.
Initializing Lhotse CutSet from a single NeMo manifest (tarred): 'data/nemo_dev-clean-2.json'
Creating a Lhotse DynamicCutSampler (bucketing is disabled, (max_batch_duration=None max_batch_size=64)
We will be using a Lhotse DataLoader.
Initializing Lhotse CutSet from a single NeMo manifest (tarred): 'data/yesno_test.json'
Creating a Lhotse DynamicCutSampler (bucketing is disabled, (max_batch_duration=None max_batch_size=64)


[NeMo W 2024-02-26 10:05:36 modelPT:612] Trainer wasn't specified in model constructor. Make sure that you really wanted it.


[NeMo I 2024-02-26 10:05:36 modelPT:723] Optimizer config = AdamW (
    Parameter Group 0
        amsgrad: False
        betas: [0.9, 0.98]
        capturable: False
        differentiable: False
        eps: 1e-08
        foreach: None
        fused: None
        lr: 0.0001
        maximize: False
        weight_decay: 0.001
    )


[NeMo W 2024-02-26 10:05:36 lr_scheduler:895] Neither `max_steps` nor `iters_per_batch` were provided to `optim.sched`, cannot compute effective `max_steps` !
    Scheduler will not be instantiated !
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

You are using a CUDA device ('NVIDIA RTX 6000 Ada Generation') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


[NeMo I 2024-02-26 10:05:37 modelPT:723] Optimizer config = AdamW (
    Parameter Group 0
        amsgrad: False
        betas: [0.9, 0.98]
        capturable: False
        differentiable: False
        eps: 1e-08
        foreach: None
        fused: None
        lr: 0.0001
        maximize: False
        weight_decay: 0.001
    )
[NeMo I 2024-02-26 10:05:37 lr_scheduler:915] Scheduler "<nemo.core.optim.lr_scheduler.CosineAnnealing object at 0x7efe042dcd00>" 
    will be used during training (effective maximum steps = 100) - 
    Parameters : 
    (warmup_steps: 5000
    warmup_ratio: null
    min_lr: 5.0e-06
    max_steps: 100
    )



  | Name              | Type                              | Params
------------------------------------------------------------------------
0 | preprocessor      | AudioToMelSpectrogramPreprocessor | 0     
1 | encoder           | ConformerEncoder                  | 13.0 M
2 | decoder           | ConvASRDecoder                    | 181 K 
3 | loss              | CTCLoss                           | 0     
4 | spec_augmentation | SpectrogramAugmentation           | 0     
5 | wer               | WER                               | 0     
6 | spec_augment      | SpectrogramAugmentation           | 0     
------------------------------------------------------------------------
13.2 M    Trainable params
0         Non-trainable params
13.2 M    Total params
52.616    Total estimated model params size (MB)


Epoch 0:  18%|█▊        | 9/50 [00:10<00:49,  1.20s/it, v_num=4-31, train_step_timing in s=0.126][NeMo I 2024-02-26 10:05:49 wer:318] 
    
[NeMo I 2024-02-26 10:05:49 wer:319] reference:mere chance and yet to jane it seemed so like him to have taken up his position precisely at the right spot on that long platform an enthusiastic lady patient had once said of deryck brand with more accuracy of definition than of grammar
[NeMo I 2024-02-26 10:05:49 wer:320] predicted:tone it so like him take his positioncisely at the right spot on thatatformthusiastic ladytient had ofricknd with moreuracy ofinition than grammar
Epoch 0:  38%|███▊      | 19/50 [00:12<00:19,  1.57it/s, v_num=4-31, train_step_timing in s=0.115][NeMo I 2024-02-26 10:05:51 wer:318] 
    
[NeMo I 2024-02-26 10:05:51 wer:319] reference:look at morgan and rockefeller and all the men that make a pile they know just as much as jeff did about the countries where they make it it stands to reason did i say that jeff shaved in the s

    


Validation DataLoader 0: : 18it [00:07,  2.57it/s][A
Validation DataLoader 0: : 0it [00:00, ?it/s]     [A
Validation DataLoader 1: : 0it [00:00, ?it/s][A[NeMo I 2024-02-26 10:06:02 wer:318] 
    
[NeMo I 2024-02-26 10:06:02 wer:319] reference:no no no yes no no no yes
[NeMo I 2024-02-26 10:06:02 wer:320] predicted:



    
    
    
    
    
    


Epoch 0: 100%|██████████| 50/50 [00:24<00:00,  2.08it/s, v_num=4-31, train_step_timing in s=0.128]


Epoch 0, global step 50: 'val_wer' reached 0.49404 (best 0.49404), saving model to '/home/pzelasko/code/canary/nemo_experiments/finetune_from_nemo_multidataset/2024-02-26_10-04-31/checkpoints/finetune_from_nemo_multidataset--val_wer=0.4940-epoch=0.ckpt' as top 5


                                                 [A[NeMo I 2024-02-26 10:06:03 nemo_model_checkpoint:223] New .nemo model saved to: /home/pzelasko/code/canary/nemo_experiments/finetune_from_nemo_multidataset/2024-02-26_10-04-31/checkpoints/finetune_from_nemo_multidataset.nemo
[NeMo I 2024-02-26 10:06:04 nemo_model_checkpoint:223] New .nemo model saved to: /home/pzelasko/code/canary/nemo_experiments/finetune_from_nemo_multidataset/2024-02-26_10-04-31/checkpoints/finetune_from_nemo_multidataset.nemo
Epoch 1:  18%|█▊        | 9/50 [00:01<00:05,  7.48it/s, v_num=4-31, train_step_timing in s=0.123][NeMo I 2024-02-26 10:06:06 wer:318] 
    
[NeMo I 2024-02-26 10:06:06 wer:319] reference:that the woman had sold as many as two dozen eggs in a day to the summer visitors but what with reading about amalgamated asbestos and consolidated copper and all that the hens began to seem pretty small business and in any case the idea of two dozen eggs at a cent apiece almost makes one blush
[NeMo I 2024-

Epoch 1, global step 100: 'val_wer' reached 0.36061 (best 0.36061), saving model to '/home/pzelasko/code/canary/nemo_experiments/finetune_from_nemo_multidataset/2024-02-26_10-04-31/checkpoints/finetune_from_nemo_multidataset--val_wer=0.3606-epoch=1.ckpt' as top 5


                                                 [A[NeMo I 2024-02-26 10:06:18 nemo_model_checkpoint:223] New .nemo model saved to: /home/pzelasko/code/canary/nemo_experiments/finetune_from_nemo_multidataset/2024-02-26_10-04-31/checkpoints/finetune_from_nemo_multidataset.nemo
[NeMo I 2024-02-26 10:06:19 nemo_model_checkpoint:223] New .nemo model saved to: /home/pzelasko/code/canary/nemo_experiments/finetune_from_nemo_multidataset/2024-02-26_10-04-31/checkpoints/finetune_from_nemo_multidataset.nemo


`Trainer.fit` stopped: `max_steps=100` reached.


Epoch 1: 100%|██████████| 50/50 [00:14<00:00,  3.35it/s, v_num=4-31, train_step_timing in s=0.129]


In [16]:
!ls nemo_experiments/finetune_from_nemo_multidataset/*/checkpoints/

 finetune_from_nemo_multidataset.nemo
'finetune_from_nemo_multidataset--val_wer=0.3606-epoch=1.ckpt'
'finetune_from_nemo_multidataset--val_wer=0.3606-epoch=1-last.ckpt'
'finetune_from_nemo_multidataset--val_wer=0.4940-epoch=0.ckpt'
