Script for estimating Lhotse dynamic duration buckets #8237

pzelasko · 2024-01-24T20:34:51Z

What does this PR do ?

The script provides the user with the optimal, data-driven value of model.train_ds.bucket_duration_bins to avoid computing it at the start of the training.

It requires tiny changes to enable dataset mixing with non-tarred manifests (as we don't want to actually load any data here) and triggered an opportunity for a small refactor of the CutSet initialization code. Another issue was that NeMo manifests do not have sampling_rate info but Lhotse must have it to create a Recording object -- I introduced a dummy_mode for use-cases like this where we only need to iterate the metadata.

Collection: All speech collections

Changelog

Script for estimating Lhotse dynamic duration buckets

Usage

Help message

$ python scripts/speech_recognition/estimate_duration_bins.py --help
usage: estimate_duration_bins.py [-h] [-b BUCKETS] [-n NUM_EXAMPLES]
                                 [-l MIN_DURATION] [-u MAX_DURATION] [-q QUIET]
                                 input

Estimate duration bins for Lhotse dynamic bucketing using a sample of the input
dataset. The dataset is read either from one or more manifest files and supports
data weighting.

positional arguments:
  input                 Same input format as in model configs under
                        model.train_ds.manifest_filepath. Options: 1)
                        "path.json"; 2) "[path1.json,path2.json,...]"; 3)
                        "[[path1.json,weight1],[path2.json,weight2],...]"

options:
  -h, --help            show this help message and exit
  -b BUCKETS, --buckets BUCKETS
                        The desired number of buckets.
  -n NUM_EXAMPLES, --num_examples NUM_EXAMPLES
                        The number of examples (utterances) to estimate the
                        bins. -1 means use all data.
  -l MIN_DURATION, --min_duration MIN_DURATION
                        If specified, we'll filter out utterances shorter than
                        this.
  -u MAX_DURATION, --max_duration MAX_DURATION
                        If specified, we'll filter out utterances longer than
                        this.
  -q QUIET, --quiet QUIET
                        When specified, only print the estimated duration bins.

Example manifest and options

$ python scripts/speech_recognition/estimate_duration_bins.py -b 30 -l 2 -u 2.5 -n 10000 manifest.json
Note: we discarded 50168/60168 (83.38%) utterances due to min/max duration filtering.
Use the following options in your config:
        num_buckets=30
        bucket_duration_bins=[2.03,2.05419,2.08,2.11,2.13452,2.16,2.18,2.2,2.22,2.24,2.2531,2.27,2.285,2.3,2.3175,2.33,2.342,2.36,2.37,2.3896,2.4,2.41369,2.4295,2.44,2.45,2.46,2.47,2.48,2.491]
Computing utterance duration distribution...
Cut statistics:
╒═══════════════════════════╤══════════╕
│ Cuts count:               │ 10000    │
├───────────────────────────┼──────────┤
│ Total duration (hh:mm:ss) │ 06:21:17 │
├───────────────────────────┼──────────┤
│ mean                      │ 2.3      │
├───────────────────────────┼──────────┤
│ std                       │ 0.1      │
├───────────────────────────┼──────────┤
│ min                       │ 2.0      │
├───────────────────────────┼──────────┤
│ 25%                       │ 2.2      │
├───────────────────────────┼──────────┤
│ 50%                       │ 2.3      │
├───────────────────────────┼──────────┤
│ 75%                       │ 2.4      │
├───────────────────────────┼──────────┤
│ 99%                       │ 2.5      │
├───────────────────────────┼──────────┤
│ 99.5%                     │ 2.5      │
├───────────────────────────┼──────────┤
│ 99.9%                     │ 2.5      │
├───────────────────────────┼──────────┤
│ max                       │ 2.5      │
├───────────────────────────┼──────────┤
│ Recordings available:     │ 10000    │
├───────────────────────────┼──────────┤
│ Features available:       │ 0        │
├───────────────────────────┼──────────┤
│ Supervisions available:   │ 10000    │
╘═══════════════════════════╧══════════╛
CUT custom fields:
- shard_id (in 10000 cuts)
- text (in 10000 cuts)
- pcstrip_text (in 10000 cuts)
- source_lang (in 10000 cuts)
- target_lang (in 10000 cuts)
- taskname (in 10000 cuts)
- pnc (in 10000 cuts)
- answer (in 10000 cuts)
Speech duration statistics:
╒══════════════════════════════╤══════════╤══════════════════════╕
│ Total speech duration        │ 06:21:17 │ 100.00% of recording │
├──────────────────────────────┼──────────┼──────────────────────┤
│ Total speaking time duration │ 06:21:17 │ 100.00% of recording │
├──────────────────────────────┼──────────┼──────────────────────┤
│ Total silence duration       │ 00:00:01 │ 0.00% of recording   │
╘══════════════════════════════╧══════════╧══════════════════════╛

Jenkins CI

To run Jenkins, a NeMo User with write access must comment jenkins on the PR.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

pzelasko · 2024-01-24T20:45:24Z

jenkins

docs/source/asr/datasets.rst

titu1994 · 2024-01-26T00:31:07Z

nemo/collections/common/data/lhotse/cutset.py

-            logging.info(
-                f"Initializing Lhotse CutSet from a single NeMo manifest (tarred): '{config.manifest_filepath}'"
-            )
+    notar_kwargs = {"dummy_mode": config.dummy_mode}


Add comment as to why this is necessary for future.

nemo/collections/common/data/lhotse/cutset.py

titu1994 · 2024-01-26T00:33:29Z

nemo/collections/common/data/lhotse/nemo_adapters.py

@@ -39,24 +39,29 @@ class LazyNeMoIterator(ImitatesDict):
    - "text" (overridable with ``text_field`` argument)

    Specially supported keys are:
+    - [recommended] "sampling_rate" allows us to provide a valid Lhotse ``Recording`` object without checking the audio file


Hmm sampling rate in manifest row item? Wont it be loads of repetition?

Depending on your setup it might, but it doesn't really make a difference if you don't need to pre-load the manifest into memory. Note that it is still an optional key -- just recommended if you want to avoid performing file header reads during iteration.

For some context - this has been my # 1 struggle in maintaining NeMo format adaptation to Lhotse. Lhotse requires sampling rate to be known for each utterance because many transforms are applied lazily on the metadata level. If you know the duration, sampling rate, number of samples, etc. you can attach transforms on the meta-data level and resolve them lazily during actual audio loading operation.

E.g. the following is valid lhotse code that'd apply a bunch of augmentations cut = cut.resample(8000).perturb_speed(0.9).mix(noise_cut).reverb_rir().resample(16000) -- no actual operation is performed until you call cut.load_audio(), but you can query the cut about cut.duration, cut.num_samples, cut.sampling_rate, and get reliable info. You can also save the cut.to_dict() in a JSON file and it would also contain these transforms.

Why is it not possible to read the global cfg.sample_rate into Lhotse ?

Lhotse has no concept of "dataset-level" anything. Everything is an item in a sequence with its own set of properties.

edit: I got a different understanding of your question the second time reading it. If we define a global sampling rate it doesn't help us -- lhotse needs to know both duration and the actual sampling rate of the original audio so that it can infer the number of audio samples and perform transforms (e.g. resampling).

titu1994 · 2024-01-26T00:33:52Z

nemo/collections/common/data/lhotse/nemo_adapters.py

        self.source = LazyJsonlIterator(path)
        self.text_field = text_field
        self.lang_field = lang_field
+        self.dummy_mode = dummy_mode


Dummy mode is a poor name imo. Anything more reasonable?

The only thing I can think of is missing_sampling_rate_ok: bool = False -- I really wish this option wasn't needed :)

scripts/speech_recognition/estimate_duration_bins.py

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

pzelasko · 2024-02-02T16:08:09Z

jenkins

titu1994

Looks great !

titu1994 · 2024-02-02T17:10:43Z

nemo/collections/common/data/lhotse/nemo_adapters.py

@@ -39,24 +39,29 @@ class LazyNeMoIterator(ImitatesDict):
    - "text" (overridable with ``text_field`` argument)

    Specially supported keys are:
+    - [recommended] "sampling_rate" allows us to provide a valid Lhotse ``Recording`` object without checking the audio file


Why is it not possible to read the global cfg.sample_rate into Lhotse ?

titu1994 · 2024-02-02T17:12:14Z

nemo/collections/common/data/lhotse/nemo_adapters.py

@@ -93,6 +102,29 @@ def __len__(self) -> int:
    def __add__(self, other):
        return LazyIteratorChain(self, other)

+    def _create_recording(self, audio_path: str, duration: float, sampling_rate: int | None = None,) -> Recording:
+        if sampling_rate is not None:
+            # TODO(pzelasko): It will only work with single-channel audio in the current shape.


How about integrating channel_selector here ?

Is this option documented somewhere? I was looking for something about multi channel but couldn't find anything.

Hmm it's there as a utility function in most transcribe() functions

I see, NeMo manifests don't have a key with information about the number of channels. That's a bummer. I'll take a closer look at multi-channel Nemo+Lhotse later.

scripts/speech_recognition/estimate_duration_bins.py

pzelasko · 2024-02-12T17:52:25Z

jenkins

pzelasko · 2024-02-13T15:44:30Z

jenkins

* Script for estimating Lhotse dynamic duration buckets Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Improve documentation Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Apply suggestions from code review Signed-off-by: Piotr Żelasko <petezor@gmail.com> --------- Signed-off-by: Piotr Żelasko <petezor@gmail.com> Signed-off-by: biscayan <skgudwn34@gmail.com>

* Script for estimating Lhotse dynamic duration buckets Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Improve documentation Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Apply suggestions from code review Signed-off-by: Piotr Żelasko <petezor@gmail.com> --------- Signed-off-by: Piotr Żelasko <petezor@gmail.com> Signed-off-by: Sasha Meister <ameister@nvidia.com>

* Script for estimating Lhotse dynamic duration buckets Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Improve documentation Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Apply suggestions from code review Signed-off-by: Piotr Żelasko <petezor@gmail.com> --------- Signed-off-by: Piotr Żelasko <petezor@gmail.com> Signed-off-by: Zeeshan Patel <zeeshanp@berkeley.edu>

…#8237)" This reverts commit 4995981.

* Script for estimating Lhotse dynamic duration buckets Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Improve documentation Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Apply suggestions from code review Signed-off-by: Piotr Żelasko <petezor@gmail.com> --------- Signed-off-by: Piotr Żelasko <petezor@gmail.com> Signed-off-by: Pablo Garay <pagaray@nvidia.com>

* Script for estimating Lhotse dynamic duration buckets Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Improve documentation Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Apply suggestions from code review Signed-off-by: Piotr Żelasko <petezor@gmail.com> --------- Signed-off-by: Piotr Żelasko <petezor@gmail.com>

Script for estimating Lhotse dynamic duration buckets

557980a

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

pzelasko requested a review from titu1994 January 24, 2024 20:34

github-actions bot added the common label Jan 24, 2024

Improve documentation

7a1d89f

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

github-actions bot added the ASR label Jan 24, 2024

pzelasko requested a review from krishnacpuvvada January 25, 2024 18:46

titu1994 reviewed Jan 26, 2024

View reviewed changes

pzelasko added 2 commits February 2, 2024 11:03

Apply suggestions from code review

9bba9ac

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

Merge branch 'main' into estimate-duration-bins

f15e8f3

titu1994 approved these changes Feb 2, 2024

View reviewed changes

pzelasko added 2 commits February 9, 2024 16:46

Merge branch 'main' into estimate-duration-bins

1f54db7

Merge branch 'main' into estimate-duration-bins

f244f74

pzelasko added 2 commits February 13, 2024 08:34

Merge branch 'main' into estimate-duration-bins

f532313

Merge branch 'main' into estimate-duration-bins

3c21259

pzelasko merged commit 03a7e4f into main Feb 13, 2024
15 checks passed

pzelasko deleted the estimate-duration-bins branch February 13, 2024 18:43

zpx01 added a commit to zpx01/NeMo that referenced this pull request Mar 8, 2024

Revert "Script for estimating Lhotse dynamic duration buckets (NVIDIA…

3921a0e

…#8237)" This reverts commit 4995981.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Script for estimating Lhotse dynamic duration buckets #8237

Script for estimating Lhotse dynamic duration buckets #8237

pzelasko commented Jan 24, 2024 •

edited

Loading

pzelasko commented Jan 24, 2024

titu1994 Jan 26, 2024

titu1994 Jan 26, 2024

pzelasko Jan 26, 2024 •

edited

Loading

titu1994 Feb 2, 2024

pzelasko Feb 2, 2024 •

edited

Loading

titu1994 Jan 26, 2024

pzelasko Jan 26, 2024

pzelasko commented Feb 2, 2024

titu1994 left a comment

titu1994 Feb 2, 2024

titu1994 Feb 2, 2024

pzelasko Feb 2, 2024

titu1994 Feb 2, 2024

pzelasko Feb 12, 2024

pzelasko commented Feb 12, 2024

pzelasko commented Feb 13, 2024

Script for estimating Lhotse dynamic duration buckets #8237

Script for estimating Lhotse dynamic duration buckets #8237

Conversation

pzelasko commented Jan 24, 2024 • edited Loading

What does this PR do ?

Changelog

Usage

Help message

Example manifest and options

Jenkins CI

Before your PR is "Ready for review"

Who can review?

Additional Information

pzelasko commented Jan 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pzelasko Jan 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pzelasko Feb 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pzelasko commented Feb 2, 2024

titu1994 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pzelasko commented Feb 12, 2024

pzelasko commented Feb 13, 2024

pzelasko commented Jan 24, 2024 •

edited

Loading

pzelasko Jan 26, 2024 •

edited

Loading

pzelasko Feb 2, 2024 •

edited

Loading