Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Script for estimating Lhotse dynamic duration buckets #8237

Merged
merged 8 commits into from
Feb 13, 2024

Conversation

pzelasko
Copy link
Collaborator

@pzelasko pzelasko commented Jan 24, 2024

What does this PR do ?

The script provides the user with the optimal, data-driven value of model.train_ds.bucket_duration_bins to avoid computing it at the start of the training.

It requires tiny changes to enable dataset mixing with non-tarred manifests (as we don't want to actually load any data here) and triggered an opportunity for a small refactor of the CutSet initialization code. Another issue was that NeMo manifests do not have sampling_rate info but Lhotse must have it to create a Recording object -- I introduced a dummy_mode for use-cases like this where we only need to iterate the metadata.

Collection: All speech collections

Changelog

  • Script for estimating Lhotse dynamic duration buckets

Usage

Help message

$ python scripts/speech_recognition/estimate_duration_bins.py --help
usage: estimate_duration_bins.py [-h] [-b BUCKETS] [-n NUM_EXAMPLES]
                                 [-l MIN_DURATION] [-u MAX_DURATION] [-q QUIET]
                                 input

Estimate duration bins for Lhotse dynamic bucketing using a sample of the input
dataset. The dataset is read either from one or more manifest files and supports
data weighting.

positional arguments:
  input                 Same input format as in model configs under
                        model.train_ds.manifest_filepath. Options: 1)
                        "path.json"; 2) "[path1.json,path2.json,...]"; 3)
                        "[[path1.json,weight1],[path2.json,weight2],...]"

options:
  -h, --help            show this help message and exit
  -b BUCKETS, --buckets BUCKETS
                        The desired number of buckets.
  -n NUM_EXAMPLES, --num_examples NUM_EXAMPLES
                        The number of examples (utterances) to estimate the
                        bins. -1 means use all data.
  -l MIN_DURATION, --min_duration MIN_DURATION
                        If specified, we'll filter out utterances shorter than
                        this.
  -u MAX_DURATION, --max_duration MAX_DURATION
                        If specified, we'll filter out utterances longer than
                        this.
  -q QUIET, --quiet QUIET
                        When specified, only print the estimated duration bins.

Example manifest and options

$ python scripts/speech_recognition/estimate_duration_bins.py -b 30 -l 2 -u 2.5 -n 10000 manifest.json
Note: we discarded 50168/60168 (83.38%) utterances due to min/max duration filtering.
Use the following options in your config:
        num_buckets=30
        bucket_duration_bins=[2.03,2.05419,2.08,2.11,2.13452,2.16,2.18,2.2,2.22,2.24,2.2531,2.27,2.285,2.3,2.3175,2.33,2.342,2.36,2.37,2.3896,2.4,2.41369,2.4295,2.44,2.45,2.46,2.47,2.48,2.491]
Computing utterance duration distribution...
Cut statistics:
╒═══════════════════════════╤══════════╕
│ Cuts count:               │ 10000    │
├───────────────────────────┼──────────┤
│ Total duration (hh:mm:ss) │ 06:21:17 │
├───────────────────────────┼──────────┤
│ mean                      │ 2.3      │
├───────────────────────────┼──────────┤
│ std                       │ 0.1      │
├───────────────────────────┼──────────┤
│ min                       │ 2.0      │
├───────────────────────────┼──────────┤
│ 25%                       │ 2.2      │
├───────────────────────────┼──────────┤
│ 50%                       │ 2.3      │
├───────────────────────────┼──────────┤
│ 75%                       │ 2.4      │
├───────────────────────────┼──────────┤
│ 99%                       │ 2.5      │
├───────────────────────────┼──────────┤
│ 99.5%                     │ 2.5      │
├───────────────────────────┼──────────┤
│ 99.9%                     │ 2.5      │
├───────────────────────────┼──────────┤
│ max                       │ 2.5      │
├───────────────────────────┼──────────┤
│ Recordings available:     │ 10000    │
├───────────────────────────┼──────────┤
│ Features available:       │ 0        │
├───────────────────────────┼──────────┤
│ Supervisions available:   │ 10000    │
╘═══════════════════════════╧══════════╛
CUT custom fields:
- shard_id (in 10000 cuts)
- text (in 10000 cuts)
- pcstrip_text (in 10000 cuts)
- source_lang (in 10000 cuts)
- target_lang (in 10000 cuts)
- taskname (in 10000 cuts)
- pnc (in 10000 cuts)
- answer (in 10000 cuts)
Speech duration statistics:
╒══════════════════════════════╤══════════╤══════════════════════╕
│ Total speech duration        │ 06:21:17 │ 100.00% of recording │
├──────────────────────────────┼──────────┼──────────────────────┤
│ Total speaking time duration │ 06:21:17 │ 100.00% of recording │
├──────────────────────────────┼──────────┼──────────────────────┤
│ Total silence duration       │ 00:00:01 │ 0.00% of recording   │
╘══════════════════════════════╧══════════╧══════════════════════╛

Jenkins CI

To run Jenkins, a NeMo User with write access must comment jenkins on the PR.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

Signed-off-by: Piotr Żelasko <petezor@gmail.com>
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
@github-actions github-actions bot added the ASR label Jan 24, 2024
@pzelasko
Copy link
Collaborator Author

jenkins

docs/source/asr/datasets.rst Show resolved Hide resolved
docs/source/asr/datasets.rst Show resolved Hide resolved
logging.info(
f"Initializing Lhotse CutSet from a single NeMo manifest (tarred): '{config.manifest_filepath}'"
)
notar_kwargs = {"dummy_mode": config.dummy_mode}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment as to why this is necessary for future.

nemo/collections/common/data/lhotse/cutset.py Show resolved Hide resolved
nemo/collections/common/data/lhotse/cutset.py Show resolved Hide resolved
@@ -39,24 +39,29 @@ class LazyNeMoIterator(ImitatesDict):
- "text" (overridable with ``text_field`` argument)

Specially supported keys are:
- [recommended] "sampling_rate" allows us to provide a valid Lhotse ``Recording`` object without checking the audio file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm sampling rate in manifest row item? Wont it be loads of repetition?

Copy link
Collaborator Author

@pzelasko pzelasko Jan 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on your setup it might, but it doesn't really make a difference if you don't need to pre-load the manifest into memory. Note that it is still an optional key -- just recommended if you want to avoid performing file header reads during iteration.

For some context - this has been my # 1 struggle in maintaining NeMo format adaptation to Lhotse. Lhotse requires sampling rate to be known for each utterance because many transforms are applied lazily on the metadata level. If you know the duration, sampling rate, number of samples, etc. you can attach transforms on the meta-data level and resolve them lazily during actual audio loading operation.

E.g. the following is valid lhotse code that'd apply a bunch of augmentations cut = cut.resample(8000).perturb_speed(0.9).mix(noise_cut).reverb_rir().resample(16000) -- no actual operation is performed until you call cut.load_audio(), but you can query the cut about cut.duration, cut.num_samples, cut.sampling_rate, and get reliable info. You can also save the cut.to_dict() in a JSON file and it would also contain these transforms.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it not possible to read the global cfg.sample_rate into Lhotse ?

Copy link
Collaborator Author

@pzelasko pzelasko Feb 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lhotse has no concept of "dataset-level" anything. Everything is an item in a sequence with its own set of properties.

edit: I got a different understanding of your question the second time reading it. If we define a global sampling rate it doesn't help us -- lhotse needs to know both duration and the actual sampling rate of the original audio so that it can infer the number of audio samples and perform transforms (e.g. resampling).

self.source = LazyJsonlIterator(path)
self.text_field = text_field
self.lang_field = lang_field
self.dummy_mode = dummy_mode
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dummy mode is a poor name imo. Anything more reasonable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing I can think of is missing_sampling_rate_ok: bool = False -- I really wish this option wasn't needed :)

@pzelasko
Copy link
Collaborator Author

pzelasko commented Feb 2, 2024

jenkins

Copy link
Collaborator

@titu1994 titu1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great !

@@ -39,24 +39,29 @@ class LazyNeMoIterator(ImitatesDict):
- "text" (overridable with ``text_field`` argument)

Specially supported keys are:
- [recommended] "sampling_rate" allows us to provide a valid Lhotse ``Recording`` object without checking the audio file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it not possible to read the global cfg.sample_rate into Lhotse ?

@@ -93,6 +102,29 @@ def __len__(self) -> int:
def __add__(self, other):
return LazyIteratorChain(self, other)

def _create_recording(self, audio_path: str, duration: float, sampling_rate: int | None = None,) -> Recording:
if sampling_rate is not None:
# TODO(pzelasko): It will only work with single-channel audio in the current shape.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about integrating channel_selector here ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this option documented somewhere? I was looking for something about multi channel but couldn't find anything.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm it's there as a utility function in most transcribe() functions

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, NeMo manifests don't have a key with information about the number of channels. That's a bummer. I'll take a closer look at multi-channel Nemo+Lhotse later.

@pzelasko
Copy link
Collaborator Author

jenkins

@pzelasko
Copy link
Collaborator Author

jenkins

@pzelasko pzelasko merged commit 03a7e4f into main Feb 13, 2024
15 checks passed
@pzelasko pzelasko deleted the estimate-duration-bins branch February 13, 2024 18:43
biscayan pushed a commit to biscayan/NeMo that referenced this pull request Feb 15, 2024
* Script for estimating Lhotse dynamic duration buckets

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* Improve documentation

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* Apply suggestions from code review

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

---------

Signed-off-by: Piotr Żelasko <petezor@gmail.com>
Signed-off-by: biscayan <skgudwn34@gmail.com>
ssh-meister pushed a commit to ssh-meister/NeMo that referenced this pull request Feb 15, 2024
* Script for estimating Lhotse dynamic duration buckets

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* Improve documentation

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* Apply suggestions from code review

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

---------

Signed-off-by: Piotr Żelasko <petezor@gmail.com>
Signed-off-by: Sasha Meister <ameister@nvidia.com>
zpx01 pushed a commit to zpx01/NeMo that referenced this pull request Mar 8, 2024
* Script for estimating Lhotse dynamic duration buckets

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* Improve documentation

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* Apply suggestions from code review

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

---------

Signed-off-by: Piotr Żelasko <petezor@gmail.com>
Signed-off-by: Zeeshan Patel <zeeshanp@berkeley.edu>
zpx01 added a commit to zpx01/NeMo that referenced this pull request Mar 8, 2024
pablo-garay pushed a commit that referenced this pull request Mar 19, 2024
* Script for estimating Lhotse dynamic duration buckets

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* Improve documentation

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* Apply suggestions from code review

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

---------

Signed-off-by: Piotr Żelasko <petezor@gmail.com>
Signed-off-by: Pablo Garay <pagaray@nvidia.com>
rohitrango pushed a commit to rohitrango/NeMo that referenced this pull request Jun 25, 2024
* Script for estimating Lhotse dynamic duration buckets

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* Improve documentation

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* Apply suggestions from code review

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

---------

Signed-off-by: Piotr Żelasko <petezor@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants