-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Script for estimating Lhotse dynamic duration buckets #8237
Conversation
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
jenkins |
logging.info( | ||
f"Initializing Lhotse CutSet from a single NeMo manifest (tarred): '{config.manifest_filepath}'" | ||
) | ||
notar_kwargs = {"dummy_mode": config.dummy_mode} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add comment as to why this is necessary for future.
@@ -39,24 +39,29 @@ class LazyNeMoIterator(ImitatesDict): | |||
- "text" (overridable with ``text_field`` argument) | |||
|
|||
Specially supported keys are: | |||
- [recommended] "sampling_rate" allows us to provide a valid Lhotse ``Recording`` object without checking the audio file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm sampling rate in manifest row item? Wont it be loads of repetition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on your setup it might, but it doesn't really make a difference if you don't need to pre-load the manifest into memory. Note that it is still an optional key -- just recommended if you want to avoid performing file header reads during iteration.
For some context - this has been my # 1 struggle in maintaining NeMo format adaptation to Lhotse. Lhotse requires sampling rate to be known for each utterance because many transforms are applied lazily on the metadata level. If you know the duration, sampling rate, number of samples, etc. you can attach transforms on the meta-data level and resolve them lazily during actual audio loading operation.
E.g. the following is valid lhotse code that'd apply a bunch of augmentations cut = cut.resample(8000).perturb_speed(0.9).mix(noise_cut).reverb_rir().resample(16000)
-- no actual operation is performed until you call cut.load_audio()
, but you can query the cut about cut.duration
, cut.num_samples
, cut.sampling_rate
, and get reliable info. You can also save the cut.to_dict()
in a JSON file and it would also contain these transforms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it not possible to read the global cfg.sample_rate into Lhotse ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lhotse has no concept of "dataset-level" anything. Everything is an item in a sequence with its own set of properties.
edit: I got a different understanding of your question the second time reading it. If we define a global sampling rate it doesn't help us -- lhotse needs to know both duration and the actual sampling rate of the original audio so that it can infer the number of audio samples and perform transforms (e.g. resampling).
self.source = LazyJsonlIterator(path) | ||
self.text_field = text_field | ||
self.lang_field = lang_field | ||
self.dummy_mode = dummy_mode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dummy mode is a poor name imo. Anything more reasonable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only thing I can think of is missing_sampling_rate_ok: bool = False
-- I really wish this option wasn't needed :)
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
jenkins |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great !
@@ -39,24 +39,29 @@ class LazyNeMoIterator(ImitatesDict): | |||
- "text" (overridable with ``text_field`` argument) | |||
|
|||
Specially supported keys are: | |||
- [recommended] "sampling_rate" allows us to provide a valid Lhotse ``Recording`` object without checking the audio file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it not possible to read the global cfg.sample_rate into Lhotse ?
@@ -93,6 +102,29 @@ def __len__(self) -> int: | |||
def __add__(self, other): | |||
return LazyIteratorChain(self, other) | |||
|
|||
def _create_recording(self, audio_path: str, duration: float, sampling_rate: int | None = None,) -> Recording: | |||
if sampling_rate is not None: | |||
# TODO(pzelasko): It will only work with single-channel audio in the current shape. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about integrating channel_selector here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this option documented somewhere? I was looking for something about multi channel but couldn't find anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm it's there as a utility function in most transcribe() functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, NeMo manifests don't have a key with information about the number of channels. That's a bummer. I'll take a closer look at multi-channel Nemo+Lhotse later.
jenkins |
jenkins |
* Script for estimating Lhotse dynamic duration buckets Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Improve documentation Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Apply suggestions from code review Signed-off-by: Piotr Żelasko <petezor@gmail.com> --------- Signed-off-by: Piotr Żelasko <petezor@gmail.com> Signed-off-by: biscayan <skgudwn34@gmail.com>
* Script for estimating Lhotse dynamic duration buckets Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Improve documentation Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Apply suggestions from code review Signed-off-by: Piotr Żelasko <petezor@gmail.com> --------- Signed-off-by: Piotr Żelasko <petezor@gmail.com> Signed-off-by: Sasha Meister <ameister@nvidia.com>
* Script for estimating Lhotse dynamic duration buckets Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Improve documentation Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Apply suggestions from code review Signed-off-by: Piotr Żelasko <petezor@gmail.com> --------- Signed-off-by: Piotr Żelasko <petezor@gmail.com> Signed-off-by: Zeeshan Patel <zeeshanp@berkeley.edu>
* Script for estimating Lhotse dynamic duration buckets Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Improve documentation Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Apply suggestions from code review Signed-off-by: Piotr Żelasko <petezor@gmail.com> --------- Signed-off-by: Piotr Żelasko <petezor@gmail.com> Signed-off-by: Pablo Garay <pagaray@nvidia.com>
* Script for estimating Lhotse dynamic duration buckets Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Improve documentation Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Apply suggestions from code review Signed-off-by: Piotr Żelasko <petezor@gmail.com> --------- Signed-off-by: Piotr Żelasko <petezor@gmail.com>
What does this PR do ?
The script provides the user with the optimal, data-driven value of
model.train_ds.bucket_duration_bins
to avoid computing it at the start of the training.It requires tiny changes to enable dataset mixing with non-tarred manifests (as we don't want to actually load any data here) and triggered an opportunity for a small refactor of the
CutSet
initialization code. Another issue was that NeMo manifests do not havesampling_rate
info but Lhotse must have it to create a Recording object -- I introduced adummy_mode
for use-cases like this where we only need to iterate the metadata.Collection: All speech collections
Changelog
Usage
Help message
Example manifest and options
Jenkins CI
To run Jenkins, a NeMo User with write access must comment
jenkins
on the PR.Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information