New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate NVIDIA DALI 1.4 to NeMo ASR #2567
Conversation
@@ -278,7 +268,7 @@ def __init__( | |||
self.pad_value = params['pad_value'] if 'pad_value' in params else 0.0 | |||
|
|||
with self.pipe: | |||
audio, transcript = dali.fn.nemo_asr_reader( | |||
audio, indices = nemo_asr( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be a matter of personal preference, but I think that it will read better if it was dali.fn.readers.nemo_asr
or fn.readers.nemo_asr
so that it's visible that this is a DALI operator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I used this style cause I couldnt get pycharm autocomplete to work otherwise, but it turned out to be a *args **kwargs type signature.
trim: bool = False, | ||
shuffle: bool = True, | ||
shuffle: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has changed from True to False. Just checking. Was that intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, shuffle=True by default can cause test and dev dali datasets to shuffle unknowingly (most of our configs dont have explicit shuffle setting for dev and test set dataloaders and it is assumed to be false. Some of our setup methods explicitly set it to false, but its good to show that even in the constructor.
NVIDIA DALI pipeline that loads tensors via one or more manifest files where each line containing a sample descriptor in JSON, | ||
including audio files, transcripts, and durations (in seconds). | ||
Here's an example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description seems to be the same as for the previous class. Shouldn't it mention something about BPE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch !
from nemo.collections.asr.data.audio_to_text_dataset import inject_dataloader_value_from_model_config | ||
from nemo.collections.common import tokenizers | ||
from nemo.utils import logging | ||
|
||
try: | ||
import nvidia.dali as dali |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed earlier, we should probably check that DALI is at the version 1.4 or later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setup a method to validation version.
for orig, shuffled in zip(original_transcripts, shuffled_transcripts): | ||
if orig == shuffled: | ||
sample_matches += 1 | ||
assert sample_matches < 6 # assume after shuffling <half samples have displaced targets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that this assumption is not guaranteed. An idea: You could add an optional seed
parameter to the DALI pipeline, so that you can make this test deterministic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup its not guarenteed but the odds of it failing are quite low. Seeding would be nice if we could set it after building the pipe but that seems to not be an option. For now ill just increase the threshold from 6 to 8 which would be quite rare event (I think).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM generally. Just want to know if we have tested our data layers comparing to DALI speedwise? What are the benefits of DALI and when we should use DALI?
@@ -187,6 +187,19 @@ def _setup_dataloader_from_config(self, config: Optional[Dict]): | |||
augmentor = None | |||
|
|||
shuffle = config['shuffle'] | |||
device = 'gpu' if torch.cuda.is_available() else 'cpu' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it the right assumption to always use gpu if available for dali?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is copied from the ctc_model.py. It's not exactly correct - you may have a model on cpu but GPU is available, but for training this is a good enough assumption.
@@ -70,7 +107,7 @@ def __len__(self): | |||
|
|||
|
|||
@experimental | |||
class AudioToCharDALIDataset(Iterator): | |||
class _AudioTextDALIDataset(Iterator): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to merge _AudioTextDALIDataset and _AudioTextDataset and use _AudioTextDataset for dali? or they are that much different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not possible. Enormously different due to Dali pipeline vs Nemo processing
Speed tests were done locally, and the speed up was significant enough for audio processing due to interleaved pipeline processing vs pytorchs parallel processing (but no interleaving or buffering or prefetching). The main benefit of these datasets are for faster training on local datasets where the number of files are not so much (these datasets don't yet support tarred dataset, but it's in the plan for DALI). IO is one of the major bottlenecks for training and thus DALI helps that part. |
Doc updates will be done once we reach 21.08, since till then DALI 1.4 won't be used inside the container. |
for orig, shuffled in zip(original_transcripts, shuffled_transcripts): | ||
if orig == shuffled: | ||
sample_matches += 1 | ||
assert sample_matches < 8 # assume after shuffling <half samples have displaced targets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is better to check if:
- at leas one sample is displaced
- all samples are present
- there is no repetitions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ive added the at least one sample is displaced
check. The other two seem excessive and make the test quite tedious.
assert count == 10 | ||
|
||
sample_matches = 0 | ||
for orig, shuffled in zip(original_transcripts, shuffled_transcripts): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test validates only AudioToCharDALIDataset with shuffling vs no shuffling. How do you know if it returns a valid output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, added a sanity test to ensure that text matches after decoding for the non shuffle case.
assert count == 10 | ||
|
||
sample_matches = 0 | ||
for orig, shuffled in zip(original_transcripts, shuffled_transcripts): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar remarks as in the previous test regarding the results validation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adressed.
] | ||
|
||
__DALI_MINIMUM_VERSION__ = "1.4" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add a comment why this is the minimal DALI version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -287,16 +312,14 @@ def __init__( | |||
min_duration=min_duration, | |||
max_duration=max_duration, | |||
read_sample_rate=False, | |||
read_text=True, | |||
read_text=False, | |||
read_idxs=True, | |||
random_shuffle=shuffle, | |||
shard_id=self.shard_id, | |||
num_shards=self.num_shards, | |||
pad_last_batch=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can switch it to true now - in 1.4 the bug in the nemo_asr
was fixed so it should no longer crash. If pad_last_batch
is set to False LastBatchPolicy.DROP or LastBatchPolicy.PARTIAL discards useful samples from the next epoch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, set it to True
min_duration (float): Determines the minimum allowed duration, in seconds, of the loaded audio files. | ||
bos_id (int): Id of beginning of sequence symbol to append if not None | ||
eos_id (int): Id of end of sequence symbol to append if not None | ||
pad_id (int): Id used to pad the input. Defaults to 0 if not provided. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see pad_id
among __init__
args.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for bos_id
and eos_id
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They arent part of init - they are embedded in the tokenizer. If the tokenizer was built with these special args, they are used, otherwise set to None/0 as required. The docstring is just to mention that these arguments are implicitly injected from the tokenizer. Will add that comment.
parser (str, callable): A str for an inbuilt parser, or a callable with signature f(str) -> List[int]. | ||
global_rank (int): Worker rank, used for partitioning shards. Defaults to 0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parser (str, callable): A str for an inbuilt parser, or a callable with signature f(str) -> List[int]. | |
global_rank (int): Worker rank, used for partitioning shards. Defaults to 0. | |
parser (str, callable): A str for an inbuilt parser, or a callable with signature f(str) -> List[int]. | |
device_id (int): Index of the GPU to be used (local_rank). Only applicable when device == 'gpu'. Defaults to 0. | |
global_rank (int): Worker rank, used for partitioning shards. Defaults to 0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
device_id
docs are missing
85e7748
to
647490a
Compare
Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: smajumdar <titu1994@gmail.com>
Co-authored-by: Janusz Lisiecki <39967756+JanuszL@users.noreply.github.com>
Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: smajumdar <titu1994@gmail.com>
a4f3230
to
3f55abf
Compare
@@ -312,7 +339,7 @@ def __init__( | |||
# No preprocessing, the output is the audio signal | |||
audio_len = dali.fn.shapes(dali.fn.reshape(audio, shape=[-1])) | |||
audio = dali.fn.pad(audio) | |||
self.pipe.set_outputs(audio, audio_len, transcript, transcript_len) | |||
self.pipe.set_outputs(audio, audio_len, indices) | |||
else: | |||
# Additive gaussian noise (dither) | |||
if self.dither > 0.0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can replace normal_distribution
with random.normal
to get rid of the deprecation warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isnt audio
the result of a a dali pipeline step? Would python's random.normal work with the pipeline output like this? For now, ill keep it as is and patch it in a later PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was talking about L346 - dali.fn.normal_distribution
is replaced by fn.random.normal
now (we moved it to a dedicated namespace).
* Initial prototype of ASR DALI integration with DALI 1.4 Signed-off-by: smajumdar <titu1994@gmail.com> * Update dali support to 1.4 Signed-off-by: smajumdar <titu1994@gmail.com> * Fix docs Signed-off-by: smajumdar <titu1994@gmail.com> * Address comments Signed-off-by: smajumdar <titu1994@gmail.com> * Apply suggestions from code review Co-authored-by: Janusz Lisiecki <39967756+JanuszL@users.noreply.github.com> * Address comments Signed-off-by: smajumdar <titu1994@gmail.com> * Correct module utils Signed-off-by: smajumdar <titu1994@gmail.com> Co-authored-by: Janusz Lisiecki <39967756+JanuszL@users.noreply.github.com> Signed-off-by: Jason <jasoli@nvidia.com>
* Initial prototype of ASR DALI integration with DALI 1.4 Signed-off-by: smajumdar <titu1994@gmail.com> * Update dali support to 1.4 Signed-off-by: smajumdar <titu1994@gmail.com> * Fix docs Signed-off-by: smajumdar <titu1994@gmail.com> * Address comments Signed-off-by: smajumdar <titu1994@gmail.com> * Apply suggestions from code review Co-authored-by: Janusz Lisiecki <39967756+JanuszL@users.noreply.github.com> * Address comments Signed-off-by: smajumdar <titu1994@gmail.com> * Correct module utils Signed-off-by: smajumdar <titu1994@gmail.com> Co-authored-by: Janusz Lisiecki <39967756+JanuszL@users.noreply.github.com> Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
* Initial prototype of ASR DALI integration with DALI 1.4 Signed-off-by: smajumdar <titu1994@gmail.com> * Update dali support to 1.4 Signed-off-by: smajumdar <titu1994@gmail.com> * Fix docs Signed-off-by: smajumdar <titu1994@gmail.com> * Address comments Signed-off-by: smajumdar <titu1994@gmail.com> * Apply suggestions from code review Co-authored-by: Janusz Lisiecki <39967756+JanuszL@users.noreply.github.com> * Address comments Signed-off-by: smajumdar <titu1994@gmail.com> * Correct module utils Signed-off-by: smajumdar <titu1994@gmail.com> Co-authored-by: Janusz Lisiecki <39967756+JanuszL@users.noreply.github.com>
Changelog