Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate NVIDIA DALI 1.4 to NeMo ASR #2567

Merged
merged 7 commits into from Aug 3, 2021
Merged

Conversation

titu1994
Copy link
Collaborator

@titu1994 titu1994 commented Jul 28, 2021

Changelog

  • Refactor out text processing to NeMo while using DALI for audio processing.
  • Refactor DALI dataset to partial implementation, subclass for Char and Subword models
  • Add unittests for DALI datasets (if installed)
  • Add jenkins test for Citrinet based subword model dataset processing
  • Version check that DALI 1.4 is used
  • Comment out DALI tests for the time being until PT container 21.08 is released and used

@titu1994 titu1994 requested a review from VahidooX July 28, 2021 01:37
@@ -278,7 +268,7 @@ def __init__(
self.pad_value = params['pad_value'] if 'pad_value' in params else 0.0

with self.pipe:
audio, transcript = dali.fn.nemo_asr_reader(
audio, indices = nemo_asr(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be a matter of personal preference, but I think that it will read better if it was dali.fn.readers.nemo_asr or fn.readers.nemo_asr so that it's visible that this is a DALI operator.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I used this style cause I couldnt get pycharm autocomplete to work otherwise, but it turned out to be a *args **kwargs type signature.

trim: bool = False,
shuffle: bool = True,
shuffle: bool = False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has changed from True to False. Just checking. Was that intentional?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, shuffle=True by default can cause test and dev dali datasets to shuffle unknowingly (most of our configs dont have explicit shuffle setting for dev and test set dataloaders and it is assumed to be false. Some of our setup methods explicitly set it to false, but its good to show that even in the constructor.

Comment on lines 530 to 572
NVIDIA DALI pipeline that loads tensors via one or more manifest files where each line containing a sample descriptor in JSON,
including audio files, transcripts, and durations (in seconds).
Here's an example:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description seems to be the same as for the previous class. Shouldn't it mention something about BPE?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch !

from nemo.collections.asr.data.audio_to_text_dataset import inject_dataloader_value_from_model_config
from nemo.collections.common import tokenizers
from nemo.utils import logging

try:
import nvidia.dali as dali
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed earlier, we should probably check that DALI is at the version 1.4 or later.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setup a method to validation version.

for orig, shuffled in zip(original_transcripts, shuffled_transcripts):
if orig == shuffled:
sample_matches += 1
assert sample_matches < 6 # assume after shuffling <half samples have displaced targets
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that this assumption is not guaranteed. An idea: You could add an optional seed parameter to the DALI pipeline, so that you can make this test deterministic.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup its not guarenteed but the odds of it failing are quite low. Seeding would be nice if we could set it after building the pipe but that seems to not be an option. For now ill just increase the threshold from 6 to 8 which would be quite rare event (I think).

Copy link
Collaborator

@VahidooX VahidooX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM generally. Just want to know if we have tested our data layers comparing to DALI speedwise? What are the benefits of DALI and when we should use DALI?

@@ -187,6 +187,19 @@ def _setup_dataloader_from_config(self, config: Optional[Dict]):
augmentor = None

shuffle = config['shuffle']
device = 'gpu' if torch.cuda.is_available() else 'cpu'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it the right assumption to always use gpu if available for dali?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is copied from the ctc_model.py. It's not exactly correct - you may have a model on cpu but GPU is available, but for training this is a good enough assumption.

@@ -70,7 +107,7 @@ def __len__(self):


@experimental
class AudioToCharDALIDataset(Iterator):
class _AudioTextDALIDataset(Iterator):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to merge _AudioTextDALIDataset and _AudioTextDataset and use _AudioTextDataset for dali? or they are that much different?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not possible. Enormously different due to Dali pipeline vs Nemo processing

@titu1994
Copy link
Collaborator Author

titu1994 commented Jul 28, 2021

Speed tests were done locally, and the speed up was significant enough for audio processing due to interleaved pipeline processing vs pytorchs parallel processing (but no interleaving or buffering or prefetching).

The main benefit of these datasets are for faster training on local datasets where the number of files are not so much (these datasets don't yet support tarred dataset, but it's in the plan for DALI). IO is one of the major bottlenecks for training and thus DALI helps that part.

@titu1994
Copy link
Collaborator Author

Doc updates will be done once we reach 21.08, since till then DALI 1.4 won't be used inside the container.

for orig, shuffled in zip(original_transcripts, shuffled_transcripts):
if orig == shuffled:
sample_matches += 1
assert sample_matches < 8 # assume after shuffling <half samples have displaced targets
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to check if:

  • at leas one sample is displaced
  • all samples are present
  • there is no repetitions

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ive added the at least one sample is displaced check. The other two seem excessive and make the test quite tedious.

assert count == 10

sample_matches = 0
for orig, shuffled in zip(original_transcripts, shuffled_transcripts):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test validates only AudioToCharDALIDataset with shuffling vs no shuffling. How do you know if it returns a valid output?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, added a sanity test to ensure that text matches after decoding for the non shuffle case.

assert count == 10

sample_matches = 0
for orig, shuffled in zip(original_transcripts, shuffled_transcripts):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar remarks as in the previous test regarding the results validation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adressed.

]

__DALI_MINIMUM_VERSION__ = "1.4"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a comment why this is the minimal DALI version.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -287,16 +312,14 @@ def __init__(
min_duration=min_duration,
max_duration=max_duration,
read_sample_rate=False,
read_text=True,
read_text=False,
read_idxs=True,
random_shuffle=shuffle,
shard_id=self.shard_id,
num_shards=self.num_shards,
pad_last_batch=False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can switch it to true now - in 1.4 the bug in the nemo_asr was fixed so it should no longer crash. If pad_last_batch is set to False LastBatchPolicy.DROP or LastBatchPolicy.PARTIAL discards useful samples from the next epoch.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, set it to True

min_duration (float): Determines the minimum allowed duration, in seconds, of the loaded audio files.
bos_id (int): Id of beginning of sequence symbol to append if not None
eos_id (int): Id of end of sequence symbol to append if not None
pad_id (int): Id used to pad the input. Defaults to 0 if not provided.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see pad_id among __init__ args.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for bos_id and eos_id.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They arent part of init - they are embedded in the tokenizer. If the tokenizer was built with these special args, they are used, otherwise set to None/0 as required. The docstring is just to mention that these arguments are implicitly injected from the tokenizer. Will add that comment.

Comment on lines 504 to 510
parser (str, callable): A str for an inbuilt parser, or a callable with signature f(str) -> List[int].
global_rank (int): Worker rank, used for partitioning shards. Defaults to 0.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
parser (str, callable): A str for an inbuilt parser, or a callable with signature f(str) -> List[int].
global_rank (int): Worker rank, used for partitioning shards. Defaults to 0.
parser (str, callable): A str for an inbuilt parser, or a callable with signature f(str) -> List[int].
device_id (int): Index of the GPU to be used (local_rank). Only applicable when device == 'gpu'. Defaults to 0.
global_rank (int): Worker rank, used for partitioning shards. Defaults to 0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

device_id docs are missing

@titu1994 titu1994 force-pushed the integrate_dali branch 3 times, most recently from 85e7748 to 647490a Compare July 30, 2021 20:06
titu1994 and others added 6 commits August 2, 2021 14:36
Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: smajumdar <titu1994@gmail.com>
Co-authored-by: Janusz Lisiecki <39967756+JanuszL@users.noreply.github.com>
Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: smajumdar <titu1994@gmail.com>
@@ -312,7 +339,7 @@ def __init__(
# No preprocessing, the output is the audio signal
audio_len = dali.fn.shapes(dali.fn.reshape(audio, shape=[-1]))
audio = dali.fn.pad(audio)
self.pipe.set_outputs(audio, audio_len, transcript, transcript_len)
self.pipe.set_outputs(audio, audio_len, indices)
else:
# Additive gaussian noise (dither)
if self.dither > 0.0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can replace normal_distribution with random.normal to get rid of the deprecation warning.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isnt audio the result of a a dali pipeline step? Would python's random.normal work with the pipeline output like this? For now, ill keep it as is and patch it in a later PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was talking about L346 - dali.fn.normal_distribution is replaced by fn.random.normal now (we moved it to a dedicated namespace).

@titu1994 titu1994 merged commit d04c7e9 into NVIDIA:main Aug 3, 2021
@titu1994 titu1994 deleted the integrate_dali branch August 3, 2021 18:01
blisc pushed a commit to blisc/NeMo that referenced this pull request Aug 12, 2021
* Initial prototype of ASR DALI integration with DALI 1.4

Signed-off-by: smajumdar <titu1994@gmail.com>

* Update dali support to 1.4

Signed-off-by: smajumdar <titu1994@gmail.com>

* Fix docs

Signed-off-by: smajumdar <titu1994@gmail.com>

* Address comments

Signed-off-by: smajumdar <titu1994@gmail.com>

* Apply suggestions from code review

Co-authored-by: Janusz Lisiecki <39967756+JanuszL@users.noreply.github.com>

* Address comments

Signed-off-by: smajumdar <titu1994@gmail.com>

* Correct module utils

Signed-off-by: smajumdar <titu1994@gmail.com>

Co-authored-by: Janusz Lisiecki <39967756+JanuszL@users.noreply.github.com>
Signed-off-by: Jason <jasoli@nvidia.com>
paarthneekhara pushed a commit to paarthneekhara/NeMo that referenced this pull request Sep 17, 2021
* Initial prototype of ASR DALI integration with DALI 1.4

Signed-off-by: smajumdar <titu1994@gmail.com>

* Update dali support to 1.4

Signed-off-by: smajumdar <titu1994@gmail.com>

* Fix docs

Signed-off-by: smajumdar <titu1994@gmail.com>

* Address comments

Signed-off-by: smajumdar <titu1994@gmail.com>

* Apply suggestions from code review

Co-authored-by: Janusz Lisiecki <39967756+JanuszL@users.noreply.github.com>

* Address comments

Signed-off-by: smajumdar <titu1994@gmail.com>

* Correct module utils

Signed-off-by: smajumdar <titu1994@gmail.com>

Co-authored-by: Janusz Lisiecki <39967756+JanuszL@users.noreply.github.com>
Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>
jfsantos pushed a commit to jfsantos/NeMo that referenced this pull request Nov 19, 2021
* Initial prototype of ASR DALI integration with DALI 1.4

Signed-off-by: smajumdar <titu1994@gmail.com>

* Update dali support to 1.4

Signed-off-by: smajumdar <titu1994@gmail.com>

* Fix docs

Signed-off-by: smajumdar <titu1994@gmail.com>

* Address comments

Signed-off-by: smajumdar <titu1994@gmail.com>

* Apply suggestions from code review

Co-authored-by: Janusz Lisiecki <39967756+JanuszL@users.noreply.github.com>

* Address comments

Signed-off-by: smajumdar <titu1994@gmail.com>

* Correct module utils

Signed-off-by: smajumdar <titu1994@gmail.com>

Co-authored-by: Janusz Lisiecki <39967756+JanuszL@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants