Integrate NVIDIA DALI 1.4 to NeMo ASR #2567

titu1994 · 2021-07-28T01:05:01Z

Changelog

Refactor out text processing to NeMo while using DALI for audio processing.
Refactor DALI dataset to partial implementation, subclass for Char and Subword models
Add unittests for DALI datasets (if installed)
Add jenkins test for Citrinet based subword model dataset processing
Version check that DALI 1.4 is used
Comment out DALI tests for the time being until PT container 21.08 is released and used

jantonguirao · 2021-07-28T15:17:27Z

nemo/collections/asr/data/audio_to_text_dali.py

@@ -278,7 +268,7 @@ def __init__(
            self.pad_value = params['pad_value'] if 'pad_value' in params else 0.0

        with self.pipe:
-            audio, transcript = dali.fn.nemo_asr_reader(
+            audio, indices = nemo_asr(


might be a matter of personal preference, but I think that it will read better if it was dali.fn.readers.nemo_asr or fn.readers.nemo_asr so that it's visible that this is a DALI operator.

Agreed, I used this style cause I couldnt get pycharm autocomplete to work otherwise, but it turned out to be a *args **kwargs type signature.

jantonguirao · 2021-07-28T15:20:47Z

nemo/collections/asr/data/audio_to_text_dali.py

        trim: bool = False,
-        shuffle: bool = True,
+        shuffle: bool = False,


This has changed from True to False. Just checking. Was that intentional?

Yes, shuffle=True by default can cause test and dev dali datasets to shuffle unknowingly (most of our configs dont have explicit shuffle setting for dev and test set dataloaders and it is assumed to be false. Some of our setup methods explicitly set it to false, but its good to show that even in the constructor.

jantonguirao · 2021-07-28T15:22:48Z

nemo/collections/asr/data/audio_to_text_dali.py

+    NVIDIA DALI pipeline that loads tensors via one or more manifest files where each line containing a sample descriptor in JSON,
+    including audio files, transcripts, and durations (in seconds).
+    Here's an example:


The description seems to be the same as for the previous class. Shouldn't it mention something about BPE?

Good catch !

jantonguirao · 2021-07-28T15:25:39Z

tests/collections/asr/test_asr_datasets.py

 from nemo.collections.asr.data.audio_to_text_dataset import inject_dataloader_value_from_model_config
 from nemo.collections.common import tokenizers
 from nemo.utils import logging

+try:
+    import nvidia.dali as dali


As discussed earlier, we should probably check that DALI is at the version 1.4 or later.

Setup a method to validation version.

jantonguirao · 2021-07-28T15:30:25Z

tests/collections/asr/test_asr_datasets.py

+            for orig, shuffled in zip(original_transcripts, shuffled_transcripts):
+                if orig == shuffled:
+                    sample_matches += 1
+            assert sample_matches < 6  # assume after shuffling <half samples have displaced targets


I guess that this assumption is not guaranteed. An idea: You could add an optional seed parameter to the DALI pipeline, so that you can make this test deterministic.

Yup its not guarenteed but the odds of it failing are quite low. Seeding would be nice if we could set it after building the pipe but that seems to not be an option. For now ill just increase the threshold from 6 to 8 which would be quite rare event (I think).

VahidooX

LGTM generally. Just want to know if we have tested our data layers comparing to DALI speedwise? What are the benefits of DALI and when we should use DALI?

VahidooX · 2021-07-28T23:20:35Z

nemo/collections/asr/models/ctc_bpe_models.py

@@ -187,6 +187,19 @@ def _setup_dataloader_from_config(self, config: Optional[Dict]):
            augmentor = None

        shuffle = config['shuffle']
+        device = 'gpu' if torch.cuda.is_available() else 'cpu'


Is it the right assumption to always use gpu if available for dali?

This is copied from the ctc_model.py. It's not exactly correct - you may have a model on cpu but GPU is available, but for training this is a good enough assumption.

VahidooX · 2021-07-28T23:21:46Z

nemo/collections/asr/data/audio_to_text_dali.py

@@ -70,7 +107,7 @@ def __len__(self):


 @experimental
-class AudioToCharDALIDataset(Iterator):
+class _AudioTextDALIDataset(Iterator):


Is there a way to merge _AudioTextDALIDataset and _AudioTextDataset and use _AudioTextDataset for dali? or they are that much different?

Not possible. Enormously different due to Dali pipeline vs Nemo processing

titu1994 · 2021-07-28T23:33:13Z

Speed tests were done locally, and the speed up was significant enough for audio processing due to interleaved pipeline processing vs pytorchs parallel processing (but no interleaving or buffering or prefetching).

The main benefit of these datasets are for faster training on local datasets where the number of files are not so much (these datasets don't yet support tarred dataset, but it's in the plan for DALI). IO is one of the major bottlenecks for training and thus DALI helps that part.

titu1994 · 2021-07-28T23:34:55Z

Doc updates will be done once we reach 21.08, since till then DALI 1.4 won't be used inside the container.

tests/collections/asr/test_asr_datasets.py

JanuszL · 2021-07-30T07:44:20Z

tests/collections/asr/test_asr_datasets.py

+            for orig, shuffled in zip(original_transcripts, shuffled_transcripts):
+                if orig == shuffled:
+                    sample_matches += 1
+            assert sample_matches < 8  # assume after shuffling <half samples have displaced targets


I think it is better to check if:

at leas one sample is displaced

all samples are present

there is no repetitions

Ive added the at least one sample is displaced check. The other two seem excessive and make the test quite tedious.

JanuszL · 2021-07-30T07:45:19Z

tests/collections/asr/test_asr_datasets.py

+            assert count == 10
+
+            sample_matches = 0
+            for orig, shuffled in zip(original_transcripts, shuffled_transcripts):


The test validates only AudioToCharDALIDataset with shuffling vs no shuffling. How do you know if it returns a valid output?

Agreed, added a sanity test to ensure that text matches after decoding for the non shuffle case.

tests/collections/asr/test_asr_datasets.py

JanuszL · 2021-07-30T07:47:31Z

tests/collections/asr/test_asr_datasets.py

+            assert count == 10
+
+            sample_matches = 0
+            for orig, shuffled in zip(original_transcripts, shuffled_transcripts):


Similar remarks as in the previous test regarding the results validation.

JanuszL · 2021-07-30T07:49:08Z

nemo/collections/asr/data/audio_to_text_dali.py

 ]

+__DALI_MINIMUM_VERSION__ = "1.4"


I would add a comment why this is the minimal DALI version.

JanuszL · 2021-07-30T08:05:01Z

nemo/collections/asr/data/audio_to_text_dali.py

@@ -287,16 +312,14 @@ def __init__(
                min_duration=min_duration,
                max_duration=max_duration,
                read_sample_rate=False,
-                read_text=True,
+                read_text=False,
+                read_idxs=True,
                random_shuffle=shuffle,
                shard_id=self.shard_id,
                num_shards=self.num_shards,
                pad_last_batch=False,


I think you can switch it to true now - in 1.4 the bug in the nemo_asr was fixed so it should no longer crash. If pad_last_batch is set to False LastBatchPolicy.DROP or LastBatchPolicy.PARTIAL discards useful samples from the next epoch.

Agreed, set it to True

JanuszL · 2021-07-30T08:17:24Z

nemo/collections/asr/data/audio_to_text_dali.py

+        min_duration (float): Determines the minimum allowed duration, in seconds, of the loaded audio files.
+        bos_id (int): Id of beginning of sequence symbol to append if not None
+        eos_id (int): Id of end of sequence symbol to append if not None
+        pad_id (int): Id used to pad the input. Defaults to 0 if not provided.


I don't see pad_id among __init__ args.

Same for bos_id and eos_id.

They arent part of init - they are embedded in the tokenizer. If the tokenizer was built with these special args, they are used, otherwise set to None/0 as required. The docstring is just to mention that these arguments are implicitly injected from the tokenizer. Will add that comment.

JanuszL · 2021-07-30T08:19:17Z

nemo/collections/asr/data/audio_to_text_dali.py

+        parser (str, callable): A str for an inbuilt parser, or a callable with signature f(str) -> List[int].
+        global_rank (int): Worker rank, used for partitioning shards. Defaults to 0.


Suggested change

parser (str, callable): A str for an inbuilt parser, or a callable with signature f(str) -> List[int].

global_rank (int): Worker rank, used for partitioning shards. Defaults to 0.

parser (str, callable): A str for an inbuilt parser, or a callable with signature f(str) -> List[int].

device_id (int): Index of the GPU to be used (local_rank). Only applicable when device == 'gpu'. Defaults to 0.

global_rank (int): Worker rank, used for partitioning shards. Defaults to 0.

device_id docs are missing

Signed-off-by: smajumdar <titu1994@gmail.com>

Co-authored-by: Janusz Lisiecki <39967756+JanuszL@users.noreply.github.com>

Signed-off-by: smajumdar <titu1994@gmail.com>

JanuszL · 2021-08-03T07:25:24Z

nemo/collections/asr/data/audio_to_text_dali.py

@@ -312,7 +339,7 @@ def __init__(
                # No preprocessing, the output is the audio signal
                audio_len = dali.fn.shapes(dali.fn.reshape(audio, shape=[-1]))
                audio = dali.fn.pad(audio)
-                self.pipe.set_outputs(audio, audio_len, transcript, transcript_len)
+                self.pipe.set_outputs(audio, audio_len, indices)
            else:
                # Additive gaussian noise (dither)
                if self.dither > 0.0:


I think you can replace normal_distribution with random.normal to get rid of the deprecation warning.

Isnt audio the result of a a dali pipeline step? Would python's random.normal work with the pipeline output like this? For now, ill keep it as is and patch it in a later PR.

I was talking about L346 - dali.fn.normal_distribution is replaced by fn.random.normal now (we moved it to a dedicated namespace).

* Initial prototype of ASR DALI integration with DALI 1.4 Signed-off-by: smajumdar <titu1994@gmail.com> * Update dali support to 1.4 Signed-off-by: smajumdar <titu1994@gmail.com> * Fix docs Signed-off-by: smajumdar <titu1994@gmail.com> * Address comments Signed-off-by: smajumdar <titu1994@gmail.com> * Apply suggestions from code review Co-authored-by: Janusz Lisiecki <39967756+JanuszL@users.noreply.github.com> * Address comments Signed-off-by: smajumdar <titu1994@gmail.com> * Correct module utils Signed-off-by: smajumdar <titu1994@gmail.com> Co-authored-by: Janusz Lisiecki <39967756+JanuszL@users.noreply.github.com> Signed-off-by: Jason <jasoli@nvidia.com>

* Initial prototype of ASR DALI integration with DALI 1.4 Signed-off-by: smajumdar <titu1994@gmail.com> * Update dali support to 1.4 Signed-off-by: smajumdar <titu1994@gmail.com> * Fix docs Signed-off-by: smajumdar <titu1994@gmail.com> * Address comments Signed-off-by: smajumdar <titu1994@gmail.com> * Apply suggestions from code review Co-authored-by: Janusz Lisiecki <39967756+JanuszL@users.noreply.github.com> * Address comments Signed-off-by: smajumdar <titu1994@gmail.com> * Correct module utils Signed-off-by: smajumdar <titu1994@gmail.com> Co-authored-by: Janusz Lisiecki <39967756+JanuszL@users.noreply.github.com> Signed-off-by: Paarth Neekhara <paarth.n@gmail.com>

* Initial prototype of ASR DALI integration with DALI 1.4 Signed-off-by: smajumdar <titu1994@gmail.com> * Update dali support to 1.4 Signed-off-by: smajumdar <titu1994@gmail.com> * Fix docs Signed-off-by: smajumdar <titu1994@gmail.com> * Address comments Signed-off-by: smajumdar <titu1994@gmail.com> * Apply suggestions from code review Co-authored-by: Janusz Lisiecki <39967756+JanuszL@users.noreply.github.com> * Address comments Signed-off-by: smajumdar <titu1994@gmail.com> * Correct module utils Signed-off-by: smajumdar <titu1994@gmail.com> Co-authored-by: Janusz Lisiecki <39967756+JanuszL@users.noreply.github.com>

titu1994 requested a review from VahidooX July 28, 2021 01:37

jantonguirao reviewed Jul 28, 2021

View reviewed changes

VahidooX approved these changes Jul 28, 2021

View reviewed changes

JanuszL reviewed Jul 30, 2021

View reviewed changes

tests/collections/asr/test_asr_datasets.py Outdated Show resolved Hide resolved

JanuszL reviewed Jul 30, 2021

View reviewed changes

tests/collections/asr/test_asr_datasets.py Outdated Show resolved Hide resolved

JanuszL reviewed Jul 30, 2021

View reviewed changes

tests/collections/asr/test_asr_datasets.py Outdated Show resolved Hide resolved

JanuszL reviewed Jul 30, 2021

View reviewed changes

tests/collections/asr/test_asr_datasets.py Outdated Show resolved Hide resolved

JanuszL reviewed Jul 30, 2021

View reviewed changes

tests/collections/asr/test_asr_datasets.py Outdated Show resolved Hide resolved

JanuszL reviewed Jul 30, 2021

View reviewed changes

tests/collections/asr/test_asr_datasets.py Outdated Show resolved Hide resolved

JanuszL reviewed Jul 30, 2021

View reviewed changes

tests/collections/asr/test_asr_datasets.py Outdated Show resolved Hide resolved

JanuszL reviewed Jul 30, 2021

View reviewed changes

tests/collections/asr/test_asr_datasets.py Outdated Show resolved Hide resolved

JanuszL reviewed Jul 30, 2021

View reviewed changes

titu1994 force-pushed the integrate_dali branch 3 times, most recently from 85e7748 to 647490a Compare July 30, 2021 20:06

JanuszL approved these changes Jul 30, 2021

View reviewed changes

titu1994 and others added 6 commits August 2, 2021 14:36

Initial prototype of ASR DALI integration with DALI 1.4

ed36627

Signed-off-by: smajumdar <titu1994@gmail.com>

Update dali support to 1.4

0a5df72

Signed-off-by: smajumdar <titu1994@gmail.com>

Fix docs

99ba6b5

Signed-off-by: smajumdar <titu1994@gmail.com>

Address comments

1fc5790

Signed-off-by: smajumdar <titu1994@gmail.com>

Apply suggestions from code review

42a37e8

Co-authored-by: Janusz Lisiecki <39967756+JanuszL@users.noreply.github.com>

Address comments

b5c95a4

Signed-off-by: smajumdar <titu1994@gmail.com>

Correct module utils

3f55abf

Signed-off-by: smajumdar <titu1994@gmail.com>

titu1994 force-pushed the integrate_dali branch from a4f3230 to 3f55abf Compare August 2, 2021 21:38

JanuszL reviewed Aug 3, 2021

View reviewed changes

titu1994 merged commit d04c7e9 into NVIDIA:main Aug 3, 2021

titu1994 deleted the integrate_dali branch August 3, 2021 18:01

		parser (str, callable): A str for an inbuilt parser, or a callable with signature f(str) -> List[int].
		global_rank (int): Worker rank, used for partitioning shards. Defaults to 0.

		]

		__DALI_MINIMUM_VERSION__ = "1.4"

Integrate NVIDIA DALI 1.4 to NeMo ASR #2567

Integrate NVIDIA DALI 1.4 to NeMo ASR #2567

Conversation

titu1994 commented Jul 28, 2021 • edited

Changelog

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VahidooX left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

titu1994 commented Jul 28, 2021 • edited

titu1994 commented Jul 28, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

titu1994 commented Jul 28, 2021 •

edited

titu1994 commented Jul 28, 2021 •

edited