Support dataloader as input to `audio` for transcription #9201

titu1994 · 2024-05-15T07:04:25Z

What does this PR do ?

Enables the use of a pre-constructed data loader as input to the model.transcribe() function.
This allows for a fastpath to ignore all manifest and tensor handling to the user, only executing the model forward and later steps.

Collection: [ASR]

Changelog

Allows the user to provide a DataLoader object, which overrides internal computation of manifest processing or dataset construction
Assumes implicit faith in user provided input - the user is now responsible for formatting and providing all arguments to match up with the ASR model's forward arguments if user chooses to provide a dataloader.

Usage

from nemo.collections.asr.data.audio_to_text import _speech_collate_fn

model = ASRModel.from_pretrained("stt_en_conformer_ctc_small")

# Load audio file
import soundfile as SF

audio_file = os.path.join(test_data_dir, "asr", "train", "an4", "wav", "an46-mmap-b.wav")
audio, sr = sf.read(audio_file, dtype='float32')

audio_file2 = os.path.join(test_data_dir, "asr", "train", "an4", "wav", "an152-mwhw-b.wav")
audio2, sr = sf.read(audio_file2, dtype='float32')

# Create a dummy dataset to hold the tensor values
class DummyDataset(Dataset):
    def __init__(self, audio_tensors: List[str], config: Dict = None):
        self.audio_tensors = audio_tensors
        self.config = config

    def __getitem__(self, index):
        data = self.audio_tensors[index]
        samples = torch.tensor(data)
        # Calculate seq length
        seq_len = torch.tensor(samples.shape[0], dtype=torch.long)

        # Dummy text tokens
        text_tokens = torch.tensor([0], dtype=torch.long)
        text_tokens_len = torch.tensor(1, dtype=torch.long)

        # Ensure to provide output tokens that can be consumed by an ASR's forward function
        return (samples, seq_len, text_tokens, text_tokens_len)

    def __len__(self):
        return len(self.audio_tensors)

# Wrap the dataset into a data loader with proper collate function
dataset = DummyDataset([audio, audio2])
collate_fn = lambda x: _speech_collate_fn(x, pad_id=0)
dataloader = DataLoader(dataset, batch_size=2, shuffle=False, num_workers=0, collate_fn=collate_fn)

# DataLoader as input to audio
outputs = model.transcribe(dataloader, batch_size=1)

assert len(outputs) == 2
assert isinstance(outputs[0], str)
assert isinstance(outputs[1], str)

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

titu1994 · 2024-05-15T07:05:07Z

Fyi @nithinraok - this should enable fastpath execution for HF too

titu1994 · 2024-05-15T07:06:42Z

@galv @pzelasko actual change is just these 5 lines, everything else is just black formatter and a unittest at the bottom - https://github.com/NVIDIA/NeMo/pull/9201/files#diff-04e10f8fb8f7afddb360c7ea0ff9c831613c754085d2fd06961bb80a9f932a25R372-R376

tests/collections/asr/mixins/test_transcription.py

+        audio2, sr = sf.read(audio_file2, dtype='float32')
+
+        dataset = DummyDataset([audio, audio2])
+        collate_fn = lambda x: _speech_collate_fn(x, pad_id=0)


pzelasko

LGTM.

@titu1994 @nithinraok @galv I'd like to point out that if you're going to use a TensorDataset, make it an IterableDataset that yields collated mini-batches, rather than relying on DataLoader's batch_size and collate_fn to do it for you. This way you'll amortize the overhead of collation as it will happen in the background process. This change will matter for super-high RTFx models.

nithinraok

minor comments. For huggingface, we need to depend on Iterable dataset I believe.

nithinraok · 2024-05-15T15:45:43Z

nemo/collections/asr/models/aed_multitask_models.py

@@ -403,7 +403,8 @@ def transcribe(
        """
        Uses greedy decoding to transcribe audio files. Use this method for debugging and prototyping.
        Args:
-            audio: (a list) of paths to audio files. \
+            audio: (a single or list) of paths to audio files or a np.ndarray audio array.
+                Can also be a dataloader object that provides values that can be consumed by the model.


Add torch.dataloader to audio types in the func signature.

nithinraok · 2024-05-15T15:46:06Z

nemo/collections/asr/models/classification_models.py

@@ -364,7 +365,8 @@ def transcribe(
        Generate class labels for provided audio files. Use this method for debugging and prototyping.

        Args:
-            audio: (a single or list) of paths to audio files or a np.ndarray audio sample. \
+            audio: (a single or list) of paths to audio files or a np.ndarray audio array.


Add torch.dataloader to audio types in the func signature.

galv · 2024-05-15T20:41:34Z

@nithinraok can you investigate to see whether this change will fix your performance degredation issues when using transcribe() within the huggingface open asr leaderboard RTFx measurements?

nithinraok · 2024-05-15T21:29:36Z

@galv yeah, I am thinking to use iterable dataset (looks like streaming is the option for huggingface datasets to get iterable dataset) and run the HF evals.

Signed-off-by: smajumdar <titu1994@gmail.com>

Signed-off-by: titu1994 <titu1994@users.noreply.github.com>

Signed-off-by: smajumdar <titu1994@gmail.com>

Signed-off-by: titu1994 <titu1994@users.noreply.github.com>

* Support dataloader as input to `audio` for transcription Signed-off-by: smajumdar <titu1994@gmail.com> * Apply isort and black reformatting Signed-off-by: titu1994 <titu1994@users.noreply.github.com> * Support dataloader as input to `audio` for transcription Signed-off-by: smajumdar <titu1994@gmail.com> * Update transcribe signatures Signed-off-by: smajumdar <titu1994@gmail.com> * Apply isort and black reformatting Signed-off-by: titu1994 <titu1994@users.noreply.github.com> --------- Signed-off-by: smajumdar <titu1994@gmail.com> Signed-off-by: titu1994 <titu1994@users.noreply.github.com> (cherry picked from commit 67401ed)

* Support dataloader as input to `audio` for transcription Signed-off-by: smajumdar <titu1994@gmail.com> * Apply isort and black reformatting Signed-off-by: titu1994 <titu1994@users.noreply.github.com> * Support dataloader as input to `audio` for transcription Signed-off-by: smajumdar <titu1994@gmail.com> * Update transcribe signatures Signed-off-by: smajumdar <titu1994@gmail.com> * Apply isort and black reformatting Signed-off-by: titu1994 <titu1994@users.noreply.github.com> --------- Signed-off-by: smajumdar <titu1994@gmail.com> Signed-off-by: titu1994 <titu1994@users.noreply.github.com> Signed-off-by: Boxiang Wang <boxiangw@nvidia.com>

* Support dataloader as input to `audio` for transcription Signed-off-by: smajumdar <titu1994@gmail.com> * Apply isort and black reformatting Signed-off-by: titu1994 <titu1994@users.noreply.github.com> * Support dataloader as input to `audio` for transcription Signed-off-by: smajumdar <titu1994@gmail.com> * Update transcribe signatures Signed-off-by: smajumdar <titu1994@gmail.com> * Apply isort and black reformatting Signed-off-by: titu1994 <titu1994@users.noreply.github.com> --------- Signed-off-by: smajumdar <titu1994@gmail.com> Signed-off-by: titu1994 <titu1994@users.noreply.github.com> Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Support dataloader as input to `audio` for transcription Signed-off-by: smajumdar <titu1994@gmail.com> * Apply isort and black reformatting Signed-off-by: titu1994 <titu1994@users.noreply.github.com> * Support dataloader as input to `audio` for transcription Signed-off-by: smajumdar <titu1994@gmail.com> * Update transcribe signatures Signed-off-by: smajumdar <titu1994@gmail.com> * Apply isort and black reformatting Signed-off-by: titu1994 <titu1994@users.noreply.github.com> --------- Signed-off-by: smajumdar <titu1994@gmail.com> Signed-off-by: titu1994 <titu1994@users.noreply.github.com>

github-actions bot added the ASR label May 15, 2024

titu1994 requested review from galv and pzelasko May 15, 2024 07:04

titu1994 added the Run CICD label May 15, 2024

github-advanced-security bot found potential problems May 15, 2024

View reviewed changes

titu1994 added Run CICD and removed Run CICD labels May 15, 2024

pzelasko previously approved these changes May 15, 2024

View reviewed changes

nithinraok reviewed May 15, 2024

View reviewed changes

titu1994 and others added 4 commits May 15, 2024 15:26

Support dataloader as input to audio for transcription

e557318

Signed-off-by: smajumdar <titu1994@gmail.com>

Apply isort and black reformatting

0201ec2

Signed-off-by: titu1994 <titu1994@users.noreply.github.com>

Support dataloader as input to audio for transcription

3be450f

Signed-off-by: smajumdar <titu1994@gmail.com>

Update transcribe signatures

403c0ae

Signed-off-by: smajumdar <titu1994@gmail.com>

titu1994 dismissed pzelasko’s stale review via 403c0ae May 15, 2024 22:27

titu1994 force-pushed the transcribe_dataloader branch from 4733f85 to 403c0ae Compare May 15, 2024 22:27

Apply isort and black reformatting

9a2dae5

Signed-off-by: titu1994 <titu1994@users.noreply.github.com>

nithinraok added Run CICD and removed Run CICD labels May 16, 2024

titu1994 merged commit 67401ed into main May 17, 2024
133 checks passed

titu1994 deleted the transcribe_dataloader branch May 17, 2024 16:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support dataloader as input to `audio` for transcription #9201

Support dataloader as input to `audio` for transcription #9201

titu1994 commented May 15, 2024

titu1994 commented May 15, 2024

titu1994 commented May 15, 2024

pzelasko left a comment

nithinraok left a comment

nithinraok May 15, 2024

titu1994 May 15, 2024

nithinraok May 15, 2024

titu1994 May 15, 2024

galv commented May 15, 2024

nithinraok commented May 15, 2024

Support dataloader as input to audio for transcription #9201

Support dataloader as input to audio for transcription #9201

Conversation

titu1994 commented May 15, 2024

What does this PR do ?

Changelog

Usage

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

titu1994 commented May 15, 2024

titu1994 commented May 15, 2024

pzelasko left a comment

Choose a reason for hiding this comment

nithinraok left a comment

Choose a reason for hiding this comment

nithinraok May 15, 2024

Choose a reason for hiding this comment

titu1994 May 15, 2024

Choose a reason for hiding this comment

nithinraok May 15, 2024

Choose a reason for hiding this comment

titu1994 May 15, 2024

Choose a reason for hiding this comment

galv commented May 15, 2024

nithinraok commented May 15, 2024

Support dataloader as input to `audio` for transcription #9201

Support dataloader as input to `audio` for transcription #9201