Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved shard_id parsing in LazyNemoTarredIterator, enables AIS dataloading #9077

Merged
merged 3 commits into from
May 2, 2024

Conversation

pzelasko
Copy link
Collaborator

@pzelasko pzelasko commented May 1, 2024

What does this PR do ?

Improves the path-based shard_id parsing in LazyNemoTarredIterator; removes the assumption of a rigid path structure and instead searches the string with regex pattern that must contain manifest/audio, _(shard_id), and json/tar with no slashes in between. It allows loading data directly from AIStore (pipe:ais get ais://bucket/audio_{0..N}.tar -) and other cloud storage CLIs; it also allows more path patterns such as manifest_0_punctuation.json (CC @krishnacpuvvada).

For example, it's possible to train a NeMo model from AIStore data just by passing the following config options (assuming ais binary is available):

manifest_filepath: 'pipe:ais get ais://my-bucket/sharded_manifests/manifest__OP_0..127_CL_.json -'
tarred_audio_filepaths: 'pipe:ais get ais://my-bucket/audio__OP_0..127_CL_.tar -'

Finally it improves the assertion messages about mismatches in paths/shard_ids/items between JSON and tar inputs to provide more details for debugging multi-dataset setups with wrong input specification.

Collection: ASR, multimodal

Changelog

  • Support AIStore dataloading for ASR/multimodal and allow more general path patterns

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

Jenkins CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

There's no need to comment jenkins on the PR to trigger Jenkins CI.
The GitHub Actions CI will run automatically when the PR is opened.
To run CI on an untrusted fork, a NeMo user with write access must click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

Signed-off-by: Piotr Żelasko <petezor@gmail.com>
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
m = pattern.match(p)
assert m is not None, f"Cannot determine shard_id from manifest path: {p}"
m = json_pattern.search(p)
assert m is not None, f"Cannot determine shard_id from manifest input specified: {p}"
Copy link
Collaborator

@krishnacpuvvada krishnacpuvvada May 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we be little more verbose in error msg? i.e. tell why it failed.. ? (for example: input_0.json).
same for tar files below.

Signed-off-by: Piotr Żelasko <petezor@gmail.com>
Copy link
Collaborator

@krishnacpuvvada krishnacpuvvada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @pzelasko

@pzelasko pzelasko added Run CICD and removed Run CICD labels May 2, 2024
@pzelasko pzelasko merged commit 8005fb2 into main May 2, 2024
133 checks passed
@pzelasko pzelasko deleted the permissive-shard-id-parsing branch May 2, 2024 19:31
rohitrango pushed a commit to rohitrango/NeMo that referenced this pull request Jun 25, 2024
…dataloading (NVIDIA#9077)

* More permissive shard_id parsing, enables AIS dataloading

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* Fix to shard id discovery

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* More informative assertion errors

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

---------

Signed-off-by: Piotr Żelasko <petezor@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants