Skip to content

[speechm2] Support indexed sharegpt JSONL and webdataset formats#15410

Merged
pzelasko merged 4 commits intomainfrom
indexed-sharegpt-data-type-parsers
Feb 26, 2026
Merged

[speechm2] Support indexed sharegpt JSONL and webdataset formats#15410
pzelasko merged 4 commits intomainfrom
indexed-sharegpt-data-type-parsers

Conversation

@pzelasko
Copy link
Collaborator

@pzelasko pzelasko commented Feb 17, 2026

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

The first PR to support indexed datasets. It reads a binary index (sequence of uint64 byte offsets marking beginning of each sample in a file), generates random permutation of indexes on the fly, and looks up the right sample.

This implementation pretends it's a sequential IO dataset for compatibility, but will be used as a building block for a resumable dataloader in the future.

Supported formats:

share_gpt

  • two files data.jsonl and data.jsonl.idx
  • schema (single line):
        {
            "id": f"audio_convo",
            "sound": f"audio.wav",
            "conversations": [
                {"from": "human", "value": f"Listen to this: <sound> What do you think?"},
                {"from": "gpt", "value": f"Response"},
            ],
        }

share_gpt_webdataset

  • directory with files like shard_0.tar + shard_0.tar.idx
  • layout
    Expected directory layout::
        data_dir/
          wids-meta.json                          # shard list metadata
          0/
            shard-0.tar      shard-0.tar.idx      # tar + optional index
            ...
    Each tar archive contains paired files per sample (same basename)::
        0.json   0.wav
        1.json   1.wav
        ...
  • individual json file has one line with the schema of share_gpt

Collection: speechlm2

Changelog

  • Data type parsers for indexed JSONL and webdataset based share_gpt format data.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

Signed-off-by: Piotr Żelasko <petezor@gmail.com>
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
@github-actions
Copy link
Contributor

[🤖]: Hi @pzelasko 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully.

So it might be time to merge this PR or get some approvals.

//cc @chtruong814 @ko3n1g @pablo-garay @thomasdhc

Copy link
Collaborator

@tbartley94 tbartley94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formatting and structural changes. General logic looks good though.

bits += 1
self._half = bits // 2
self._mask = (1 << self._half) - 1
self._rounds = 6
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make a arg in init.

left = (x >> self._half) & self._mask
right = x & self._mask
for key in self._keys:
left, right = right, left ^ (((right * 2654435761) ^ key) >> 32 & self._mask)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make global var at top of file.

for line in f_in:
current_offset += len(line)
write_buffer.extend(struct.pack('<Q', current_offset))
if len(write_buffer) > 8 * 1024 * 1024:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nitpicky, but just write out the full multiplication as a var above and comment. no need to do the extra ops for every line.


{
"id": str,
"id": str, # not optional, but we will tolerate if it's missing
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bit cryptic, provide line where we tolerate this?

raise FileNotFoundError(f"No wids-meta.json and no .tar files found under {self.data_dir}")
self.audio_placeholders = _normalize_audio_placeholders(self.audio_placeholders)
self._has_index = all(Path(p + ".idx").exists() for p in self._shard_paths)
self.epoch = 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, is there anyway we can sync this with the trainer? having an adapter maintaining epoch on its lonesome sounds like a pending desync issue that will be annoying to hunt down

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, there is no way. Each dataset iterator keeps its own track and there is no shared concept of a global epoch in our setups.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, for sanity can you add to the unit tests to check trainer.epoch == text_adapter.epoch (if not already done)? This seems like something potentially painful in a year.

@pyf98
Copy link
Collaborator

pyf98 commented Feb 19, 2026

Thanks. This PR looks good to me!

pzelasko and others added 2 commits February 25, 2026 14:16
- Extract magic constant to module-level _KNUTH_HASH
- Make Feistel round count a configurable init arg (num_rounds)
- Refactor _load_index to always return sentinel-inclusive offsets and
  validate offset bounds, simplifying callers
- Replace raw tuple with TarSample NamedTuple for clarity
- Add docstring explaining why custom tar reading vs stdlib tarfile
- Pre-compute flush_threshold in create_index
- Raise ValueError on missing audio path instead of silent skip
- Clarify "tolerate" comment with concrete fallback reference
- Add comment explaining force_finite flag purpose

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
@github-actions
Copy link
Contributor

[🤖]: Hi @pzelasko 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully.

So it might be time to merge this PR or get some approvals.

@pzelasko pzelasko enabled auto-merge (squash) February 26, 2026 12:34
Copy link
Collaborator

@tbartley94 tbartley94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

request for simple unit test. else LGTM

raise FileNotFoundError(f"No wids-meta.json and no .tar files found under {self.data_dir}")
self.audio_placeholders = _normalize_audio_placeholders(self.audio_placeholders)
self._has_index = all(Path(p + ".idx").exists() for p in self._shard_paths)
self.epoch = 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, for sanity can you add to the unit tests to check trainer.epoch == text_adapter.epoch (if not already done)? This seems like something potentially painful in a year.

@pzelasko pzelasko merged commit 1692a8f into main Feb 26, 2026
212 of 214 checks passed
@pzelasko pzelasko deleted the indexed-sharegpt-data-type-parsers branch February 26, 2026 15:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants