[speechm2] Support indexed sharegpt JSONL and webdataset formats by pzelasko · Pull Request #15410 · NVIDIA-NeMo/NeMo

pzelasko · 2026-02-17T16:57:34Z

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

The first PR to support indexed datasets. It reads a binary index (sequence of uint64 byte offsets marking beginning of each sample in a file), generates random permutation of indexes on the fly, and looks up the right sample.

This implementation pretends it's a sequential IO dataset for compatibility, but will be used as a building block for a resumable dataloader in the future.

Supported formats:

share_gpt

two files data.jsonl and data.jsonl.idx
schema (single line):

        {
            "id": f"audio_convo",
            "sound": f"audio.wav",
            "conversations": [
                {"from": "human", "value": f"Listen to this: <sound> What do you think?"},
                {"from": "gpt", "value": f"Response"},
            ],
        }

share_gpt_webdataset

directory with files like shard_0.tar + shard_0.tar.idx
layout

    Expected directory layout::
        data_dir/
          wids-meta.json                          # shard list metadata
          0/
            shard-0.tar      shard-0.tar.idx      # tar + optional index
            ...
    Each tar archive contains paired files per sample (same basename)::
        0.json   0.wav
        1.json   1.wav
        ...

individual json file has one line with the schema of share_gpt

Collection: speechlm2

Changelog

Data type parsers for indexed JSONL and webdataset based share_gpt format data.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

github-actions · 2026-02-17T19:21:36Z

[🤖]: Hi @pzelasko 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully.

So it might be time to merge this PR or get some approvals.

//cc @chtruong814 @ko3n1g @pablo-garay @thomasdhc

tbartley94

Formatting and structural changes. General logic looks good though.

nemo/collections/common/data/lhotse/cutset.py

tbartley94 · 2026-02-18T18:33:04Z

nemo/collections/common/data/lhotse/indexed_adapters.py

+            bits += 1
+        self._half = bits // 2
+        self._mask = (1 << self._half) - 1
+        self._rounds = 6


make a arg in init.

tbartley94 · 2026-02-18T18:34:36Z

nemo/collections/common/data/lhotse/indexed_adapters.py

+        left = (x >> self._half) & self._mask
+        right = x & self._mask
+        for key in self._keys:
+            left, right = right, left ^ (((right * 2654435761) ^ key) >> 32 & self._mask)


make global var at top of file.

nemo/collections/common/data/lhotse/indexed_adapters.py

tbartley94 · 2026-02-18T18:56:13Z

nemo/collections/common/data/lhotse/indexed_adapters.py

+        for line in f_in:
+            current_offset += len(line)
+            write_buffer.extend(struct.pack('<Q', current_offset))
+            if len(write_buffer) > 8 * 1024 * 1024:


very nitpicky, but just write out the full multiplication as a var above and comment. no need to do the extra ops for every line.

nemo/collections/common/data/lhotse/text_adapters.py

tbartley94 · 2026-02-18T19:05:01Z

nemo/collections/common/data/lhotse/text_adapters.py


        {
-            "id": str,
+            "id": str,  # not optional, but we will tolerate if it's missing


bit cryptic, provide line where we tolerate this?

nemo/collections/common/data/lhotse/text_adapters.py

tbartley94 · 2026-02-18T19:09:31Z

nemo/collections/common/data/lhotse/text_adapters.py

+                raise FileNotFoundError(f"No wids-meta.json and no .tar files found under {self.data_dir}")
+        self.audio_placeholders = _normalize_audio_placeholders(self.audio_placeholders)
+        self._has_index = all(Path(p + ".idx").exists() for p in self._shard_paths)
+        self.epoch = 0


hmm, is there anyway we can sync this with the trainer? having an adapter maintaining epoch on its lonesome sounds like a pending desync issue that will be annoying to hunt down

No, there is no way. Each dataset iterator keeps its own track and there is no shared concept of a global epoch in our setups.

hmmm, for sanity can you add to the unit tests to check trainer.epoch == text_adapter.epoch (if not already done)? This seems like something potentially painful in a year.

pyf98 · 2026-02-19T04:53:57Z

Thanks. This PR looks good to me!

- Extract magic constant to module-level _KNUTH_HASH - Make Feistel round count a configurable init arg (num_rounds) - Refactor _load_index to always return sentinel-inclusive offsets and validate offset bounds, simplifying callers - Replace raw tuple with TarSample NamedTuple for clarity - Add docstring explaining why custom tar reading vs stdlib tarfile - Pre-compute flush_threshold in create_index - Raise ValueError on missing audio path instead of silent skip - Clarify "tolerate" comment with concrete fallback reference - Add comment explaining force_finite flag purpose Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

github-actions · 2026-02-26T12:26:08Z

[🤖]: Hi @pzelasko 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully.

So it might be time to merge this PR or get some approvals.

tbartley94

request for simple unit test. else LGTM

nemo/collections/common/data/lhotse/text_adapters.py

tbartley94 · 2026-02-26T15:23:51Z

nemo/collections/common/data/lhotse/text_adapters.py

+                raise FileNotFoundError(f"No wids-meta.json and no .tar files found under {self.data_dir}")
+        self.audio_placeholders = _normalize_audio_placeholders(self.audio_placeholders)
+        self._has_index = all(Path(p + ".idx").exists() for p in self._shard_paths)
+        self.epoch = 0


hmmm, for sanity can you add to the unit tests to check trainer.epoch == text_adapter.epoch (if not already done)? This seems like something potentially painful in a year.

nemo/collections/common/data/lhotse/cutset.py

[speechm2] Support indexed sharegpt JSONL and webdataset formats

015b560

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

pzelasko requested review from pyf98 and tbartley94 February 17, 2026 16:57

github-actions bot added the common label Feb 17, 2026

fix lint

149c911

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

pzelasko added the Run CICD label Feb 17, 2026

pzelasko temporarily deployed to test February 17, 2026 17:04 — with GitHub Actions Inactive

github-actions bot removed the Run CICD label Feb 17, 2026

tbartley94 requested changes Feb 18, 2026

View reviewed changes

pzelasko and others added 2 commits February 25, 2026 14:16

fix lint

7fea6a7

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

pzelasko added the Run CICD label Feb 25, 2026

pzelasko temporarily deployed to test February 25, 2026 19:22 — with GitHub Actions Inactive

github-actions bot removed the Run CICD label Feb 26, 2026

pzelasko enabled auto-merge (squash) February 26, 2026 12:34

tbartley94 approved these changes Feb 26, 2026

View reviewed changes

pzelasko merged commit 1692a8f into main Feb 26, 2026
212 of 214 checks passed

pzelasko deleted the indexed-sharegpt-data-type-parsers branch February 26, 2026 15:29

Conversation

pzelasko commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

Usage

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

Uh oh!

github-actions bot commented Feb 17, 2026

Uh oh!

tbartley94 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pyf98 commented Feb 19, 2026

Uh oh!

github-actions bot commented Feb 26, 2026

Uh oh!

tbartley94 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pzelasko commented Feb 17, 2026 •

edited

Loading