varlendataset for thd e2e and benchmark by xiaoyao0115 · Pull Request #4832 · NVIDIA/Megatron-LM

xiaoyao0115 · 2026-05-16T21:59:30Z

What does this PR do ?

Add VarlenDataset for variable-length training over HF / local jsonl / parquet data

1. What this PR does

Adds a new dataset class VarlenDataset (and its MockVarlenDataset sibling) for variable-length training, gated by a new top-level flag --use-varlen-dataset. Designed for the packed-sequence (THD) path, both static (--sequence-packing-scheduler dp_balanced) and dynamic (--dynamic-context-parallel) variants.

Why a separate dataset class instead of extending --sft?

Supporting using hugging face dataset to run thd e2e.
Each __getitem__ returns one tokenized sample in unpacked form (tokens, labels, loss_mask, position_ids, original_seq_len, padded_seq_len).
The upstream packing scheduler (dp_balanced or default_dynamic_cp) sees variable-length samples and packs them across the DP×CP grid up to --max-seqlen-per-dp-cp-rank.

This is what BasePackingScheduler.get_required_sample_keys() already expects, and what the existing comment in data_schedule_utils.py flags as the "ideal" dataset shape. SFTDataset triggers a (wasteful) unpack → repack round-trip via _unpack_batch; VarlenDataset skips it.

Three additional framework-level fixes rolled in to make the new path work cleanly without breaking --sft:

_unpack_batch short-circuits when the sample already has padded_seq_len (no cu_seqlens-based slicing needed). --sft path unchanged.
data_samplers.py uses identity collate_fn for all packing schedulers, not just --dynamic-context-parallel. The previous gate excluded dp_balanced users.
pretrain_gpt.py:get_batch widens the is_packed_sequence check from args.sft to args.sft or args.use_varlen_dataset.

Three validate-args asserts guard the new flag:

--use-varlen-dataset ⊥ --sft (both select the packed-sequence dataset family).
--use-varlen-dataset ⊥ --mock-data is allowed (routes to MockVarlenDataset, configured via --varlen-mock-dataset-config-json).
--use-varlen-dataset auto-picks a packing scheduler when none is given: dp_balanced by default, or default_dynamic_cp when --dynamic-context-parallel is set. --varlen-bshd-validation opts out of the packing path entirely.

Files touched

megatron/training/datasets/varlen_dataset.py   (new, ~340 lines)
megatron/training/arguments.py                  +51   new args group + validate asserts
megatron/training/datasets/data_samplers.py     +6    identity collate for all scheduler paths
megatron/core/datasets/data_schedule_utils.py   +23   _unpack_batch short-circuit
megatron/core/datasets/gpt_dataset.py           +5    varlen_mock_dataset_config_json field
pretrain_gpt.py                                 +13   dataset_type dispatch + is_packed_sequence

Total: 5 modified + 1 new, ~96 line diff plus the new file.

2. How to use it

--use-varlen-dataset reuses the existing --data-path argument. Three input sources, all auto-detected:

# HuggingFace Hub repo id (auto-downloaded by `datasets.load_dataset`)
--use-varlen-dataset --data-path Yukang/LongAlpaca-12k
--use-varlen-dataset --data-path HuggingFaceH4/no_robots
--use-varlen-dataset --data-path databricks/databricks-dolly-15k

# Local parquet
--use-varlen-dataset --data-path /path/to/dataset.parquet

# Local jsonl
--use-varlen-dataset --data-path /path/to/dataset.jsonl

A sequence packing scheduler is auto-selected: dp_balanced (static) by default, or default_dynamic_cp when --dynamic-context-parallel is passed. To override either default, pass --sequence-packing-scheduler explicitly.

Supported dataset schemas (auto-detected from column names)

Each jsonl line / parquet row / HF Hub row must match one of:

{"instruction": "Summarize this paper.", "input": "Paper text...", "output": "..."}
{"instruction": "Who wrote 1984?", "context": "1984 was written...", "response": "Orwell"}
{"prompt": "Q?", "response": "A."}

B. ShareGPT style — conversations column with {"from": ..., "value": ...} entries.

{"conversations": [
    {"from": "human", "value": "Hi"},
    {"from": "gpt",   "value": "Hello"}
]}

from is mapped to chat-template roles via a small dict (human/user → user, gpt/assistant/model/chatgpt/bing/bard → assistant, tool/function/observation → tool); unknown speakers fall back to user.

C. OpenAI messages style — messages column with {"role": ..., "content": ...} entries.

{"messages": [
    {"role": "system",    "content": "Be terse."},
    {"role": "user",      "content": "Hi"},
    {"role": "assistant", "content": "Hello"}
]}

Detection priority: messages > conversations > alpaca-synonyms. Unrecognized columns raise a clear ValueError.

Known compatible HuggingFace datasets

The schemas above cover most public SFT corpora. Examples that work out of the box (no preprocessing, just --data-path owner/repo):

HF repo id	Schema	Approx size	Notes
`Yukang/LongAlpaca-12k`	alpaca	12 k rows / 500 MB	Long-context SFT, many samples > 16k tokens
`tatsu-lab/alpaca`	alpaca	52 k rows	The canonical Stanford Alpaca dataset
`vicgalle/alpaca-gpt4`	alpaca	52 k rows	GPT-4 regenerated Alpaca
`databricks/databricks-dolly-15k`	alpaca (instruction + context + response)	15 k rows	Dolly fields auto-handled via synonyms
`HuggingFaceH4/no_robots`	openai-messages	10 k rows / 22 MB parquet	Multi-turn chat
`Open-Orca/OpenOrca`	sharegpt-style (column `conversations`)	~3 M rows	Large; expect long load
`Open-Orca/SlimOrca`	sharegpt	~500 k rows	Filtered subset of OpenOrca
`lmsys/lmsys-chat-1m`	openai-messages	1 M rows	Multi-turn user/assistant
`cognitivecomputations/SystemChat-2.0`	sharegpt	~7 k rows	System-prompt-led conversations
`nvidia/HelpSteer2`	alpaca-like (`prompt` + `response`)	~10 k rows	Picked up via the `prompt`/`response` synonyms

Datasets explicitly not supported (would raise on schema detect):

OpenAssistant/oasst1 — tree-structured conversation graph
Anthropic/hh-rlhf — preference pairs (chosen / rejected), not a single conversation per row
Multi-modal SFT corpora (content stored as a list of image / text parts)

Mock mode (for benchmarking)

--use-varlen-dataset --mock-data
--use-varlen-dataset --mock-data --varlen-mock-dataset-config-json \
  '{"mode":"distribution","type":"lognormal","min_seq_len":1024,"max_seq_len":8192,"mean_seq_len":4096,"lognormal_sigma":1.2}'

Three mock modes (mirroring --sft-mock-dataset-config-json):

distribution (lognormal seq-length sampling)
file (per-line lengths from a CSV)
verification (real tokens from an IndexedDataset, with lognormal sampled lengths)

BSHD reference mode (for THD numerical verification)

--varlen-bshd-validation bypasses the packed-sequence path entirely: each sample is right-padded to --seq-length, no cu_seqlens, no packing scheduler. Used to obtain a BSHD reference run from the same data and same tokenization that the THD path consumes, so the two can be compared for correctness. Incompatible with --dynamic-context-parallel and --sequence-packing-scheduler.

# Side-by-side run for THD correctness validation:
--use-varlen-dataset --data-path my_data.jsonl                                              # THD (with scheduler)
--use-varlen-dataset --data-path my_data.jsonl --varlen-bshd-validation                     # BSHD reference

Tokenizer requirement

Same as --sft: needs a tokenizer with tokenize_conversation support. Pass --tokenizer-type SFTTokenizer --sft-tokenizer-prompt-format {default | nemotron-h-aligned | nemotron-nano-v2 | identity} along with --tokenizer-model <hf-tokenizer-dir>.

Limitations (raise rather than silent-mishandle)

Tree-structured (e.g. OpenAssistant oasst1) or chosen/rejected preference datasets are not supported.
Multi-modal samples (content as a list of image/text parts) are not supported.
HF Hub repos: only split="train" is loaded. Export to a local jsonl/parquet first if your dataset's primary split is named differently.

Issue tracking

For PRs from open-source community contributors:

New features: a linked issue is required. Please open a feature request and reference it here before submitting the PR.
Small updates (bug fixes, minor improvements): a linked issue is recommended and will accelerate the PR review process.

Linked issue:

Contribution process

Pre-checks

I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

When your PR is ready, click Ready for Review.
An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
- Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

copy-pr-bot · 2026-05-16T21:59:34Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

yuzhongw-nvidia · 2026-05-19T02:00:32Z

+            # BSHD reference mode: each sample is right-padded to
+            # sequence_length and shipped through the default non-packed
+            # pipeline. No scheduler / dynamic-cp involved.
+            assert not args.dynamic_context_parallel, (
+                "--varlen-bshd-validation is incompatible with "
+                "--dynamic-context-parallel (BSHD mode is not packed)."
+            )
+            assert args.sequence_packing_scheduler is None, (
+                "--varlen-bshd-validation does not use a sequence packing "
+                "scheduler; drop --sequence-packing-scheduler."
+            )


Can we move these checks into gpt_dataset.py?

yuzhongw-nvidia · 2026-05-19T02:11:29Z


    sft_mock_dataset_config_json: Optional[str] = None
+
+    varlen_mock_dataset_config_json: Optional[str] = None


Can we support either a JSON string or a JSON file path for sft_mock_dataset_config_json and varlen_mock_dataset_config_json? It is a little bit annoying for users to pass a JSON string via the CLI.

ok, i'll make the change~

Signed-off-by: tailaim <tailaim@nvidia.com>

…dolma3_longmino_mix-100B-1125 Signed-off-by: tailaim <tailaim@nvidia.com>

Signed-off-by: tailaim <tailaim@nvidia.com>

xiaoyao0115 · 2026-05-29T04:08:04Z

/ok to test 1f8946a

yuzhongw-nvidia · 2026-06-01T09:27:21Z

    if args.sequence_packing_scheduler is not None:
        if args.sequence_packing_scheduler == 'dp_balanced':
            total_cp_ranks = args.context_parallel_size
        else:
            total_cp_ranks = args.data_parallel_size * args.context_parallel_size
        assert total_cp_ranks * args.max_seqlen_per_dp_cp_rank >= args.seq_length, (
            f'Packed sequence buffer size ({total_cp_ranks * args.max_seqlen_per_dp_cp_rank}) '
            f'must be >= single sequence max length ({args.seq_length})'
        )


Please move these checks below Line 1728. See https://github.com/NVIDIA/Megatron-LM/pull/4832/changes#r3333081617.

yuzhongw-nvidia · 2026-06-01T09:31:13Z

+    # Identity collate for VarlenDataset and packing-scheduler paths;
+    # they emit one variable-length dict per sample, not stack-able by
+    # the default collate.
+    if args.use_varlen_dataset or args.sequence_packing_scheduler is not None:


Suggested change

if args.use_varlen_dataset or args.sequence_packing_scheduler is not None:

if (args.use_varlen_dataset and not args.varlen_bshd_validation) or args.sequence_packing_scheduler is not None:

--varlen-bshd-validation disables the scheduler, but the DataLoader still uses identity collate because args.use_varlen_dataset is true.

yuzhongw-nvidia · 2026-06-01T09:33:01Z

+            #     earlier in ``validate_args``).
+            #   * Otherwise fall back to ``dp_balanced`` (static packing).
+            if args.sequence_packing_scheduler is None:
+                args.sequence_packing_scheduler = 'dp_balanced'


The normal --use-varlen-dataset path auto-selects dp_balanced after the generic sequence-packing validation has already run. See https://github.com/NVIDIA/Megatron-LM/pull/4832/changes#r3333048405.

yuzhongw-nvidia · 2026-06-01T09:37:25Z

+        # tokenizers like Qwen3). Fall back to eod for padding — irrelevant
+        # for loss because loss_mask zeros pad positions out.
+        eod = tokenizer.eod
+        pad = tokenizer.pad if tokenizer.pad is not None else eod


When the tokenizer has no pad token, VarlenDataset falls back to pad = eod and then masks loss by token value with loss_mask[labels == pad] = 0.0.

The fallback is explicitly intended for tokenizers without an explicit pad token, but when pad == eod, value-based masking removes every EOD/EOS target from the loss, including real sequence-ending EOD tokens that are not padding. This silently changes training semantics for common raw pretraining tokenizers without a pad token: the model no longer learns the end-of-document target. The BSHD branch has the same value-based masking.

Suggestion: Track the padded tail by position instead of masking all labels equal to the pad id. For example, save the valid shifted length before padding and then zero loss_mask[valid_len:], while still masking IGNORE_INDEX. Mirror the same behavior in MockVarlenDataset, which currently does not apply the real dataset's pad fallback.

yuzhongw-nvidia · 2026-06-01T09:41:34Z

Hi @xiaoyao0115 , thanks for your incredible work! I discuss with codex and come to several minor comments.

Besides, we have some extra comments about the UT, FYI.

tests/unit_tests/data/test_varlen_dataset.py covers schema conversion and local JSONL loading, but does not instantiate VarlenDataset.__getitem__() or MockVarlenDataset.__getitem__().
No test covers --varlen-bshd-validation through DataLoader + normal get_batch_on_this_tp_rank().
No test covers the full THD handoff from VarlenDataset output through _unpack_batch() / scheduler input contract.
Empty raw text rows are accepted by _raw_text_loader() as ""; depending on the tokenizer, tokenizer.tokenize("") can return an empty list, after which tokens_list[-1] would fail.

Victarry

LGTM. Left few minor suggestions.

Victarry · 2026-06-02T10:06:56Z

    sft_mock_dataset_config_json: Optional[str] = None
+
+    varlen_mock_dataset_config_json: Optional[str] = None
+    """Mock-dataset config (same JSON schema as ``sft_mock_dataset_config_json``)
+    used by the ``--use-varlen-dataset`` path; kept separate so the varlen path
+    does not implicitly inherit SFT-specific knobs."""
+
+    varlen_bshd_validation: bool = False
+    """When True, :class:`VarlenDataset.__getitem__` emits SBHD samples padded
+    to ``sequence_length`` (no ``cu_seqlens`` / ``original_seq_len`` /
+    ``padded_seq_len``), bypassing the packed-sequence path. Used to obtain a
+    BSHD reference run that mirrors the THD path's tokenization but skips all
+    packing — useful for THD numerical-correctness validation."""
    """This config provides the necessary information for the mock dataset."""


[SUGGESTION] Orphaned field docstring after inserting the new varlen config fields.

This PR inserts varlen_mock_dataset_config_json and varlen_bshd_validation (each with its own docstring) between sft_mock_dataset_config_json and the docstring that originally documented it. As a result:

sft_mock_dataset_config_json (line 79) no longer has a docstring.

"""This config provides the necessary information for the mock dataset.""" (line 92) now dangles after varlen_bshd_validation as a no-op string expression, where it no longer describes the field it was written for and is misleading next to a bool flag.

Suggestion: Move the docstring back under sft_mock_dataset_config_json, or delete it if redundant.

Victarry · 2026-06-02T10:07:00Z

+
+    @property
+    def schema_name(self) -> str:
+        """Detected schema name: ``alpaca`` / ``sharegpt`` / ``openai-messages``."""


[SUGGESTION] Docstrings omit the pretrain-text schema that _select_converter actually returns.

_select_converter (line 206) returns a fourth schema, pretrain-text (the text-column fallback), and _raw_text_loader returns a plain string rather than a messages list. Several higher-level docstrings were not updated when this schema was added and are now inaccurate:

This schema_name docstring (line 292) lists only alpaca / sharegpt / openai-messages.

The module docstring (line 19) and the VarlenLowLevelDataset class docstring say "three common instruction-tuning layouts ... normalized to the messages list format" — but there are four schemas, and pretrain-text is not normalized to a messages list.

Suggestion: Add pretrain-text to these docstrings and note that it returns a raw string handled separately in VarlenDataset.__getitem__.

xiaoyao0115 self-assigned this May 16, 2026

xiaoyao0115 requested review from a team as code owners May 16, 2026 21:59

xiaoyao0115 force-pushed the varlen-dataset branch from df8fad0 to 93d806a Compare May 16, 2026 22:09

hxbai mentioned this pull request May 17, 2026

DeepSeek-V4 training support #4468

Open

3 tasks

yuzhongw-nvidia reviewed May 19, 2026

View reviewed changes

wplf reviewed May 19, 2026

View reviewed changes

Comment thread megatron/training/datasets/data_samplers.py Outdated

Victarry mentioned this pull request May 21, 2026

[ROADMAP][2026 Q2] Megatron Core MoE Roadmap #4815

Open

71 tasks

xuantengh mentioned this pull request May 26, 2026

Fuse per-sequence AlltoAll into a unified one in GDN forward #4913

Open

5 tasks

xiaoyao0115 added 3 commits May 28, 2026 08:56

add varlendataset for thd e2e and benchmark

73f5432

Signed-off-by: tailaim <tailaim@nvidia.com>

add support for dataset like https://huggingface.co/datasets/allenai/…

feba628

…dolma3_longmino_mix-100B-1125 Signed-off-by: tailaim <tailaim@nvidia.com>

minor fixes according to the comments

1f8946a

Signed-off-by: tailaim <tailaim@nvidia.com>

xiaoyao0115 force-pushed the varlen-dataset branch from 93d806a to 1f8946a Compare May 28, 2026 15:57

NVIDIA deleted a comment from copy-pr-bot Bot May 29, 2026

copy-pr-bot Bot temporarily deployed to test May 29, 2026 04:09 Inactive

wplf approved these changes Jun 1, 2026

View reviewed changes

yuzhongw-nvidia reviewed Jun 1, 2026

View reviewed changes

yuzhongw-nvidia self-requested a review June 2, 2026 09:39

yuzhongw-nvidia approved these changes Jun 2, 2026

View reviewed changes

Victarry reviewed Jun 2, 2026

View reviewed changes

Victarry approved these changes Jun 2, 2026

View reviewed changes


		sft_mock_dataset_config_json: Optional[str] = None

		varlen_mock_dataset_config_json: Optional[str] = None

	if args.use_varlen_dataset or args.sequence_packing_scheduler is not None:
	if (args.use_varlen_dataset and not args.varlen_bshd_validation) or args.sequence_packing_scheduler is not None:

Conversation

xiaoyao0115 commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

1. What this PR does

Files touched

2. How to use it

Supported dataset schemas (auto-detected from column names)

Known compatible HuggingFace datasets

Mock mode (for benchmarking)

BSHD reference mode (for THD numerical verification)

Tokenizer requirement

Limitations (raise rather than silent-mishandle)

Issue tracking

Contribution process

Pre-checks

Code review

Step 1: Mark PR as "Ready for Review"

Step 2: Final Review

Step 3: Approved

Merge

Uh oh!

copy-pr-bot Bot commented May 16, 2026

Uh oh!

yuzhongw-nvidia May 19, 2026

Choose a reason for hiding this comment

Uh oh!

yuzhongw-nvidia May 19, 2026

Choose a reason for hiding this comment

Uh oh!

xiaoyao0115 May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xiaoyao0115 commented May 29, 2026

Uh oh!

yuzhongw-nvidia Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuzhongw-nvidia Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuzhongw-nvidia Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuzhongw-nvidia Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

yuzhongw-nvidia commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Victarry left a comment

Choose a reason for hiding this comment

Uh oh!

Victarry Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Victarry Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xiaoyao0115 commented May 16, 2026 •

edited

Loading

yuzhongw-nvidia Jun 1, 2026 •

edited

Loading

yuzhongw-nvidia Jun 1, 2026 •

edited

Loading

yuzhongw-nvidia Jun 1, 2026 •

edited

Loading

yuzhongw-nvidia commented Jun 1, 2026 •

edited

Loading