Skip to content

varlendataset for thd e2e and benchmark#4832

Open
xiaoyao0115 wants to merge 3 commits into
NVIDIA:devfrom
xiaoyao0115:varlen-dataset
Open

varlendataset for thd e2e and benchmark#4832
xiaoyao0115 wants to merge 3 commits into
NVIDIA:devfrom
xiaoyao0115:varlen-dataset

Conversation

@xiaoyao0115
Copy link
Copy Markdown
Contributor

@xiaoyao0115 xiaoyao0115 commented May 16, 2026

What does this PR do ?

Add VarlenDataset for variable-length training over HF / local jsonl / parquet data

1. What this PR does

Adds a new dataset class VarlenDataset (and its MockVarlenDataset sibling) for variable-length training, gated by a new top-level flag --use-varlen-dataset. Designed for the packed-sequence (THD) path, both static (--sequence-packing-scheduler dp_balanced) and dynamic (--dynamic-context-parallel) variants.

Why a separate dataset class instead of extending --sft?

  • Supporting using hugging face dataset to run thd e2e.
  • Each __getitem__ returns one tokenized sample in unpacked form (tokens, labels, loss_mask, position_ids, original_seq_len, padded_seq_len).
  • The upstream packing scheduler (dp_balanced or default_dynamic_cp) sees variable-length samples and packs them across the DP×CP grid up to --max-seqlen-per-dp-cp-rank.

This is what BasePackingScheduler.get_required_sample_keys() already expects, and what the existing comment in data_schedule_utils.py flags as the "ideal" dataset shape. SFTDataset triggers a (wasteful) unpack → repack round-trip via _unpack_batch; VarlenDataset skips it.

Three additional framework-level fixes rolled in to make the new path work cleanly without breaking --sft:

  1. _unpack_batch short-circuits when the sample already has padded_seq_len (no cu_seqlens-based slicing needed). --sft path unchanged.
  2. data_samplers.py uses identity collate_fn for all packing schedulers, not just --dynamic-context-parallel. The previous gate excluded dp_balanced users.
  3. pretrain_gpt.py:get_batch widens the is_packed_sequence check from args.sft to args.sft or args.use_varlen_dataset.

Three validate-args asserts guard the new flag:

  • --use-varlen-dataset--sft (both select the packed-sequence dataset family).
  • --use-varlen-dataset--mock-data is allowed (routes to MockVarlenDataset, configured via --varlen-mock-dataset-config-json).
  • --use-varlen-dataset auto-picks a packing scheduler when none is given: dp_balanced by default, or default_dynamic_cp when --dynamic-context-parallel is set. --varlen-bshd-validation opts out of the packing path entirely.

Files touched

megatron/training/datasets/varlen_dataset.py   (new, ~340 lines)
megatron/training/arguments.py                  +51   new args group + validate asserts
megatron/training/datasets/data_samplers.py     +6    identity collate for all scheduler paths
megatron/core/datasets/data_schedule_utils.py   +23   _unpack_batch short-circuit
megatron/core/datasets/gpt_dataset.py           +5    varlen_mock_dataset_config_json field
pretrain_gpt.py                                 +13   dataset_type dispatch + is_packed_sequence

Total: 5 modified + 1 new, ~96 line diff plus the new file.

2. How to use it

--use-varlen-dataset reuses the existing --data-path argument. Three input sources, all auto-detected:

# HuggingFace Hub repo id (auto-downloaded by `datasets.load_dataset`)
--use-varlen-dataset --data-path Yukang/LongAlpaca-12k
--use-varlen-dataset --data-path HuggingFaceH4/no_robots
--use-varlen-dataset --data-path databricks/databricks-dolly-15k

# Local parquet
--use-varlen-dataset --data-path /path/to/dataset.parquet

# Local jsonl
--use-varlen-dataset --data-path /path/to/dataset.jsonl

A sequence packing scheduler is auto-selected: dp_balanced (static) by default, or default_dynamic_cp when --dynamic-context-parallel is passed. To override either default, pass --sequence-packing-scheduler explicitly.

Supported dataset schemas (auto-detected from column names)

Each jsonl line / parquet row / HF Hub row must match one of:

A. Alpaca / Dolly style — at least one of instruction | prompt | query | question, plus one of output | response | completion | answer. Optional supplementary context: input | context.

{"instruction": "Summarize this paper.", "input": "Paper text...", "output": "..."}
{"instruction": "Who wrote 1984?", "context": "1984 was written...", "response": "Orwell"}
{"prompt": "Q?", "response": "A."}

B. ShareGPT styleconversations column with {"from": ..., "value": ...} entries.

{"conversations": [
    {"from": "human", "value": "Hi"},
    {"from": "gpt",   "value": "Hello"}
]}

from is mapped to chat-template roles via a small dict (human/useruser, gpt/assistant/model/chatgpt/bing/bardassistant, tool/function/observationtool); unknown speakers fall back to user.

C. OpenAI messages stylemessages column with {"role": ..., "content": ...} entries.

{"messages": [
    {"role": "system",    "content": "Be terse."},
    {"role": "user",      "content": "Hi"},
    {"role": "assistant", "content": "Hello"}
]}

Detection priority: messages > conversations > alpaca-synonyms. Unrecognized columns raise a clear ValueError.

Known compatible HuggingFace datasets

The schemas above cover most public SFT corpora. Examples that work out of the box (no preprocessing, just --data-path owner/repo):

HF repo id Schema Approx size Notes
Yukang/LongAlpaca-12k alpaca 12 k rows / 500 MB Long-context SFT, many samples > 16k tokens
tatsu-lab/alpaca alpaca 52 k rows The canonical Stanford Alpaca dataset
vicgalle/alpaca-gpt4 alpaca 52 k rows GPT-4 regenerated Alpaca
databricks/databricks-dolly-15k alpaca (instruction + context + response) 15 k rows Dolly fields auto-handled via synonyms
HuggingFaceH4/no_robots openai-messages 10 k rows / 22 MB parquet Multi-turn chat
Open-Orca/OpenOrca sharegpt-style (column conversations) ~3 M rows Large; expect long load
Open-Orca/SlimOrca sharegpt ~500 k rows Filtered subset of OpenOrca
lmsys/lmsys-chat-1m openai-messages 1 M rows Multi-turn user/assistant
cognitivecomputations/SystemChat-2.0 sharegpt ~7 k rows System-prompt-led conversations
nvidia/HelpSteer2 alpaca-like (prompt + response) ~10 k rows Picked up via the prompt/response synonyms

Datasets explicitly not supported (would raise on schema detect):

  • OpenAssistant/oasst1 — tree-structured conversation graph
  • Anthropic/hh-rlhf — preference pairs (chosen / rejected), not a single conversation per row
  • Multi-modal SFT corpora (content stored as a list of image / text parts)

Mock mode (for benchmarking)

--use-varlen-dataset --mock-data
--use-varlen-dataset --mock-data --varlen-mock-dataset-config-json \
  '{"mode":"distribution","type":"lognormal","min_seq_len":1024,"max_seq_len":8192,"mean_seq_len":4096,"lognormal_sigma":1.2}'

Three mock modes (mirroring --sft-mock-dataset-config-json):

  • distribution (lognormal seq-length sampling)
  • file (per-line lengths from a CSV)
  • verification (real tokens from an IndexedDataset, with lognormal sampled lengths)

BSHD reference mode (for THD numerical verification)

--varlen-bshd-validation bypasses the packed-sequence path entirely: each sample is right-padded to --seq-length, no cu_seqlens, no packing scheduler. Used to obtain a BSHD reference run from the same data and same tokenization that the THD path consumes, so the two can be compared for correctness. Incompatible with --dynamic-context-parallel and --sequence-packing-scheduler.

# Side-by-side run for THD correctness validation:
--use-varlen-dataset --data-path my_data.jsonl                                              # THD (with scheduler)
--use-varlen-dataset --data-path my_data.jsonl --varlen-bshd-validation                     # BSHD reference

Tokenizer requirement

Same as --sft: needs a tokenizer with tokenize_conversation support. Pass --tokenizer-type SFTTokenizer --sft-tokenizer-prompt-format {default | nemotron-h-aligned | nemotron-nano-v2 | identity} along with --tokenizer-model <hf-tokenizer-dir>.

Limitations (raise rather than silent-mishandle)

  • Tree-structured (e.g. OpenAssistant oasst1) or chosen/rejected preference datasets are not supported.
  • Multi-modal samples (content as a list of image/text parts) are not supported.
  • HF Hub repos: only split="train" is loaded. Export to a local jsonl/parquet first if your dataset's primary split is named differently.

Issue tracking

For PRs from open-source community contributors:

  • New features: a linked issue is required. Please open a feature request and reference it here before submitting the PR.
  • Small updates (bug fixes, minor improvements): a linked issue is recommended and will accelerate the PR review process.

Linked issue:

Contribution process

Pre-checks

  • I have added relevant unit tests
  • I have added relevant functional tests
  • I have added proper typing to my code Typing guidelines
  • I have added relevant documentation
  • I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

  1. When your PR is ready, click Ready for Review.
  2. An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
    • Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

For MRs into `dev` branch The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

@xiaoyao0115 xiaoyao0115 self-assigned this May 16, 2026
@xiaoyao0115 xiaoyao0115 requested review from a team as code owners May 16, 2026 21:59
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 16, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Comment thread megatron/training/arguments.py Outdated
Comment on lines +1704 to +1714
# BSHD reference mode: each sample is right-padded to
# sequence_length and shipped through the default non-packed
# pipeline. No scheduler / dynamic-cp involved.
assert not args.dynamic_context_parallel, (
"--varlen-bshd-validation is incompatible with "
"--dynamic-context-parallel (BSHD mode is not packed)."
)
assert args.sequence_packing_scheduler is None, (
"--varlen-bshd-validation does not use a sequence packing "
"scheduler; drop --sequence-packing-scheduler."
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move these checks into gpt_dataset.py?


sft_mock_dataset_config_json: Optional[str] = None

varlen_mock_dataset_config_json: Optional[str] = None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we support either a JSON string or a JSON file path for sft_mock_dataset_config_json and varlen_mock_dataset_config_json? It is a little bit annoying for users to pass a JSON string via the CLI.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i'll make the change~

Comment thread megatron/training/datasets/data_samplers.py Outdated
Signed-off-by: tailaim <tailaim@nvidia.com>
Signed-off-by: tailaim <tailaim@nvidia.com>
@xiaoyao0115
Copy link
Copy Markdown
Contributor Author

/ok to test 1f8946a

@NVIDIA NVIDIA deleted a comment from copy-pr-bot Bot May 29, 2026
Comment on lines 1563 to 1571
if args.sequence_packing_scheduler is not None:
if args.sequence_packing_scheduler == 'dp_balanced':
total_cp_ranks = args.context_parallel_size
else:
total_cp_ranks = args.data_parallel_size * args.context_parallel_size
assert total_cp_ranks * args.max_seqlen_per_dp_cp_rank >= args.seq_length, (
f'Packed sequence buffer size ({total_cp_ranks * args.max_seqlen_per_dp_cp_rank}) '
f'must be >= single sequence max length ({args.seq_length})'
)
Copy link
Copy Markdown
Contributor

@yuzhongw-nvidia yuzhongw-nvidia Jun 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move these checks below Line 1728. See https://github.com/NVIDIA/Megatron-LM/pull/4832/changes#r3333081617.

# Identity collate for VarlenDataset and packing-scheduler paths;
# they emit one variable-length dict per sample, not stack-able by
# the default collate.
if args.use_varlen_dataset or args.sequence_packing_scheduler is not None:
Copy link
Copy Markdown
Contributor

@yuzhongw-nvidia yuzhongw-nvidia Jun 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if args.use_varlen_dataset or args.sequence_packing_scheduler is not None:
if (args.use_varlen_dataset and not args.varlen_bshd_validation) or args.sequence_packing_scheduler is not None:

--varlen-bshd-validation disables the scheduler, but the DataLoader still uses identity collate because args.use_varlen_dataset is true.

# earlier in ``validate_args``).
# * Otherwise fall back to ``dp_balanced`` (static packing).
if args.sequence_packing_scheduler is None:
args.sequence_packing_scheduler = 'dp_balanced'
Copy link
Copy Markdown
Contributor

@yuzhongw-nvidia yuzhongw-nvidia Jun 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The normal --use-varlen-dataset path auto-selects dp_balanced after the generic sequence-packing validation has already run. See https://github.com/NVIDIA/Megatron-LM/pull/4832/changes#r3333048405.

# tokenizers like Qwen3). Fall back to eod for padding — irrelevant
# for loss because loss_mask zeros pad positions out.
eod = tokenizer.eod
pad = tokenizer.pad if tokenizer.pad is not None else eod
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the tokenizer has no pad token, VarlenDataset falls back to pad = eod and then masks loss by token value with loss_mask[labels == pad] = 0.0.

The fallback is explicitly intended for tokenizers without an explicit pad token, but when pad == eod, value-based masking removes every EOD/EOS target from the loss, including real sequence-ending EOD tokens that are not padding. This silently changes training semantics for common raw pretraining tokenizers without a pad token: the model no longer learns the end-of-document target. The BSHD branch has the same value-based masking.

Suggestion: Track the padded tail by position instead of masking all labels equal to the pad id. For example, save the valid shifted length before padding and then zero loss_mask[valid_len:], while still masking IGNORE_INDEX. Mirror the same behavior in MockVarlenDataset, which currently does not apply the real dataset's pad fallback.

@yuzhongw-nvidia
Copy link
Copy Markdown
Contributor

yuzhongw-nvidia commented Jun 1, 2026

Hi @xiaoyao0115 , thanks for your incredible work! I discuss with codex and come to several minor comments.

Besides, we have some extra comments about the UT, FYI.

  • tests/unit_tests/data/test_varlen_dataset.py covers schema conversion and local JSONL loading, but does not instantiate VarlenDataset.__getitem__() or MockVarlenDataset.__getitem__().
  • No test covers --varlen-bshd-validation through DataLoader + normal get_batch_on_this_tp_rank().
  • No test covers the full THD handoff from VarlenDataset output through _unpack_batch() / scheduler input contract.
  • Empty raw text rows are accepted by _raw_text_loader() as ""; depending on the tokenizer, tokenizer.tokenize("") can return an empty list, after which tokens_list[-1] would fail.

@yuzhongw-nvidia yuzhongw-nvidia self-requested a review June 2, 2026 09:39
Copy link
Copy Markdown
Contributor

@Victarry Victarry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Left few minor suggestions.

Comment on lines 79 to 92
sft_mock_dataset_config_json: Optional[str] = None

varlen_mock_dataset_config_json: Optional[str] = None
"""Mock-dataset config (same JSON schema as ``sft_mock_dataset_config_json``)
used by the ``--use-varlen-dataset`` path; kept separate so the varlen path
does not implicitly inherit SFT-specific knobs."""

varlen_bshd_validation: bool = False
"""When True, :class:`VarlenDataset.__getitem__` emits SBHD samples padded
to ``sequence_length`` (no ``cu_seqlens`` / ``original_seq_len`` /
``padded_seq_len``), bypassing the packed-sequence path. Used to obtain a
BSHD reference run that mirrors the THD path's tokenization but skips all
packing — useful for THD numerical-correctness validation."""
"""This config provides the necessary information for the mock dataset."""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SUGGESTION] Orphaned field docstring after inserting the new varlen config fields.

This PR inserts varlen_mock_dataset_config_json and varlen_bshd_validation (each with its own docstring) between sft_mock_dataset_config_json and the docstring that originally documented it. As a result:

  • sft_mock_dataset_config_json (line 79) no longer has a docstring.
  • """This config provides the necessary information for the mock dataset.""" (line 92) now dangles after varlen_bshd_validation as a no-op string expression, where it no longer describes the field it was written for and is misleading next to a bool flag.

Suggestion: Move the docstring back under sft_mock_dataset_config_json, or delete it if redundant.


@property
def schema_name(self) -> str:
"""Detected schema name: ``alpaca`` / ``sharegpt`` / ``openai-messages``."""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SUGGESTION] Docstrings omit the pretrain-text schema that _select_converter actually returns.

_select_converter (line 206) returns a fourth schema, pretrain-text (the text-column fallback), and _raw_text_loader returns a plain string rather than a messages list. Several higher-level docstrings were not updated when this schema was added and are now inaccurate:

  • This schema_name docstring (line 292) lists only alpaca / sharegpt / openai-messages.
  • The module docstring (line 19) and the VarlenLowLevelDataset class docstring say "three common instruction-tuning layouts ... normalized to the messages list format" — but there are four schemas, and pretrain-text is not normalized to a messages list.

Suggestion: Add pretrain-text to these docstrings and note that it returns a raw string handled separately in VarlenDataset.__getitem__.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants