varlendataset for thd e2e and benchmark#4832
Conversation
df8fad0 to
93d806a
Compare
| # BSHD reference mode: each sample is right-padded to | ||
| # sequence_length and shipped through the default non-packed | ||
| # pipeline. No scheduler / dynamic-cp involved. | ||
| assert not args.dynamic_context_parallel, ( | ||
| "--varlen-bshd-validation is incompatible with " | ||
| "--dynamic-context-parallel (BSHD mode is not packed)." | ||
| ) | ||
| assert args.sequence_packing_scheduler is None, ( | ||
| "--varlen-bshd-validation does not use a sequence packing " | ||
| "scheduler; drop --sequence-packing-scheduler." | ||
| ) |
There was a problem hiding this comment.
Can we move these checks into gpt_dataset.py?
|
|
||
| sft_mock_dataset_config_json: Optional[str] = None | ||
|
|
||
| varlen_mock_dataset_config_json: Optional[str] = None |
There was a problem hiding this comment.
Can we support either a JSON string or a JSON file path for sft_mock_dataset_config_json and varlen_mock_dataset_config_json? It is a little bit annoying for users to pass a JSON string via the CLI.
There was a problem hiding this comment.
ok, i'll make the change~
Signed-off-by: tailaim <tailaim@nvidia.com>
…dolma3_longmino_mix-100B-1125 Signed-off-by: tailaim <tailaim@nvidia.com>
Signed-off-by: tailaim <tailaim@nvidia.com>
93d806a to
1f8946a
Compare
|
/ok to test 1f8946a |
| if args.sequence_packing_scheduler is not None: | ||
| if args.sequence_packing_scheduler == 'dp_balanced': | ||
| total_cp_ranks = args.context_parallel_size | ||
| else: | ||
| total_cp_ranks = args.data_parallel_size * args.context_parallel_size | ||
| assert total_cp_ranks * args.max_seqlen_per_dp_cp_rank >= args.seq_length, ( | ||
| f'Packed sequence buffer size ({total_cp_ranks * args.max_seqlen_per_dp_cp_rank}) ' | ||
| f'must be >= single sequence max length ({args.seq_length})' | ||
| ) |
There was a problem hiding this comment.
Please move these checks below Line 1728. See https://github.com/NVIDIA/Megatron-LM/pull/4832/changes#r3333081617.
| # Identity collate for VarlenDataset and packing-scheduler paths; | ||
| # they emit one variable-length dict per sample, not stack-able by | ||
| # the default collate. | ||
| if args.use_varlen_dataset or args.sequence_packing_scheduler is not None: |
There was a problem hiding this comment.
| if args.use_varlen_dataset or args.sequence_packing_scheduler is not None: | |
| if (args.use_varlen_dataset and not args.varlen_bshd_validation) or args.sequence_packing_scheduler is not None: |
--varlen-bshd-validation disables the scheduler, but the DataLoader still uses identity collate because args.use_varlen_dataset is true.
| # earlier in ``validate_args``). | ||
| # * Otherwise fall back to ``dp_balanced`` (static packing). | ||
| if args.sequence_packing_scheduler is None: | ||
| args.sequence_packing_scheduler = 'dp_balanced' |
There was a problem hiding this comment.
The normal --use-varlen-dataset path auto-selects dp_balanced after the generic sequence-packing validation has already run. See https://github.com/NVIDIA/Megatron-LM/pull/4832/changes#r3333048405.
| # tokenizers like Qwen3). Fall back to eod for padding — irrelevant | ||
| # for loss because loss_mask zeros pad positions out. | ||
| eod = tokenizer.eod | ||
| pad = tokenizer.pad if tokenizer.pad is not None else eod |
There was a problem hiding this comment.
When the tokenizer has no pad token, VarlenDataset falls back to pad = eod and then masks loss by token value with loss_mask[labels == pad] = 0.0.
The fallback is explicitly intended for tokenizers without an explicit pad token, but when pad == eod, value-based masking removes every EOD/EOS target from the loss, including real sequence-ending EOD tokens that are not padding. This silently changes training semantics for common raw pretraining tokenizers without a pad token: the model no longer learns the end-of-document target. The BSHD branch has the same value-based masking.
Suggestion: Track the padded tail by position instead of masking all labels equal to the pad id. For example, save the valid shifted length before padding and then zero loss_mask[valid_len:], while still masking IGNORE_INDEX. Mirror the same behavior in MockVarlenDataset, which currently does not apply the real dataset's pad fallback.
|
Hi @xiaoyao0115 , thanks for your incredible work! I discuss with codex and come to several minor comments. Besides, we have some extra comments about the UT, FYI.
|
Victarry
left a comment
There was a problem hiding this comment.
LGTM. Left few minor suggestions.
| sft_mock_dataset_config_json: Optional[str] = None | ||
|
|
||
| varlen_mock_dataset_config_json: Optional[str] = None | ||
| """Mock-dataset config (same JSON schema as ``sft_mock_dataset_config_json``) | ||
| used by the ``--use-varlen-dataset`` path; kept separate so the varlen path | ||
| does not implicitly inherit SFT-specific knobs.""" | ||
|
|
||
| varlen_bshd_validation: bool = False | ||
| """When True, :class:`VarlenDataset.__getitem__` emits SBHD samples padded | ||
| to ``sequence_length`` (no ``cu_seqlens`` / ``original_seq_len`` / | ||
| ``padded_seq_len``), bypassing the packed-sequence path. Used to obtain a | ||
| BSHD reference run that mirrors the THD path's tokenization but skips all | ||
| packing — useful for THD numerical-correctness validation.""" | ||
| """This config provides the necessary information for the mock dataset.""" |
There was a problem hiding this comment.
[SUGGESTION] Orphaned field docstring after inserting the new varlen config fields.
This PR inserts varlen_mock_dataset_config_json and varlen_bshd_validation (each with its own docstring) between sft_mock_dataset_config_json and the docstring that originally documented it. As a result:
sft_mock_dataset_config_json(line 79) no longer has a docstring."""This config provides the necessary information for the mock dataset."""(line 92) now dangles aftervarlen_bshd_validationas a no-op string expression, where it no longer describes the field it was written for and is misleading next to a bool flag.
Suggestion: Move the docstring back under sft_mock_dataset_config_json, or delete it if redundant.
|
|
||
| @property | ||
| def schema_name(self) -> str: | ||
| """Detected schema name: ``alpaca`` / ``sharegpt`` / ``openai-messages``.""" |
There was a problem hiding this comment.
[SUGGESTION] Docstrings omit the pretrain-text schema that _select_converter actually returns.
_select_converter (line 206) returns a fourth schema, pretrain-text (the text-column fallback), and _raw_text_loader returns a plain string rather than a messages list. Several higher-level docstrings were not updated when this schema was added and are now inaccurate:
- This
schema_namedocstring (line 292) lists onlyalpaca / sharegpt / openai-messages. - The module docstring (line 19) and the
VarlenLowLevelDatasetclass docstring say "three common instruction-tuning layouts ... normalized to the messages list format" — but there are four schemas, andpretrain-textis not normalized to a messages list.
Suggestion: Add pretrain-text to these docstrings and note that it returns a raw string handled separately in VarlenDataset.__getitem__.
What does this PR do ?
Add
VarlenDatasetfor variable-length training over HF / local jsonl / parquet data1. What this PR does
Adds a new dataset class
VarlenDataset(and itsMockVarlenDatasetsibling) for variable-length training, gated by a new top-level flag--use-varlen-dataset. Designed for the packed-sequence (THD) path, both static (--sequence-packing-scheduler dp_balanced) and dynamic (--dynamic-context-parallel) variants.Why a separate dataset class instead of extending
--sft?__getitem__returns one tokenized sample in unpacked form (tokens,labels,loss_mask,position_ids,original_seq_len,padded_seq_len).dp_balancedordefault_dynamic_cp) sees variable-length samples and packs them across the DP×CP grid up to--max-seqlen-per-dp-cp-rank.This is what
BasePackingScheduler.get_required_sample_keys()already expects, and what the existing comment indata_schedule_utils.pyflags as the "ideal" dataset shape.SFTDatasettriggers a (wasteful) unpack → repack round-trip via_unpack_batch;VarlenDatasetskips it.Three additional framework-level fixes rolled in to make the new path work cleanly without breaking
--sft:_unpack_batchshort-circuits when the sample already haspadded_seq_len(no cu_seqlens-based slicing needed).--sftpath unchanged.data_samplers.pyuses identitycollate_fnfor all packing schedulers, not just--dynamic-context-parallel. The previous gate excludeddp_balancedusers.pretrain_gpt.py:get_batchwidens theis_packed_sequencecheck fromargs.sfttoargs.sft or args.use_varlen_dataset.Three validate-args asserts guard the new flag:
--use-varlen-dataset⊥--sft(both select the packed-sequence dataset family).--use-varlen-dataset⊥--mock-datais allowed (routes toMockVarlenDataset, configured via--varlen-mock-dataset-config-json).--use-varlen-datasetauto-picks a packing scheduler when none is given:dp_balancedby default, ordefault_dynamic_cpwhen--dynamic-context-parallelis set.--varlen-bshd-validationopts out of the packing path entirely.Files touched
Total: 5 modified + 1 new, ~96 line diff plus the new file.
2. How to use it
--use-varlen-datasetreuses the existing--data-pathargument. Three input sources, all auto-detected:A sequence packing scheduler is auto-selected:
dp_balanced(static) by default, ordefault_dynamic_cpwhen--dynamic-context-parallelis passed. To override either default, pass--sequence-packing-schedulerexplicitly.Supported dataset schemas (auto-detected from column names)
Each jsonl line / parquet row / HF Hub row must match one of:
A. Alpaca / Dolly style — at least one of
instruction | prompt | query | question, plus one ofoutput | response | completion | answer. Optional supplementary context:input | context.{"instruction": "Summarize this paper.", "input": "Paper text...", "output": "..."} {"instruction": "Who wrote 1984?", "context": "1984 was written...", "response": "Orwell"} {"prompt": "Q?", "response": "A."}B. ShareGPT style —
conversationscolumn with{"from": ..., "value": ...}entries.{"conversations": [ {"from": "human", "value": "Hi"}, {"from": "gpt", "value": "Hello"} ]}fromis mapped to chat-template roles via a small dict (human/user→user,gpt/assistant/model/chatgpt/bing/bard→assistant,tool/function/observation→tool); unknown speakers fall back touser.C. OpenAI messages style —
messagescolumn with{"role": ..., "content": ...}entries.{"messages": [ {"role": "system", "content": "Be terse."}, {"role": "user", "content": "Hi"}, {"role": "assistant", "content": "Hello"} ]}Detection priority:
messages>conversations> alpaca-synonyms. Unrecognized columns raise a clearValueError.Known compatible HuggingFace datasets
The schemas above cover most public SFT corpora. Examples that work out of the box (no preprocessing, just
--data-path owner/repo):Yukang/LongAlpaca-12ktatsu-lab/alpacavicgalle/alpaca-gpt4databricks/databricks-dolly-15kHuggingFaceH4/no_robotsOpen-Orca/OpenOrcaconversations)Open-Orca/SlimOrcalmsys/lmsys-chat-1mcognitivecomputations/SystemChat-2.0nvidia/HelpSteer2prompt+response)prompt/responsesynonymsDatasets explicitly not supported (would raise on schema detect):
OpenAssistant/oasst1— tree-structured conversation graphAnthropic/hh-rlhf— preference pairs (chosen / rejected), not a single conversation per rowMock mode (for benchmarking)
--use-varlen-dataset --mock-data --use-varlen-dataset --mock-data --varlen-mock-dataset-config-json \ '{"mode":"distribution","type":"lognormal","min_seq_len":1024,"max_seq_len":8192,"mean_seq_len":4096,"lognormal_sigma":1.2}'Three mock modes (mirroring
--sft-mock-dataset-config-json):distribution(lognormal seq-length sampling)file(per-line lengths from a CSV)verification(real tokens from an IndexedDataset, with lognormal sampled lengths)BSHD reference mode (for THD numerical verification)
--varlen-bshd-validationbypasses the packed-sequence path entirely: each sample is right-padded to--seq-length, nocu_seqlens, no packing scheduler. Used to obtain a BSHD reference run from the same data and same tokenization that the THD path consumes, so the two can be compared for correctness. Incompatible with--dynamic-context-paralleland--sequence-packing-scheduler.Tokenizer requirement
Same as
--sft: needs a tokenizer withtokenize_conversationsupport. Pass--tokenizer-type SFTTokenizer --sft-tokenizer-prompt-format {default | nemotron-h-aligned | nemotron-nano-v2 | identity}along with--tokenizer-model <hf-tokenizer-dir>.Limitations (raise rather than silent-mishandle)
split="train"is loaded. Export to a local jsonl/parquet first if your dataset's primary split is named differently.Issue tracking
For PRs from open-source community contributors:
Linked issue:
Contribution process
Pre-checks
Code review
Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.
Step 1: Mark PR as "Ready for Review"
.github/CODEOWNERS.Final Review might get declined if these requirements are not fulfilled.
Step 2: Final Review
For PRs that change
megatron/core, once all expert reviewers have approved, theFinal Reviewlabel is applied automatically and final reviewers are assigned.For PRs outside
megatron/core, this step is skipped.Step 3: Approved
Once all required reviewers have approved, the
Approvedlabel is applied automatically.Merge
Any member of mcore-engineers will be able to merge your PR.
For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.