feat(benchmark): HSTU E2E training benchmark suite with progressive optimizations#340
Merged
shijieliu merged 12 commits intoNVIDIA:mainfrom Apr 14, 2026
Merged
Conversation
Collaborator
Author
|
Need to rectify the benchmark result once #313 is done |
Contributor
Greptile SummaryThis PR delivers an HSTU E2E training benchmark suite (5 progressive experiments reaching 3.6× speedup on H100), alongside several correctness and performance fixes: Triton attention
Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant DL as DataLoader
participant PP as TrainPipeline
participant WD as StackDumpWatchdog
participant EM as ShardedEmbedding
participant MS as MainStream
participant SS as SideStream(DP)
participant NC as NCCL(World)
PP->>WD: "watched_iter(count(step), timeout=60s)"
WD-->>PP: heartbeat on each step
PP->>DL: next batch
PP->>PP: batch.num_loss_tokens()
PP->>NC: all_reduce(global_tokens) [world group]
NC-->>PP: global_tokens (sum all ranks)
PP->>EM: forward(kjt)
EM->>EM: compute_dp_length_per_key() [main stream, no D2H]
par DP embedding on side stream
EM->>SS: DataParallelEmbeddingCollection(kjt, lpk)
SS-->>EM: dp_embeddings
and MP embedding on main stream
EM->>MS: ModelParallelEmbeddingCollection(kjt)
MS-->>EM: mp_embeddings (awaitable.wait())
end
EM->>MS: wait_stream(side_stream)
EM-->>PP: merged embeddings dict
PP->>PP: model_fwd → losses
PP->>PP: "local_loss_sum = sum(losses)"
PP->>PP: "(local_loss_sum * dp_size / global_tokens).backward()"
PP->>PP: optimizer.step()
PP->>WD: heartbeat
Reviews (5): Last reviewed commit: "docs: update E2E benchmark results (3933..." | Re-trigger Greptile |
b41488b to
3a8ad56
Compare
c1356f2 to
93b8cf4
Compare
shijieliu
reviewed
Apr 7, 2026
shijieliu
reviewed
Apr 7, 2026
e2440fc to
8485813
Compare
fa2e12a to
ee542c6
Compare
bbf6103 to
e26061f
Compare
…watchdog - Fix Triton kernel mask from Python and/or to bitwise &/| (correctness) - Fix torch.distributed.gather destination rank for non-default DP groups - Handle 0-D tensor in num_contextuals via .view(-1)[0].item() - Fix shuffler race condition, batch counter, and batch a2a support - Scale total_candidates_seq_len for TP and shuffler - Add collective watchdog for hang detection - Avoid D2H sync in _Split2DJaggedFunction by precomputing split lengths Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Eliminate D2H sync in DP embedding forward and enable DP/MP overlap - Optimize loss normalization: move global token count before forward, defer loss all_reduce to log intervals only - Add MEM_DEBUG instrumentation for GPU physical memory tracking Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add comprehensive HSTU training benchmark infrastructure: - Experiment generation via gin configs with configurable optimizations (CUTLASS, recompute, shuffler, caching, TP, value distribution) - SLURM batch submission with remote clone isolation (submit_remote.sh) - Automated result analysis, comparison plots, and MFU heatmaps - CUTLASS attention kernel micro-benchmark (benchmark_hstu_attn_mfu.py) - GPU memory watchdog, cache hit rate debug logging, TFLOPS/MFU utils - Zipf and uniform value distribution support for embedding keys - E2E_BENCHMARK.md documentation with results and optimization space Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…M_DEBUG Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ogress() signature BaseBatch gets a default num_loss_tokens() (labels count or batch_size fallback). GPTSIDBatch overrides with candidate-specific token counting. sid_gr training.py updated to handle 3-value return from progress(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
e26061f to
84c9087
Compare
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
84c9087 to
a4b7e34
Compare
shijieliu
approved these changes
Apr 14, 2026
JacoCheung
added a commit
to JacoCheung/recsys-examples
that referenced
this pull request
Apr 16, 2026
Update from 04df536 to 65bad42 which adds fake tensor implementations for torch.export (hstu_ops_gpu.py). This was missing since PR NVIDIA#340 accidentally reverted the submodule pointer. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
shijieliu
pushed a commit
that referenced
this pull request
Apr 17, 2026
…363) * fix: reduce Docker image layers to avoid overlay2 max depth limit Aggressively merge RUN instructions in the Dockerfile to reduce total layer count from ~126 to ~119. The inference image was hitting the overlay2 128-layer limit ("failed to register layer: max depth exceeded") on CI nodes. devel stage: 8 RUN + 1 COPY -> 4 RUN + 1 COPY (-4 layers) build stage: 4 RUN + 1 COPY -> 1 RUN + 1 COPY (-3 layers) FBGEMM and TorchRec kept as separate layers for build cache efficiency. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * ci: add pull_request_target trigger for auto CI on PR open/sync Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix imports for fake ops wrapper used in expor * fix: remove invalid import of hstu.hstu_ops_gpu The module hstu.hstu_ops_gpu does not exist as a Python module. The C++ source hstu_ops_gpu.cpp compiles into hstu/fbgemm_gpu_experimental_hstu.so, not a separate hstu_ops_gpu submodule. This import was incorrectly added in PR #327 and causes ModuleNotFoundError in CI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: update FBGEMM submodule to include hstu_ops_gpu.py fake impl Update from 04df536 to 65bad42 which adds fake tensor implementations for torch.export (hstu_ops_gpu.py). This was missing since PR #340 accidentally reverted the submodule pointer. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * ci: allow /build with flags by matching prefix instead of exact string Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * ci: remove pull_request_target trigger, keep only /build comment Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Junyi Qiu <junyiq@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add a comprehensive HSTU end-to-end training benchmark suite that measures the impact of progressive optimizations on H100 GPUs.
generate_gin_config.py), SLURM submission scripts, result analysis and visualization toolsKey code changes (non-benchmark)
_Split2DJaggedFunctionby precomputing split lengthsnum_contextuals, empty batch handlingtotal_candidates_seq_lenfor TP and shuffler compatibilitynum_loss_tokens()added toBaseBatch/GPTSIDBatch, sid_grprogress()signature updatedMEM_DEBUGbatch_allgatherdense tensor padding guard for pre-padded tensors (BaseBatch dense tensor shape convention inconsistency with padded fields (num_candidates) #361)Benchmark experiments
Known Issues
pad_and_allgather_batchdense tensor padding assumes unpadded dim-0, but some fields (e.g.num_candidates) are pre-padded by dataloader → BaseBatch dense tensor shape convention inconsistency with padded fields (num_candidates) #361Test plan
num_loss_tokens()added for SID-GR compatibilityCloses #307
CI