[Dev] Refactor CUDA graph API: decompose cuda_graph_scope into full_iteration impl, inference scope, and per-layer capture modules#4293
Conversation
|
/claude strict-review |
There was a problem hiding this comment.
Code Review Summary
CRITICAL: 0 | IMPORTANT: 4 | SUGGESTION: 4
Key Findings
IMPORTANT:
-
Deprecated scope migration sets string instead of enum (
transformer_config.py:932,arguments.py:248): Thesetattr(self, attr, value)migration path setsinference_cuda_graph_granularityto a string ('transformer_block') rather than the declaredInferenceCudaGraphGranularityenum. This works today becausenormalize_inference_cuda_graph_granularity()is called afterwards, but creates a fragile order-dependent invariant. Recommend converting to the final type immediately or storing typed values inCUDA_GRAPH_SCOPE_DEPRECATIONS. -
full_iterationimpl allowscuda_graph_impl="none"in deprecated scope validation (cuda_graph_config.py:127): Passing only--cuda-graph-scope full_iteration(without explicit--cuda-graph-impl) now silently enables full-iteration CUDA graphs, whereas in the old API this combination would have been a no-op (since graphs were disabled). This is a subtle backward-incompatible behavior change in an edge case. -
State leak in
megatron_rl_inference_mode(rl_utils.py:1838-1843):inference_cuda_graph_granularityandcuda_graph_scopeare overwritten on context entry but never restored on exit. Benign today sincecuda_graph_implis restored to"none", but leaves the config in an inconsistent state. -
dynamic_engine.pyruns warmup passes forfull_iterationwithout capturing graphs (dynamic_engine.py:338-345): The warning does not mention that warmup forward passes still run. Combined with the matchinginference/utils.pychange that enablesnum_cuda_graphsforfull_iteration, this burns GPU time and memory unnecessarily during inference withfull_iterationimpl.
Overall Assessment
Risk: Low-Medium. The PR's core design is sound — separating training capture scope from inference granularity into distinct, purpose-specific fields is the right architectural move. The backward compatibility migration is well-structured with centralized normalization in cuda_graph_config.py, proper deprecation warnings, and comprehensive test coverage for migration paths.
The findings above are not blocking — they're fragility/hygiene issues, not correctness bugs in the current code. The most actionable item is the string-vs-enum type inconsistency during migration, which could be fixed with a small change to the deprecation map or the setattr call sites.
The new documentation (cuda_graph.md) and test coverage are excellent additions.
49e3630 to
3d64f80
Compare
253a6dc to
86a8380
Compare
238a250 to
29ee723
Compare
cc0a00d to
e0cb7d9
Compare
e0cb7d9 to
145c047
Compare
|
/ok to test 145c047 |
a2e12c4 to
24c3696
Compare
24c3696 to
d231704
Compare
|
/ok to test d231704 |
What does this PR do ?
Main PR #4292
This PR decomposes the overloaded
cuda_graph_scopefield into three dedicated, semantically distinct concepts and cleans up naming throughout.Problem
The old API overloaded
--cuda-graph-scopewith three unrelated concerns:--cuda-graph-scope full_iteration→ full-iteration training graphs (a capture strategy, not a module)--cuda-graph-scope full_iteration_inference→ block-owned inference graphs (inference ownership, not a module)--cuda-graph-scope attn mlp→ per-layer capture regions (the actual intended use)CudaGraphScopemixed iteration-level control flow with per-layer module selection. The three concepts have nothing in common and cannot be meaningfully combined in one field.Solution
Four concrete changes:
full_iterationbecomes its owncuda_graph_implvalue.--cuda-graph-impl full_iterationreplaces--cuda-graph-impl local --cuda-graph-scope full_iteration.--inference-cuda-graph-scopeis a new dedicated field.InferenceCudaGraphScope(none/layer/block) replacesfull_iteration_inferenceincuda_graph_scope. The default for--cuda-graph-impl localislayer(preserving prior behaviour).CudaGraphScopeis renamed toCudaGraphModule;cuda_graph_scopetocuda_graph_modules.With the two non-module values removed, the enum and field names now accurately reflect their purpose: selecting which per-layer modules to capture.
Normalization is centralized in
cuda_graph_config.py.normalize_cuda_graph_modules,normalize_inference_cuda_graph_scope, andvalidate_deprecated_cuda_graph_modules_migration_inputsare shared betweenTransformerConfig.__post_init__andvalidate_args, removing duplication and making the migration logic testable in isolation.Backward compatibility
All deprecated inputs are still accepted and silently migrated at startup:
--enable-cuda-graph--cuda-graph-impl local--external-cuda-graph--cuda-graph-impl transformer_engine--cuda-graph-scope full_iteration--cuda-graph-impl full_iteration--cuda-graph-scope full_iteration_inference--cuda-graph-impl local --inference-cuda-graph-scope block--cuda-graph-scope attn mlp--cuda-graph-modules attn mlpfrom megatron.core.transformer.enums import CudaGraphScopeCudaGraphScopealias kept; useCudaGraphModuleTransformerConfig(cuda_graph_scope=...)cuda_graph_scopefield kept, deprecated; usecuda_graph_modulesConflicting combinations (e.g. passing both a deprecated value and the new flag with an incompatible value) are rejected with a clear assertion.
Contribution process
Pre-checks
Code review
Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.
Step 1: Mark PR as "Ready for Review"
.github/CODEOWNERS.Final Review might get declined if these requirements are not fulfilled.
Step 2: Final Review
For PRs that change
megatron/core, once all expert reviewers have approved, theFinal Reviewlabel is applied automatically and final reviewers are assigned.For PRs outside
megatron/core, this step is skipped.Step 3: Approved
Once all required reviewers have approved, the
Approvedlabel is applied automatically.Merge
Any member of mcore-engineers will be able to merge your PR.
For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.