Skip to content

rename confusing arg#2

Closed
yaroslavvb wants to merge 1 commit intoNVIDIA:masterfrom
yaroslavvb:patch-1
Closed

rename confusing arg#2
yaroslavvb wants to merge 1 commit intoNVIDIA:masterfrom
yaroslavvb:patch-1

Conversation

@yaroslavvb
Copy link
Copy Markdown

WORLD_SIZE usually refers to total number of gpus while --nproc_per_node should be set to number of gpus per node

WORLD_SIZE usually refers to total number of gpus while --nproc_per_node should be set to number of gpus per node
Copy link
Copy Markdown
Contributor

@raulpuric raulpuric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this isn't proper convention. If anything I should have set a WORLD_SIZE=$GPUS_PER_NODE argument afterwards for clarity.

Could you add that variable assignment and also make these changes to the other scripts for consistency.

@jaredcasper
Copy link
Copy Markdown
Contributor

These variables should be named correctly in our latest version so this has been applied manually. Thanks for your help!

punitkoura pushed a commit to punitkoura/Megatron-LM that referenced this pull request Jan 26, 2022
shjwudp referenced this pull request in shjwudp/Megatron-LM Apr 18, 2022
deepakn94 referenced this pull request in stanford-futuredata/Megatron-LM Jan 26, 2023
Pipeline parallelism for Switch and MoB models
jon-barker pushed a commit that referenced this pull request Jul 19, 2023
jon-barker pushed a commit that referenced this pull request Jul 19, 2023
Test #2: Memory, timing

See merge request ADLR/megatron-lm!677
chelseajohn referenced this pull request in OpenGPTX/Megatron-LM Jul 24, 2023
janEbert pushed a commit to janEbert/Megatron-LM that referenced this pull request Jul 25, 2023
haidark pushed a commit to haidark/Megatron-LM that referenced this pull request Mar 8, 2024
Edenzzzz pushed a commit to Edenzzzz/Megatron-LM that referenced this pull request Aug 20, 2024
shjwudp referenced this pull request in shjwudp/Megatron-LM Nov 8, 2024
add pre-allocation for each cpu grad and overlap CPU/CUDA step
ko3n1g added a commit that referenced this pull request Sep 3, 2025
…compatibility from 0.14.0) (Followed up on !3945)

Author: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
shjwudp referenced this pull request in shjwudp/Megatron-LM Nov 6, 2025
* add forward-mainloop and bwd_partial_dlogits kernel

Signed-off-by: Jianbing Dong <jianbingd@nvidia.com>

* skip TestFusedLinearCrossEntropyOnGptModel for single GPU

Signed-off-by: Jianbing Dong <jianbingd@nvidia.com>

* added unit-test for linear_cross_entropy on dp

Signed-off-by: Jianbing Dong <jianbingd@nvidia.com>

---------

Signed-off-by: Jianbing Dong <jianbingd@nvidia.com>
copy-pr-bot Bot pushed a commit that referenced this pull request Nov 13, 2025
shjwudp referenced this pull request in shjwudp/Megatron-LM Nov 21, 2025
* add forward-mainloop and bwd_partial_dlogits kernel

Signed-off-by: Jianbing Dong <jianbingd@nvidia.com>

* skip TestFusedLinearCrossEntropyOnGptModel for single GPU

Signed-off-by: Jianbing Dong <jianbingd@nvidia.com>

* added unit-test for linear_cross_entropy on dp

Signed-off-by: Jianbing Dong <jianbingd@nvidia.com>

---------

Signed-off-by: Jianbing Dong <jianbingd@nvidia.com>
nanz-nv pushed a commit to nanz-nv/Megatron-LM that referenced this pull request Feb 5, 2026
copy-pr-bot Bot pushed a commit that referenced this pull request Feb 16, 2026
…lpers

- Use BooleanOptionalAction for --inference-dynamic-batching-prefix-caching (comment #2)
- Inline 5 trivial helper methods in BlockAllocator per reviewer feedback (comments #4-8):
  set_block_hash, get_block_hash, lookup_block_by_hash, increment_ref_count, decrement_ref_count
- Update all call sites in dynamic_context.py, dynamic_engine.py, and tests
- Apply autoformat (black, isort)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
guapisolo pushed a commit to guapisolo/Megatron-LM that referenced this pull request Feb 19, 2026
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
guapisolo pushed a commit to guapisolo/Megatron-LM that referenced this pull request Feb 25, 2026
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
peter-ni-noob pushed a commit to peter-ni-noob/Megatron-LM that referenced this pull request Feb 27, 2026
guapisolo pushed a commit to guapisolo/Megatron-LM that referenced this pull request Mar 2, 2026
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
copy-pr-bot Bot pushed a commit that referenced this pull request Mar 11, 2026
…ward

Fix the fix dummy forward to avoid picking up a cudagraph
parthmannan pushed a commit to parthmannan/Megatron-LM that referenced this pull request Mar 31, 2026
copy-pr-bot Bot pushed a commit that referenced this pull request Apr 7, 2026
Fix row-parallel bias TP mode detection as replicated
Connor-XY added a commit to Connor-XY/Megatron-LM that referenced this pull request May 4, 2026
Resolves all eleven comments on the PR thread:

* Rename CheckpointManager → CheckpointWithoutOutputManager and update
  the docstring; the class strictly manages CheckpointWithoutOutput
  instances, so the new name avoids the broader "checkpoint" overloading.
  Updates all importers and tests. (NVIDIA#1)
* Document why subtracting the per-row max in SinkhornKnopp.forward is
  benign — Sinkhorn's first row-normalization cancels any per-row
  scalar, so the shifted and unshifted exp produce the same fixed point
  and gradient. (NVIDIA#2)
* Use NotImplementedError for the mhc + fine_grained_activation_offloading
  block — it's a known unimplemented interaction, not a config error. (NVIDIA#3)
* Drop the new __call__ override and backward_dw_cudagraph from base
  TransformerLayer; the mHC kwarg extraction now lives on
  HyperConnectionTransformerLayer.__call__, with _mhc_recompute_manager
  initialized in __init__ so forward() reads it directly without a
  getattr fallback. cuda_graphs.py reads is_decode_only() directly,
  so dropping the dynamic_inference_decode_only injection is safe. (NVIDIA#4, NVIDIA#5, NVIDIA#10)
* Rename the FineGrainedActivationOffloadingInterface alias
  off_interface → offload_interface in transformer_layer.py for clarity. (NVIDIA#6)
* Extract a _run_mlp helper on TransformerLayer that owns the MLP-call
  branching (recompute / chunked-prefill / fp8-fp4 / plain-mlp); both
  base and HC _forward_mlp call it, eliminating the previous
  ~80-line duplication. The MoE-cudagraph early-return remains in base
  _forward_mlp after the helper call (HC is guarded against MoE). (NVIDIA#8)
* Raise NotImplementedError at HyperConnectionTransformerLayer.__init__
  when is_moe_layer is True and point users at HyperConnectionHybridLayer;
  drop the dead MoE branch in _get_submodules_under_cudagraphs. (NVIDIA#9)
* No code change for the MoE composition / extensibility comment (NVIDIA#7) —
  see the PR thread reply for the rationale.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Connor-XY added a commit to Connor-XY/Megatron-LM that referenced this pull request May 4, 2026
Resolves all eleven comments on the PR thread:

* Rename CheckpointManager → CheckpointWithoutOutputManager and update
  the docstring; the class strictly manages CheckpointWithoutOutput
  instances, so the new name avoids the broader "checkpoint" overloading.
  Updates all importers and tests. (NVIDIA#1)
* Document why subtracting the per-row max in SinkhornKnopp.forward is
  benign — Sinkhorn's first row-normalization cancels any per-row
  scalar, so the shifted and unshifted exp produce the same fixed point
  and gradient. (NVIDIA#2)
* Use NotImplementedError for the mhc + fine_grained_activation_offloading
  block — it's a known unimplemented interaction, not a config error. (NVIDIA#3)
* Drop the new __call__ override and backward_dw_cudagraph from base
  TransformerLayer; the mHC kwarg extraction now lives on
  HyperConnectionTransformerLayer.__call__, with _mhc_recompute_manager
  initialized in __init__ so forward() reads it directly without a
  getattr fallback. cuda_graphs.py reads is_decode_only() directly,
  so dropping the dynamic_inference_decode_only injection is safe. (NVIDIA#4, NVIDIA#5, NVIDIA#10)
* Rename the FineGrainedActivationOffloadingInterface alias
  off_interface → offload_interface in transformer_layer.py for clarity. (NVIDIA#6)
* Extract a _run_mlp helper on TransformerLayer that owns the MLP-call
  branching (recompute / chunked-prefill / fp8-fp4 / plain-mlp); both
  base and HC _forward_mlp call it, eliminating the previous
  ~80-line duplication. The MoE-cudagraph early-return remains in base
  _forward_mlp after the helper call (HC is guarded against MoE). (NVIDIA#8)
* Raise NotImplementedError at HyperConnectionTransformerLayer.__init__
  when is_moe_layer is True and point users at HyperConnectionHybridLayer;
  drop the dead MoE branch in _get_submodules_under_cudagraphs. (NVIDIA#9)
* No code change for the MoE composition / extensibility comment (NVIDIA#7) —
  see the PR thread reply for the rationale.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants