rename confusing arg#2
Closed
yaroslavvb wants to merge 1 commit intoNVIDIA:masterfrom
Closed
Conversation
WORLD_SIZE usually refers to total number of gpus while --nproc_per_node should be set to number of gpus per node
raulpuric
suggested changes
Apr 25, 2019
Contributor
|
These variables should be named correctly in our latest version so this has been applied manually. Thanks for your help! |
punitkoura
pushed a commit
to punitkoura/Megatron-LM
that referenced
this pull request
Jan 26, 2022
deepakn94
referenced
this pull request
in stanford-futuredata/Megatron-LM
Jan 26, 2023
Pipeline parallelism for Switch and MoB models
jon-barker
pushed a commit
that referenced
this pull request
Jul 19, 2023
jon-barker
pushed a commit
that referenced
this pull request
Jul 19, 2023
Test #2: Memory, timing See merge request ADLR/megatron-lm!677
chelseajohn
referenced
this pull request
in OpenGPTX/Megatron-LM
Jul 24, 2023
janEbert
pushed a commit
to janEbert/Megatron-LM
that referenced
this pull request
Jul 25, 2023
Resolves NVIDIA#2.
haidark
pushed a commit
to haidark/Megatron-LM
that referenced
this pull request
Mar 8, 2024
…ar_lr_hyp_tune patch workers.
Edenzzzz
pushed a commit
to Edenzzzz/Megatron-LM
that referenced
this pull request
Aug 20, 2024
minor change on auto schedule
shjwudp
referenced
this pull request
in shjwudp/Megatron-LM
Nov 8, 2024
add pre-allocation for each cpu grad and overlap CPU/CUDA step
ko3n1g
added a commit
that referenced
this pull request
Sep 3, 2025
…compatibility from 0.14.0) (Followed up on !3945) Author: yaoyu-33 <yaoyu.094@gmail.com> Signed-off-by: oliver könig <okoenig@nvidia.com>
shjwudp
referenced
this pull request
in shjwudp/Megatron-LM
Nov 6, 2025
* add forward-mainloop and bwd_partial_dlogits kernel Signed-off-by: Jianbing Dong <jianbingd@nvidia.com> * skip TestFusedLinearCrossEntropyOnGptModel for single GPU Signed-off-by: Jianbing Dong <jianbingd@nvidia.com> * added unit-test for linear_cross_entropy on dp Signed-off-by: Jianbing Dong <jianbingd@nvidia.com> --------- Signed-off-by: Jianbing Dong <jianbingd@nvidia.com>
copy-pr-bot Bot
pushed a commit
that referenced
this pull request
Nov 13, 2025
Unit and functional test for PP
shjwudp
referenced
this pull request
in shjwudp/Megatron-LM
Nov 21, 2025
* add forward-mainloop and bwd_partial_dlogits kernel Signed-off-by: Jianbing Dong <jianbingd@nvidia.com> * skip TestFusedLinearCrossEntropyOnGptModel for single GPU Signed-off-by: Jianbing Dong <jianbingd@nvidia.com> * added unit-test for linear_cross_entropy on dp Signed-off-by: Jianbing Dong <jianbingd@nvidia.com> --------- Signed-off-by: Jianbing Dong <jianbingd@nvidia.com>
6 tasks
6 tasks
nanz-nv
pushed a commit
to nanz-nv/Megatron-LM
that referenced
this pull request
Feb 5, 2026
copy-pr-bot Bot
pushed a commit
that referenced
this pull request
Feb 16, 2026
…lpers - Use BooleanOptionalAction for --inference-dynamic-batching-prefix-caching (comment #2) - Inline 5 trivial helper methods in BlockAllocator per reviewer feedback (comments #4-8): set_block_hash, get_block_hash, lookup_block_by_hash, increment_ref_count, decrement_ref_count - Update all call sites in dynamic_context.py, dynamic_engine.py, and tests - Apply autoformat (black, isort) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
guapisolo
pushed a commit
to guapisolo/Megatron-LM
that referenced
this pull request
Feb 19, 2026
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
guapisolo
pushed a commit
to guapisolo/Megatron-LM
that referenced
this pull request
Feb 25, 2026
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
peter-ni-noob
pushed a commit
to peter-ni-noob/Megatron-LM
that referenced
this pull request
Feb 27, 2026
add citation and readme_zh
guapisolo
pushed a commit
to guapisolo/Megatron-LM
that referenced
this pull request
Mar 2, 2026
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
copy-pr-bot Bot
pushed a commit
that referenced
this pull request
Mar 11, 2026
…ward Fix the fix dummy forward to avoid picking up a cudagraph
parthmannan
pushed a commit
to parthmannan/Megatron-LM
that referenced
this pull request
Mar 31, 2026
Fix mtp_detach_heads
6 tasks
copy-pr-bot Bot
pushed a commit
that referenced
this pull request
Apr 7, 2026
Fix row-parallel bias TP mode detection as replicated
6 tasks
Connor-XY
added a commit
to Connor-XY/Megatron-LM
that referenced
this pull request
May 4, 2026
Resolves all eleven comments on the PR thread: * Rename CheckpointManager → CheckpointWithoutOutputManager and update the docstring; the class strictly manages CheckpointWithoutOutput instances, so the new name avoids the broader "checkpoint" overloading. Updates all importers and tests. (NVIDIA#1) * Document why subtracting the per-row max in SinkhornKnopp.forward is benign — Sinkhorn's first row-normalization cancels any per-row scalar, so the shifted and unshifted exp produce the same fixed point and gradient. (NVIDIA#2) * Use NotImplementedError for the mhc + fine_grained_activation_offloading block — it's a known unimplemented interaction, not a config error. (NVIDIA#3) * Drop the new __call__ override and backward_dw_cudagraph from base TransformerLayer; the mHC kwarg extraction now lives on HyperConnectionTransformerLayer.__call__, with _mhc_recompute_manager initialized in __init__ so forward() reads it directly without a getattr fallback. cuda_graphs.py reads is_decode_only() directly, so dropping the dynamic_inference_decode_only injection is safe. (NVIDIA#4, NVIDIA#5, NVIDIA#10) * Rename the FineGrainedActivationOffloadingInterface alias off_interface → offload_interface in transformer_layer.py for clarity. (NVIDIA#6) * Extract a _run_mlp helper on TransformerLayer that owns the MLP-call branching (recompute / chunked-prefill / fp8-fp4 / plain-mlp); both base and HC _forward_mlp call it, eliminating the previous ~80-line duplication. The MoE-cudagraph early-return remains in base _forward_mlp after the helper call (HC is guarded against MoE). (NVIDIA#8) * Raise NotImplementedError at HyperConnectionTransformerLayer.__init__ when is_moe_layer is True and point users at HyperConnectionHybridLayer; drop the dead MoE branch in _get_submodules_under_cudagraphs. (NVIDIA#9) * No code change for the MoE composition / extensibility comment (NVIDIA#7) — see the PR thread reply for the rationale. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Connor-XY
added a commit
to Connor-XY/Megatron-LM
that referenced
this pull request
May 4, 2026
Resolves all eleven comments on the PR thread: * Rename CheckpointManager → CheckpointWithoutOutputManager and update the docstring; the class strictly manages CheckpointWithoutOutput instances, so the new name avoids the broader "checkpoint" overloading. Updates all importers and tests. (NVIDIA#1) * Document why subtracting the per-row max in SinkhornKnopp.forward is benign — Sinkhorn's first row-normalization cancels any per-row scalar, so the shifted and unshifted exp produce the same fixed point and gradient. (NVIDIA#2) * Use NotImplementedError for the mhc + fine_grained_activation_offloading block — it's a known unimplemented interaction, not a config error. (NVIDIA#3) * Drop the new __call__ override and backward_dw_cudagraph from base TransformerLayer; the mHC kwarg extraction now lives on HyperConnectionTransformerLayer.__call__, with _mhc_recompute_manager initialized in __init__ so forward() reads it directly without a getattr fallback. cuda_graphs.py reads is_decode_only() directly, so dropping the dynamic_inference_decode_only injection is safe. (NVIDIA#4, NVIDIA#5, NVIDIA#10) * Rename the FineGrainedActivationOffloadingInterface alias off_interface → offload_interface in transformer_layer.py for clarity. (NVIDIA#6) * Extract a _run_mlp helper on TransformerLayer that owns the MLP-call branching (recompute / chunked-prefill / fp8-fp4 / plain-mlp); both base and HC _forward_mlp call it, eliminating the previous ~80-line duplication. The MoE-cudagraph early-return remains in base _forward_mlp after the helper call (HC is guarded against MoE). (NVIDIA#8) * Raise NotImplementedError at HyperConnectionTransformerLayer.__init__ when is_moe_layer is True and point users at HyperConnectionHybridLayer; drop the dead MoE branch in _get_submodules_under_cudagraphs. (NVIDIA#9) * No code change for the MoE composition / extensibility comment (NVIDIA#7) — see the PR thread reply for the rationale. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
WORLD_SIZE usually refers to total number of gpus while --nproc_per_node should be set to number of gpus per node