fix: broadcast cu_seqlens/max_seqlen to intermediate PP stages for SF… by meinie0826 · Pull Request #4150 · NVIDIA/Megatron-LM

meinie0826 · 2026-04-05T18:15:50Z

What does this PR do ?

Fix: Add an else branch for intermediate stages on both the sender and receiver sides of get_batch_on_this_tp_rank to broadcast attention_mask, cu_seqlens, and max_seqlen.

Fixes #4092

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact the @mcore-oncall.

Contribution process

Pre-checks

I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

When your PR is ready, click Ready for Review.
An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
- Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

copy-pr-bot · 2026-04-05T18:15:54Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2026-04-05T18:15:58Z

This PR has been automatically converted to draft because all PRs must start as drafts.

When you are ready for review, click Ready for Review to begin the review process. This will:

Add the oncall reviewer (optional reviewer)
Add required review teams based on your changes

See the contribution guide for more details.

Copilot

Pull request overview

Fixes SFT packed-sequence training with PP>2 by ensuring packed-sequence metadata is broadcast to TP ranks on intermediate pipeline stages, and adds a unit-test regression suite for the broadcast behavior (issue #4092).

Changes:

Add intermediate-stage broadcast/receive logic in get_batch_on_this_tp_rank for attention_mask, cu_seqlens, and max_seqlen.
Add new CUDA/distributed unit tests validating cu_seqlens/max_seqlen propagation across TP/PP configurations.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
`megatron/training/utils.py`	Adds intermediate PP-stage broadcast/receive of packed-sequence metadata in `get_batch_on_this_tp_rank`.
`tests/unit_tests/test_get_batch_on_this_tp_rank.py`	Introduces regression tests for packed-sequence metadata broadcast across TP ranks and PP stages.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-05T18:22:47Z

+        else:
+            # Intermediate pipeline stages: broadcast attention_mask, cu_seqlens and max_seqlen
+            # so that RoPE and FlashAttention on TP ranks > 0 receive the packed-sequence metadata.
+            _broadcast(batch['attention_mask'])
+            _broadcast_cu_seqlens(batch['cu_seqlens'])
+            _broadcast(batch['max_seqlen'])


In the PP-last-stage path, TP ranks > 0 still never receive cu_seqlens/max_seqlen (the code only broadcasts labels, loss_mask, attention_mask). For SFT/packed sequences, the last stage still runs transformer layers and needs the packed-sequence metadata on every TP rank (see pretrain_gpt.py:get_batch / pretrain_mamba.py:get_batch, which build packed-seq params on first/last stages). Consider broadcasting cu_seqlens (via _broadcast_cu_seqlens) and max_seqlen here as well, and mirroring the receive-side logic so TP>0 doesn’t fall back to non-packed execution.

Copilot · 2026-04-05T18:22:47Z

+        else:
+            # Intermediate pipeline stages: receive attention_mask, cu_seqlens and max_seqlen
+            # from TP rank 0 so that RoPE and FlashAttention work correctly with packed sequences.
+            tokens = None
+            labels = None
+            loss_mask = None
+            position_ids = None
+            local_cp_size = None
+
+            _broadcast(attention_mask)
+            cu_seqlens = _broadcast_cu_seqlens()
+            _broadcast(max_seqlen)
+


Receive-side intermediate-stage logic now pulls cu_seqlens/max_seqlen, but the PP-last-stage branch above explicitly sets cu_seqlens = None and max_seqlen = None and never receives them. That leaves TP ranks > 0 on the last stage without packed-sequence metadata, causing pretrain_gpt.py:get_batch / pretrain_mamba.py:get_batch to treat the batch as non-packed on those ranks. Align the PP-last-stage receive logic with the sender side (and with first/intermediate stages) so cu_seqlens/max_seqlen are available on all TP ranks when args.sft is enabled.

Copilot · 2026-04-05T18:22:47Z

+    world_size = tp * pp
+    if torch.cuda.device_count() < world_size:
+        pytest.skip(f"Need {world_size} GPUs (tp={tp}, pp={pp})")
+
+    Utils.initialize_model_parallel(
+        tensor_model_parallel_size=tp,
+        pipeline_model_parallel_size=pp,
+    )


These tests assume that having enough visible GPUs (torch.cuda.device_count()) implies the test is running with WORLD_SIZE == tp*pp, but Utils.initialize_model_parallel(tp, pp) requires the process group world size to match (or exceed) the requested topology. When running pytest without torchrun, WORLD_SIZE defaults to 1 and this will fail instead of skipping. Consider skipping unless int(os.environ.get('WORLD_SIZE','1')) >= tp*pp (or torch.distributed.get_world_size() >= tp*pp once initialized).

Copilot · 2026-04-05T18:22:48Z

+    tp_rank = mpu.get_tensor_model_parallel_rank()
+    pp_rank = mpu.get_pipeline_model_parallel_rank()
+    is_first = mpu.is_pipeline_first_stage()
+    is_last = mpu.is_pipeline_last_stage()
+
+    data_iterator = _make_data_iterator() if tp_rank == 0 else None
+    batch = get_batch_on_this_tp_rank(data_iterator)
+
+    if is_last and not is_first:
+        Utils.destroy_model_parallel()
+        return  # last stage: cu_seqlens is intentionally None
+
+    assert batch["cu_seqlens"] is not None
+    assert batch["max_seqlen"] is not None
+
+    # Broadcast the TP-0 copy to all TP ranks and verify every rank agrees.
+    cu_ref = batch["cu_seqlens"].clone()
+    mx_ref = batch["max_seqlen"].clone()
+    torch.distributed.broadcast(
+        cu_ref,
+        src=mpu.get_tensor_model_parallel_src_rank(),
+        group=mpu.get_tensor_model_parallel_group(),
+    )
+    torch.distributed.broadcast(
+        mx_ref,
+        src=mpu.get_tensor_model_parallel_src_rank(),
+        group=mpu.get_tensor_model_parallel_group(),
+    )
+
+    assert torch.equal(batch["cu_seqlens"], cu_ref), (
+        f"cu_seqlens mismatch on pp_rank={pp_rank} tp_rank={tp_rank}"
+    )
+    assert torch.equal(batch["max_seqlen"], mx_ref), (
+        f"max_seqlen mismatch on pp_rank={pp_rank} tp_rank={tp_rank}"
+    )
+
+    Utils.destroy_model_parallel()


This test returns early for the last PP stage under the assumption that cu_seqlens is intentionally None there. For packed sequences, last-stage TP ranks also need the metadata to build packed-seq params for attention. Once the last-stage broadcast is fixed, remove this early-return and validate consistency across TP ranks on the last stage as well.

Suggested change

tp_rank = mpu.get_tensor_model_parallel_rank()

pp_rank = mpu.get_pipeline_model_parallel_rank()

is_first = mpu.is_pipeline_first_stage()

is_last = mpu.is_pipeline_last_stage()

data_iterator = _make_data_iterator() if tp_rank == 0 else None

batch = get_batch_on_this_tp_rank(data_iterator)

if is_last and not is_first:

Utils.destroy_model_parallel()

return # last stage: cu_seqlens is intentionally None

assert batch["cu_seqlens"] is not None

assert batch["max_seqlen"] is not None

# Broadcast the TP-0 copy to all TP ranks and verify every rank agrees.

cu_ref = batch["cu_seqlens"].clone()

mx_ref = batch["max_seqlen"].clone()

torch.distributed.broadcast(

cu_ref,

src=mpu.get_tensor_model_parallel_src_rank(),

group=mpu.get_tensor_model_parallel_group(),

)

torch.distributed.broadcast(

mx_ref,

src=mpu.get_tensor_model_parallel_src_rank(),

group=mpu.get_tensor_model_parallel_group(),

)

assert torch.equal(batch["cu_seqlens"], cu_ref), (

f"cu_seqlens mismatch on pp_rank={pp_rank} tp_rank={tp_rank}"

)

assert torch.equal(batch["max_seqlen"], mx_ref), (

f"max_seqlen mismatch on pp_rank={pp_rank} tp_rank={tp_rank}"

)

Utils.destroy_model_parallel()

try:

tp_rank = mpu.get_tensor_model_parallel_rank()

pp_rank = mpu.get_pipeline_model_parallel_rank()

data_iterator = _make_data_iterator() if tp_rank == 0 else None

batch = get_batch_on_this_tp_rank(data_iterator)

assert batch["cu_seqlens"] is not None

assert batch["max_seqlen"] is not None

# Broadcast the TP-0 copy to all TP ranks and verify every rank agrees.

cu_ref = batch["cu_seqlens"].clone()

mx_ref = batch["max_seqlen"].clone()

torch.distributed.broadcast(

cu_ref,

src=mpu.get_tensor_model_parallel_src_rank(),

group=mpu.get_tensor_model_parallel_group(),

)

torch.distributed.broadcast(

mx_ref,

src=mpu.get_tensor_model_parallel_src_rank(),

group=mpu.get_tensor_model_parallel_group(),

)

assert torch.equal(batch["cu_seqlens"], cu_ref), (

f"cu_seqlens mismatch on pp_rank={pp_rank} tp_rank={tp_rank}"

)

assert torch.equal(batch["max_seqlen"], mx_ref), (

f"max_seqlen mismatch on pp_rank={pp_rank} tp_rank={tp_rank}"

)

finally:

Utils.destroy_model_parallel()

…T packing

meinie0826 · 2026-04-09T08:38:28Z

@chtruong814 Hi, could you take a look at this pr and give some advice?

meinie0826 requested review from a team as code owners April 5, 2026 18:15

Copilot AI review requested due to automatic review settings April 5, 2026 18:15

svcnvidia-nemo-ci marked this pull request as draft April 5, 2026 18:15

github-actions bot added the community-request label Apr 5, 2026

Copilot started reviewing on behalf of meinie0826 April 5, 2026 18:16 View session

meinie0826 marked this pull request as ready for review April 5, 2026 18:17

svcnvidia-nemo-ci requested a review from a team April 5, 2026 18:17

Copilot AI reviewed Apr 5, 2026

View reviewed changes

meinie0826 force-pushed the fix/sft-packing-intermediate-pp-cu-seqlens-broadcast branch 5 times, most recently from 2b04b29 to 1b5fe0f Compare April 6, 2026 17:06

fix: broadcast cu_seqlens/max_seqlen to intermediate PP stages for SF…

cea9d00

…T packing

meinie0826 force-pushed the fix/sft-packing-intermediate-pp-cu-seqlens-broadcast branch from 1b5fe0f to cea9d00 Compare April 6, 2026 17:09

chtruong814 added the needs-follow-up Issue needs follow-up label Apr 8, 2026

chtruong814 added needs-follow-up Issue needs follow-up and removed needs-follow-up Issue needs follow-up labels Apr 9, 2026

Phlip79 requested a review from parthmannan April 17, 2026 16:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: broadcast cu_seqlens/max_seqlen to intermediate PP stages for SF…#4150

fix: broadcast cu_seqlens/max_seqlen to intermediate PP stages for SF…#4150
meinie0826 wants to merge 1 commit intoNVIDIA:mainfrom
meinie0826:fix/sft-packing-intermediate-pp-cu-seqlens-broadcast

meinie0826 commented Apr 5, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Apr 5, 2026

Uh oh!

github-actions bot commented Apr 5, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 5, 2026

Uh oh!

Copilot AI Apr 5, 2026

Uh oh!

Copilot AI Apr 5, 2026

Uh oh!

Uh oh!

Copilot AI Apr 5, 2026

Uh oh!

meinie0826 commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

meinie0826 commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Contribution process

Pre-checks

Code review

Step 1: Mark PR as "Ready for Review"

Step 2: Final Review

Step 3: Approved

Merge

Uh oh!

copy-pr-bot bot commented Apr 5, 2026

Uh oh!

github-actions bot commented Apr 5, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

meinie0826 commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

meinie0826 commented Apr 5, 2026 •

edited

Loading