Introduce ep-as-cp customized logical rule by NuojCheng · Pull Request #3656 · AI-Hypercomputer/maxtext

NuojCheng · 2026-04-14T00:25:37Z

Description

Note: This PR should be merged after #3607.

This PR introduces a new custom mesh and rule, ep-as-cp, and deprecates the expert_shard_attention_option flag.

Key Changes

Enabled via custom_mesh_and_rule=ep-as-cp, this rule supports DP, PP, FSDP, and EP. Under this setup, EP functions as CP everywhere except within the core MoE components (specifically, the layers between EP all-to-all communications).
Introduces a new string flag (defaulting to "context") to explicitly designate which physical axis serves as context parallelism. This is required because CP is utilized in the data pipeline and attention load balancing, and this axis mapping cannot be easily inferred from the custom logical rules alone.
The new ep-as-cp mesh and rule are validated by the dump sharding tests.

Tests

Functionality Check

The most straightforward way to verify the ep-as-cp implementation is by running an experiment using a fractional batch size. To test the implementation, we ran the following experiments:

Experiment 1: Real Training

Topology: v5p-8 (4 devices)
Sharding: FSDP=2, EP=2
Batch Size: PDB=0.5
Model: model_name=deepseek3-test
Log: https://paste.googleplex.com/5983807506350080

Experiment 2: Training Compilation

Topology: tpu7x-256 (256 devices)
Sharding: FSDP=64, EP=4
Batch Size: PDB=0.25
Model: model_name=deepseek3-671b
Flags: use_ring_of_experts=true
Log: https://paste.googleplex.com/4811773317349376

Experiment 3: Default vs. `ep-as-cp` (debug_sharding diff)

Same configuration as Experiment 1, except with pdb=1.
Diff: https://diff.googleplex.com/#key=g1MUoGqjDzql

Experiment 4: Training compilation large

Topology: tpu7x-8192 (8192 devices)
Sharding: FSDP=128 EP=8 PP=8
Batch Size: PDB=0.25
Model: model_name=deepseek3-671b
Flags: https://paste.googleplex.com/6381544026537984
Output: Compilaiton succeeds.

Correctness Check

To verify loss correctness, we evaluated the following configuration:

Topology: v5p-8
Sharding Comparison: FSDP=2 + CP=2 vs. FSDP=2 + EP (as CP)=2
Batch Size: per_device_batch_size=1
Model: model_name=deepseek3-test

Note: We compare FSDP=2 + CP=2 against FSDP=2 + EP (as CP)=2 to ensure an apples-to-apples comparison, as CP directly impacts input data pipelining (CP load balancing).

Diff: https://diff.googleplex.com/#key=wY3v1dUv5eI7

Result: The losses match within a reasonable tolerance.

Performance Check

To evaluate the performance implications, we compared FSDP + EP (as FSDP) against FSDP + EP (as CP) using the following configuration:

Topology: v5p-8 (4 devices)
Sharding: FSDP=2, EP=2
Batch Size: PDB=1
Model: model_name=deepseek3-test
Diff: https://diff.googleplex.com/#key=u2vncodwZBrP

Result: Using EP as CP incurs a general performance penalty. However, this configuration is specifically designed for, and highly beneficial in, two scenarios:

Long-context training: To prevent OOM errors.
Strong scaling: When a fractional batch size is strictly required.

Diff with Previous Implementation

Finally, we compared the sharding and performance of this PR against the previous implementation (using the expert_shard_attention_option flag). For an accurate baseline, we compared against commit 9777a4cf9574f3d10c591e25450cea1b1dde7e01 (April 3rd), which is isolated from recent changes.

Configuration:

Topology: v5p-8
Model: model_name=deepseek3-test
Batch Size: per_device_batch_size=1 (Fractional batch size is unsupported by the old flag).
Diff: https://diff.googleplex.com/#key=AsrXm1JknryF

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

codecov · 2026-04-14T00:30:37Z

Codecov Report

❌ Patch coverage is 16.66667% with 5 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/maxtext/utils/train_utils.py	0.00%	1 Missing and 2 partials ⚠️
src/maxtext/layers/attention_op.py	0.00%	1 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

gobbleturk · 2026-04-15T21:48:43Z

+                      # ==========================================
+                      # Dense Activations
+                      ['activation_mlp', []],
+                      ['activation_batch', ['data', 'fsdp']],


activation_batch is also used for attention and is a key dimension. Maybe note this as a comment in the attention section above

Good point! Added comments in base.yml

gobbleturk · 2026-04-15T21:50:39Z

 data_sharding: [['data', 'stage', 'fsdp', 'fsdp_transpose', 'sequence', 'context', 'context_autoregressive', 'tensor', 'tensor_transpose', 'tensor_sequence', 'expert', 'autoregressive']]
 input_data_sharding_logical_axes: ['activation_embed_and_logits_batch', 'activation_norm_length']
+# Determines which physical axis plays the role of context parallelism for input data processing and load balancing
+context_sharding: "context"


What values can this take? Can we remove this as a field as instead its implied by the logical axis rules? E.g. we need a fuction that takes as input the rules and outputs the value of context_sharding?

Could we list other options? if any

It is hard to infer which physical axis is used for CP from reading logical rule. For example, both sequence and context are used to shard activation_length but only context is used for data processing.

I will add comments indicating possible values of context sharding and add checks.

Could we list other options? if any

Done!

RissyRan

Thank you! Wondering if we should add this test into train_compile to prevent the breakage. But I am also fine to keep it as scheduled tests since this is not frequent used. Both work for me.

RissyRan · 2026-04-15T23:52:09Z

+
+# This rule uses data, FSDP, and expert. Expert axis acts as context parallelism in  
+# components except core dMoE part (between EP all2all).
+mesh_axes: ['data', 'fsdp', 'expert']


Wondering if you will have a README about those custom mesh and rule supported?

I plan to add a doc explaining custom mesh and rule. I probably won't include the doc in this PR since more planned changes of custom rule are on the way.

RissyRan · 2026-04-15T23:52:18Z

+                      # General Weights
+                      ['mlp', []],
+                      ['embed', ['fsdp', 'expert']],
+                  ]


nit: extra line

RissyRan · 2026-04-15T23:54:51Z

 data_sharding: [['data', 'stage', 'fsdp', 'fsdp_transpose', 'sequence', 'context', 'context_autoregressive', 'tensor', 'tensor_transpose', 'tensor_sequence', 'expert', 'autoregressive']]
 input_data_sharding_logical_axes: ['activation_embed_and_logits_batch', 'activation_norm_length']
+# Determines which physical axis plays the role of context parallelism for input data processing and load balancing
+context_sharding: "context"


Could we list other options? if any

NuojCheng added the draft Draft PR label Apr 14, 2026

NuojCheng force-pushed the chengnuojin-ep-cp branch 2 times, most recently from e77000d to ded153f Compare April 15, 2026 16:01

NuojCheng marked this pull request as ready for review April 15, 2026 17:50

NuojCheng removed the draft Draft PR label Apr 15, 2026

NuojCheng added 2 commits April 15, 2026 20:00

reorder logical rule and add embed_vocab

5dd1039

add ep-as-cp custom rule

38326d4

NuojCheng force-pushed the chengnuojin-ep-cp branch from ded153f to 6f4b641 Compare April 15, 2026 20:01

gobbleturk reviewed Apr 15, 2026

View reviewed changes

gobbleturk approved these changes Apr 15, 2026

View reviewed changes

RissyRan reviewed Apr 15, 2026

View reviewed changes

NuojCheng force-pushed the chengnuojin-ep-cp branch from 6f4b641 to ccab53b Compare April 16, 2026 16:33

RissyRan approved these changes Apr 16, 2026

View reviewed changes

deprecate expert_shard_attention_option config

73cb095

NuojCheng force-pushed the chengnuojin-ep-cp branch from ccab53b to 73cb095 Compare April 16, 2026 17:48

NuojCheng added the pull ready label Apr 16, 2026

copybara-service Bot merged commit 9c68b1a into main Apr 16, 2026
29 of 31 checks passed

copybara-service Bot deleted the chengnuojin-ep-cp branch April 16, 2026 19:40

NuojCheng mentioned this pull request Apr 22, 2026

Introduce cp-as-ep rule for long context training or strong scaling #3726

Draft

4 tasks

Conversation

NuojCheng commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Functionality Check

Experiment 1: Real Training

Experiment 2: Training Compilation

Experiment 3: Default vs. ep-as-cp (debug_sharding diff)

Experiment 4: Training compilation large

Correctness Check

Performance Check

Diff with Previous Implementation

Checklist

Uh oh!

codecov Bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

gobbleturk Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RissyRan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

NuojCheng commented Apr 14, 2026 •

edited

Loading

Experiment 3: Default vs. `ep-as-cp` (debug_sharding diff)

codecov Bot commented Apr 14, 2026 •

edited

Loading

gobbleturk Apr 15, 2026 •

edited

Loading