Introduce ep-as-cp customized logical rule#3656
Conversation
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
e77000d to
ded153f
Compare
ded153f to
6f4b641
Compare
| # ========================================== | ||
| # Dense Activations | ||
| ['activation_mlp', []], | ||
| ['activation_batch', ['data', 'fsdp']], |
There was a problem hiding this comment.
activation_batch is also used for attention and is a key dimension. Maybe note this as a comment in the attention section above
There was a problem hiding this comment.
Good point! Added comments in base.yml
| data_sharding: [['data', 'stage', 'fsdp', 'fsdp_transpose', 'sequence', 'context', 'context_autoregressive', 'tensor', 'tensor_transpose', 'tensor_sequence', 'expert', 'autoregressive']] | ||
| input_data_sharding_logical_axes: ['activation_embed_and_logits_batch', 'activation_norm_length'] | ||
| # Determines which physical axis plays the role of context parallelism for input data processing and load balancing | ||
| context_sharding: "context" |
There was a problem hiding this comment.
What values can this take? Can we remove this as a field as instead its implied by the logical axis rules? E.g. we need a fuction that takes as input the rules and outputs the value of context_sharding?
There was a problem hiding this comment.
Could we list other options? if any
There was a problem hiding this comment.
It is hard to infer which physical axis is used for CP from reading logical rule. For example, both sequence and context are used to shard activation_length but only context is used for data processing.
I will add comments indicating possible values of context sharding and add checks.
There was a problem hiding this comment.
Could we list other options? if any
Done!
RissyRan
left a comment
There was a problem hiding this comment.
Thank you! Wondering if we should add this test into train_compile to prevent the breakage. But I am also fine to keep it as scheduled tests since this is not frequent used. Both work for me.
|
|
||
| # This rule uses data, FSDP, and expert. Expert axis acts as context parallelism in | ||
| # components except core dMoE part (between EP all2all). | ||
| mesh_axes: ['data', 'fsdp', 'expert'] |
There was a problem hiding this comment.
Wondering if you will have a README about those custom mesh and rule supported?
There was a problem hiding this comment.
I plan to add a doc explaining custom mesh and rule. I probably won't include the doc in this PR since more planned changes of custom rule are on the way.
| # General Weights | ||
| ['mlp', []], | ||
| ['embed', ['fsdp', 'expert']], | ||
| ] No newline at end of file |
| data_sharding: [['data', 'stage', 'fsdp', 'fsdp_transpose', 'sequence', 'context', 'context_autoregressive', 'tensor', 'tensor_transpose', 'tensor_sequence', 'expert', 'autoregressive']] | ||
| input_data_sharding_logical_axes: ['activation_embed_and_logits_batch', 'activation_norm_length'] | ||
| # Determines which physical axis plays the role of context parallelism for input data processing and load balancing | ||
| context_sharding: "context" |
There was a problem hiding this comment.
Could we list other options? if any
6f4b641 to
ccab53b
Compare
ccab53b to
73cb095
Compare
Description
Note: This PR should be merged after #3607.
This PR introduces a new custom mesh and rule,
ep-as-cp, and deprecates theexpert_shard_attention_optionflag.Key Changes
custom_mesh_and_rule=ep-as-cp, this rule supports DP, PP, FSDP, and EP. Under this setup, EP functions as CP everywhere except within the core MoE components (specifically, the layers between EP all-to-all communications)."context") to explicitly designate which physical axis serves as context parallelism. This is required because CP is utilized in the data pipeline and attention load balancing, and this axis mapping cannot be easily inferred from the custom logical rules alone.ep-as-cpmesh and rule are validated by the dump sharding tests.Tests
Functionality Check
The most straightforward way to verify the
ep-as-cpimplementation is by running an experiment using a fractional batch size. To test the implementation, we ran the following experiments:Experiment 1: Real Training
model_name=deepseek3-testExperiment 2: Training Compilation
model_name=deepseek3-671buse_ring_of_experts=trueExperiment 3: Default vs.
ep-as-cp(debug_sharding diff)pdb=1.Experiment 4: Training compilation large
model_name=deepseek3-671bOutput: Compilaiton succeeds.
Correctness Check
To verify loss correctness, we evaluated the following configuration:
per_device_batch_size=1model_name=deepseek3-testNote: We compare FSDP=2 + CP=2 against FSDP=2 + EP (as CP)=2 to ensure an apples-to-apples comparison, as CP directly impacts input data pipelining (CP load balancing).
Result: The losses match within a reasonable tolerance.
Performance Check
To evaluate the performance implications, we compared FSDP + EP (as FSDP) against FSDP + EP (as CP) using the following configuration:
Topology: v5p-8 (4 devices)
Sharding: FSDP=2, EP=2
Batch Size: PDB=1
Model:
model_name=deepseek3-testDiff: https://diff.googleplex.com/#key=u2vncodwZBrP
Result: Using EP as CP incurs a general performance penalty. However, this configuration is specifically designed for, and highly beneficial in, two scenarios:
Diff with Previous Implementation
Finally, we compared the sharding and performance of this PR against the previous implementation (using the
expert_shard_attention_optionflag). For an accurate baseline, we compared against commit9777a4cf9574f3d10c591e25450cea1b1dde7e01(April 3rd), which is isolated from recent changes.Configuration:
Topology: v5p-8
Model:
model_name=deepseek3-testBatch Size:
per_device_batch_size=1(Fractional batch size is unsupported by the old flag).Diff: https://diff.googleplex.com/#key=AsrXm1JknryF
Checklist
Before submitting this PR, please make sure (put X in square brackets):
gemini-reviewlabel.