-
Notifications
You must be signed in to change notification settings - Fork 506
Introduce ep-as-cp customized logical rule #3656
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,77 @@ | ||
| # Copyright 2026 Google LLC | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # https://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| # This rule uses data, stage, FSDP, and expert. Expert axis acts as context parallelism in | ||
| # components except core dMoE part (between EP all2all). | ||
| mesh_axes: ['data', 'stage', 'fsdp', 'expert'] | ||
| data_sharding: [['data', 'stage', 'fsdp', 'expert']] | ||
| context_sharding: 'expert' | ||
| logical_axis_rules: [ | ||
| # ========================================== | ||
| # Vocabulary Embedding | ||
| # ========================================== | ||
| # Vocab Activations | ||
| ['activation_embed_and_logits_batch', ['data', 'stage', 'fsdp']], | ||
| ['activation_embed_and_logits_batch_sequence', ['data', 'stage', 'fsdp', 'expert']], | ||
| # Vocab Weights | ||
| ['vocab', []], | ||
| ['embed_vocab', ['fsdp', 'expert']], | ||
| # ========================================== | ||
| # Attention | ||
| # ========================================== | ||
| # Attention Activations | ||
| ['activation_heads', []], | ||
| ['activation_kv_heads', []], | ||
| ['activation_attn_length', ['expert']], | ||
| ['activation_q_length', ['expert']], | ||
| ['activation_kv_length', []], | ||
| ['activation_attn_embed', []], | ||
| ['activation_kv', []], | ||
| ['activation_kv_batch', ['data', 'fsdp']], | ||
| ['activation_kv_head_dim', []], | ||
| # Attention Weights | ||
| ['heads', []], | ||
| ['q_heads', []], | ||
| ['kv_heads', []], | ||
| ['qkv', []], | ||
| ['kv', []], | ||
| ['kv_head_dim', []], | ||
| ['q_lora', ['fsdp']], | ||
| ["q_lora_up_proj", []], | ||
| ['kv_lora', ['fsdp']], | ||
| ["kv_lora_up_proj", []], | ||
| # ========================================== | ||
| # Mixture of Experts (MoE) | ||
| # ========================================== | ||
| # MoE Activations | ||
| ['activation_batch_moe', ['data', 'fsdp']], | ||
| ['activation_exp', ['expert']], | ||
| # MoE Weights | ||
| ['exp', 'expert'], | ||
| ['embed_moe', ['fsdp']], | ||
| # ========================================== | ||
| # Standard MLP / Dense Layers / Model Structure | ||
| # ========================================== | ||
| # Dense Activations | ||
| ['activation_mlp', []], | ||
| ['activation_batch', ['data', 'fsdp']], | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. activation_batch is also used for attention and is a key dimension. Maybe note this as a comment in the attention section above
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point! Added comments in |
||
| ['activation_length', ['expert']], | ||
| ['activation_norm_length', ['expert']], | ||
| ['activation_embed', []], | ||
| ['activation_stage', 'stage'], | ||
| # General Weights | ||
| ['mlp', []], | ||
| ['layers', 'stage'], | ||
| ['embed', ['fsdp', 'expert']], | ||
| ] | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What values can this take? Can we remove this as a field as instead its implied by the logical axis rules? E.g. we need a fuction that takes as input the rules and outputs the value of context_sharding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we list other options? if any
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is hard to infer which physical axis is used for CP from reading logical rule. For example, both sequence and context are used to shard activation_length but only context is used for data processing.
I will add comments indicating possible values of context sharding and add checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!