-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Labels
AutoDeploy<NV> AutoDeploy Backend<NV> AutoDeploy Backendfeature requestNew feature or request. This includes new model, dtype, functionality supportNew feature or request. This includes new model, dtype, functionality support
Description
🚀 The feature, motivation and pitch
Current TP sharding heuristics to detect shardable regions (identify_regions_between_residuals) fails to properly distinguish layers (attention/MoE/MLP/SSM) for some models: Nemotron, LLama4, Phi.
To efficiently shard the model, we need head parallelism. To apply head parallelism, we need a more robust way to extract layers.
Alternatives
No response
Additional context
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.
coderabbitai
Metadata
Metadata
Assignees
Labels
AutoDeploy<NV> AutoDeploy Backend<NV> AutoDeploy Backendfeature requestNew feature or request. This includes new model, dtype, functionality supportNew feature or request. This includes new model, dtype, functionality support
Type
Projects
Status
In review