[Feature][AutoDeploy]: Improved heuristics to detect shardable regions

### 🚀 The feature, motivation and pitch

Current TP sharding heuristics to detect shardable regions (`identify_regions_between_residuals`) fails to properly distinguish layers (attention/MoE/MLP/SSM) for some models: Nemotron, LLama4, Phi.

To efficiently shard the model, we need head parallelism. To apply head parallelism, we need a more robust way to extract layers.

### Alternatives

_No response_

### Additional context

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and checked the [documentation](https://nvidia.github.io/TensorRT-LLM/) and [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) for answers to frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature][AutoDeploy]: Improved heuristics to detect shardable regions #8946

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature][AutoDeploy]: Improved heuristics to detect shardable regions #8946

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions