feat: add GPU topology discovery and unit tests#454
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new topology discovery module to classify GPU-GPU connectivity across nodes (intra-node, intra-rack fabric, RDMA) and introduces unit tests + optional dependency extras to validate/enable vendor-specific discovery.
Changes:
- Added
iris.topologywith GPU/node/fabric discovery, peer grouping, and distribution planning. - Added extensive unit tests for serialization, peer partitioning, oversubscription/isolation scenarios, and (optionally) distributed execution.
- Added optional dependency groups in
pyproject.tomlfor vendor tooling.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 19 comments.
| File | Description |
|---|---|
iris/topology.py |
Implements topology discovery, fabric detection, intra-node topology parsing, peer grouping APIs. |
tests/unittests/test_topology.py |
Adds unit tests for new topology classes/utilities and optional distributed tests. |
pyproject.toml |
Adds optional extras for NVIDIA/AMD dependency hints. |
You can also share your feedback on Copilot code review. Take the survey.
b803f28 to
1ab3581
Compare
There was a problem hiding this comment.
Pull request overview
Adds a new iris.topology module plus unit tests to discover and represent multi-GPU / multi-node interconnect topology (intra-node vs intra-rack fabric vs RDMA), including PCI-bus-ID normalization and peer grouping utilities used by Iris’ memory manager.
Changes:
- Introduces
iris/topology.pywith topology discovery, serialization, PCI/UUID helpers, and peer-group planning. - Adds extensive unit tests for
TopologyMap, oversubscription/isolation scenarios, and helper utilities. - Adds
nvidiaoptional dependency for NVML access.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 18 comments.
| File | Description |
|---|---|
iris/topology.py |
Implements topology discovery (NVML/AMDSMI hooks), data models, gather utilities, and peer/group planning. |
tests/unittests/test_topology.py |
Adds unit and (skipped) distributed tests covering serialization, classification, grouping, and edge cases. |
pyproject.toml |
Adds optional dependency group for NVIDIA NVML bindings. |
You can also share your feedback on Copilot code review. Take the survey.
1ab3581 to
33f9fc7
Compare
There was a problem hiding this comment.
Pull request overview
Adds multi-GPU topology discovery capabilities (fabric domain + intra-node link classification) and validates behavior via an extensive new unit test suite.
Changes:
- Introduces
iris.topologywith GPU/node discovery, fabric-domain grouping, intra-node topology probing, and peer/group planning APIs. - Adds comprehensive unit tests covering serialization, PCI normalization, peer partitions, oversubscription/isolation scenarios, and (conditionally) distributed discovery.
- Adds optional dependency extras for NVIDIA/AMD environments.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
iris/topology.py |
New topology discovery implementation (GPU info, intra-node links, fabric domains, peer tiers, comm groups). |
tests/unittests/test_topology.py |
New unit + (skipped) distributed tests for topology discovery and helper utilities. |
pyproject.toml |
Adds iris[nvidia]/iris[amd] optional dependency groups for vendor tooling. |
You can also share your feedback on Copilot code review. Take the survey.
33f9fc7 to
48e6bd0
Compare
48e6bd0 to
ccd72d0
Compare
There was a problem hiding this comment.
Pull request overview
Adds multi-GPU topology discovery intended to classify GPU-to-GPU connectivity across nodes (intra-node vs fabric vs RDMA) so Iris can choose optimal transports.
Changes:
- Introduces
iris.topologywith data models and discovery logic (PCI/UUID probing, intra-node topology parsing, fabric-domain grouping, peer grouping, and planning utilities). - Adds a comprehensive unit test suite covering serialization, peer grouping, oversubscription/isolation scenarios, and basic distributed helpers.
- Adds optional extras for NVIDIA tooling in
pyproject.toml.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
iris/topology.py |
Implements topology discovery, fabric-domain classification, and helper utilities (PCI normalization, NVML/AMDSMI probing, all-gather). |
tests/unittests/test_topology.py |
Adds unit tests for topology map behavior and (currently skipped) distributed/e2e discovery tests. |
pyproject.toml |
Adds optional dependency group for NVIDIA NVML Python bindings. |
You can also share your feedback on Copilot code review. Take the survey.
Add support for multi-GPU topology discovery with support for AMD xGMI and NVIDIA NVLink fabrics. Includes vendor-agnostic fabric domain identification and interconnect level classification.
ccd72d0 to
e32be34
Compare
There was a problem hiding this comment.
Pull request overview
Adds multi-GPU cluster topology discovery to classify GPU-to-GPU communication into intra-node, intra-rack fabric, and RDMA tiers, along with a comprehensive unit test suite.
Changes:
- Introduces
iris.topologydiscovery logic, GPU/node/fabric data models, and peer/group planning APIs. - Adds unit tests covering serialization, PCI bus normalization, peer grouping, oversubscription/isolation scenarios, and distributed discovery behavior (skipped when not configured).
- Adds optional dependency groups for vendor-specific tooling.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
iris/topology.py |
Implements topology discovery, fabric domain modeling, PCI/UUID helpers, and grouping/planning utilities. |
tests/unittests/test_topology.py |
Adds unit tests for topology map behavior, regression scenarios, and distributed-only routines. |
pyproject.toml |
Adds optional dependency groups for NVIDIA/AMD environments. |
You can also share your feedback on Copilot code review. Take the survey.
mawad-amd
left a comment
There was a problem hiding this comment.
Looks good. Thanks, Ahmet.
Motivation
The goal is to discover the physical GPU interconnect layout across a distributed cluster. The discovery map classifies every GPU pair into one of three communication tiers: intra-node (IPC/NVLink, PCIe etc.), intra-rack fabric (i.e. multi-node NVSwitch), or inter-node RDMA (i.e. InfiniBand), so the Iris memory manager can pick the optimal transport for each GPU pair.
Technical Details
Add support for multi-GPU topology discovery with support for AMD xGMI and NVIDIA NVLink fabrics. Includes vendor-agnostic fabric domain identification and interconnect level classification.
Test Plan
Tested via unit tests, and run it on a single node with 4 GPUs to verify the memory topology detection on a single node with multiple numa domains and GPUs.
Test Result
Unit Tests
Submission Checklist