Skip to content

feat: add GPU topology discovery and unit tests#454

Merged
artulab merged 1 commit intomainfrom
aayildir/topology_discovery
Mar 17, 2026
Merged

feat: add GPU topology discovery and unit tests#454
artulab merged 1 commit intomainfrom
aayildir/topology_discovery

Conversation

@artulab
Copy link
Collaborator

@artulab artulab commented Mar 13, 2026

Motivation

The goal is to discover the physical GPU interconnect layout across a distributed cluster. The discovery map classifies every GPU pair into one of three communication tiers: intra-node (IPC/NVLink, PCIe etc.), intra-rack fabric (i.e. multi-node NVSwitch), or inter-node RDMA (i.e. InfiniBand), so the Iris memory manager can pick the optimal transport for each GPU pair.

Technical Details

Add support for multi-GPU topology discovery with support for AMD xGMI and NVIDIA NVLink fabrics. Includes vendor-agnostic fabric domain identification and interconnect level classification.

Test Plan

Tested via unit tests, and run it on a single node with 4 GPUs to verify the memory topology detection on a single node with multiple numa domains and GPUs.

Test Result

Unit Tests

tests/unittests/test_topology.py::TestFabricInfo::test_empty_fabric_info PASSED                                                         [  1%]
tests/unittests/test_topology.py::TestFabricInfo::test_valid_fabric_info PASSED                                                         [  2%]
tests/unittests/test_topology.py::TestFabricInfo::test_domain_key_comparison PASSED                                                     [  4%]
tests/unittests/test_topology.py::TestFabricInfo::test_empty_domain_keys_are_not_equal PASSED                                           [  5%]
tests/unittests/test_topology.py::TestFabricInfo::test_serialization_roundtrip PASSED                                                   [  7%]
tests/unittests/test_topology.py::TestFabricInfo::test_from_dict_missing_keys PASSED                                                    [  8%]
tests/unittests/test_topology.py::TestGPUInfo::test_serialization_roundtrip PASSED                                                      [  9%]
tests/unittests/test_topology.py::TestGPUInfo::test_from_dict_does_not_mutate_input PASSED                                              [ 11%]
tests/unittests/test_topology.py::TestGPUInfo::test_from_dict_missing_fabric PASSED                                                     [ 12%]
tests/unittests/test_topology.py::TestNormalizePCIBusId::test_standard_format PASSED                                                    [ 14%]
tests/unittests/test_topology.py::TestNormalizePCIBusId::test_uppercase PASSED                                                          [ 15%]
tests/unittests/test_topology.py::TestNormalizePCIBusId::test_nvidia_8char_domain PASSED                                                [ 16%]
tests/unittests/test_topology.py::TestNormalizePCIBusId::test_prefix_junk PASSED                                                        [ 18%]
tests/unittests/test_topology.py::TestNormalizePCIBusId::test_no_match PASSED                                                           [ 19%]
tests/unittests/test_topology.py::TestNodeInfo::test_get_link_type_self PASSED                                                          [ 21%]
tests/unittests/test_topology.py::TestNodeInfo::test_get_link_type_out_of_bounds PASSED                                                 [ 22%]
tests/unittests/test_topology.py::TestNodeInfo::test_get_link_type_no_matrix PASSED                                                     [ 23%]
tests/unittests/test_topology.py::TestNodeInfo::test_p2p_access_out_of_bounds PASSED                                                    [ 25%]
tests/unittests/test_topology.py::TestNodeInfo::test_p2p_access_self_always_true PASSED                                                 [ 26%]
tests/unittests/test_topology.py::TestTopologyMap::test_same_rank_is_intra_node PASSED                                                  [ 28%]
tests/unittests/test_topology.py::TestTopologyMap::test_same_node_is_intra_node PASSED                                                  [ 29%]
tests/unittests/test_topology.py::TestTopologyMap::test_same_fabric_different_node_is_fabric PASSED                                     [ 30%]
tests/unittests/test_topology.py::TestTopologyMap::test_no_fabric_is_rdma PASSED                                                        [ 32%]
tests/unittests/test_topology.py::TestTopologyMap::test_node_peers PASSED                                                               [ 33%]
tests/unittests/test_topology.py::TestTopologyMap::test_fabric_domain_peers PASSED                                                      [ 35%]
tests/unittests/test_topology.py::TestTopologyMap::test_rdma_peers PASSED                                                               [ 36%]
tests/unittests/test_topology.py::TestTopologyMap::test_peer_groups_partition_world PASSED                                              [ 38%]
tests/unittests/test_topology.py::TestTopologyMap::test_comm_groups_intra_node PASSED                                                   [ 39%]
tests/unittests/test_topology.py::TestTopologyMap::test_comm_groups_fabric_includes_standalone PASSED                                   [ 40%]
tests/unittests/test_topology.py::TestTopologyMap::test_comm_groups_fabric_domain_group_content PASSED                                  [ 42%]
tests/unittests/test_topology.py::TestTopologyMap::test_comm_groups_fabric_is_not_empty_when_domains_exist PASSED                       [ 43%]
tests/unittests/test_topology.py::TestTopologyMap::test_comm_groups_rdma_is_world PASSED                                                [ 45%]
tests/unittests/test_topology.py::TestTopologyMap::test_heap_plan_completeness PASSED                                                   [ 46%]
tests/unittests/test_topology.py::TestTopologyMap::test_heap_plan_rank4 PASSED                                                          [ 47%]
tests/unittests/test_topology.py::TestTopologyMap::test_heap_plan_no_peer_overlap PASSED                                                [ 49%]
tests/unittests/test_topology.py::TestTopologyMap::test_summary_contains_all_nodes PASSED                                               [ 50%]
tests/unittests/test_topology.py::TestTopologyMap::test_ranks_for_fabric_domain PASSED                                                  [ 52%]
tests/unittests/test_topology.py::TestTopologyMap::test_ranks_for_nonexistent_domain PASSED                                             [ 53%]
tests/unittests/test_topology.py::TestOversubscription::test_num_gpus_is_physical_count PASSED                                          [ 54%]
tests/unittests/test_topology.py::TestOversubscription::test_link_type_by_gpu_id PASSED                                                 [ 56%]
tests/unittests/test_topology.py::TestOversubscription::test_p2p_by_gpu_id PASSED                                                       [ 57%]
tests/unittests/test_topology.py::TestOversubscription::test_all_ranks_are_node_peers PASSED                                            [ 59%]
tests/unittests/test_topology.py::TestIsolationCollapse::test_num_gpus_not_collapsed PASSED                                             [ 60%]
tests/unittests/test_topology.py::TestIsolationCollapse::test_both_ranks_are_node_peers PASSED                                          [ 61%]
tests/unittests/test_topology.py::TestNoFabricCluster::test_no_fabric_all_rdma PASSED                                                   [ 63%]
tests/unittests/test_topology.py::TestNoFabricCluster::test_fabric_peers_empty PASSED                                                   [ 64%]
tests/unittests/test_topology.py::TestNoFabricCluster::test_comm_groups_no_fabric_is_empty PASSED                                       [ 66%]
tests/unittests/test_topology.py::TestNoFabricCluster::test_comm_groups_intra_node_still_correct PASSED                                 [ 67%]
tests/unittests/test_topology.py::TestNoFabricCluster::test_comm_groups_rdma_still_covers_world PASSED                                  [ 69%]
tests/unittests/test_topology.py::TestNoFabricCluster::test_heap_plan_no_fabric_peers PASSED                                            [ 70%]
tests/unittests/test_topology.py::TestAllFabricCluster::test_comm_groups_fabric_spans_nodes PASSED                                      [ 71%]
tests/unittests/test_topology.py::TestAllFabricCluster::test_comm_groups_no_standalone_groups PASSED                                    [ 73%]
tests/unittests/test_topology.py::TestAllFabricCluster::test_comm_groups_intra_node_still_per_host PASSED                               [ 74%]
tests/unittests/test_topology.py::TestAllFabricCluster::test_heap_plan_fabric_peers_cross_node PASSED                                   [ 76%]
tests/unittests/test_topology.py::TestAllFabricCluster::test_interconnect_cross_node_is_fabric PASSED                                   [ 77%]
tests/unittests/test_topology.py::TestDistributed::test_all_gather_strings SKIPPED (No distributed process group)                       [ 78%]
tests/unittests/test_topology.py::TestDistributed::test_all_gather_strings_empty SKIPPED (No distributed process group)                 [ 80%]
tests/unittests/test_topology.py::TestDistributed::test_all_gather_strings_large_payload SKIPPED (No distributed process group)         [ 81%]
tests/unittests/test_topology.py::TestFullDiscovery::test_discover_returns_topology SKIPPED (No distributed process group)              [ 83%]
tests/unittests/test_topology.py::TestFullDiscovery::test_local_rank_is_unique_per_node SKIPPED (No distributed process group)          [ 84%]
tests/unittests/test_topology.py::TestFullDiscovery::test_own_rank_info_correct SKIPPED (No distributed process group)                  [ 85%]
tests/unittests/test_topology.py::TestFullDiscovery::test_interconnect_symmetry SKIPPED (No distributed process group)                  [ 87%]
tests/unittests/test_topology.py::TestFullDiscovery::test_peer_partition_exhaustive SKIPPED (No distributed process group)              [ 88%]
tests/unittests/test_topology.py::TestLogicalToPhysicalGpuIndex::test_no_env_var_returns_logical PASSED                                 [ 90%]
tests/unittests/test_topology.py::TestLogicalToPhysicalGpuIndex::test_nvidia_remapping PASSED                                           [ 91%]
tests/unittests/test_topology.py::TestLogicalToPhysicalGpuIndex::test_amd_hip_visible PASSED                                            [ 92%]
tests/unittests/test_topology.py::TestLogicalToPhysicalGpuIndex::test_amd_rocr_fallback PASSED                                          [ 94%]
tests/unittests/test_topology.py::TestLogicalToPhysicalGpuIndex::test_hip_takes_priority_over_rocr PASSED                               [ 95%]
tests/unittests/test_topology.py::TestLogicalToPhysicalGpuIndex::test_logical_out_of_range_returns_logical PASSED                       [ 97%]
tests/unittests/test_topology.py::TestLogicalToPhysicalGpuIndex::test_uuid_style_entry_returns_logical PASSED                           [ 98%]
tests/unittests/test_topology.py::TestLogicalToPhysicalGpuIndex::test_negative_index_passthrough PASSED                                 [100%]

======================================================== 63 passed, 8 skipped in 0.90s ========================================================

Submission Checklist

Copilot AI review requested due to automatic review settings March 13, 2026 19:56
@github-actions github-actions bot added in-progress We are working on it iris Iris project issue labels Mar 13, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new topology discovery module to classify GPU-GPU connectivity across nodes (intra-node, intra-rack fabric, RDMA) and introduces unit tests + optional dependency extras to validate/enable vendor-specific discovery.

Changes:

  • Added iris.topology with GPU/node/fabric discovery, peer grouping, and distribution planning.
  • Added extensive unit tests for serialization, peer partitioning, oversubscription/isolation scenarios, and (optionally) distributed execution.
  • Added optional dependency groups in pyproject.toml for vendor tooling.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 19 comments.

File Description
iris/topology.py Implements topology discovery, fabric detection, intra-node topology parsing, peer grouping APIs.
tests/unittests/test_topology.py Adds unit tests for new topology classes/utilities and optional distributed tests.
pyproject.toml Adds optional extras for NVIDIA/AMD dependency hints.

You can also share your feedback on Copilot code review. Take the survey.

@artulab artulab force-pushed the aayildir/topology_discovery branch from b803f28 to 1ab3581 Compare March 16, 2026 21:44
@artulab artulab requested a review from Copilot March 16, 2026 22:07
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new iris.topology module plus unit tests to discover and represent multi-GPU / multi-node interconnect topology (intra-node vs intra-rack fabric vs RDMA), including PCI-bus-ID normalization and peer grouping utilities used by Iris’ memory manager.

Changes:

  • Introduces iris/topology.py with topology discovery, serialization, PCI/UUID helpers, and peer-group planning.
  • Adds extensive unit tests for TopologyMap, oversubscription/isolation scenarios, and helper utilities.
  • Adds nvidia optional dependency for NVML access.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 18 comments.

File Description
iris/topology.py Implements topology discovery (NVML/AMDSMI hooks), data models, gather utilities, and peer/group planning.
tests/unittests/test_topology.py Adds unit and (skipped) distributed tests covering serialization, classification, grouping, and edge cases.
pyproject.toml Adds optional dependency group for NVIDIA NVML bindings.

You can also share your feedback on Copilot code review. Take the survey.

@artulab artulab force-pushed the aayildir/topology_discovery branch from 1ab3581 to 33f9fc7 Compare March 17, 2026 01:37
@artulab artulab requested a review from Copilot March 17, 2026 01:38
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds multi-GPU topology discovery capabilities (fabric domain + intra-node link classification) and validates behavior via an extensive new unit test suite.

Changes:

  • Introduces iris.topology with GPU/node discovery, fabric-domain grouping, intra-node topology probing, and peer/group planning APIs.
  • Adds comprehensive unit tests covering serialization, PCI normalization, peer partitions, oversubscription/isolation scenarios, and (conditionally) distributed discovery.
  • Adds optional dependency extras for NVIDIA/AMD environments.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.

File Description
iris/topology.py New topology discovery implementation (GPU info, intra-node links, fabric domains, peer tiers, comm groups).
tests/unittests/test_topology.py New unit + (skipped) distributed tests for topology discovery and helper utilities.
pyproject.toml Adds iris[nvidia]/iris[amd] optional dependency groups for vendor tooling.

You can also share your feedback on Copilot code review. Take the survey.

@artulab artulab force-pushed the aayildir/topology_discovery branch from 33f9fc7 to 48e6bd0 Compare March 17, 2026 02:00
@artulab artulab requested a review from Copilot March 17, 2026 02:02
@artulab artulab force-pushed the aayildir/topology_discovery branch from 48e6bd0 to ccd72d0 Compare March 17, 2026 02:05
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds multi-GPU topology discovery intended to classify GPU-to-GPU connectivity across nodes (intra-node vs fabric vs RDMA) so Iris can choose optimal transports.

Changes:

  • Introduces iris.topology with data models and discovery logic (PCI/UUID probing, intra-node topology parsing, fabric-domain grouping, peer grouping, and planning utilities).
  • Adds a comprehensive unit test suite covering serialization, peer grouping, oversubscription/isolation scenarios, and basic distributed helpers.
  • Adds optional extras for NVIDIA tooling in pyproject.toml.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File Description
iris/topology.py Implements topology discovery, fabric-domain classification, and helper utilities (PCI normalization, NVML/AMDSMI probing, all-gather).
tests/unittests/test_topology.py Adds unit tests for topology map behavior and (currently skipped) distributed/e2e discovery tests.
pyproject.toml Adds optional dependency group for NVIDIA NVML Python bindings.

You can also share your feedback on Copilot code review. Take the survey.

   Add support for multi-GPU topology discovery with support
   for AMD xGMI and NVIDIA NVLink fabrics. Includes vendor-agnostic
   fabric domain identification and interconnect level classification.
@artulab artulab force-pushed the aayildir/topology_discovery branch from ccd72d0 to e32be34 Compare March 17, 2026 02:19
@artulab artulab requested a review from Copilot March 17, 2026 02:21
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds multi-GPU cluster topology discovery to classify GPU-to-GPU communication into intra-node, intra-rack fabric, and RDMA tiers, along with a comprehensive unit test suite.

Changes:

  • Introduces iris.topology discovery logic, GPU/node/fabric data models, and peer/group planning APIs.
  • Adds unit tests covering serialization, PCI bus normalization, peer grouping, oversubscription/isolation scenarios, and distributed discovery behavior (skipped when not configured).
  • Adds optional dependency groups for vendor-specific tooling.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
iris/topology.py Implements topology discovery, fabric domain modeling, PCI/UUID helpers, and grouping/planning utilities.
tests/unittests/test_topology.py Adds unit tests for topology map behavior, regression scenarios, and distributed-only routines.
pyproject.toml Adds optional dependency groups for NVIDIA/AMD environments.

You can also share your feedback on Copilot code review. Take the survey.

Copy link
Collaborator

@mawad-amd mawad-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks, Ahmet.

@artulab artulab merged commit 45c8656 into main Mar 17, 2026
116 of 120 checks passed
@artulab artulab deleted the aayildir/topology_discovery branch March 17, 2026 19:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

in-progress We are working on it iris Iris project issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants