Fix test_cuda_device_order on some multi-GPU systems#1590
Fix test_cuda_device_order on some multi-GPU systems#1590mdboom merged 3 commits intoNVIDIA:mainfrom
Conversation
|
Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
/ok to test |
There was a problem hiding this comment.
Pull request overview
Adjusts the NVML/CUDA device-order test to avoid failures on multi-GPU systems where device visibility differs depending on CUDA_VISIBLE_DEVICES.
Changes:
- Updates
test_cuda_device_orderto accept themonkeypatchfixture. - Deletes
CUDA_VISIBLE_DEVICESduring the test run before querying CUDA/NVML devices.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This comment has been minimized.
This comment has been minimized.
|
/ok to test |
| # and each of them should still be found in NVML devices. | ||
| assert len(cuda_devices) <= len(nvml_devices) | ||
| for cuda_device in cuda_devices: | ||
| assert cuda_device in nvml_devices, f"CUDA device {cuda_device} not found in NVML device list" |
There was a problem hiding this comment.
Give me a sec to experiment: does the f-string here suppress the helpful pytest default behavior or not?
There was a problem hiding this comment.
This is great as-is.
I hacked the test so that it fails on my workstation. This is the diff with/OUT the f-string:
(TestVenv) smc120-0009.ipp2a2.colossus.nvidia.com:/wrk/forked/cuda-python/cuda_bindings $ diff -u $Z/withOUT_fstring $Z/with_fstring
--- /wrk/z/withOUT_fstring 2026-02-09 08:59:27.153944722 -0800
+++ /wrk/z/with_fstring 2026-02-09 08:59:11.085012520 -0800
@@ -2,7 +2,7 @@
platform linux -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 -- /wrk/forked/cuda-python/TestVenv/bin/python
cachedir: .pytest_cache
benchmark: 5.2.3 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
-Using --randomly-seed=1514160743
+Using --randomly-seed=3206929400
rootdir: /wrk/forked/cuda-python/cuda_bindings
configfile: pyproject.toml
plugins: repeat-0.9.4, benchmark-5.2.3, mock-3.15.1, randomly-4.0.1
@@ -25,8 +25,9 @@
# and each of them should still be found in NVML devices.
assert len(cuda_devices) <= len(nvml_devices)
for cuda_device in cuda_devices:
-> assert cuda_device not in nvml_devices
-E AssertionError: assert {'id': 75, 'name': 'NVIDIA A10G'} not in [{'id': 75, 'name': 'NVIDIA A10G'}]
+> assert cuda_device not in nvml_devices, f"CUDA device {cuda_device} not found in NVML device list"
+E AssertionError: CUDA device {'name': 'NVIDIA A10G', 'id': 75} not found in NVML device list
+E assert {'id': 75, 'name': 'NVIDIA A10G'} not in [{'id': 75, 'name': 'NVIDIA A10G'}]
cuda_device = {'id': 75, 'name': 'NVIDIA A10G'}
cuda_devices = [{'id': 75, 'name': 'NVIDIA A10G'}]
@@ -34,5 +35,5 @@
tests/nvml/test_cuda.py:67: AssertionError
=========================== short test summary info ============================
-FAILED tests/nvml/test_cuda.py::test_cuda_device_order - AssertionError: asse...
-============================== 1 failed in 0.38s ===============================
+FAILED tests/nvml/test_cuda.py::test_cuda_device_order - AssertionError: CUDA...
+============================== 1 failed in 0.41s ===============================
rwgk
left a comment
There was a problem hiding this comment.
LGTM except for the suspected possible/possibly typo.
| # and each of them should still be found in NVML devices. | ||
| assert len(cuda_devices) <= len(nvml_devices) | ||
| for cuda_device in cuda_devices: | ||
| assert cuda_device in nvml_devices, f"CUDA device {cuda_device} not found in NVML device list" |
There was a problem hiding this comment.
This is great as-is.
I hacked the test so that it fails on my workstation. This is the diff with/OUT the f-string:
(TestVenv) smc120-0009.ipp2a2.colossus.nvidia.com:/wrk/forked/cuda-python/cuda_bindings $ diff -u $Z/withOUT_fstring $Z/with_fstring
--- /wrk/z/withOUT_fstring 2026-02-09 08:59:27.153944722 -0800
+++ /wrk/z/with_fstring 2026-02-09 08:59:11.085012520 -0800
@@ -2,7 +2,7 @@
platform linux -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 -- /wrk/forked/cuda-python/TestVenv/bin/python
cachedir: .pytest_cache
benchmark: 5.2.3 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
-Using --randomly-seed=1514160743
+Using --randomly-seed=3206929400
rootdir: /wrk/forked/cuda-python/cuda_bindings
configfile: pyproject.toml
plugins: repeat-0.9.4, benchmark-5.2.3, mock-3.15.1, randomly-4.0.1
@@ -25,8 +25,9 @@
# and each of them should still be found in NVML devices.
assert len(cuda_devices) <= len(nvml_devices)
for cuda_device in cuda_devices:
-> assert cuda_device not in nvml_devices
-E AssertionError: assert {'id': 75, 'name': 'NVIDIA A10G'} not in [{'id': 75, 'name': 'NVIDIA A10G'}]
+> assert cuda_device not in nvml_devices, f"CUDA device {cuda_device} not found in NVML device list"
+E AssertionError: CUDA device {'name': 'NVIDIA A10G', 'id': 75} not found in NVML device list
+E assert {'id': 75, 'name': 'NVIDIA A10G'} not in [{'id': 75, 'name': 'NVIDIA A10G'}]
cuda_device = {'id': 75, 'name': 'NVIDIA A10G'}
cuda_devices = [{'id': 75, 'name': 'NVIDIA A10G'}]
@@ -34,5 +35,5 @@
tests/nvml/test_cuda.py:67: AssertionError
=========================== short test summary info ============================
-FAILED tests/nvml/test_cuda.py::test_cuda_device_order - AssertionError: asse...
-============================== 1 failed in 0.38s ===============================
+FAILED tests/nvml/test_cuda.py::test_cuda_device_order - AssertionError: CUDA...
+============================== 1 failed in 0.41s ===============================Co-authored-by: Ralf W. Grosse-Kunstleve <rwgkio@gmail.com>
|
/ok to test |
This comment has been minimized.
This comment has been minimized.
1 similar comment
|
The
CUDA_VISIBLE_DEVICESenvironment variable controls whether all devices are included incuDeviceGetCount, or just those that are available for CUDA compute.We should make the test adaptable to either case.