Expand Windows test matrix to reproduce and fix nvbugs 5630448 #1242

leofang · 2025-11-16T20:45:42Z

Description

close #462
close #984
closes #985
close #1024
close #1241

expand Windows test matrix to cover

6 Python versions
2 CUDA versions
3 driver modes
- datacenter and Quadro cards cover TCC & MCDM
- gamer cards cover WDDM ~~& MCDM~~
1 driver version
- unlike on Linux, for now we can control what driver versions we install on Windows, so I'll work on this next

Using this PR, we can reproduce nvbugs 5630448 and I worked out a fix for it.

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

Co-authored-by: leofang <5534781+leofang@users.noreply.github.com>

…r modes Co-authored-by: leofang <5534781+leofang@users.noreply.github.com>

Co-authored-by: leofang <5534781+leofang@users.noreply.github.com>

…r mode support Co-authored-by: leofang <5534781+leofang@users.noreply.github.com>

Co-authored-by: leofang <5534781+leofang@users.noreply.github.com>

copy-pr-bot · 2025-11-16T20:45:46Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Co-authored-by: leofang <5534781+leofang@users.noreply.github.com>

leofang · 2025-11-16T20:58:18Z

/ok to test 16b0e3f

rwgk · 2025-11-16T21:35:04Z

Awesome, thanks!

- we do not have access to rtx6000ada - rtxpro6000 is a datacenter card - cover WDDM in at least 2 pipelines

… different modes rtx2080, rtx4090, rtxpro6000, v100, a100, l4 (t4 nodes are too slow)

leofang · 2025-11-16T21:57:08Z

/ok to test f2ffbb1

leofang · 2025-11-16T21:59:45Z

There are some runners we cannot access yet, will be resolved via https://github.com/nv-gha-runners/enterprise-runner-configuration/pull/219
The previous runs with MCDM mode can already reproduce the issues seen in test_vmm_allocator_basic/grow_allocation is SKIPPED on all CI Windows runners #1241, so it is a miss on the QA side too (they only tested WDDM).

Co-authored-by: leofang <5534781+leofang@users.noreply.github.com>

rwgk · 2025-11-17T00:40:28Z

Another small puzzle:

Here in the CI I see 2 failed:

https://github.com/NVIDIA/cuda-python/actions/runs/19411934928/job/55534848113

tests\test_memory.py::test_vmm_allocator_basic_allocation FAILED         [ 61%]
...
tests\test_memory.py::test_vmm_allocator_grow_allocation FAILED          [ 62%]
...
============ 2 failed, 522 passed, 68 skipped in 120.46s (0:02:00) ============

That's also what I saw on the TITAN RTX colossus workstation yesterday with the 591.32 driver.

I just installed the 591.34 driver and kitpick034 on my main workstation (not colossus), now there are 3 failed:

FAILED tests\test_memory.py::test_vmm_allocator_basic_allocation - cuda.core.experimental._utils.cuda_utils.CUDAError: CUDA_ERROR_INVALID_VALUE: This indicates that one or more of the parameters passed to the API call is not within an acceptable range of values.
FAILED tests\test_memory.py::test_vmm_allocator_policy_configuration - cuda.core.experimental._utils.cuda_utils.CUDAError: CUDA_ERROR_INVALID_VALUE: This indicates that one or more of the parameters passed to the API call is not within an acceptable range of values.
FAILED tests\test_memory.py::test_vmm_allocator_grow_allocation - cuda.core.experimental._utils.cuda_utils.CUDAError: CUDA_ERROR_INVALID_VALUE: This indicates that one or more of the parameters passed to the API call is not within an acceptable range of values.
============ 3 failed, 512 passed, 75 skipped in 69.32s (0:01:09) =============

.github/workflows/test-wheel-windows.yml

leofang

fix

Removed redundant 'Ensure GPU is working' step and kept the driver mode verification.

rwgk · 2025-11-18T08:39:34Z

I tested commit da63359 here interactively on the TITAN RTX colossus workstation, with CTK 13.0.2. No failures anymore:

$ grep_pytest_summary qa_bindings_windows_2025-11-18+000725_tests_log.txt
qa_bindings_windows_2025-11-18+000725_tests_log.txt
82:rootdir: C:\Users\rgrossekunst\forked\cuda-python
217:======================= 76 passed, 2 skipped in 13.67s ========================
222:rootdir: C:\Users\rgrossekunst\forked\cuda-python\cuda_bindings
424:======================= 194 passed, 1 skipped in 15.04s =======================
429:rootdir: C:\Users\rgrossekunst\forked\cuda-python\cuda_bindings
631:======================= 194 passed, 1 skipped in 10.17s =======================
636:rootdir: C:\Users\rgrossekunst\forked\cuda-python\cuda_bindings\examples
710:============================= 13 passed in 23.19s =============================
715:rootdir: C:\Users\rgrossekunst\forked\cuda-python\cuda_bindings
730:============================== 9 passed in 1.69s ==============================
735:rootdir: C:\Users\rgrossekunst\forked\cuda-python\cuda_core
1385:================= 519 passed, 75 skipped in 135.21s (0:02:15) =================
1390:rootdir: C:\Users\rgrossekunst\forked\cuda-python\cuda_core
1414:======================== 3 passed, 8 skipped in 0.46s =========================

But on my main workstation I'm still seeing the same test_vmm_allocator_policy_configuration failure as before, with the da63359 commit.

leofang · 2025-11-18T15:04:15Z

/ok to test 6c8cbcb

cuda_core/tests/test_memory.py

leofang · 2025-11-18T15:28:14Z

@Andy-Jost yes I already fixed them in 6c8cbcb. You were reviewing while I was pushing the fix 🙂 Late evening caused stupid bugs like these...

leofang · 2025-11-18T15:45:32Z

This PR is ready for review/merge to unblock Ralf.

rwgk · 2025-11-18T16:36:38Z

Interestingly, on my main workstation I'm still seeing one error (below).

But I think we should deal with that separately.

CTK = ~~13.0.2~~ 13.0.1
Driver = 591.34

rwgk-win11.localdomain:/mnt/c/Users/rgrossekunst/logs $ grep_pytest_summary qa_bindings_windows_2025-11-18+081432_tests_log.txt
qa_bindings_windows_2025-11-18+081432_tests_log.txt
92:rootdir: C:\Users\rgrossekunst\forked\cuda-python
227:======================== 76 passed, 2 skipped in 5.51s ========================
232:rootdir: C:\Users\rgrossekunst\forked\cuda-python\cuda_bindings
434:======================= 194 passed, 1 skipped in 6.05s ========================
439:rootdir: C:\Users\rgrossekunst\forked\cuda-python\cuda_bindings
641:======================= 194 passed, 1 skipped in 4.09s ========================
646:rootdir: C:\Users\rgrossekunst\forked\cuda-python\cuda_bindings\examples
720:============================= 13 passed in 9.24s ==============================
725:rootdir: C:\Users\rgrossekunst\forked\cuda-python\cuda_bindings
773:rootdir: C:\Users\rgrossekunst\forked\cuda-python\cuda_core
1491:============ 1 failed, 518 passed, 75 skipped in 72.39s (0:01:12) =============
1496:rootdir: C:\Users\rgrossekunst\forked\cuda-python\cuda_core
1520:======================== 3 passed, 8 skipped in 0.11s =========================

================================== FAILURES ===================================
___________________ test_vmm_allocator_policy_configuration ___________________

    def test_vmm_allocator_policy_configuration():
        """Test VMM allocator with different policy configurations.
    
        This test verifies that VirtualMemoryResource can be configured
        with different allocation policies and that the configuration affects
        the allocation behavior.
        """
        device = Device()
        device.set_current()
    
        # Skip if virtual memory management is not supported
        if not device.properties.virtual_memory_management_supported:
            pytest.skip("Virtual memory management is not supported on this device")
    
        # Skip if GPU Direct RDMA is supported (we want to test the unsupported case)
        if not device.properties.gpu_direct_rdma_supported:
            pytest.skip("This test requires a device that doesn't support GPU Direct RDMA")
    
        # Test with custom VMM config
        custom_config = VirtualMemoryResourceOptions(
            allocation_type="pinned",
            location_type="device",
            granularity="minimum",
            gpu_direct_rdma=True,
            handle_type="posix_fd" if not IS_WINDOWS else "win32_kmt",
            peers=(),
            self_access="rw",
            peer_access="rw",
        )
    
        vmm_mr = VirtualMemoryResource(device, config=custom_config)
    
        # Verify configuration is applied
        assert vmm_mr.config == custom_config
        assert vmm_mr.config.gpu_direct_rdma is True
        assert vmm_mr.config.granularity == "minimum"
    
        # Test allocation with custom config
        buffer = vmm_mr.allocate(8192)
        assert buffer.size >= 8192
        assert buffer.device_id == device.device_id
    
        # Test policy modification
        new_config = VirtualMemoryResourceOptions(
            allocation_type="pinned",
            location_type="device",
            granularity="recommended",
            gpu_direct_rdma=False,
            handle_type="posix_fd" if not IS_WINDOWS else "win32_kmt",
            peers=(),
            self_access="r",  # Read-only access
            peer_access="r",
        )
    
        # Modify allocation policy
>       modified_buffer = vmm_mr.modify_allocation(buffer, 16384, config=new_config)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

tests\test_memory.py:440: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
cuda\core\experimental\_memory\_virtual_memory_resource.py:230: in modify_allocation
    raise_if_driver_error(res)
cuda\core\experimental\_utils\cuda_utils.pyx:67: in cuda.core.experimental._utils.cuda_utils._check_driver_error
    cpdef inline int _check_driver_error(cydriver.CUresult error) except?-1 nogil:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   raise CUDAError(f"{name.decode()}: {expl}")
E   cuda.core.experimental._utils.cuda_utils.CUDAError: CUDA_ERROR_UNKNOWN: This indicates that an unknown internal error has occurred.

cuda\core\experimental\_utils\cuda_utils.pyx:78: CUDAError
=========================== short test summary info ===========================
SKIPPED [6] tests\example_tests\utils.py:37: cupy not installed, skipping related tests
SKIPPED [1] tests\example_tests\utils.py:37: torch not installed, skipping related tests
SKIPPED [1] tests\example_tests\utils.py:43: skip C:\Users\rgrossekunst\forked\cuda-python\cuda_core\tests\example_tests\..\..\examples\thread_block_cluster.py
SKIPPED [5] tests\memory_ipc\test_errors.py:20: Device does not support IPC
SKIPPED [1] tests\memory_ipc\test_event_ipc.py:20: Device does not support IPC
SKIPPED [1] tests\memory_ipc\test_event_ipc.py:91: Device does not support IPC
SKIPPED [2] tests\memory_ipc\test_event_ipc.py:106: Device does not support IPC
SKIPPED [8] tests\memory_ipc\test_event_ipc.py:123: Device does not support IPC
SKIPPED [1] tests\memory_ipc\test_leaks.py:26: mempool allocation handle is not using fds or psutil is unavailable
SKIPPED [12] tests\memory_ipc\test_leaks.py:82: mempool allocation handle is not using fds or psutil is unavailable
SKIPPED [1] tests\memory_ipc\test_memory_ipc.py:16: Device does not support IPC
SKIPPED [1] tests\memory_ipc\test_memory_ipc.py:53: Device does not support IPC
SKIPPED [1] tests\memory_ipc\test_memory_ipc.py:103: Device does not support IPC
SKIPPED [1] tests\memory_ipc\test_memory_ipc.py:153: Device does not support IPC
SKIPPED [2] tests\memory_ipc\test_send_buffers.py:18: Device does not support IPC
SKIPPED [1] tests\memory_ipc\test_serialize.py:24: Device does not support IPC
SKIPPED [1] tests\memory_ipc\test_serialize.py:79: Device does not support IPC
SKIPPED [1] tests\memory_ipc\test_serialize.py:125: Device does not support IPC
SKIPPED [2] tests\memory_ipc\test_workerpool.py:29: Device does not support IPC
SKIPPED [2] tests\memory_ipc\test_workerpool.py:65: Device does not support IPC
SKIPPED [2] tests\memory_ipc\test_workerpool.py:109: Device does not support IPC
SKIPPED [1] tests\test_device.py:327: Test requires at least 2 CUDA devices
SKIPPED [1] tests\test_device.py:375: Test requires at least 2 CUDA devices
SKIPPED [1] tests\test_launcher.py:92: Driver or GPU not new enough for thread block clusters
SKIPPED [1] tests\test_launcher.py:122: Driver or GPU not new enough for thread block clusters
SKIPPED [2] tests\test_launcher.py:274: cupy not installed
SKIPPED [1] tests\test_linker.py:113: nvjitlink requires lto for ptx linking
SKIPPED [1] tests\test_memory.py:514: This test requires a device that doesn't support GPU Direct RDMA
SKIPPED [1] tests\test_memory.py:645: Driver rejects IPC-enabled mempool creation on this platform
SKIPPED [7] tests\test_module.py:345: Test requires numba to be installed
SKIPPED [2] tests\test_module.py:389: Device with compute capability 90 or higher is required for cluster support
SKIPPED [1] tests\test_module.py:404: Device with compute capability 90 or higher is required for cluster support
SKIPPED [2] tests\test_utils.py: got empty parameter set for (in_arr, use_stream)
SKIPPED [1] tests\test_utils.py: CuPy is not installed
FAILED tests/test_memory.py::test_vmm_allocator_policy_configuration - cuda.core.experimental._utils.cuda_utils.CUDAError: CUDA_ERROR_UNKNOWN: This indicates that an unknown internal error has occurred.
============ 1 failed, 518 passed, 75 skipped in 72.39s (0:01:12) =============

Andy-Jost

lgtm

cuda_core/cuda/core/experimental/_memory/_virtual_memory_resource.py

rwgk · 2025-11-18T17:02:20Z

ci/tools/install_gpu_driver.ps1

+    }
+    pnputil /disable-device /class Display
+    pnputil /enable-device /class Display
+    # Give it a minute to settle:


nit: moment (not minute)

leofang · 2025-11-18T17:16:22Z

Interestingly, on my main workstation I'm still seeing one error (below).

@rwgk I suspect this would be an actual issue that we should bring to the driver team, given what you described above that this test passed with 591.32 and fails now with 591.34.

rwgk · 2025-11-18T17:30:49Z

Interestingly, on my main workstation I'm still seeing one error (below).

@rwgk I suspect this would be an actual issue that we should bring to the driver team, given what you described above that this test passed with 591.32 and fails now with 591.34.

There were two variables:

Different machines (TITAN RTX vs A6000)
Different driver

Possibly the 591.32 vs 591.34 driver may NOT make a difference. I'd have to try it out to be sure.

github-actions · 2025-11-18T17:32:15Z

Doc Preview CI
Preview removed because the pull request was closed or merged.

rwgk · 2025-11-18T17:37:47Z

Possibly the 591.32 vs 591.34 driver may NOT make a difference. I'd have to try it out to be sure.

Oh: In the meantime (I think yesterday or Sunday night) I also changed the driver on the TITAN RTX machine, it's 591.34 now, too. I.e. it's almost certain that it's TITAN RTX vs A6000 that makes the difference.

leofang · 2025-11-18T17:43:22Z

There were two variables:

This is new to me. Let's continue this conversation offline.

leofang · 2025-11-18T18:54:49Z

ci/tools/install_gpu_driver.ps1

FYI @alliepiper I backported your changes in CCCL to here 😋

cuda_core/tests/test_memory.py

Copilot AI and others added 8 commits November 16, 2025 19:57

Initial plan

f92ee6c

Move install_gpu_driver.ps1 to ci/tools and update call sites

5e08b07

Co-authored-by: leofang <5534781+leofang@users.noreply.github.com>

Update install_gpu_driver.ps1 to support GPU type detection and drive…

2219f3b

…r modes Co-authored-by: leofang <5534781+leofang@users.noreply.github.com>

Make nightly sections empty in ci/test-matrix.json

de5b109

Co-authored-by: leofang <5534781+leofang@users.noreply.github.com>

Expand Windows test matrix with driver mode support

585e184

Co-authored-by: leofang <5534781+leofang@users.noreply.github.com>

Wire driver mode from test-matrix.json into Windows workflow

de42011

Co-authored-by: leofang <5534781+leofang@users.noreply.github.com>

Update install_gpu_driver.ps1 to match CCCL implementation with drive…

35fa159

…r mode support Co-authored-by: leofang <5534781+leofang@users.noreply.github.com>

Simplify driver mode handling per review feedback

da32f6c

Co-authored-by: leofang <5534781+leofang@users.noreply.github.com>

leofang changed the title ~~Copilot/move install gpu driver script~~ Expand Windows test matrix Nov 16, 2025

Copilot AI and others added 2 commits November 16, 2025 20:50

Use GPU_TYPE env var instead of parsing JOB_RUNNER

a4a65ad

Co-authored-by: leofang <5534781+leofang@users.noreply.github.com>

ensure each GPU kind are tested under two modes

16b0e3f

leofang self-assigned this Nov 16, 2025

leofang added P0 High priority - Must do! CI/CD CI/CD infrastructure labels Nov 16, 2025

leofang added this to the cuda-python 13-next, 12-next milestone Nov 16, 2025

leofang added the enhancement Any code-related improvements label Nov 16, 2025

This comment has been minimized.

Sign in to view

leofang and others added 2 commits November 16, 2025 16:35

fix arch coverage

f789922

- we do not have access to rtx6000ada - rtxpro6000 is a datacenter card - cover WDDM in at least 2 pipelines

make script more flexible; ensure cover 6 different GPUs, each with 2…

f2ffbb1

… different modes rtx2080, rtx4090, rtxpro6000, v100, a100, l4 (t4 nodes are too slow)

Add driver mode verification and change v100 to rtxpro6000 for CUDA 13

0293947

Co-authored-by: leofang <5534781+leofang@users.noreply.github.com>

leofang commented Nov 17, 2025

View reviewed changes

.github/workflows/test-wheel-windows.yml Outdated Show resolved Hide resolved

.github/workflows/test-wheel-windows.yml Outdated Show resolved Hide resolved

leofang commented Nov 17, 2025

View reviewed changes

leofang added 2 commits November 16, 2025 19:51

fix

1706a06

merge

2393b68

Removed redundant 'Ensure GPU is working' step and kept the driver mode verification.

fix stupid negation

6c8cbcb

Andy-Jost reviewed Nov 18, 2025

View reviewed changes

cuda_core/tests/test_memory.py Outdated Show resolved Hide resolved

Andy-Jost reviewed Nov 18, 2025

View reviewed changes

cuda_core/tests/test_memory.py Outdated Show resolved Hide resolved

leofang added bug Something isn't working cuda.core Everything related to the cuda.core module labels Nov 18, 2025

leofang requested review from Andy-Jost, cryos, rparolin and rwgk November 18, 2025 15:44

leofang changed the title ~~Expand Windows test matrix~~ Expand Windows test matrix to fix nvbugs 5630448 Nov 18, 2025

leofang changed the title ~~Expand Windows test matrix to fix nvbugs 5630448~~ Expand Windows test matrix to reproduce and fix nvbugs 5630448 Nov 18, 2025

Andy-Jost approved these changes Nov 18, 2025

View reviewed changes

rwgk reviewed Nov 18, 2025

View reviewed changes

leofang merged commit c4079dd into NVIDIA:main Nov 18, 2025
119 of 121 checks passed

leofang mentioned this pull request Nov 18, 2025

Allow win32_handle_metadata to be an arbitrary object? #1263

Open

leofang commented Nov 18, 2025

View reviewed changes

cuda_core/tests/test_memory.py Show resolved Hide resolved

rwgk mentioned this pull request Nov 18, 2025

test_vmm_allocator_policy_configuration failure: Windows / A6000 / WDDM #1264

Open

This was referenced Nov 18, 2025

CI: Test different driver versions on Windows #1265

Open

Fix VMM tests #1277

Closed

Expand Windows test matrix to reproduce and fix nvbugs 5630448 #1242

Expand Windows test matrix to reproduce and fix nvbugs 5630448 #1242

Uh oh!

Conversation

leofang commented Nov 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

copy-pr-bot bot commented Nov 16, 2025

Uh oh!

leofang commented Nov 16, 2025

Uh oh!

This comment has been minimized.

rwgk commented Nov 16, 2025

Uh oh!

leofang commented Nov 16, 2025

Uh oh!

leofang commented Nov 16, 2025

Uh oh!

rwgk commented Nov 17, 2025

Uh oh!

Uh oh!

Uh oh!

leofang left a comment

Choose a reason for hiding this comment

Uh oh!

rwgk commented Nov 18, 2025

Uh oh!

leofang commented Nov 18, 2025

Uh oh!

Uh oh!

Uh oh!

leofang commented Nov 18, 2025

Uh oh!

leofang commented Nov 18, 2025

Uh oh!

rwgk commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Andy-Jost left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rwgk Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

leofang commented Nov 18, 2025

Uh oh!

Uh oh!

rwgk commented Nov 18, 2025

Uh oh!

github-actions bot commented Nov 18, 2025

Uh oh!

rwgk commented Nov 18, 2025

Uh oh!

leofang commented Nov 18, 2025

Uh oh!

leofang Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

leofang commented Nov 16, 2025 •

edited

Loading

rwgk commented Nov 18, 2025 •

edited

Loading