Skip to content

Conversation

@leofang
Copy link
Member

@leofang leofang commented Nov 16, 2025

Description

close #462
close #984
closes #985
close #1024
close #1241

expand Windows test matrix to cover

  • 6 Python versions
  • 2 CUDA versions
  • 3 driver modes
    • datacenter and Quadro cards cover TCC & MCDM
    • gamer cards cover WDDM & MCDM
  • 1 driver version
    • unlike on Linux, for now we can control what driver versions we install on Windows, so I'll work on this next

Using this PR, we can reproduce nvbugs 5630448 and I worked out a fix for it.

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copilot AI and others added 8 commits November 16, 2025 19:57
Co-authored-by: leofang <5534781+leofang@users.noreply.github.com>
…r modes

Co-authored-by: leofang <5534781+leofang@users.noreply.github.com>
Co-authored-by: leofang <5534781+leofang@users.noreply.github.com>
Co-authored-by: leofang <5534781+leofang@users.noreply.github.com>
Co-authored-by: leofang <5534781+leofang@users.noreply.github.com>
…r mode support

Co-authored-by: leofang <5534781+leofang@users.noreply.github.com>
Co-authored-by: leofang <5534781+leofang@users.noreply.github.com>
@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Nov 16, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@leofang leofang changed the title Copilot/move install gpu driver script Expand Windows test matrix Nov 16, 2025
Copilot AI and others added 2 commits November 16, 2025 20:50
@leofang
Copy link
Member Author

leofang commented Nov 16, 2025

/ok to test 16b0e3f

@leofang leofang self-assigned this Nov 16, 2025
@leofang leofang added P0 High priority - Must do! CI/CD CI/CD infrastructure labels Nov 16, 2025
@leofang leofang added this to the cuda-python 13-next, 12-next milestone Nov 16, 2025
@leofang leofang added the enhancement Any code-related improvements label Nov 16, 2025
@github-actions

This comment has been minimized.

@rwgk
Copy link
Collaborator

rwgk commented Nov 16, 2025

Awesome, thanks!

leofang and others added 2 commits November 16, 2025 16:35
- we do not have access to rtx6000ada
- rtxpro6000 is a datacenter card
- cover WDDM in at least 2 pipelines
… different modes

rtx2080, rtx4090, rtxpro6000, v100, a100, l4 (t4 nodes are too slow)
@leofang
Copy link
Member Author

leofang commented Nov 16, 2025

/ok to test f2ffbb1

@leofang
Copy link
Member Author

leofang commented Nov 16, 2025

Co-authored-by: leofang <5534781+leofang@users.noreply.github.com>
@rwgk
Copy link
Collaborator

rwgk commented Nov 17, 2025

Another small puzzle:

Here in the CI I see 2 failed:

https://github.com/NVIDIA/cuda-python/actions/runs/19411934928/job/55534848113

tests\test_memory.py::test_vmm_allocator_basic_allocation FAILED         [ 61%]
...
tests\test_memory.py::test_vmm_allocator_grow_allocation FAILED          [ 62%]
...
============ 2 failed, 522 passed, 68 skipped in 120.46s (0:02:00) ============

That's also what I saw on the TITAN RTX colossus workstation yesterday with the 591.32 driver.

I just installed the 591.34 driver and kitpick034 on my main workstation (not colossus), now there are 3 failed:

FAILED tests\test_memory.py::test_vmm_allocator_basic_allocation - cuda.core.experimental._utils.cuda_utils.CUDAError: CUDA_ERROR_INVALID_VALUE: This indicates that one or more of the parameters passed to the API call is not within an acceptable range of values.
FAILED tests\test_memory.py::test_vmm_allocator_policy_configuration - cuda.core.experimental._utils.cuda_utils.CUDAError: CUDA_ERROR_INVALID_VALUE: This indicates that one or more of the parameters passed to the API call is not within an acceptable range of values.
FAILED tests\test_memory.py::test_vmm_allocator_grow_allocation - cuda.core.experimental._utils.cuda_utils.CUDAError: CUDA_ERROR_INVALID_VALUE: This indicates that one or more of the parameters passed to the API call is not within an acceptable range of values.
============ 3 failed, 512 passed, 75 skipped in 69.32s (0:01:09) =============

Copy link
Member Author

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix

Removed redundant 'Ensure GPU is working' step and kept the driver mode verification.
@rwgk
Copy link
Collaborator

rwgk commented Nov 18, 2025

I tested commit da63359 here interactively on the TITAN RTX colossus workstation, with CTK 13.0.2. No failures anymore:

$ grep_pytest_summary qa_bindings_windows_2025-11-18+000725_tests_log.txt
qa_bindings_windows_2025-11-18+000725_tests_log.txt
82:rootdir: C:\Users\rgrossekunst\forked\cuda-python
217:======================= 76 passed, 2 skipped in 13.67s ========================
222:rootdir: C:\Users\rgrossekunst\forked\cuda-python\cuda_bindings
424:======================= 194 passed, 1 skipped in 15.04s =======================
429:rootdir: C:\Users\rgrossekunst\forked\cuda-python\cuda_bindings
631:======================= 194 passed, 1 skipped in 10.17s =======================
636:rootdir: C:\Users\rgrossekunst\forked\cuda-python\cuda_bindings\examples
710:============================= 13 passed in 23.19s =============================
715:rootdir: C:\Users\rgrossekunst\forked\cuda-python\cuda_bindings
730:============================== 9 passed in 1.69s ==============================
735:rootdir: C:\Users\rgrossekunst\forked\cuda-python\cuda_core
1385:================= 519 passed, 75 skipped in 135.21s (0:02:15) =================
1390:rootdir: C:\Users\rgrossekunst\forked\cuda-python\cuda_core
1414:======================== 3 passed, 8 skipped in 0.46s =========================

But on my main workstation I'm still seeing the same test_vmm_allocator_policy_configuration failure as before, with the da63359 commit.

@leofang
Copy link
Member Author

leofang commented Nov 18, 2025

/ok to test 6c8cbcb

@leofang
Copy link
Member Author

leofang commented Nov 18, 2025

@Andy-Jost yes I already fixed them in 6c8cbcb. You were reviewing while I was pushing the fix 🙂 Late evening caused stupid bugs like these...

@leofang leofang added bug Something isn't working cuda.core Everything related to the cuda.core module labels Nov 18, 2025
@leofang leofang changed the title Expand Windows test matrix Expand Windows test matrix to fix nvbugs 5630448 Nov 18, 2025
@leofang leofang changed the title Expand Windows test matrix to fix nvbugs 5630448 Expand Windows test matrix to reproduce and fix nvbugs 5630448 Nov 18, 2025
@leofang
Copy link
Member Author

leofang commented Nov 18, 2025

This PR is ready for review/merge to unblock Ralf.

@rwgk
Copy link
Collaborator

rwgk commented Nov 18, 2025

Interestingly, on my main workstation I'm still seeing one error (below).

But I think we should deal with that separately.

CTK = 13.0.2 13.0.1
Driver = 591.34

rwgk-win11.localdomain:/mnt/c/Users/rgrossekunst/logs $ grep_pytest_summary qa_bindings_windows_2025-11-18+081432_tests_log.txt
qa_bindings_windows_2025-11-18+081432_tests_log.txt
92:rootdir: C:\Users\rgrossekunst\forked\cuda-python
227:======================== 76 passed, 2 skipped in 5.51s ========================
232:rootdir: C:\Users\rgrossekunst\forked\cuda-python\cuda_bindings
434:======================= 194 passed, 1 skipped in 6.05s ========================
439:rootdir: C:\Users\rgrossekunst\forked\cuda-python\cuda_bindings
641:======================= 194 passed, 1 skipped in 4.09s ========================
646:rootdir: C:\Users\rgrossekunst\forked\cuda-python\cuda_bindings\examples
720:============================= 13 passed in 9.24s ==============================
725:rootdir: C:\Users\rgrossekunst\forked\cuda-python\cuda_bindings
773:rootdir: C:\Users\rgrossekunst\forked\cuda-python\cuda_core
1491:============ 1 failed, 518 passed, 75 skipped in 72.39s (0:01:12) =============
1496:rootdir: C:\Users\rgrossekunst\forked\cuda-python\cuda_core
1520:======================== 3 passed, 8 skipped in 0.11s =========================
================================== FAILURES ===================================
___________________ test_vmm_allocator_policy_configuration ___________________

    def test_vmm_allocator_policy_configuration():
        """Test VMM allocator with different policy configurations.
    
        This test verifies that VirtualMemoryResource can be configured
        with different allocation policies and that the configuration affects
        the allocation behavior.
        """
        device = Device()
        device.set_current()
    
        # Skip if virtual memory management is not supported
        if not device.properties.virtual_memory_management_supported:
            pytest.skip("Virtual memory management is not supported on this device")
    
        # Skip if GPU Direct RDMA is supported (we want to test the unsupported case)
        if not device.properties.gpu_direct_rdma_supported:
            pytest.skip("This test requires a device that doesn't support GPU Direct RDMA")
    
        # Test with custom VMM config
        custom_config = VirtualMemoryResourceOptions(
            allocation_type="pinned",
            location_type="device",
            granularity="minimum",
            gpu_direct_rdma=True,
            handle_type="posix_fd" if not IS_WINDOWS else "win32_kmt",
            peers=(),
            self_access="rw",
            peer_access="rw",
        )
    
        vmm_mr = VirtualMemoryResource(device, config=custom_config)
    
        # Verify configuration is applied
        assert vmm_mr.config == custom_config
        assert vmm_mr.config.gpu_direct_rdma is True
        assert vmm_mr.config.granularity == "minimum"
    
        # Test allocation with custom config
        buffer = vmm_mr.allocate(8192)
        assert buffer.size >= 8192
        assert buffer.device_id == device.device_id
    
        # Test policy modification
        new_config = VirtualMemoryResourceOptions(
            allocation_type="pinned",
            location_type="device",
            granularity="recommended",
            gpu_direct_rdma=False,
            handle_type="posix_fd" if not IS_WINDOWS else "win32_kmt",
            peers=(),
            self_access="r",  # Read-only access
            peer_access="r",
        )
    
        # Modify allocation policy
>       modified_buffer = vmm_mr.modify_allocation(buffer, 16384, config=new_config)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

tests\test_memory.py:440: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
cuda\core\experimental\_memory\_virtual_memory_resource.py:230: in modify_allocation
    raise_if_driver_error(res)
cuda\core\experimental\_utils\cuda_utils.pyx:67: in cuda.core.experimental._utils.cuda_utils._check_driver_error
    cpdef inline int _check_driver_error(cydriver.CUresult error) except?-1 nogil:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   raise CUDAError(f"{name.decode()}: {expl}")
E   cuda.core.experimental._utils.cuda_utils.CUDAError: CUDA_ERROR_UNKNOWN: This indicates that an unknown internal error has occurred.

cuda\core\experimental\_utils\cuda_utils.pyx:78: CUDAError
=========================== short test summary info ===========================
SKIPPED [6] tests\example_tests\utils.py:37: cupy not installed, skipping related tests
SKIPPED [1] tests\example_tests\utils.py:37: torch not installed, skipping related tests
SKIPPED [1] tests\example_tests\utils.py:43: skip C:\Users\rgrossekunst\forked\cuda-python\cuda_core\tests\example_tests\..\..\examples\thread_block_cluster.py
SKIPPED [5] tests\memory_ipc\test_errors.py:20: Device does not support IPC
SKIPPED [1] tests\memory_ipc\test_event_ipc.py:20: Device does not support IPC
SKIPPED [1] tests\memory_ipc\test_event_ipc.py:91: Device does not support IPC
SKIPPED [2] tests\memory_ipc\test_event_ipc.py:106: Device does not support IPC
SKIPPED [8] tests\memory_ipc\test_event_ipc.py:123: Device does not support IPC
SKIPPED [1] tests\memory_ipc\test_leaks.py:26: mempool allocation handle is not using fds or psutil is unavailable
SKIPPED [12] tests\memory_ipc\test_leaks.py:82: mempool allocation handle is not using fds or psutil is unavailable
SKIPPED [1] tests\memory_ipc\test_memory_ipc.py:16: Device does not support IPC
SKIPPED [1] tests\memory_ipc\test_memory_ipc.py:53: Device does not support IPC
SKIPPED [1] tests\memory_ipc\test_memory_ipc.py:103: Device does not support IPC
SKIPPED [1] tests\memory_ipc\test_memory_ipc.py:153: Device does not support IPC
SKIPPED [2] tests\memory_ipc\test_send_buffers.py:18: Device does not support IPC
SKIPPED [1] tests\memory_ipc\test_serialize.py:24: Device does not support IPC
SKIPPED [1] tests\memory_ipc\test_serialize.py:79: Device does not support IPC
SKIPPED [1] tests\memory_ipc\test_serialize.py:125: Device does not support IPC
SKIPPED [2] tests\memory_ipc\test_workerpool.py:29: Device does not support IPC
SKIPPED [2] tests\memory_ipc\test_workerpool.py:65: Device does not support IPC
SKIPPED [2] tests\memory_ipc\test_workerpool.py:109: Device does not support IPC
SKIPPED [1] tests\test_device.py:327: Test requires at least 2 CUDA devices
SKIPPED [1] tests\test_device.py:375: Test requires at least 2 CUDA devices
SKIPPED [1] tests\test_launcher.py:92: Driver or GPU not new enough for thread block clusters
SKIPPED [1] tests\test_launcher.py:122: Driver or GPU not new enough for thread block clusters
SKIPPED [2] tests\test_launcher.py:274: cupy not installed
SKIPPED [1] tests\test_linker.py:113: nvjitlink requires lto for ptx linking
SKIPPED [1] tests\test_memory.py:514: This test requires a device that doesn't support GPU Direct RDMA
SKIPPED [1] tests\test_memory.py:645: Driver rejects IPC-enabled mempool creation on this platform
SKIPPED [7] tests\test_module.py:345: Test requires numba to be installed
SKIPPED [2] tests\test_module.py:389: Device with compute capability 90 or higher is required for cluster support
SKIPPED [1] tests\test_module.py:404: Device with compute capability 90 or higher is required for cluster support
SKIPPED [2] tests\test_utils.py: got empty parameter set for (in_arr, use_stream)
SKIPPED [1] tests\test_utils.py: CuPy is not installed
FAILED tests/test_memory.py::test_vmm_allocator_policy_configuration - cuda.core.experimental._utils.cuda_utils.CUDAError: CUDA_ERROR_UNKNOWN: This indicates that an unknown internal error has occurred.
============ 1 failed, 518 passed, 75 skipped in 72.39s (0:01:12) =============

Copy link
Contributor

@Andy-Jost Andy-Jost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

}
pnputil /disable-device /class Display
pnputil /enable-device /class Display
# Give it a minute to settle:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: moment (not minute)

@leofang
Copy link
Member Author

leofang commented Nov 18, 2025

Interestingly, on my main workstation I'm still seeing one error (below).

@rwgk I suspect this would be an actual issue that we should bring to the driver team, given what you described above that this test passed with 591.32 and fails now with 591.34.

@leofang leofang merged commit c4079dd into NVIDIA:main Nov 18, 2025
119 of 121 checks passed
@rwgk
Copy link
Collaborator

rwgk commented Nov 18, 2025

Interestingly, on my main workstation I'm still seeing one error (below).

@rwgk I suspect this would be an actual issue that we should bring to the driver team, given what you described above that this test passed with 591.32 and fails now with 591.34.

There were two variables:

  • Different machines (TITAN RTX vs A6000)
  • Different driver

Possibly the 591.32 vs 591.34 driver may NOT make a difference. I'd have to try it out to be sure.

@github-actions
Copy link

Doc Preview CI
Preview removed because the pull request was closed or merged.

@rwgk
Copy link
Collaborator

rwgk commented Nov 18, 2025

Possibly the 591.32 vs 591.34 driver may NOT make a difference. I'd have to try it out to be sure.

Oh: In the meantime (I think yesterday or Sunday night) I also changed the driver on the TITAN RTX machine, it's 591.34 now, too. I.e. it's almost certain that it's TITAN RTX vs A6000 that makes the difference.

@leofang
Copy link
Member Author

leofang commented Nov 18, 2025

There were two variables:

This is new to me. Let's continue this conversation offline.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @alliepiper I backported your changes in CCCL to here 😋

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working CI/CD CI/CD infrastructure cuda.core Everything related to the cuda.core module enhancement Any code-related improvements P0 High priority - Must do!

Projects

None yet

3 participants