Skip to content

Conversation

@rparolin
Copy link
Collaborator

@rparolin rparolin commented Oct 22, 2025

This PR addresses issues with GPU Direct RDMA support validation in the Virtual Memory Resource (VMM) allocator and improves test coverage for memory management functionality.

  • Removed hardcoded platform checks: Eliminated Windows and WSL-specific skips in favor of device capability checks.
  • New test case: Added test_vmm_allocator_rdma_unsupported_exception() to verify proper error handling when RDMA is requested on unsupported devices.
image image image

@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Oct 22, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@rparolin rparolin requested a review from leofang October 22, 2025 21:49
@rparolin rparolin marked this pull request as ready for review October 22, 2025 21:50
@rparolin
Copy link
Collaborator Author

/ok to test aba3a17

@rparolin
Copy link
Collaborator Author

@greptileai

@rparolin rparolin changed the title Checking for RDMA support before allocating via VMM Checking for RDMA support before allocating via VMM in test suite Oct 22, 2025
@github-actions

This comment has been minimized.

Copy link
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It occurs to me that none of us (Ben, Keith, myself) read the docs when getting the VMM PR merged. The docs made it clear that there is one device attribute that we should check (which is typical to all major CUDA features, as we did in the IPC mempool test helper).
https://docs.nvidia.com/cuda/cuda-c-programming-guide/#query-for-support

Translate this to cuda.core, we need to check

dev = Device()
if not dev.properties.virtual_memory_management_supported:
    pytest.skip(...)

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Oct 22, 2025

Greptile encountered an error while reviewing this PR. Please reach out to support@greptile.com for assistance.

@rparolin
Copy link
Collaborator Author

/ok to test ebc6818

1 similar comment
@rparolin
Copy link
Collaborator Author

/ok to test ebc6818

@leofang leofang linked an issue Oct 22, 2025 that may be closed by this pull request
@leofang leofang added bug Something isn't working P0 High priority - Must do! cuda.core Everything related to the cuda.core module labels Oct 22, 2025
@leofang leofang added this to the cuda.core beta 8 milestone Oct 22, 2025
@rparolin
Copy link
Collaborator Author

It occurs to me that none of us (Ben, Keith, myself) read the docs when getting the VMM PR merged. The docs made it clear that there is one device attribute that we should check (which is typical to all major CUDA features, as we did in the IPC mempool test helper). https://docs.nvidia.com/cuda/cuda-c-programming-guide/#query-for-support

Translate this to cuda.core, we need to check

dev = Device()
if not dev.properties.virtual_memory_management_supported:
    pytest.skip(...)

As discussed in person, migrated the majority of the test suites skip checks to use dev.properties.virtual_memory_management_supported but also check for device support for RDMA when the user is explicitly requesting it via our API.

@rparolin rparolin enabled auto-merge (squash) October 22, 2025 23:34
@rparolin rparolin merged commit f3cb5a2 into NVIDIA:main Oct 23, 2025
74 checks passed
@github-actions
Copy link

Doc Preview CI
Preview removed because the pull request was closed or merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working cuda.core Everything related to the cuda.core module P0 High priority - Must do!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Investigate VMM issues on WSL

2 participants