Skip to content

Conversation

@Andy-Jost
Copy link
Contributor

@Andy-Jost Andy-Jost commented Oct 31, 2025

Major refactoring of the memory package.

Overview
This PR refactors the _memory.pyx module into a dedicated package (_memory/) to address its growing size and complexity, which were hindering further development. The primary goals are to physically separate the code into more manageable submodules, simplify the internal logic, and enhance the overall structure, including the addition of .pxd headers for better Cython integration.

Major Changes

  • Split _memory.pyx into submodules, the major ones being the following:
    • Buffers: _buffer.*
    • Device memory resources: _dmr.*
    • IPC (Inter-Process Communication): _ipc.*
    • Virtual memory management: _vmm.*
  • Introduced Cython headers (.pxd) for public definitions to improve modularity and type safety.
  • Refactored DeviceMemoryResource to isolate IPC-related code, reducing coupling.
  • Simplified IPC implementation by adding an IPCData class to encapsulate relevant data members and eliminating a redundant uuid field.
  • Streamlined the class hierarchy by removing unnecessary classes.
  • Simplified the Cython interface for memory allocation and deallocation operations.

Minor Improvements

  • Added __all__ lists to modules for explicit control over exports.
  • Extracted long implementation functions from class definitions to make classes more concise and readable.
  • Renamed various private attributes and methods for consistency (e.g., _handle instead of _mempool_handle).
  • Consolidated and alphabetized property definitions for better organization.
  • Converted additional classes and functions to Cython for performance gains.

@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Oct 31, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@Andy-Jost Andy-Jost requested review from cpcloud, leofang and mdboom and removed request for cpcloud October 31, 2025 22:41
@Andy-Jost Andy-Jost added the cuda.core Everything related to the cuda.core module label Nov 4, 2025
def _clear(self):
self._ptr = 0
self._size = 0
self._mr = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Consider renaming _mr -> _memory_resource or mem_resource.

stream: Stream | None = None
):
cdef Buffer self = Buffer.__new__(cls)
self._ptr = <intptr_t>(int(ptr))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be a uintptr_t or a uint64_t?

@Andy-Jost
Copy link
Contributor Author

/ok to test cf4dc9d

@leofang leofang added this to the cuda.core beta 9 milestone Nov 10, 2025
@leofang leofang added enhancement Any code-related improvements P0 High priority - Must do! labels Nov 10, 2025
@Andy-Jost
Copy link
Contributor Author

/ok to test 19e4b8f

if self._memory_resource is None:
raise ValueError("a destination buffer must be provided (this "
"buffer does not have a memory_resource)")
dst = self._memory_resource.allocate(src_size, stream)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This is a common idiom in python? My tendency would be to assert and require the user to pass in a buffer that fits expectations on size and alignment rather than doing it for them from a clarity of memory ownership perspective. I appreciate that this is not as large a concern in Python as it is in other systems programming languages.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My 2c: it would be better to require the buffer, as you suggest.

cdef size_t dst_size = self._size
cdef size_t src_size = src._size

if src_size != dst_size:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to guard against if size is zero? To avoid kicking off a memcpy operation at all in that situation.

Copy link
Contributor Author

@Andy-Jost Andy-Jost Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the POV of this change (refactoring), no need to change. More generally, I would prefer to NOT have that guard because the driver should take care of correctness and the caller can avoid the zero-size. No need to spend time checking it here, since it would pessimize the common case. I don't feel too strongly about it, though.

"""Return True if this buffer can be accessed by the CPU, otherwise False."""
if self._memory_resource is not None:
return self._memory_resource.is_host_accessible
raise NotImplementedError("WIP: Currently this property only supports buffers with associated MemoryResource")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This something that we intend to change in the future? Should we remove the 'wip' from the string?

Copy link
Contributor Author

@Andy-Jost Andy-Jost Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above. For this change I only copy code. Not saying anything directly to your point, though. If we want to change this I'm happy to prepare the PR.

"""

@abc.abstractmethod
def allocate(self, size_t size, stream: Stream = None) -> Buffer:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Do we need memory alignment support in these alloc functions?

Copy link
Contributor Author

@Andy-Jost Andy-Jost Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


# TODO: support creating this MR with flags that are later passed to cuMemHostAlloc?

def allocate(self, size, stream=None) -> Buffer:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto alignment comment above.

Copy link
Collaborator

@rparolin rparolin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good to me. I left some general comments that I consider non-blocking.

@leofang Any major concerns you'd like to see address before this gets merged?

Comment on lines +28 to +29
if TYPE_CHECKING:
from cuda.core.experimental._memory import Buffer, MemoryResource
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not blocking, but we want to avoid using TYPE_CHECKING: #468

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something to eliminate? If so, I can prepare a separate change to remove it everywhere.

]
d = {}
exec("from cuda.core.experimental._memory import *", d) # noqa: S102
d = {k: v for k, v in d.items() if not k.startswith("__")}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: We should exclude everything starting with one underscore, not just those with two?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could change that. When I originally wrote this _SynchronousMemoryResource was in the mix, too.

@Andy-Jost
Copy link
Contributor Author

/ok to test ff3820f

@Andy-Jost Andy-Jost enabled auto-merge (squash) November 12, 2025 19:09
Comment on lines +5 to +9
from ._buffer import * # noqa: F403
from ._device_memory_resource import * # noqa: F403
from ._ipc import * # noqa: F403
from ._legacy import * # noqa: F403
from ._virtual_memory_resource import * # noqa: F403
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: Would be better to call out what's being imported. Maintaining __all__ or having to chasing after each module for their __all__ is not fun. I don't recall we ever maintain __all__ in any other modules.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I have the opposite preference. E.g., if I add something to, say, _buffer I find it easier to update the __all__ list there rather than in a separate file.

@Andy-Jost Andy-Jost merged commit f9df16f into NVIDIA:main Nov 12, 2025
57 checks passed
@github-actions
Copy link

Doc Preview CI
Preview removed because the pull request was closed or merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.core Everything related to the cuda.core module enhancement Any code-related improvements P0 High priority - Must do!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants