Skip to content

use pinned device_ptr to init cpu cache tensor#1287

Merged
hiworldwzj merged 5 commits into
mainfrom
wzj_ptr
May 7, 2026
Merged

use pinned device_ptr to init cpu cache tensor#1287
hiworldwzj merged 5 commits into
mainfrom
wzj_ptr

Conversation

@hiworldwzj

Copy link
Copy Markdown
Collaborator

No description provided.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies the CPU cache creation logic to use device pointers obtained via cudaHostGetDevicePointer and introduces asynchronous registration. However, several issues were identified: a busy-wait loop in creator.py undermines the non-blocking parameter and is redundant when blocking; an assertion comparing host and device pointers is likely to fail and should be removed; and the device pointer is exposed before all memory chunks are fully registered, which could lead to unsafe memory access.

Comment on lines +37 to +38
while attach_handle.device_ptr is None:
time.sleep(0.01)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

这个忙等循环在 pin_no_blockingFalse 时是多余的,因为 attach_handle.wait() 已经确保了注册工作完成(且 device_ptrtasks_finished 设置前已被赋值)。更重要的是,当 pin_no_blockingTrue 时,这个循环会导致调用者阻塞,违背了该参数的设计初衷。此外,如果移除此循环且 pin_no_blockingTrue,第40行会因为 device_ptrNone 而抛出 TypeError。这表明异步固定逻辑与立即获取 device_ptr 的需求在当前设计下是冲突的。

Comment thread lightllm/common/cpu_cache/creator.py Outdated
Comment on lines 40 to 41
cpu_cache_tensor = self._build_tensor_view(shm_ptr=attach_handle.device_ptr)
assert shm_ptr == cpu_cache_tensor.data_ptr()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

此断言 assert shm_ptr == cpu_cache_tensor.data_ptr()(实际上是在检查 shm_ptr == attach_handle.device_ptr)与使用 device_ptr 的初衷相悖。如果设备指针与主机指针不同(这是使用 cudaHostGetDevicePointer 的原因),此断言将导致程序崩溃。如果两者总是相同,则无需更改为使用 device_ptr。鉴于 kv_cache_utils.py 中已移除了类似的断言,建议移除此行以支持指针不一致的系统。

Suggested change
cpu_cache_tensor = self._build_tensor_view(shm_ptr=attach_handle.device_ptr)
assert shm_ptr == cpu_cache_tensor.data_ptr()
cpu_cache_tensor = self._build_tensor_view(shm_ptr=attach_handle.device_ptr)

Comment on lines +278 to +287
if handle.device_ptr is None:
# 提前获取对应的指针对象,避免在wait后再获取,照成过长的阻塞等待。
device_ptr = ctypes.c_void_p()
host_ptr = ctypes.c_void_p(shm_ptr)
res = cuda.cudaHostGetDevicePointer(ctypes.byref(device_ptr), host_ptr, 0)
if res != 0:
raise Exception(f"cudaHostGetDevicePointer failed with error code {res}")

logger.info(f"cudaHostGetDevicePointer success, host_ptr={host_ptr.value}, device_ptr={device_ptr.value}")
handle.device_ptr = device_ptr.value

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

在仅完成第一个分块注册后就设置 handle.device_ptr 是不安全的。如果调用者在后台线程完成所有分块注册之前,就对整个 Tensor 执行 GPU 操作,可能会访问到尚未固定的内存。在某些对内存固定有严格要求的硬件上,这可能导致未定义行为或性能大幅下降。建议仅在所有分块注册完成后再暴露 device_ptr,或者确保调用者明确知道此时仅部分固定。

@hiworldwzj hiworldwzj merged commit 3d08cba into main May 7, 2026
1 check passed
@hiworldwzj hiworldwzj deleted the wzj_ptr branch May 7, 2026 02:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant