Retain CUDA IPC events in MP adapter by he-yufeng · Pull Request #3245 · LMCache/LMCache

he-yufeng · 2026-05-10T07:39:40Z

What this PR does / why we need it:

LMCacheMPWorkerAdapter now keeps the producer-side CUDA event objects alive for pending MP store and retrieve requests. The adapter already tracks the matching futures; this adds matching event references and drops them when the future completes or when pending work is drained after the server becomes unhealthy.

Without retaining the event object, the daemon can receive an IPC handle whose producer-side event has already been collected, which can make torch.cuda.Event.from_ipc_handle(...) fail with CUDA error: invalid argument.

Special notes for your reviewers:

The unit tests use a fake CUDA event plus weakref to check both sides of the lifetime: pending requests keep the event alive, and get_finished() releases it once the matching future completes.

Validation run locally on Windows with a CPU/source-only test environment:

python -m ruff check lmcache\integration\vllm\vllm_multi_process_adapter.py tests\v1\test_vllm_mp_adapter.py
python -m ruff format --check lmcache\integration\vllm\vllm_multi_process_adapter.py tests\v1\test_vllm_mp_adapter.py
.\.venv\Scripts\python.exe -m pytest tests\v1\test_vllm_mp_adapter.py -q
.\.venv\Scripts\python.exe -m py_compile lmcache\integration\vllm\vllm_multi_process_adapter.py tests\v1\test_vllm_mp_adapter.py
git diff --check
git fsck --no-progress

If applicable:

this PR contains user facing changes - docs added
this PR contains unit tests

gemini-code-assist

Code Review

This pull request modifies the LMCacheMPWorkerAdapter to maintain references to CUDA event objects during store and retrieve operations, ensuring they are not garbage collected while their IPC handles are in use. It also includes regression tests to verify that these events are correctly held and released upon request completion. I have no feedback to provide.

KuntaiDu

LGTM! @ApostaC @YaoJiayi can you also check?

he-yufeng · 2026-05-27T08:50:29Z

Follow-up pushed in 6dd0685.

Changes:

replaced direct torch.cuda.Event annotations with a small structural ipc_handle() protocol, so this adapter API no longer couples the type surface to torch.cuda;
annotated the fake heartbeat callback slot to clear the mypy assignment error.

Validation on Windows:

python -m py_compile lmcache\integration\vllm\vllm_multi_process_adapter.py tests\v1\test_vllm_mp_adapter.py
git diff --check
python -m ruff check lmcache\integration\vllm\vllm_multi_process_adapter.py tests\v1\test_vllm_mp_adapter.py
python -m mypy --config-file pyproject.toml lmcache\integration\vllm\vllm_multi_process_adapter.py tests\v1\test_vllm_mp_adapter.py

Note: python -m pytest tests\v1\test_vllm_mp_adapter.py -q is blocked locally before collection by the missing lmcache.c_ops native extension in this Windows checkout.

he-yufeng · 2026-05-27T08:53:25Z

DCO is fixed now. I rebased the two PR commits with Signed-off-by footers and force-pushed the same code diff; current head is 530fd55. The temporary bad squash was immediately reverted, and the PR diff is back to the intended two files.

he-yufeng · 2026-05-27T09:42:07Z

Rebased onto the latest dev and resolved the test conflict; current head is 383d485. The code diff is still limited to the MP adapter and its regression tests. Validation on Windows: python -m py_compile lmcache\integration\vllm\vllm_multi_process_adapter.py tests\v1\test_vllm_mp_adapter.py; python -m ruff check lmcache\integration\vllm\vllm_multi_process_adapter.py tests\v1\test_vllm_mp_adapter.py; python -m mypy --config-file pyproject.toml lmcache\integration\vllm\vllm_multi_process_adapter.py tests\v1\test_vllm_mp_adapter.py; git diff --check upstream/dev...HEAD. Targeted pytest is still blocked before collection by missing lmcache.c_ops in this Windows checkout.

Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com>

he-yufeng · 2026-05-27T18:09:32Z

Updated again in 3cd6b98.

Changes:

rebased the PR onto current dev;
fixed the CI failure in the new event-retention tests by stubbing the current TransferContext path instead of the older send_lmcache_request/to_cuda_future path;
the earlier torch_dev review concern is addressed in 872da20: the adapter retains IPC-capable events through a small _IpcEvent protocol and no longer exposes a torch.cuda.Event type dependency in this adapter API.

Validation on Windows:

python -m py_compile lmcache\integration\vllm\vllm_multi_process_adapter.py tests\v1\test_vllm_mp_adapter.py
python -m ruff check lmcache\integration\vllm\vllm_multi_process_adapter.py tests\v1\test_vllm_mp_adapter.py
python -m mypy --config-file pyproject.toml lmcache\integration\vllm\vllm_multi_process_adapter.py tests\v1\test_vllm_mp_adapter.py
git diff --check

The targeted pytest command is still blocked locally before collection in this Windows checkout by the missing native lmcache.c_ops extension; the previous CI failure itself should be covered by the Linux test job after this push.

Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com>

he-yufeng · 2026-05-30T12:29:25Z

Pushed 195edaa to fix the Linux CI failures in the new event-retention tests. The adapter was already clearing store_events / retrieve_events; the test mock's call_args was still holding the FakeCudaEvent reference, so the weakref assertion was testing the fixture rather than adapter cleanup. Validation on Windows: py_compile for the adapter and test, ruff check on touched files, mypy on touched files, and git diff --check passed. The targeted pytest still cannot collect locally on this Windows host because lmcache.c_ops is not built.

he-yufeng · 2026-05-30T13:40:16Z

Follow-up for the CI failure: the test was still retaining the fake CUDA event through the parent transfer_ctx mock's call history. I switched the cleanup to reset the parent mock, so the weakref check now verifies the adapter no longer owns the event after get_finished() removes it.

Validated locally:

python -m py_compile tests\v1\test_vllm_mp_adapter.py
ruff check tests\v1\test_vllm_mp_adapter.py
mypy tests\v1\test_vllm_mp_adapter.py
git diff --check

python -m pytest tests\v1\test_vllm_mp_adapter.py -q is still blocked in my Windows checkout before test collection by ModuleNotFoundError: No module named 'lmcache.c_ops' from tests/conftest.py.

Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com>

maobaolong

LGTM.

maobaolong · 2026-05-31T01:13:38Z

@hlin99 Would you like to take a double check?

hlin99

LGTM

* fix: retain CUDA IPC events in MP adapter Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com> * fix: avoid CUDA event type coupling Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com> * test: use transfer context in MP adapter event tests Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com> * test: clear MP event test mock references Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com> * test: drop parent mock event references Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com> --------- Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com>

he-yufeng requested review from ApostaC, YaoJiayi, deng451e, hickeyma, maobaolong and sammshen as code owners May 10, 2026 07:39

gemini-code-assist Bot reviewed May 10, 2026

View reviewed changes

KuntaiDu approved these changes May 20, 2026

View reviewed changes

maobaolong requested changes May 26, 2026

View reviewed changes

Comment thread lmcache/integration/vllm/vllm_multi_process_adapter.py Outdated

he-yufeng force-pushed the fix/mp-event-lifetime branch from 6dd0685 to ba4ccfb Compare May 27, 2026 08:51

he-yufeng requested review from DongDongJu, chunxiaozheng, royyhuang and yoo-kumaneko as code owners May 27, 2026 08:51

he-yufeng force-pushed the fix/mp-event-lifetime branch 2 times, most recently from 6dd0685 to 530fd55 Compare May 27, 2026 08:52

he-yufeng force-pushed the fix/mp-event-lifetime branch from 530fd55 to 383d485 Compare May 27, 2026 09:41

he-yufeng added 3 commits May 28, 2026 02:08

fix: retain CUDA IPC events in MP adapter

74ac58e

Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com>

fix: avoid CUDA event type coupling

872da20

Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com>

test: use transfer context in MP adapter event tests

3cd6b98

Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com>

he-yufeng force-pushed the fix/mp-event-lifetime branch from 383d485 to 3cd6b98 Compare May 27, 2026 18:09

test: clear MP event test mock references

195edaa

Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com>

test: drop parent mock event references

e50d972

Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com>

he-yufeng force-pushed the fix/mp-event-lifetime branch from 95af7d1 to e50d972 Compare May 30, 2026 13:41

maobaolong approved these changes May 31, 2026

View reviewed changes

hlin99 approved these changes Jun 1, 2026

View reviewed changes

hlin99 enabled auto-merge (squash) June 1, 2026 06:48

Merge branch 'dev' into fix/mp-event-lifetime

b77a028

github-actions Bot added the full Run comprehensive tests on this PR label Jun 1, 2026

Merge branch 'dev' into fix/mp-event-lifetime

53f9038

hlin99 merged commit 5824ab3 into LMCache:dev Jun 2, 2026
26 of 27 checks passed

This was referenced Jun 2, 2026

docs: daily drift check — multi-process mode (2026-06-02) #3497

Closed

docs: daily drift check — multi-process mode (2026-06-03) #3515

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retain CUDA IPC events in MP adapter#3245

Retain CUDA IPC events in MP adapter#3245
hlin99 merged 7 commits into
LMCache:devfrom
he-yufeng:fix/mp-event-lifetime

he-yufeng commented May 10, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

KuntaiDu left a comment

Uh oh!

Uh oh!

he-yufeng commented May 27, 2026

Uh oh!

he-yufeng commented May 27, 2026

Uh oh!

he-yufeng commented May 27, 2026

Uh oh!

he-yufeng commented May 27, 2026

Uh oh!

he-yufeng commented May 30, 2026

Uh oh!

he-yufeng commented May 30, 2026

Uh oh!

maobaolong left a comment

Uh oh!

maobaolong commented May 31, 2026

Uh oh!

hlin99 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

he-yufeng commented May 10, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

KuntaiDu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

he-yufeng commented May 27, 2026

Uh oh!

he-yufeng commented May 27, 2026

Uh oh!

he-yufeng commented May 27, 2026

Uh oh!

he-yufeng commented May 27, 2026

Uh oh!

he-yufeng commented May 30, 2026

Uh oh!

he-yufeng commented May 30, 2026

Uh oh!

maobaolong left a comment

Choose a reason for hiding this comment

Uh oh!

maobaolong commented May 31, 2026

Uh oh!

hlin99 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants