Skip to content

fix(mp): correct store cached requests in lmcache_mp_connector#3012

Merged
ApostaC merged 6 commits intoLMCache:devfrom
maobaolong:fix_not_store_bugs
Apr 19, 2026
Merged

fix(mp): correct store cached requests in lmcache_mp_connector#3012
ApostaC merged 6 commits intoLMCache:devfrom
maobaolong:fix_not_store_bugs

Conversation

@maobaolong
Copy link
Copy Markdown
Collaborator

@maobaolong maobaolong commented Apr 12, 2026

Summary

This PR fixes two bookkeeping bugs in lmcache_mp_connector_0180.py that together cause KV cache blocks to be silently dropped during store operations.

Here is a relevant vllm side PR vllm-project/vllm#39655 , the changed content is totally same with this PR.


Bug 1: GetStoreMetadata — incorrect min_available_blocks upper bound

Root cause (causes missed stores):

In GetStoreMetadata, min_available_blocks was computed solely from num_scheduled_tokens // vllm_block_size. However, num_stored_blocks (set during the lookup phase) already includes the retrieve-hit blocks. When a request has LMCache hits, the subtraction min_available_blocks - num_stored_blocks can go negative, preventing any new blocks from being stored.

Example (block_size=64, chunk_size=256, blocks_in_chunk=4):

Suppose a request has 4096 tokens, and LMCache lookup hits 60 blocks (3840 tokens). The scheduler then schedules the remaining 256 tokens (= 4 blocks).

Variable Old code New code
num_scheduled_tokens // block_size 256 / 64 = 4 4
num_lmcache_hit_blocks 60
computed_blocks 4 4 + 60 = 64
min_available_blocks 4 64
num_stored_blocks (from lookup) 60 60
num_staging_blocks 4 − 60 = −56 ❌ 64 − 60 = 4 ✅

With the old code, num_staging_blocks is negative, so no new block is ever stored. The fix adds num_lmcache_hit_blocks to computed_blocks so the upper bound matches the baseline that num_stored_blocks was set against. Hit blocks are not re-stored because they are already counted in num_stored_blocks.


Bug 2: _process_cached_requests — cumulative vs. incremental token count

Root cause (bookkeeping corruption):

In _process_cached_requests, num_new_tokens was read from cached_reqs.num_computed_tokens[idx], which is a cumulative value (total computed tokens so far). It was then passed to increase_num_scheduled_tokens(), which does += (accumulation). This double-counts tokens across scheduling rounds.

Example (block_size=64, chunk_size=256):

Suppose a request has 4096 tokens and the scheduler splits it into two prefill rounds of 2048 tokens each:

Round Source Value passed Tracker after +=
Round 1 (new_req) num_scheduled_tokens 2048 2048 ✅
Round 2 (cached_req, old) num_computed_tokens (cumulative) 4096 2048 + 4096 = 6144 ❌
Round 2 (cached_req, new) num_scheduled_tokens (incremental) 2048 2048 + 2048 = 4096 ✅

The inflated tracker value (6144 instead of 4096) does not cause out-of-bound stores thanks to the min() guard in GetStoreMetadata, but it corrupts internal bookkeeping and is a latent bug. The fix switches to scheduler_output.num_scheduled_tokens[request_id] (incremental), consistent with _process_new_requests.


Changes

  • GetStoreMetadata: include num_lmcache_hit_blocks in computed_blocks to restore the correct upper bound for num_staging_blocks.
  • _process_cached_requests: use scheduler_output.num_scheduled_tokens[request_id] (incremental) instead of cached_reqs.num_computed_tokens[idx] (cumulative).

Note

Medium Risk
Changes KV-cache store bookkeeping for cached and partially-hit requests; mistakes here could still cause missed stores or incorrect block accounting during prefill/generation.

Overview
Fixes LMCache MP store bookkeeping so KV blocks aren’t silently skipped when requests have a mix of vLLM APC hits and LMCache hits.

GetStoreMetadata now computes the storeable-block upper bound using scheduled blocks plus max(num_vllm_hit_blocks, num_lmcache_hit_blocks) (with added rationale comments), and _process_cached_requests now increments num_scheduled_tokens using per-step scheduler_output.num_scheduled_tokens[request_id] instead of a cumulative computed-token counter.

Reviewed by Cursor Bugbot for commit a8ee866. Bugbot is set up for automated code reviews on this repo. Configure here.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies the block calculation logic in GetStoreMetadata to include hit blocks and updates _process_cached_requests to use incremental scheduled tokens for consistency. A review comment was provided requesting the addition of regression tests for these bug fixes, as mandated by the repository style guide.

Comment thread lmcache/integration/vllm/lmcache_mp_connector_0180.py Outdated
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit d1e4179. Configure here.

Comment thread lmcache/integration/vllm/lmcache_mp_connector_0180.py Outdated
@maobaolong maobaolong changed the title fix(mp): correct store bookkeeping for cached requests in lmcache_mp_connector_0180 fix(mp): correct store cached requests in lmcache_mp_connector Apr 12, 2026
…connector_0180

Signed-off-by: baoloongmao <baoloongmao@tencent.com>
…s bugs

Signed-off-by: baoloongmao <baoloongmao@tencent.com>
Signed-off-by: baoloongmao <baoloongmao@tencent.com>
@ApostaC
Copy link
Copy Markdown
Contributor

ApostaC commented Apr 13, 2026

@maobaolong good catch for the bugs! Can you also create a PR in vLLM? Both me and @KuntaiDu can approve that for you!

Copy link
Copy Markdown
Contributor

@ApostaC ApostaC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Comment thread lmcache/integration/vllm/lmcache_mp_connector_0180.py Outdated
@maobaolong
Copy link
Copy Markdown
Collaborator Author

@ApostaC Thanks for this quick review!

maobaolong added a commit to maobaolong/LMCache that referenced this pull request Apr 13, 2026
…tor LMCache#3012

Signed-off-by: baoloongmao <baoloongmao@tencent.com>
maobaolong added a commit to maobaolong/LMCache that referenced this pull request Apr 13, 2026
…tor LMCache#3012 (#15)

Signed-off-by: baoloongmao <baoloongmao@tencent.com>
Copy link
Copy Markdown
Contributor

@sammshen sammshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ApostaC
Copy link
Copy Markdown
Contributor

ApostaC commented Apr 14, 2026

@maobaolong Are we going to keep track with vllm-project/vllm#39719?

@maobaolong
Copy link
Copy Markdown
Collaborator Author

@maobaolong Are we going to keep track with vllm-project/vllm#39719?

Yeah, I will.

Signed-off-by: baoloongmao <baoloongmao@tencent.com>
Signed-off-by: baoloongmao <baoloongmao@tencent.com>
@ApostaC ApostaC enabled auto-merge (squash) April 19, 2026 03:38
@github-actions github-actions Bot added the full Run comprehensive tests on this PR label Apr 19, 2026
@ApostaC ApostaC merged commit f1e0db7 into LMCache:dev Apr 19, 2026
30 of 34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

full Run comprehensive tests on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants