fix(mp): correct store cached requests in lmcache_mp_connector#3012
fix(mp): correct store cached requests in lmcache_mp_connector#3012ApostaC merged 6 commits intoLMCache:devfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request modifies the block calculation logic in GetStoreMetadata to include hit blocks and updates _process_cached_requests to use incremental scheduled tokens for consistency. A review comment was provided requesting the addition of regression tests for these bug fixes, as mandated by the repository style guide.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit d1e4179. Configure here.
d1e4179 to
469aace
Compare
…connector_0180 Signed-off-by: baoloongmao <baoloongmao@tencent.com>
…s bugs Signed-off-by: baoloongmao <baoloongmao@tencent.com>
Signed-off-by: baoloongmao <baoloongmao@tencent.com>
adc5fd8 to
befd463
Compare
|
@maobaolong good catch for the bugs! Can you also create a PR in vLLM? Both me and @KuntaiDu can approve that for you! |
7fe7821 to
4571a84
Compare
|
@ApostaC Thanks for this quick review! |
…tor LMCache#3012 Signed-off-by: baoloongmao <baoloongmao@tencent.com>
…tor LMCache#3012 (#15) Signed-off-by: baoloongmao <baoloongmao@tencent.com>
|
@maobaolong Are we going to keep track with vllm-project/vllm#39719? |
Yeah, I will. |
Signed-off-by: baoloongmao <baoloongmao@tencent.com>
Signed-off-by: baoloongmao <baoloongmao@tencent.com>
7f26074 to
a4ada01
Compare

Summary
This PR fixes two bookkeeping bugs in
lmcache_mp_connector_0180.pythat together cause KV cache blocks to be silently dropped during store operations.Here is a relevant vllm side PR vllm-project/vllm#39655 , the changed content is totally same with this PR.
Bug 1:
GetStoreMetadata— incorrectmin_available_blocksupper boundRoot cause (causes missed stores):
In
GetStoreMetadata,min_available_blockswas computed solely fromnum_scheduled_tokens // vllm_block_size. However,num_stored_blocks(set during the lookup phase) already includes the retrieve-hit blocks. When a request has LMCache hits, the subtractionmin_available_blocks - num_stored_blockscan go negative, preventing any new blocks from being stored.Example (block_size=64, chunk_size=256, blocks_in_chunk=4):
Suppose a request has 4096 tokens, and LMCache lookup hits 60 blocks (3840 tokens). The scheduler then schedules the remaining 256 tokens (= 4 blocks).
num_scheduled_tokens // block_sizenum_lmcache_hit_blockscomputed_blocksmin_available_blocksnum_stored_blocks(from lookup)num_staging_blocksWith the old code,
num_staging_blocksis negative, so no new block is ever stored. The fix addsnum_lmcache_hit_blockstocomputed_blocksso the upper bound matches the baseline thatnum_stored_blockswas set against. Hit blocks are not re-stored because they are already counted innum_stored_blocks.Bug 2:
_process_cached_requests— cumulative vs. incremental token countRoot cause (bookkeeping corruption):
In
_process_cached_requests,num_new_tokenswas read fromcached_reqs.num_computed_tokens[idx], which is a cumulative value (total computed tokens so far). It was then passed toincrease_num_scheduled_tokens(), which does+=(accumulation). This double-counts tokens across scheduling rounds.Example (block_size=64, chunk_size=256):
Suppose a request has 4096 tokens and the scheduler splits it into two prefill rounds of 2048 tokens each:
+=num_scheduled_tokensnum_computed_tokens(cumulative)num_scheduled_tokens(incremental)The inflated tracker value (6144 instead of 4096) does not cause out-of-bound stores thanks to the
min()guard inGetStoreMetadata, but it corrupts internal bookkeeping and is a latent bug. The fix switches toscheduler_output.num_scheduled_tokens[request_id](incremental), consistent with_process_new_requests.Changes
GetStoreMetadata: includenum_lmcache_hit_blocksincomputed_blocksto restore the correct upper bound fornum_staging_blocks._process_cached_requests: usescheduler_output.num_scheduled_tokens[request_id](incremental) instead ofcached_reqs.num_computed_tokens[idx](cumulative).Note
Medium Risk
Changes KV-cache store bookkeeping for cached and partially-hit requests; mistakes here could still cause missed stores or incorrect block accounting during prefill/generation.
Overview
Fixes LMCache MP store bookkeeping so KV blocks aren’t silently skipped when requests have a mix of vLLM APC hits and LMCache hits.
GetStoreMetadatanow computes the storeable-block upper bound using scheduled blocks plusmax(num_vllm_hit_blocks, num_lmcache_hit_blocks)(with added rationale comments), and_process_cached_requestsnow incrementsnum_scheduled_tokensusing per-stepscheduler_output.num_scheduled_tokens[request_id]instead of a cumulative computed-token counter.Reviewed by Cursor Bugbot for commit a8ee866. Bugbot is set up for automated code reviews on this repo. Configure here.