[v1] Move block_hashes from KVCacheManager to Request.block_hashes (#19728) #19728

orozery · 2025-06-17T06:08:03Z

This PR move the request block hashes from the KVCacheManager to the Request object itself.
In particular, this will allow connectors to access the request block hashes.

Part of the work described in RFC #19854

gemini-code-assist · 2025-06-17T06:08:06Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

github-actions · 2025-06-17T06:08:12Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

mergify · 2025-06-17T06:08:42Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @orozery.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2025-06-19T04:55:55Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @orozery.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2025-06-25T19:15:40Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @orozery.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

njhill

This looks reasonable to me but would be good to get approval from @heheda12345
and @WoosukKwon.

heheda12345

My major concern is that the block_hash list is updated here

vllm/vllm/v1/core/block_pool.py

Line 177 in d3f05c9

block_hashes.append(block_hash)

It is quite far from Request class. It will make the number of element in the list magic to connectors, and hurt the modularity of the code.
Therefore, if you want to put it into Request class, can you move this logic from "update when block hash is needed" to "update when token ids of the request are updated"?

vllm/v1/request.py

orozery · 2025-07-06T07:07:31Z

My major concern is that the block_hash list is updated here

vllm/vllm/v1/core/block_pool.py

Line 177 in d3f05c9

block_hashes.append(block_hash)

It is quite far from Request class. It will make the number of element in the list magic to connectors, and hurt the modularity of the code.
Therefore, if you want to put it into Request class, can you move this logic from "update when block hash is needed" to "update when token ids of the request are updated"?

Today, block_hashes are computed when the scheduler calls _update_waiting_for_remote_kv.
Shortly after this call, calls are made to kv_cache_manager.get_computed_blocks and connector.get_num_new_matched_tokens.
The connector must get the block hashes at this point.

I'm trying to interpret your suggestion "update when token ids of the request are updated".
My best guess is that you mean request.append_output_token_ids. This is too late for the connector.
I'm guessing that I do not understand you.
Can you please be more concrete and reference to actual code locations?

heheda12345 · 2025-07-08T06:49:02Z

Yes I mean request.append_output_token_ids.
Based on the code here, the new tokens are updated in update_from_output in step k, and then used by the connector in scheduler.schedule() of step k+1. Why is it too late?

vllm/vllm/v1/engine/core.py

Lines 223 to 240 in 72d14d0

    
               def step(self) -> tuple[dict[int, EngineCoreOutputs], bool]: 
        
                   """Schedule, execute, and make output. 
        
                   Returns tuple of outputs and a flag indicating whether the model 
        
                   was executed. 
        
                   """ 
        
                   # Check for any requests remaining in the scheduler - unfinished, 
        
                   # or finished and not yet removed from the batch. 
        
                   if not self.scheduler.has_requests(): 
        
                       return {}, False 
        
                   scheduler_output = self.scheduler.schedule() 
        
                   model_output = self.execute_model(scheduler_output) 
        
                   engine_core_outputs = self.scheduler.update_from_output( 
        
                       scheduler_output, model_output)  # type: ignore 
        
                   return (engine_core_outputs, 
        
                           scheduler_output.total_num_scheduled_tokens > 0)

orozery · 2025-07-08T07:05:20Z

Yes I mean request.append_output_token_ids. Based on the code here, the new tokens are updated in update_from_output in step k, and then used by the connector in scheduler.schedule() of step k+1. Why is it too late?

vllm/vllm/v1/engine/core.py

Lines 223 to 240 in 72d14d0

def step(self) -> tuple[dict[int, EngineCoreOutputs], bool]:

"""Schedule, execute, and make output.

Returns tuple of outputs and a flag indicating whether the model

was executed.

"""

# Check for any requests remaining in the scheduler - unfinished,

# or finished and not yet removed from the batch.

if not self.scheduler.has_requests():

return {}, False

scheduler_output = self.scheduler.schedule()

model_output = self.execute_model(scheduler_output)

engine_core_outputs = self.scheduler.update_from_output(

scheduler_output, model_output) # type: ignore

return (engine_core_outputs,

scheduler_output.total_num_scheduled_tokens > 0)

@heheda12345 Consider the following scenario:

Engine gets a fresh new prompt with 1000 (number does not matter) tokens.
Scheduler pops the prompt from its self.waiting list.
Scheduler checks whether this prompt has already computed tokens in the GPU prefix cache, by calling self.kv_cache_manager.get_computed_blocks. Assumes no hits are found, so num_computed_tokens = 0.
Next, it queries the connector of any externally hit tokens, by calling self.connector.get_num_new_matched_tokens(request, ...). At this point, the connector needs the block hashes in order to check for hits.
Assuming your approach of delaying the setting of request.block_hashes, the connector will yield num_external_computed_tokens = 0, even though it may have all the tokens available!
The request tokens will be sent to the workers, and the workers will recompute the KV-values for this 1000 tokens.
Only after the workers re-compute the tokens, the scheduler will call update_from_output, setting the block_hashes.

So to sum-up, because of the delay in setting of request.block_hashes, we lose the ability to utilize offloaded KV-values of tokens from external source.

heheda12345 · 2025-07-08T08:12:15Z

But I think the hash of the prompt can be initialized when the request is created

mergify · 2025-07-29T17:42:22Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @orozery.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

heheda12345 · 2025-07-23T17:37:50Z

vllm/utils/__init__.py

@@ -3236,6 +3236,25 @@ def sha256_cbor_64bit(input) -> int:
    return full_hash & ((1 << 64) - 1)


+@lru_cache


I think we don't need a cache. This function should only be called once during initialization.

heheda12345 · 2025-07-26T00:18:20Z

vllm/v1/core/sched/scheduler.py

+                block_hash = hash_block_tokens(self.caching_hash_fn,
+                                               prev_block_hash_value,
+                                               block_tokens, extra_keys)
+                request.block_hashes.append(block_hash)


Is it ok to add the hash function as an attribute of a request? Then we can update the hash during constructing the request and append_output_token_ids

orozery · 2025-08-03T18:05:35Z

@heheda12345 @njhill Moved all hashing to be triggered by Request directly.

heheda12345

Sorry for my late response. Can you add more comments in hash_request_tokens and cache_full_blocks. I think there are many tricks that needs to be explained here

vllm/v1/core/kv_cache_utils.py

heheda12345 · 2025-08-08T22:22:20Z

vllm/v1/request.py

@@ -108,8 +111,17 @@ def __init__(
        # indicates that the output is corrupted
        self.num_nans_in_logits = 0

+        self.block_hashes: list[BlockHash] = []
+        self.block_hasher: Optional[Callable[[Request],


Personally prefer self.block_hasher= partial(block_hasher, request=self) so that we can call it with self.block_hasher() without passing self.

vllm/v1/request.py

orozery · 2025-08-10T06:25:05Z

Can you add more comments in hash_request_tokens

You mean in get_request_block_hasher

heheda12345 · 2025-08-10T06:33:51Z

I want to say move some comments in hash_request_tokens and cache_full_blocks to your new code.

orozery · 2025-08-10T07:11:37Z

@heheda12345 I made another set of changes.
Thanks for taking the time giving out really helpful comments!

heheda12345

LGTM! Thanks for your great work and patience.

njhill · 2025-08-11T18:57:37Z

@orozery could you rebase on latest main? Should hopefully help with at least some of the CI failures.

This commit move the request block hashes from the KVCacheManager to the Request object itself. In particular, this will allow connectors to access the request block hashes. Signed-off-by: Or Ozeri <oro@il.ibm.com>

orozery · 2025-08-11T19:16:16Z

@orozery could you rebase on latest main? Should hopefully help with at least some of the CI failures.

done

orozery requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners June 17, 2025 06:08

mergify bot added the v1 label Jun 17, 2025

mergify bot added the needs-rebase label Jun 17, 2025

orozery force-pushed the request-block-hashes branch from 457540f to a34e6ac Compare June 17, 2025 06:36

mergify bot removed the needs-rebase label Jun 17, 2025

mergify bot added the needs-rebase label Jun 19, 2025

orozery force-pushed the request-block-hashes branch from a34e6ac to 15001b4 Compare June 19, 2025 09:19

mergify bot removed the needs-rebase label Jun 19, 2025

orozery mentioned this pull request Jun 19, 2025

[RFC]: KV cache offloading #19854

Open

1 task

mergify bot added the needs-rebase label Jun 25, 2025

orozery force-pushed the request-block-hashes branch from 15001b4 to 914876e Compare June 26, 2025 07:21

mergify bot removed the needs-rebase label Jun 26, 2025

njhill reviewed Jul 2, 2025

View reviewed changes

njhill requested a review from heheda12345 July 2, 2025 18:15

heheda12345 reviewed Jul 5, 2025

View reviewed changes

vllm/v1/request.py Outdated Show resolved Hide resolved

orozery force-pushed the request-block-hashes branch from 914876e to 234f334 Compare July 6, 2025 07:56

orozery force-pushed the request-block-hashes branch from ac256d3 to 908cb7b Compare July 27, 2025 08:26

mergify bot removed the needs-rebase label Jul 27, 2025

orozery force-pushed the request-block-hashes branch 2 times, most recently from 1b6c4b5 to 7403762 Compare July 27, 2025 08:50

vMaroon mentioned this pull request Jul 28, 2025

vLLM Native CPU Offloading Connector llm-d/llm-d-kv-cache-manager#67

Open

mergify bot added the needs-rebase label Jul 29, 2025

njhill requested a review from heheda12345 July 31, 2025 17:02

heheda12345 reviewed Jul 31, 2025

View reviewed changes

orozery force-pushed the request-block-hashes branch from 7403762 to 8d48ef9 Compare August 3, 2025 16:36

mergify bot removed the needs-rebase label Aug 3, 2025

njhill requested a review from heheda12345 August 4, 2025 10:11

heheda12345 reviewed Aug 8, 2025

View reviewed changes

orozery force-pushed the request-block-hashes branch 2 times, most recently from 5ea914f to 0241ee6 Compare August 10, 2025 07:09

orozery mentioned this pull request Aug 10, 2025

v1: Offloading connector #22595

Open

heheda12345 approved these changes Aug 11, 2025

View reviewed changes

heheda12345 changed the title ~~v1: Add Request.block_hashes~~ [v1] Move block_hashes from KVCacheManager to Request.block_hashes (#19728) Aug 11, 2025

heheda12345 enabled auto-merge (squash) August 11, 2025 16:50

github-actions bot added the ready label Aug 11, 2025

v1: Add Request.block_hashes

564a359

This commit move the request block hashes from the KVCacheManager to the Request object itself. In particular, this will allow connectors to access the request block hashes. Signed-off-by: Or Ozeri <oro@il.ibm.com>

auto-merge was automatically disabled August 11, 2025 19:16
Head branch was pushed to by a user without write access

orozery force-pushed the request-block-hashes branch from 0241ee6 to 564a359 Compare August 11, 2025 19:16

heheda12345 enabled auto-merge (squash) August 12, 2025 17:32

		@@ -3236,6 +3236,25 @@ def sha256_cbor_64bit(input) -> int:
		return full_hash & ((1 << 64) - 1)


		@lru_cache

Uh oh!

[v1] Move block_hashes from KVCacheManager to Request.block_hashes (#19728) #19728

Are you sure you want to change the base?

[v1] Move block_hashes from KVCacheManager to Request.block_hashes (#19728) #19728

Conversation

orozery commented Jun 17, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Jun 17, 2025

Uh oh!

github-actions bot commented Jun 17, 2025

Uh oh!

mergify bot commented Jun 17, 2025

Uh oh!

mergify bot commented Jun 19, 2025

Uh oh!

mergify bot commented Jun 25, 2025

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

heheda12345 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

orozery commented Jul 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

heheda12345 commented Jul 8, 2025

Uh oh!

orozery commented Jul 8, 2025

Uh oh!

heheda12345 commented Jul 8, 2025

Uh oh!

mergify bot commented Jul 29, 2025

Uh oh!

heheda12345 Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

heheda12345 Jul 26, 2025

Choose a reason for hiding this comment

Uh oh!

orozery commented Aug 3, 2025

Uh oh!

heheda12345 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

heheda12345 Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

orozery commented Aug 10, 2025

Uh oh!

heheda12345 commented Aug 10, 2025

Uh oh!

orozery commented Aug 10, 2025

Uh oh!

heheda12345 left a comment

Choose a reason for hiding this comment

Uh oh!

njhill commented Aug 11, 2025

Uh oh!

orozery commented Aug 11, 2025

Uh oh!

Uh oh!

orozery commented Jun 17, 2025 •

edited by github-actions bot

Loading

orozery commented Jul 6, 2025 •

edited

Loading