Allow to override KV cache memory calculation #19804

da-x · 2025-06-18T12:22:41Z

When trying to run two LMCache instances on a single GPU, the available KV cache memory calculation seemed to be incorrect, even with --gpu-memory-utilization 0.38 given to both instances. It worked for the first instance but not for the second one, in which available_kv_cache_memory turned out to be 0.

Without spending time to debug this, with this change I was able to force the second instance to have the same calculated value of available_kv_cache_memory of the first instance, and then the second instance booted correctly and was fully functional.

So it seems to me that VLLM_AVAIL_KV_CACHE_OVERRIDE would be handy for similar situations, even if we are not fixing this.

I should probably add more code to print the details of the calculation to see where the problem originates, but for now I'd very much like to have VLLM_AVAIL_KV_CACHE_OVERRIDE environment supported.

github-actions · 2025-06-18T12:22:50Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Summary of Changes

Hello @da-x, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request provides a critical escape hatch for users encountering issues with VLLM's automatic KV cache memory calculation. By introducing an environment variable override, it enables manual configuration of the available memory, which can be crucial for successfully running multiple VLLM instances or in environments where the default calculation is problematic. This change focuses on usability and robustness without altering the core memory calculation logic.

Highlights

KV Cache Memory Override: I've introduced support for the VLLM_AVAIL_KV_CACHE_OVERRIDE environment variable within the determine_available_memory function. This allows users to manually specify the available KV cache memory in bytes, providing a workaround for situations where the automatic calculation might be incorrect or insufficient, as described in the PR body.
Improved Logging: I've added an INFO level log message to display the automatically determined KV cache memory. Additionally, a WARNING level log message will now be printed whenever the VLLM_AVAIL_KV_CACHE_OVERRIDE environment variable is detected and used to override the calculated memory value, making it clear when manual intervention is applied.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces an environment variable VLLM_AVAIL_KV_CACHE_OVERRIDE to allow manual overriding of the calculated KV cache memory. This is a useful workaround for scenarios where the automatic calculation might be incorrect, as described by the author. The added logging for determined and overridden values will also aid in debugging.

My main feedback focuses on enhancing the robustness of parsing and validating the override value from the environment variable. Specifically, I've suggested adding error handling for non-integer inputs and validation for non-negative values to prevent potential crashes or misconfigurations during worker initialization. This ensures that if the override is used, it's applied correctly or fails explicitly.

gemini-code-assist · 2025-06-18T12:24:33Z

vllm/v1/worker/gpu_worker.py

+        if vllm_avail_kv_cache_override:
+            available_kv_cache_memory = int(vllm_avail_kv_cache_override)
+            logger.warning(
+                "Overriding amount of memory available to KV cache: %d bytes",
+                available_kv_cache_memory)


The environment variable VLLM_AVAIL_KV_CACHE_OVERRIDE is converted to an integer directly using int(). This can lead to a ValueError if the environment variable is set to a string that cannot be converted to an integer (e.g., "abc", "12.3", or an empty string if os.getenv was called without a default and the variable was empty). Such an unhandled exception would cause the worker initialization to crash.

Additionally, if the override value is negative, it could lead to undefined behavior or errors downstream when allocating memory for the KV cache.

To make this override mechanism more robust, consider the following improvements:

Wrap the int() conversion in a try-except ValueError block to handle cases where the environment variable is not a valid integer.

Validate that the overridden_value is not negative. A negative memory amount is invalid.

If parsing fails or the value is invalid, it's crucial to raise an error to make the misconfiguration apparent. This is especially important because this override is often used as a workaround for critical startup issues, and silently falling back to a problematic auto-calculated value would defeat its purpose.

try: overridden_value = int(vllm_avail_kv_cache_override) if overridden_value < 0: # It's generally not expected for available memory to be negative. # Raise an error to make misconfiguration clear. raise ValueError( f"VLLM_AVAIL_KV_CACHE_OVERRIDE cannot be negative. " f"Received value: {vllm_avail_kv_cache_override}" ) available_kv_cache_memory = overridden_value logger.warning( "Overriding amount of memory available to KV cache to %d bytes " "due to VLLM_AVAIL_KV_CACHE_OVERRIDE=%s.", available_kv_cache_memory, vllm_avail_kv_cache_override ) except ValueError as e: logger.error( "Failed to apply VLLM_AVAIL_KV_CACHE_OVERRIDE='%s'. " "Error: %s. Please ensure it's a valid non-negative integer.", vllm_avail_kv_cache_override, e ) raise # Re-raise to halt initialization if override is malformed

When trying to run two LMCache instances on a single GPU, the available KV cache memory calculation seemed to be incorrect, even with `--gpu-memory-utilization 0.38` given to both instances. It worked for the first instance but not for the second one, in which `available_kv_cache_memory` turned out to be 0. Without spending time to debug this, with this change I was able to force the second instance to have the same calculated value of `available_kv_cache_memory` of the first instance, and then the second instance booted correctly and was fully functional. So it seems to me that VLLM_AVAIL_KV_CACHE_OVERRIDE would be handy for similar situations, even if we are not fixing this. I should probably add more code to print the details of the calculation to see where the problem originates, but for now I'd very much like to have VLLM_AVAIL_KV_CACHE_OVERRIDE environment supported. Signed-off-by: Dan Aloni <dan.aloni@vastdata.com>

WoosukKwon · 2025-06-22T23:02:02Z

Hi @da-x, IIUC, this bug was recently fixed by #18974 (@ProExpertProg). Which version of vLLM did you use?

da-x · 2025-06-24T16:15:32Z

@WoosukKwon I used 8f4ffbd (Jun 4), which did not contain the fix you mentioned. So maybe we can close this one, unless enough people think it still has value.

da-x requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners June 18, 2025 12:22

gemini-code-assist bot reviewed Jun 18, 2025

View reviewed changes

mergify bot added the v1 label Jun 18, 2025

gemini-code-assist bot reviewed Jun 18, 2025

View reviewed changes

da-x force-pushed the kvcache-mem-override-pr branch from 41836c5 to d413fa2 Compare June 22, 2025 14:25

da-x force-pushed the kvcache-mem-override-pr branch from d413fa2 to fd34b3e Compare June 22, 2025 14:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Allow to override KV cache memory calculation #19804

Allow to override KV cache memory calculation #19804

da-x commented Jun 18, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jun 18, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jun 18, 2025

Uh oh!

WoosukKwon commented Jun 22, 2025

Uh oh!

da-x commented Jun 24, 2025

Uh oh!

Uh oh!

Uh oh!

Allow to override KV cache memory calculation #19804

Are you sure you want to change the base?

Allow to override KV cache memory calculation #19804

Conversation

da-x commented Jun 18, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jun 18, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

WoosukKwon commented Jun 22, 2025

Uh oh!

da-x commented Jun 24, 2025

Uh oh!

Uh oh!

da-x commented Jun 18, 2025 •

edited by github-actions bot

Loading