Skip to content

Allow to override KV cache memory calculation #19804

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

da-x
Copy link

@da-x da-x commented Jun 18, 2025

When trying to run two LMCache instances on a single GPU, the available KV cache memory calculation seemed to be incorrect, even with --gpu-memory-utilization 0.38 given to both instances. It worked for the first instance but not for the second one, in which available_kv_cache_memory turned out to be 0.

Without spending time to debug this, with this change I was able to force the second instance to have the same calculated value of available_kv_cache_memory of the first instance, and then the second instance booted correctly and was fully functional.

So it seems to me that VLLM_AVAIL_KV_CACHE_OVERRIDE would be handy for similar situations, even if we are not fixing this.

I should probably add more code to print the details of the calculation to see where the problem originates, but for now I'd very much like to have VLLM_AVAIL_KV_CACHE_OVERRIDE environment supported.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @da-x, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request provides a critical escape hatch for users encountering issues with VLLM's automatic KV cache memory calculation. By introducing an environment variable override, it enables manual configuration of the available memory, which can be crucial for successfully running multiple VLLM instances or in environments where the default calculation is problematic. This change focuses on usability and robustness without altering the core memory calculation logic.

Highlights

  • KV Cache Memory Override: I've introduced support for the VLLM_AVAIL_KV_CACHE_OVERRIDE environment variable within the determine_available_memory function. This allows users to manually specify the available KV cache memory in bytes, providing a workaround for situations where the automatic calculation might be incorrect or insufficient, as described in the PR body.
  • Improved Logging: I've added an INFO level log message to display the automatically determined KV cache memory. Additionally, a WARNING level log message will now be printed whenever the VLLM_AVAIL_KV_CACHE_OVERRIDE environment variable is detected and used to override the calculated memory value, making it clear when manual intervention is applied.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@mergify mergify bot added the v1 label Jun 18, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an environment variable VLLM_AVAIL_KV_CACHE_OVERRIDE to allow manual overriding of the calculated KV cache memory. This is a useful workaround for scenarios where the automatic calculation might be incorrect, as described by the author. The added logging for determined and overridden values will also aid in debugging.

My main feedback focuses on enhancing the robustness of parsing and validating the override value from the environment variable. Specifically, I've suggested adding error handling for non-integer inputs and validation for non-negative values to prevent potential crashes or misconfigurations during worker initialization. This ensures that if the override is used, it's applied correctly or fails explicitly.

Comment on lines 243 to 247
if vllm_avail_kv_cache_override:
available_kv_cache_memory = int(vllm_avail_kv_cache_override)
logger.warning(
"Overriding amount of memory available to KV cache: %d bytes",
available_kv_cache_memory)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The environment variable VLLM_AVAIL_KV_CACHE_OVERRIDE is converted to an integer directly using int(). This can lead to a ValueError if the environment variable is set to a string that cannot be converted to an integer (e.g., "abc", "12.3", or an empty string if os.getenv was called without a default and the variable was empty). Such an unhandled exception would cause the worker initialization to crash.

Additionally, if the override value is negative, it could lead to undefined behavior or errors downstream when allocating memory for the KV cache.

To make this override mechanism more robust, consider the following improvements:

  1. Wrap the int() conversion in a try-except ValueError block to handle cases where the environment variable is not a valid integer.
  2. Validate that the overridden_value is not negative. A negative memory amount is invalid.
  3. If parsing fails or the value is invalid, it's crucial to raise an error to make the misconfiguration apparent. This is especially important because this override is often used as a workaround for critical startup issues, and silently falling back to a problematic auto-calculated value would defeat its purpose.
            try:
                overridden_value = int(vllm_avail_kv_cache_override)
                if overridden_value < 0:
                    # It's generally not expected for available memory to be negative.
                    # Raise an error to make misconfiguration clear.
                    raise ValueError(
                        f"VLLM_AVAIL_KV_CACHE_OVERRIDE cannot be negative. "
                        f"Received value: {vllm_avail_kv_cache_override}"
                    )
                
                available_kv_cache_memory = overridden_value
                logger.warning(
                    "Overriding amount of memory available to KV cache to %d bytes "
                    "due to VLLM_AVAIL_KV_CACHE_OVERRIDE=%s.",
                    available_kv_cache_memory, vllm_avail_kv_cache_override
                )
            except ValueError as e:
                logger.error(
                    "Failed to apply VLLM_AVAIL_KV_CACHE_OVERRIDE='%s'. "
                    "Error: %s. Please ensure it's a valid non-negative integer.",
                    vllm_avail_kv_cache_override, e
                )
                raise  # Re-raise to halt initialization if override is malformed

@da-x da-x force-pushed the kvcache-mem-override-pr branch from 41836c5 to d413fa2 Compare June 22, 2025 14:25
When trying to run two LMCache instances on a single GPU, the available
KV cache memory calculation seemed to be incorrect, even with
`--gpu-memory-utilization 0.38` given to both instances. It worked for
the first instance but not for the second one, in which
`available_kv_cache_memory` turned out to be 0.

Without spending time to debug this, with this change I was able to
force the second instance to have the same calculated value of
`available_kv_cache_memory` of the first instance, and then the second
instance booted correctly and was fully functional.

So it seems to me that VLLM_AVAIL_KV_CACHE_OVERRIDE would be handy for
similar situations, even if we are not fixing this.

I should probably add more code to print the details of the calculation
to see where the problem originates, but for now I'd very much like to
have VLLM_AVAIL_KV_CACHE_OVERRIDE environment supported.

Signed-off-by: Dan Aloni <dan.aloni@vastdata.com>
@da-x da-x force-pushed the kvcache-mem-override-pr branch from d413fa2 to fd34b3e Compare June 22, 2025 14:34
@WoosukKwon
Copy link
Collaborator

Hi @da-x, IIUC, this bug was recently fixed by #18974 (@ProExpertProg). Which version of vLLM did you use?

@da-x
Copy link
Author

da-x commented Jun 24, 2025

@WoosukKwon I used 8f4ffbd (Jun 4), which did not contain the fix you mentioned. So maybe we can close this one, unless enough people think it still has value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants