-
-
Notifications
You must be signed in to change notification settings - Fork 8.3k
Allow to override KV cache memory calculation #19804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @da-x, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request provides a critical escape hatch for users encountering issues with VLLM's automatic KV cache memory calculation. By introducing an environment variable override, it enables manual configuration of the available memory, which can be crucial for successfully running multiple VLLM instances or in environments where the default calculation is problematic. This change focuses on usability and robustness without altering the core memory calculation logic.
Highlights
- KV Cache Memory Override: I've introduced support for the
VLLM_AVAIL_KV_CACHE_OVERRIDE
environment variable within thedetermine_available_memory
function. This allows users to manually specify the available KV cache memory in bytes, providing a workaround for situations where the automatic calculation might be incorrect or insufficient, as described in the PR body. - Improved Logging: I've added an
INFO
level log message to display the automatically determined KV cache memory. Additionally, aWARNING
level log message will now be printed whenever theVLLM_AVAIL_KV_CACHE_OVERRIDE
environment variable is detected and used to override the calculated memory value, making it clear when manual intervention is applied.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces an environment variable VLLM_AVAIL_KV_CACHE_OVERRIDE
to allow manual overriding of the calculated KV cache memory. This is a useful workaround for scenarios where the automatic calculation might be incorrect, as described by the author. The added logging for determined and overridden values will also aid in debugging.
My main feedback focuses on enhancing the robustness of parsing and validating the override value from the environment variable. Specifically, I've suggested adding error handling for non-integer inputs and validation for non-negative values to prevent potential crashes or misconfigurations during worker initialization. This ensures that if the override is used, it's applied correctly or fails explicitly.
vllm/v1/worker/gpu_worker.py
Outdated
if vllm_avail_kv_cache_override: | ||
available_kv_cache_memory = int(vllm_avail_kv_cache_override) | ||
logger.warning( | ||
"Overriding amount of memory available to KV cache: %d bytes", | ||
available_kv_cache_memory) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The environment variable VLLM_AVAIL_KV_CACHE_OVERRIDE
is converted to an integer directly using int()
. This can lead to a ValueError
if the environment variable is set to a string that cannot be converted to an integer (e.g., "abc", "12.3", or an empty string if os.getenv
was called without a default and the variable was empty). Such an unhandled exception would cause the worker initialization to crash.
Additionally, if the override value is negative, it could lead to undefined behavior or errors downstream when allocating memory for the KV cache.
To make this override mechanism more robust, consider the following improvements:
- Wrap the
int()
conversion in atry-except ValueError
block to handle cases where the environment variable is not a valid integer. - Validate that the
overridden_value
is not negative. A negative memory amount is invalid. - If parsing fails or the value is invalid, it's crucial to raise an error to make the misconfiguration apparent. This is especially important because this override is often used as a workaround for critical startup issues, and silently falling back to a problematic auto-calculated value would defeat its purpose.
try:
overridden_value = int(vllm_avail_kv_cache_override)
if overridden_value < 0:
# It's generally not expected for available memory to be negative.
# Raise an error to make misconfiguration clear.
raise ValueError(
f"VLLM_AVAIL_KV_CACHE_OVERRIDE cannot be negative. "
f"Received value: {vllm_avail_kv_cache_override}"
)
available_kv_cache_memory = overridden_value
logger.warning(
"Overriding amount of memory available to KV cache to %d bytes "
"due to VLLM_AVAIL_KV_CACHE_OVERRIDE=%s.",
available_kv_cache_memory, vllm_avail_kv_cache_override
)
except ValueError as e:
logger.error(
"Failed to apply VLLM_AVAIL_KV_CACHE_OVERRIDE='%s'. "
"Error: %s. Please ensure it's a valid non-negative integer.",
vllm_avail_kv_cache_override, e
)
raise # Re-raise to halt initialization if override is malformed
41836c5
to
d413fa2
Compare
When trying to run two LMCache instances on a single GPU, the available KV cache memory calculation seemed to be incorrect, even with `--gpu-memory-utilization 0.38` given to both instances. It worked for the first instance but not for the second one, in which `available_kv_cache_memory` turned out to be 0. Without spending time to debug this, with this change I was able to force the second instance to have the same calculated value of `available_kv_cache_memory` of the first instance, and then the second instance booted correctly and was fully functional. So it seems to me that VLLM_AVAIL_KV_CACHE_OVERRIDE would be handy for similar situations, even if we are not fixing this. I should probably add more code to print the details of the calculation to see where the problem originates, but for now I'd very much like to have VLLM_AVAIL_KV_CACHE_OVERRIDE environment supported. Signed-off-by: Dan Aloni <dan.aloni@vastdata.com>
d413fa2
to
fd34b3e
Compare
Hi @da-x, IIUC, this bug was recently fixed by #18974 (@ProExpertProg). Which version of vLLM did you use? |
@WoosukKwon I used 8f4ffbd (Jun 4), which did not contain the fix you mentioned. So maybe we can close this one, unless enough people think it still has value. |
When trying to run two LMCache instances on a single GPU, the available KV cache memory calculation seemed to be incorrect, even with
--gpu-memory-utilization 0.38
given to both instances. It worked for the first instance but not for the second one, in whichavailable_kv_cache_memory
turned out to be 0.Without spending time to debug this, with this change I was able to force the second instance to have the same calculated value of
available_kv_cache_memory
of the first instance, and then the second instance booted correctly and was fully functional.So it seems to me that VLLM_AVAIL_KV_CACHE_OVERRIDE would be handy for similar situations, even if we are not fixing this.
I should probably add more code to print the details of the calculation to see where the problem originates, but for now I'd very much like to have VLLM_AVAIL_KV_CACHE_OVERRIDE environment supported.