Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new CLI options #68

Merged
merged 9 commits into from
Mar 19, 2025
Merged

Add new CLI options #68

merged 9 commits into from
Mar 19, 2025

Conversation

fcogidi
Copy link
Contributor

@fcogidi fcogidi commented Mar 14, 2025

PR Type

Feature

Short Description

Add new CLI options that can help with performance tuning.

  • --enable-chunked-prefill: enables chunked prefill to prioritize decode requests over prefill requests.
  • --max-num-batch-tokens: specifies the token budget for chunked prefill.
  • --enable-prefix-caching: enables automatic prefix caching, which reuses the KV cache of existing requests that match new requests.
  • --compilation-config: level of optimization for torch.compile.

Tests Added

...

@fcogidi fcogidi added the enhancement New feature or request label Mar 14, 2025
@fcogidi fcogidi requested review from amrit110 and XkunW March 14, 2025 19:35
@fcogidi fcogidi self-assigned this Mar 14, 2025
@fcogidi
Copy link
Contributor Author

fcogidi commented Mar 14, 2025

Found a case where the default max-num-batch-tokens doesn't work (Llama-2-70b-chat-hf with context length of 4096 throws an error).

ERROR 03-14 19:10:34 engine.py:400] ValueError: max_num_batched_tokens (2048) is smaller than max_model_len (4096). This effectively limits the maximum sequence length to max_num_batched_tokens and makes vLLM reject longer sequences. Please increase max_num_batched_tokens or decrease max_model_len.

Will fix on Monday.

@fcogidi
Copy link
Contributor Author

fcogidi commented Mar 17, 2025

@XkunW The options should be working now.

@codecov-commenter
Copy link

codecov-commenter commented Mar 17, 2025

Codecov Report

Attention: Patch coverage is 74.07407% with 7 lines in your changes missing coverage. Please review.

Project coverage is 79.07%. Comparing base (3794604) to head (5bf5ac7).

Files with missing lines Patch % Lines
vec_inf/cli/_helper.py 68.18% 7 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           develop      #68      +/-   ##
===========================================
- Coverage    79.50%   79.07%   -0.43%     
===========================================
  Files            4        4              
  Lines          522      540      +18     
===========================================
+ Hits           415      427      +12     
- Misses         107      113       +6     
Files with missing lines Coverage Δ
vec_inf/cli/_cli.py 83.15% <100.00%> (+0.74%) ⬆️
vec_inf/cli/_config.py 100.00% <100.00%> (ø)
vec_inf/cli/_utils.py 85.88% <ø> (-0.64%) ⬇️
vec_inf/cli/_helper.py 74.32% <68.18%> (-0.53%) ⬇️

Impacted file tree graph

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@XkunW XkunW merged commit 9ef73c0 into develop Mar 19, 2025
6 checks passed
@XkunW XkunW deleted the fco/expose_more_args branch March 19, 2025 18:14
)
@click.option(
"--compilation-config",
type=click.Choice(["0", "3"]),
help="torch.compile optimization level, accepts '0' or '3', default to '0', which means no optimization is applied",
type=click.Choice(["0", "1", "2", "3"]),
Copy link
Contributor Author

@fcogidi fcogidi Mar 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 and 2 are meant for internal use for vLLM developers.

--compilation-config, -O

torch.compile configuration for the model.When it is a number (0, 1, 2, 3), it will be interpreted as the optimization level. NOTE: level 0 is the default level without any optimization. level 1 and 2 are for internal testing only. level 3 is the recommended level for production. To specify the full compilation config, use a JSON string. Following the convention of traditional compilers, using -O without space is also supported. -O3 is equivalent to -O 3.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, missed that bit in the description, will get rid of them

Comment on lines 177 to 184
if "Prefix cache hit rate" in line:
# Parse the metric values from the line
metrics_str = line.split("] ")[1].strip()
prefix, metrics_str = metrics_str.split(": ", 1)
metrics_list = metrics_str.split(", ")
for metric in metrics_list:
key, value = metric.split(": ")
latest_metric[f"{key} {prefix}"] = value
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefix cache hit rate is not captured in updated metrics command.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this metric is relevant if prefix caching isn't enabled, I can add it in if you think this is useful, otherwise it just complicates the metrics command logic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's not needed if prefix caching is not enabled. It's not necessary to add it. I can still view it from the log file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the log files captures the production metrics anymore? IIUC everything goes to the metrics API endpoint. I'm working on another PR right now, will add this change in there

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants