Add new CLI options #68

fcogidi · 2025-03-14T19:35:48Z

PR Type

Feature

Short Description

Add new CLI options that can help with performance tuning.

--enable-chunked-prefill: enables chunked prefill to prioritize decode requests over prefill requests.
--max-num-batch-tokens: specifies the token budget for chunked prefill.
--enable-prefix-caching: enables automatic prefix caching, which reuses the KV cache of existing requests that match new requests.
--compilation-config: level of optimization for torch.compile.

Tests Added

...

…hed tokens

…RM script

fcogidi · 2025-03-14T23:23:23Z

Found a case where the default max-num-batch-tokens doesn't work (Llama-2-70b-chat-hf with context length of 4096 throws an error).

ERROR 03-14 19:10:34 engine.py:400] ValueError: max_num_batched_tokens (2048) is smaller than max_model_len (4096). This effectively limits the maximum sequence length to max_num_batched_tokens and makes vLLM reject longer sequences. Please increase max_num_batched_tokens or decrease max_model_len.

Will fix on Monday.

fcogidi · 2025-03-17T16:11:00Z

@XkunW The options should be working now.

codecov-commenter · 2025-03-17T23:17:13Z

Codecov Report

Attention: Patch coverage is 74.07407% with 7 lines in your changes missing coverage. Please review.

Project coverage is 79.07%. Comparing base (3794604) to head (5bf5ac7).

Files with missing lines	Patch %	Lines
vec_inf/cli/_helper.py	68.18%	7 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop      #68      +/-   ##
===========================================
- Coverage    79.50%   79.07%   -0.43%     
===========================================
  Files            4        4              
  Lines          522      540      +18     
===========================================
+ Hits           415      427      +12     
- Misses         107      113       +6

Files with missing lines	Coverage Δ
vec_inf/cli/_cli.py	`83.15% <100.00%> (+0.74%)`	⬆️
vec_inf/cli/_config.py	`100.00% <100.00%> (ø)`
vec_inf/cli/_utils.py	`85.88% <ø> (-0.64%)`	⬇️
vec_inf/cli/_helper.py	`74.32% <68.18%> (-0.53%)`	⬇️

🚀 New features to boost your workflow:

❄ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…, change bool options to flags

fcogidi · 2025-03-19T19:13:46Z

vec_inf/cli/_cli.py

 )
 @click.option(
    "--compilation-config",
-    type=click.Choice(["0", "3"]),
-    help="torch.compile optimization level, accepts '0' or '3', default to '0', which means no optimization is applied",
+    type=click.Choice(["0", "1", "2", "3"]),


1 and 2 are meant for internal use for vLLM developers.

--compilation-config, -O

torch.compile configuration for the model.When it is a number (0, 1, 2, 3), it will be interpreted as the optimization level. NOTE: level 0 is the default level without any optimization. level 1 and 2 are for internal testing only. level 3 is the recommended level for production. To specify the full compilation config, use a JSON string. Following the convention of traditional compilers, using -O without space is also supported. -O3 is equivalent to -O 3.

Good catch, missed that bit in the description, will get rid of them

fcogidi · 2025-03-19T19:18:01Z

vec_inf/cli/_utils.py

-            if "Prefix cache hit rate" in line:
-                # Parse the metric values from the line
-                metrics_str = line.split("] ")[1].strip()
-                prefix, metrics_str = metrics_str.split(": ", 1)
-                metrics_list = metrics_str.split(", ")
-                for metric in metrics_list:
-                    key, value = metric.split(": ")
-                    latest_metric[f"{key} {prefix}"] = value


Prefix cache hit rate is not captured in updated metrics command.

I don't think this metric is relevant if prefix caching isn't enabled, I can add it in if you think this is useful, otherwise it just complicates the metrics command logic

Yeah, it's not needed if prefix caching is not enabled. It's not necessary to add it. I can still view it from the log file.

I don't think the log files captures the production metrics anymore? IIUC everything goes to the metrics API endpoint. I'm working on another PR right now, will add this change in there

Add new CLI options for prefix caching, chunked prefill, and max batc…

2607f0d

…hed tokens

fcogidi added the enhancement label Mar 14, 2025

fcogidi requested review from amrit110 and XkunW March 14, 2025 19:35

fcogidi self-assigned this Mar 14, 2025

revert removal of gpu-memory-utilization option in multinode VLLM SLU…

cfd3e91

…RM script

amrit110 approved these changes Mar 14, 2025

View reviewed changes

Update handling of max-num-batched-tokens in CLI and SLURM scripts

0d1c070

Merge branch 'develop' into fco/expose_more_args

961aa7a

XkunW and others added 5 commits March 18, 2025 17:03

Merge branch 'develop' into fco/expose_more_args

8f45869

Add compilation config to model config, remove max num batched tokens…

1458b60

…, change bool options to flags

remove redundant conditions and values

422d414

Remove unused convert boolean value

60d5d27

Ruff format

5bf5ac7

XkunW merged commit 9ef73c0 into develop Mar 19, 2025
6 checks passed

XkunW deleted the fco/expose_more_args branch March 19, 2025 18:14

fcogidi commented Mar 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new CLI options #68

Add new CLI options #68

fcogidi commented Mar 14, 2025 •

edited

Loading

fcogidi commented Mar 14, 2025 •

edited

Loading

fcogidi commented Mar 17, 2025

codecov-commenter commented Mar 17, 2025 •

edited

Loading

fcogidi Mar 19, 2025 •

edited

Loading

XkunW Mar 19, 2025

fcogidi Mar 19, 2025

XkunW Mar 19, 2025

fcogidi Mar 19, 2025

XkunW Mar 19, 2025

Add new CLI options #68

Add new CLI options #68

Conversation

fcogidi commented Mar 14, 2025 • edited Loading

PR Type

Short Description

Tests Added

fcogidi commented Mar 14, 2025 • edited Loading

fcogidi commented Mar 17, 2025

codecov-commenter commented Mar 17, 2025 • edited Loading

Codecov Report

fcogidi Mar 19, 2025 • edited Loading

Choose a reason for hiding this comment

XkunW Mar 19, 2025

Choose a reason for hiding this comment

fcogidi Mar 19, 2025

Choose a reason for hiding this comment

XkunW Mar 19, 2025

Choose a reason for hiding this comment

fcogidi Mar 19, 2025

Choose a reason for hiding this comment

XkunW Mar 19, 2025

Choose a reason for hiding this comment

fcogidi commented Mar 14, 2025 •

edited

Loading

fcogidi commented Mar 14, 2025 •

edited

Loading

codecov-commenter commented Mar 17, 2025 •

edited

Loading

fcogidi Mar 19, 2025 •

edited

Loading