[Performance]: Performance decrease after upgrading from 0.8.5 to 0.9.2

I want to use the DP multi-API feature, so I upgraded to the latest version (0.9.2). Then I ran a benchmark and found that the overall performance has decreased. I want to know why this is happening? Am I missing any features?

```

Before:
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: 30.0
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|██████████| 10000/10000 [07:31<00:00, 22.16it/s]
============ Serving Benchmark Result ============
Successful requests:                     10000     
Benchmark duration (s):                  451.28    
Total input tokens:                      2205227   
Total generated tokens:                  2011692   
Request throughput (req/s):              22.16     
Output token throughput (tok/s):         4457.76   
Total Token throughput (tok/s):          9344.38   
---------------Time to First Token----------------
Mean TTFT (ms):                          987.38    
Median TTFT (ms):                        538.83    
P99 TTFT (ms):                           8046.79   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          262.05    
Median TPOT (ms):                        267.88    
P99 TPOT (ms):                           427.91    
---------------Inter-token Latency----------------
Mean ITL (ms):                           245.41    
Median ITL (ms):                         220.48    
P99 ITL (ms):                            678.37    
==================================================

After:
============ Serving Benchmark Result ============
Successful requests:                     10000     
Benchmark duration (s):                  538.73    
Total input tokens:                      2205227   
Total generated tokens:                  2014470   
Request throughput (req/s):              18.56     
Output token throughput (tok/s):         3739.29   
Total Token throughput (tok/s):          7832.67   
---------------Time to First Token----------------
Mean TTFT (ms):                          55239.91  
Median TTFT (ms):                        55251.84  
P99 TTFT (ms):                           124103.15 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          118.82    
Median TPOT (ms):                        120.96    
P99 TPOT (ms):                           138.40    
---------------Inter-token Latency----------------
Mean ITL (ms):                           117.63    
Median ITL (ms):                         116.30    
P99 ITL (ms):                            185.65    
==================================================
```

I ran the same test using the same machine and graphics(2*A100) card. The overall performance was down.

```
Test Command:
MODEL_NAME="/data/others/*****/qwen3-moe"
BACKEND="openai-chat"
DATASET_NAME="sharegpt"
DATASET_PATH="/home/*****/dataset/ShareGPT/ShareGPT_V3_unfiltered_cleaned_split.json"
NUM_PROMPTS=10000
VLLM_LOG="vllm_server_moe.log"
BENCH_LOG="benchmark_moe.log"

nohup vllm serve ${MODEL_NAME} \
  --disable-log-requests \
  --tensor-parallel-size 1 \
  --max-model-len 5000 \
  --data-parallel-size 2 \
  --gpu_memory_utilization 0.95 \
  --enable_expert_parallel \
  --api-server-count=2 \    The performance does not change with or without this instruction. TTFT is around 55000.
  --enforce-eager \
  > ${VLLM_LOG} 2>&1 &


python3 ~/vllm/benchmarks/benchmark_serving.py \
  --backend ${BACKEND} \
  --model ${MODEL_NAME} \
  --random-input-len 1500 \
  --random-output-len 2500 \
  --random-range-ratio 0.3 \
  --endpoint /v1/chat/completions \
  --dataset-name ${DATASET_NAME} \
  --dataset-path ${DATASET_PATH} \
  --num-prompts ${NUM_PROMPTS} \
  --request-rate 30 \
  > ${BENCH_LOG} 2>&1
```

I don't know what happened. I thought there was something wrong with the calculation logic of the benchmark_serving script. I switched to 0.8.5's benchmark_serving.py and the performance regressed overall.
I want to know what I missed and why the performance regressed.
Thanks

### Misc discussion on performance

_No response_

### Your current environment (if you think it is necessary)

```
Sun Jun 22 23:53:40 2025       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 545.23.08    CUDA Version: 12.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100 80G...  On   | 00000000:0B:00.0 Off |                    0 |
| N/A   36C    P0    51W / 300W |    693MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80G...  On   | 00000000:14:00.0 Off |                    0 |
| N/A   39C    P0    59W / 300W |    693MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+




INFO 06-22 23:18:58 [__init__.py:244] Automatically detected platform cuda.
INFO 06-22 23:19:05 [api_server.py:1287] vLLM API server version 0.9.2.dev197+gec0db6f51
INFO 06-22 23:19:05 [cli_args.py:309] non-default args: {'model': '/data/others/adeltoosi/qwen3-moe', 'max_model_len': 5000, 'enforce_eager': True, 'data_par
allel_size': 2, 'enable_expert_parallel': True, 'gpu_memory_utilization': 0.95, 'disable_log_requests': True}
INFO 06-22 23:19:13 [config.py:831] This model supports multiple tasks: {'embed', 'score', 'generate', 'classify', 'reward'}. Defaulting to 'generate'.
INFO 06-22 23:19:13 [config.py:1444] Using max model len 5000
INFO 06-22 23:19:13 [config.py:2188] Chunked prefill is enabled with max_num_batched_tokens=2048.
WARNING 06-22 23:19:13 [cuda.py:102] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor c
annot be used
INFO 06-22 23:19:13 [serve.py:230] Started DP Coordinator process (PID: 245161)
INFO 06-22 23:19:13 [utils.py:212] Started 2 API server processes
INFO 06-22 23:19:18 [__init__.py:244] Automatically detected platform cuda.
INFO 06-22 23:19:18 [__init__.py:244] Automatically detected platform cuda.
INFO 06-22 23:19:18 [__init__.py:244] Automatically detected platform cuda.
INFO 06-22 23:19:18 [__init__.py:244] Automatically detected platform cuda.
INFO 06-22 23:19:19 [__init__.py:244] Automatically detected platform cuda.
(EngineCore_1 pid=245166) INFO 06-22 23:19:23 [core.py:459] Waiting for init message from front-end.
(EngineCore_0 pid=245165) INFO 06-22 23:19:23 [core.py:459] Waiting for init message from front-end.
(EngineCore_1 pid=245166) INFO 06-22 23:19:23 [core.py:69] Initializing a V1 LLM engine (v0.9.2.dev197+gec0db6f51) with config: model='/data/other
s/adeltoosi/qwen3-moe', speculative_config=None, tokenizer='/data/others/adeltoosi/qwen3-moe', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None,
 override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=5000, download_dir=None, load_format=LoadForm
at.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  devi
ce_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, rea
soning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), 
seed=0, served_model_name=/data/others/adeltoosi/qwen3-moe, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefil
l_enabled=True, use_async_output_proc=False, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":
[],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cuda
graph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":0,"local_cache
_dir":null}
(EngineCore_0 pid=245165) INFO 06-22 23:19:23 [core.py:69] Initializing a V1 LLM engine (v0.9.2.dev197+gec0db6f51) with config: model='/data/other
s/adeltoosi/qwen3-moe', speculative_config=None, tokenizer='/data/others/adeltoosi/qwen3-moe', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None,
 override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=5000, download_dir=None, load_format=LoadForm
at.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  devi
ce_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, rea
soning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), 
seed=0, served_model_name=/data/others/adeltoosi/qwen3-moe, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefil
l_enabled=True, use_async_output_proc=False, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":
[],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cuda
graph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":0,"local_cache
_dir":null}
(EngineCore_1 pid=245166) WARNING 06-22 23:19:24 [utils.py:2756] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes n
ot implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x155428988a40>
(EngineCore_0 pid=245165) WARNING 06-22 23:19:24 [utils.py:2756] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes n
ot implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x155428466360>
(EngineCore_1 pid=245166) INFO 06-22 23:19:24 [parallel_state.py:934] Adjusting world_size=2 rank=1 distributed_init_method=tcp://127.0.0.1:53788 
for DP
(EngineCore_0 pid=245165) INFO 06-22 23:19:24 [parallel_state.py:934] Adjusting world_size=2 rank=0 distributed_init_method=tcp://127.0.0.1:53788 
for DP
(EngineCore_0 pid=245165) INFO 06-22 23:19:25 [utils.py:1136] Found nccl from library libnccl.so.2
(EngineCore_1 pid=245166) INFO 06-22 23:19:25 [utils.py:1136] Found nccl from library libnccl.so.2
(EngineCore_0 pid=245165) INFO 06-22 23:19:25 [pynccl.py:70] vLLM is using nccl==2.26.2
(EngineCore_1 pid=245166) INFO 06-22 23:19:25 [pynccl.py:70] vLLM is using nccl==2.26.2
(EngineCore_0 pid=245165) INFO 06-22 23:19:25 [cuda_communicator.py:65] Using naive all2all manager.
(EngineCore_1 pid=245166) INFO 06-22 23:19:25 [cuda_communicator.py:65] Using naive all2all manager.
(EngineCore_0 pid=245165) INFO 06-22 23:19:25 [parallel_state.py:1072] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0, EP r
ank 0
(EngineCore_1 pid=245166) INFO 06-22 23:19:25 [parallel_state.py:1072] rank 1 in world size 2 is assigned as DP rank 1, PP rank 0, TP rank 0, EP r
ank 1
(EngineCore_0 pid=245165) WARNING 06-22 23:19:25 [topk_topp_sampler.py:59] FlashInfer is not available. Falling back to the PyTorch-native impleme
ntation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(EngineCore_1 pid=245166) WARNING 06-22 23:19:25 [topk_topp_sampler.py:59] FlashInfer is not available. Falling back to the PyTorch-native impleme
ntation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(EngineCore_0 pid=245165) INFO 06-22 23:19:25 [gpu_model_runner.py:1696] Starting to load model /data/others/adeltoosi/qwen3-moe...
(EngineCore_1 pid=245166) INFO 06-22 23:19:25 [gpu_model_runner.py:1696] Starting to load model /data/others/adeltoosi/qwen3-moe...
(EngineCore_0 pid=245165) INFO 06-22 23:19:25 [gpu_model_runner.py:1701] Loading model from scratch...
(EngineCore_1 pid=245166) INFO 06-22 23:19:25 [gpu_model_runner.py:1701] Loading model from scratch...
(EngineCore_0 pid=245165) INFO 06-22 23:19:25 [cuda.py:270] Using Flash Attention backend on V1 engine.
(EngineCore_1 pid=245166) INFO 06-22 23:19:25 [cuda.py:270] Using Flash Attention backend on V1 engine.
Loading safetensors checkpoint shards:   0% Completed | 0/16 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   6% Completed | 1/16 [00:00<00:06,  2.25it/s]
Loading safetensors checkpoint shards:  12% Completed | 2/16 [00:00<00:06,  2.03it/s]
Loading safetensors checkpoint shards:  19% Completed | 3/16 [00:01<00:04,  3.03it/s]
Loading safetensors checkpoint shards:  25% Completed | 4/16 [00:01<00:04,  2.53it/s]
Loading safetensors checkpoint shards:  31% Completed | 5/16 [00:02<00:04,  2.33it/s]
Loading safetensors checkpoint shards:  38% Completed | 6/16 [00:02<00:04,  2.16it/s]
Loading safetensors checkpoint shards:  44% Completed | 7/16 [00:03<00:04,  2.16it/s]
Loading safetensors checkpoint shards:  50% Completed | 8/16 [00:03<00:03,  2.20it/s]
Loading safetensors checkpoint shards:  56% Completed | 9/16 [00:03<00:03,  2.16it/s]
Loading safetensors checkpoint shards:  62% Completed | 10/16 [00:04<00:02,  2.11it/s]
Loading safetensors checkpoint shards:  69% Completed | 11/16 [00:04<00:02,  2.13it/s]
Loading safetensors checkpoint shards:  75% Completed | 12/16 [00:05<00:01,  2.16it/s]
Loading safetensors checkpoint shards:  81% Completed | 13/16 [00:05<00:01,  2.15it/s]
Loading safetensors checkpoint shards:  88% Completed | 14/16 [00:06<00:00,  2.11it/s]
Loading safetensors checkpoint shards:  94% Completed | 15/16 [00:06<00:00,  2.05it/s]
(ApiServer_1 pid=245168) INFO 06-22 23:19:33 [config.py:831] This model supports multiple tasks: {'classify', 'embed', 'score', 'reward', 'generat
e'}. Defaulting to 'generate'.
(ApiServer_1 pid=245168) INFO 06-22 23:19:33 [config.py:1444] Using max model len 5000
(ApiServer_1 pid=245168) INFO 06-22 23:19:33 [config.py:2188] Chunked prefill is enabled with max_num_batched_tokens=2048.
(ApiServer_1 pid=245168) WARNING 06-22 23:19:33 [cuda.py:102] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager 
is enabled, async output processor cannot be used
Loading safetensors checkpoint shards: 100% Completed | 16/16 [00:07<00:00,  1.98it/s]
Loading safetensors checkpoint shards: 100% Completed | 16/16 [00:07<00:00,  2.15it/s]
(EngineCore_0 pid=245165) 
(EngineCore_0 pid=245165) INFO 06-22 23:19:33 [default_loader.py:272] Loading weights took 7.48 seconds
(ApiServer_0 pid=245167) INFO 06-22 23:19:33 [config.py:831] This model supports multiple tasks: {'embed', 'reward', 'classify', 'generate', 'scor
e'}. Defaulting to 'generate'.
(ApiServer_0 pid=245167) INFO 06-22 23:19:33 [config.py:1444] Using max model len 5000
(ApiServer_0 pid=245167) INFO 06-22 23:19:33 [config.py:2188] Chunked prefill is enabled with max_num_batched_tokens=2048.
(ApiServer_0 pid=245167) WARNING 06-22 23:19:33 [cuda.py:102] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager 
is enabled, async output processor cannot be used
(EngineCore_1 pid=245166) INFO 06-22 23:19:33 [default_loader.py:272] Loading weights took 7.82 seconds
(EngineCore_0 pid=245165) INFO 06-22 23:19:34 [gpu_model_runner.py:1725] Model loading took 29.8815 GiB and 7.751315 seconds
(EngineCore_1 pid=245166) INFO 06-22 23:19:34 [gpu_model_runner.py:1725] Model loading took 29.8815 GiB and 8.105841 seconds
(EngineCore_1 pid=245166) WARNING 06-22 23:19:35 [fused_moe.py:683] Using default MoE config. Performance might be sub-optimal! Config file not fo
und at /home/adeltoosi/vllm/vllm/model_executor/layers/fused_moe/configs/E=64,N=768,device_name=NVIDIA_A100_80GB_PCIe.json
(EngineCore_0 pid=245165) WARNING 06-22 23:19:35 [fused_moe.py:683] Using default MoE config. Performance might be sub-optimal! Config file not fo
und at /home/adeltoosi/vllm/vllm/model_executor/layers/fused_moe/configs/E=64,N=768,device_name=NVIDIA_A100_80GB_PCIe.json
(EngineCore_0 pid=245165) INFO 06-22 23:19:36 [gpu_worker.py:232] Available KV cache memory: 42.80 GiB
(EngineCore_1 pid=245166) INFO 06-22 23:19:36 [gpu_worker.py:232] Available KV cache memory: 42.80 GiB
(EngineCore_0 pid=245165) INFO 06-22 23:19:36 [kv_cache_utils.py:716] GPU KV cache size: 467,520 tokens
(EngineCore_0 pid=245165) INFO 06-22 23:19:36 [kv_cache_utils.py:720] Maximum concurrency for 5,000 tokens per request: 93.35x
(EngineCore_1 pid=245166) INFO 06-22 23:19:36 [kv_cache_utils.py:716] GPU KV cache size: 467,520 tokens
(EngineCore_1 pid=245166) INFO 06-22 23:19:36 [kv_cache_utils.py:720] Maximum concurrency for 5,000 tokens per request: 93.35x
(EngineCore_0 pid=245165) WARNING 06-22 23:19:36 [utils.py:101] Unable to detect current VLLM config. Defaulting to NHD kv cache layout.
(EngineCore_1 pid=245166) WARNING 06-22 23:19:36 [utils.py:101] Unable to detect current VLLM config. Defaulting to NHD kv cache layout.
(EngineCore_0 pid=245165) INFO 06-22 23:19:36 [core.py:172] init engine (profile, create kv cache, warmup model) took 2.55 seconds
(EngineCore_1 pid=245166) INFO 06-22 23:19:36 [core.py:172] init engine (profile, create kv cache, warmup model) took 2.21 seconds
INFO 06-22 23:19:37 [utils.py:569] Waiting for API servers to complete ...
(ApiServer_1 pid=245168) WARNING 06-22 23:19:37 [config.py:1371] Default sampling parameters have been overridden by the model's Hugging Face gene
ration config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [serving_chat.py:118] Using default chat sampling params from model: {'temperature': 0.6, 'top_k': 20
, 'top_p': 0.95}
(ApiServer_0 pid=245167) WARNING 06-22 23:19:37 [config.py:1371] Default sampling parameters have been overridden by the model's Hugging Face gene
ration config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [serving_chat.py:118] Using default chat sampling params from model: {'temperature': 0.6, 'top_k': 20
, 'top_p': 0.95}
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [serving_completion.py:66] Using default completion sampling params from model: {'temperature': 0.6, 
'top_k': 20, 'top_p': 0.95}
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [api_server.py:1349] Starting vLLM API server 1 on http://0.0.0.0:8000
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [launcher.py:29] Available routes are:
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [launcher.py:37] Route: /openapi.json, Methods: GET, HEAD
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [launcher.py:37] Route: /docs, Methods: GET, HEAD
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [launcher.py:37] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [launcher.py:37] Route: /redoc, Methods: GET, HEAD
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [launcher.py:37] Route: /health, Methods: GET
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [launcher.py:37] Route: /load, Methods: GET
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [launcher.py:37] Route: /ping, Methods: POST
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [launcher.py:37] Route: /ping, Methods: GET
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [launcher.py:37] Route: /tokenize, Methods: POST
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [launcher.py:37] Route: /detokenize, Methods: POST
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [launcher.py:37] Route: /v1/models, Methods: GET
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [launcher.py:37] Route: /version, Methods: GET
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [launcher.py:37] Route: /v1/chat/completions, Methods: POST
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [launcher.py:37] Route: /v1/completions, Methods: POST
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [launcher.py:37] Route: /v1/embeddings, Methods: POST
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [launcher.py:37] Route: /pooling, Methods: POST
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [launcher.py:37] Route: /classify, Methods: POST
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [launcher.py:37] Route: /score, Methods: POST
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [launcher.py:37] Route: /v1/score, Methods: POST
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [launcher.py:37] Route: /v1/audio/transcriptions, Methods: POST
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [launcher.py:37] Route: /rerank, Methods: POST
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [launcher.py:37] Route: /v1/rerank, Methods: POST
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [launcher.py:37] Route: /v2/rerank, Methods: POST
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [launcher.py:37] Route: /invocations, Methods: POST
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [launcher.py:37] Route: /metrics, Methods: GET
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [serving_completion.py:66] Using default completion sampling params from model: {'temperature': 0.6, 
'top_k': 20, 'top_p': 0.95}
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [api_server.py:1349] Starting vLLM API server 0 on http://0.0.0.0:8000
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [launcher.py:29] Available routes are:
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [launcher.py:37] Route: /openapi.json, Methods: HEAD, GET
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [launcher.py:37] Route: /docs, Methods: HEAD, GET
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [launcher.py:37] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [launcher.py:37] Route: /redoc, Methods: HEAD, GET
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [launcher.py:37] Route: /health, Methods: GET
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [launcher.py:37] Route: /load, Methods: GET
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [launcher.py:37] Route: /ping, Methods: POST
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [launcher.py:37] Route: /ping, Methods: GET
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [launcher.py:37] Route: /tokenize, Methods: POST
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [launcher.py:37] Route: /detokenize, Methods: POST
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [launcher.py:37] Route: /v1/models, Methods: GET
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [launcher.py:37] Route: /version, Methods: GET
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [launcher.py:37] Route: /v1/chat/completions, Methods: POST
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [launcher.py:37] Route: /v1/completions, Methods: POST
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [launcher.py:37] Route: /v1/embeddings, Methods: POST
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [launcher.py:37] Route: /pooling, Methods: POST
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [launcher.py:37] Route: /classify, Methods: POST
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [launcher.py:37] Route: /score, Methods: POST
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [launcher.py:37] Route: /v1/score, Methods: POST
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [launcher.py:37] Route: /v1/audio/transcriptions, Methods: POST
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [launcher.py:37] Route: /rerank, Methods: POST
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [launcher.py:37] Route: /v1/rerank, Methods: POST
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [launcher.py:37] Route: /v2/rerank, Methods: POST
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [launcher.py:37] Route: /invocations, Methods: POST
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [launcher.py:37] Route: /metrics, Methods: GET
(ApiServer_0 pid=245167) INFO:     Started server process [245167]
(ApiServer_1 pid=245168) INFO:     Started server process [245168]
(ApiServer_0 pid=245167) INFO:     Waiting for application startup.
(ApiServer_1 pid=245168) INFO:     Waiting for application startup.
(ApiServer_1 pid=245168) INFO:     Application startup complete.
(ApiServer_0 pid=245167) INFO:     Application startup complete.

```
### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Performance]: Performance decrease after upgrading from 0.8.5 to 0.9.2 #19954

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Performance]: Performance decrease after upgrading from 0.8.5 to 0.9.2 #19954

Description

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions