-
-
Notifications
You must be signed in to change notification settings - Fork 9.2k
Closed
Labels
performancePerformance-related issuesPerformance-related issues
Description
I want to use the DP multi-API feature, so I upgraded to the latest version (0.9.2). Then I ran a benchmark and found that the overall performance has decreased. I want to know why this is happening? Am I missing any features?
Before:
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: 30.0
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|██████████| 10000/10000 [07:31<00:00, 22.16it/s]
============ Serving Benchmark Result ============
Successful requests: 10000
Benchmark duration (s): 451.28
Total input tokens: 2205227
Total generated tokens: 2011692
Request throughput (req/s): 22.16
Output token throughput (tok/s): 4457.76
Total Token throughput (tok/s): 9344.38
---------------Time to First Token----------------
Mean TTFT (ms): 987.38
Median TTFT (ms): 538.83
P99 TTFT (ms): 8046.79
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 262.05
Median TPOT (ms): 267.88
P99 TPOT (ms): 427.91
---------------Inter-token Latency----------------
Mean ITL (ms): 245.41
Median ITL (ms): 220.48
P99 ITL (ms): 678.37
==================================================
After:
============ Serving Benchmark Result ============
Successful requests: 10000
Benchmark duration (s): 538.73
Total input tokens: 2205227
Total generated tokens: 2014470
Request throughput (req/s): 18.56
Output token throughput (tok/s): 3739.29
Total Token throughput (tok/s): 7832.67
---------------Time to First Token----------------
Mean TTFT (ms): 55239.91
Median TTFT (ms): 55251.84
P99 TTFT (ms): 124103.15
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 118.82
Median TPOT (ms): 120.96
P99 TPOT (ms): 138.40
---------------Inter-token Latency----------------
Mean ITL (ms): 117.63
Median ITL (ms): 116.30
P99 ITL (ms): 185.65
==================================================
I ran the same test using the same machine and graphics(2*A100) card. The overall performance was down.
Test Command:
MODEL_NAME="/data/others/*****/qwen3-moe"
BACKEND="openai-chat"
DATASET_NAME="sharegpt"
DATASET_PATH="/home/*****/dataset/ShareGPT/ShareGPT_V3_unfiltered_cleaned_split.json"
NUM_PROMPTS=10000
VLLM_LOG="vllm_server_moe.log"
BENCH_LOG="benchmark_moe.log"
nohup vllm serve ${MODEL_NAME} \
--disable-log-requests \
--tensor-parallel-size 1 \
--max-model-len 5000 \
--data-parallel-size 2 \
--gpu_memory_utilization 0.95 \
--enable_expert_parallel \
--api-server-count=2 \ The performance does not change with or without this instruction. TTFT is around 55000.
--enforce-eager \
> ${VLLM_LOG} 2>&1 &
python3 ~/vllm/benchmarks/benchmark_serving.py \
--backend ${BACKEND} \
--model ${MODEL_NAME} \
--random-input-len 1500 \
--random-output-len 2500 \
--random-range-ratio 0.3 \
--endpoint /v1/chat/completions \
--dataset-name ${DATASET_NAME} \
--dataset-path ${DATASET_PATH} \
--num-prompts ${NUM_PROMPTS} \
--request-rate 30 \
> ${BENCH_LOG} 2>&1
I don't know what happened. I thought there was something wrong with the calculation logic of the benchmark_serving script. I switched to 0.8.5's benchmark_serving.py and the performance regressed overall.
I want to know what I missed and why the performance regressed.
Thanks
Misc discussion on performance
No response
Your current environment (if you think it is necessary)
Sun Jun 22 23:53:40 2025
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100 80G... On | 00000000:0B:00.0 Off | 0 |
| N/A 36C P0 51W / 300W | 693MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100 80G... On | 00000000:14:00.0 Off | 0 |
| N/A 39C P0 59W / 300W | 693MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
INFO 06-22 23:18:58 [__init__.py:244] Automatically detected platform cuda.
INFO 06-22 23:19:05 [api_server.py:1287] vLLM API server version 0.9.2.dev197+gec0db6f51
INFO 06-22 23:19:05 [cli_args.py:309] non-default args: {'model': '/data/others/adeltoosi/qwen3-moe', 'max_model_len': 5000, 'enforce_eager': True, 'data_par
allel_size': 2, 'enable_expert_parallel': True, 'gpu_memory_utilization': 0.95, 'disable_log_requests': True}
INFO 06-22 23:19:13 [config.py:831] This model supports multiple tasks: {'embed', 'score', 'generate', 'classify', 'reward'}. Defaulting to 'generate'.
INFO 06-22 23:19:13 [config.py:1444] Using max model len 5000
INFO 06-22 23:19:13 [config.py:2188] Chunked prefill is enabled with max_num_batched_tokens=2048.
WARNING 06-22 23:19:13 [cuda.py:102] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor c
annot be used
INFO 06-22 23:19:13 [serve.py:230] Started DP Coordinator process (PID: 245161)
INFO 06-22 23:19:13 [utils.py:212] Started 2 API server processes
INFO 06-22 23:19:18 [__init__.py:244] Automatically detected platform cuda.
INFO 06-22 23:19:18 [__init__.py:244] Automatically detected platform cuda.
INFO 06-22 23:19:18 [__init__.py:244] Automatically detected platform cuda.
INFO 06-22 23:19:18 [__init__.py:244] Automatically detected platform cuda.
INFO 06-22 23:19:19 [__init__.py:244] Automatically detected platform cuda.
(EngineCore_1 pid=245166) INFO 06-22 23:19:23 [core.py:459] Waiting for init message from front-end.
(EngineCore_0 pid=245165) INFO 06-22 23:19:23 [core.py:459] Waiting for init message from front-end.
(EngineCore_1 pid=245166) INFO 06-22 23:19:23 [core.py:69] Initializing a V1 LLM engine (v0.9.2.dev197+gec0db6f51) with config: model='/data/other
s/adeltoosi/qwen3-moe', speculative_config=None, tokenizer='/data/others/adeltoosi/qwen3-moe', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None,
override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=5000, download_dir=None, load_format=LoadForm
at.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, devi
ce_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, rea
soning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None),
seed=0, served_model_name=/data/others/adeltoosi/qwen3-moe, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefil
l_enabled=True, use_async_output_proc=False, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":
[],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cuda
graph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":0,"local_cache
_dir":null}
(EngineCore_0 pid=245165) INFO 06-22 23:19:23 [core.py:69] Initializing a V1 LLM engine (v0.9.2.dev197+gec0db6f51) with config: model='/data/other
s/adeltoosi/qwen3-moe', speculative_config=None, tokenizer='/data/others/adeltoosi/qwen3-moe', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None,
override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=5000, download_dir=None, load_format=LoadForm
at.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, devi
ce_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, rea
soning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None),
seed=0, served_model_name=/data/others/adeltoosi/qwen3-moe, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefil
l_enabled=True, use_async_output_proc=False, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":
[],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cuda
graph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":0,"local_cache
_dir":null}
(EngineCore_1 pid=245166) WARNING 06-22 23:19:24 [utils.py:2756] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes n
ot implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x155428988a40>
(EngineCore_0 pid=245165) WARNING 06-22 23:19:24 [utils.py:2756] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes n
ot implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x155428466360>
(EngineCore_1 pid=245166) INFO 06-22 23:19:24 [parallel_state.py:934] Adjusting world_size=2 rank=1 distributed_init_method=tcp://127.0.0.1:53788
for DP
(EngineCore_0 pid=245165) INFO 06-22 23:19:24 [parallel_state.py:934] Adjusting world_size=2 rank=0 distributed_init_method=tcp://127.0.0.1:53788
for DP
(EngineCore_0 pid=245165) INFO 06-22 23:19:25 [utils.py:1136] Found nccl from library libnccl.so.2
(EngineCore_1 pid=245166) INFO 06-22 23:19:25 [utils.py:1136] Found nccl from library libnccl.so.2
(EngineCore_0 pid=245165) INFO 06-22 23:19:25 [pynccl.py:70] vLLM is using nccl==2.26.2
(EngineCore_1 pid=245166) INFO 06-22 23:19:25 [pynccl.py:70] vLLM is using nccl==2.26.2
(EngineCore_0 pid=245165) INFO 06-22 23:19:25 [cuda_communicator.py:65] Using naive all2all manager.
(EngineCore_1 pid=245166) INFO 06-22 23:19:25 [cuda_communicator.py:65] Using naive all2all manager.
(EngineCore_0 pid=245165) INFO 06-22 23:19:25 [parallel_state.py:1072] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0, EP r
ank 0
(EngineCore_1 pid=245166) INFO 06-22 23:19:25 [parallel_state.py:1072] rank 1 in world size 2 is assigned as DP rank 1, PP rank 0, TP rank 0, EP r
ank 1
(EngineCore_0 pid=245165) WARNING 06-22 23:19:25 [topk_topp_sampler.py:59] FlashInfer is not available. Falling back to the PyTorch-native impleme
ntation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(EngineCore_1 pid=245166) WARNING 06-22 23:19:25 [topk_topp_sampler.py:59] FlashInfer is not available. Falling back to the PyTorch-native impleme
ntation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(EngineCore_0 pid=245165) INFO 06-22 23:19:25 [gpu_model_runner.py:1696] Starting to load model /data/others/adeltoosi/qwen3-moe...
(EngineCore_1 pid=245166) INFO 06-22 23:19:25 [gpu_model_runner.py:1696] Starting to load model /data/others/adeltoosi/qwen3-moe...
(EngineCore_0 pid=245165) INFO 06-22 23:19:25 [gpu_model_runner.py:1701] Loading model from scratch...
(EngineCore_1 pid=245166) INFO 06-22 23:19:25 [gpu_model_runner.py:1701] Loading model from scratch...
(EngineCore_0 pid=245165) INFO 06-22 23:19:25 [cuda.py:270] Using Flash Attention backend on V1 engine.
(EngineCore_1 pid=245166) INFO 06-22 23:19:25 [cuda.py:270] Using Flash Attention backend on V1 engine.
Loading safetensors checkpoint shards: 0% Completed | 0/16 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 6% Completed | 1/16 [00:00<00:06, 2.25it/s]
Loading safetensors checkpoint shards: 12% Completed | 2/16 [00:00<00:06, 2.03it/s]
Loading safetensors checkpoint shards: 19% Completed | 3/16 [00:01<00:04, 3.03it/s]
Loading safetensors checkpoint shards: 25% Completed | 4/16 [00:01<00:04, 2.53it/s]
Loading safetensors checkpoint shards: 31% Completed | 5/16 [00:02<00:04, 2.33it/s]
Loading safetensors checkpoint shards: 38% Completed | 6/16 [00:02<00:04, 2.16it/s]
Loading safetensors checkpoint shards: 44% Completed | 7/16 [00:03<00:04, 2.16it/s]
Loading safetensors checkpoint shards: 50% Completed | 8/16 [00:03<00:03, 2.20it/s]
Loading safetensors checkpoint shards: 56% Completed | 9/16 [00:03<00:03, 2.16it/s]
Loading safetensors checkpoint shards: 62% Completed | 10/16 [00:04<00:02, 2.11it/s]
Loading safetensors checkpoint shards: 69% Completed | 11/16 [00:04<00:02, 2.13it/s]
Loading safetensors checkpoint shards: 75% Completed | 12/16 [00:05<00:01, 2.16it/s]
Loading safetensors checkpoint shards: 81% Completed | 13/16 [00:05<00:01, 2.15it/s]
Loading safetensors checkpoint shards: 88% Completed | 14/16 [00:06<00:00, 2.11it/s]
Loading safetensors checkpoint shards: 94% Completed | 15/16 [00:06<00:00, 2.05it/s]
(ApiServer_1 pid=245168) INFO 06-22 23:19:33 [config.py:831] This model supports multiple tasks: {'classify', 'embed', 'score', 'reward', 'generat
e'}. Defaulting to 'generate'.
(ApiServer_1 pid=245168) INFO 06-22 23:19:33 [config.py:1444] Using max model len 5000
(ApiServer_1 pid=245168) INFO 06-22 23:19:33 [config.py:2188] Chunked prefill is enabled with max_num_batched_tokens=2048.
(ApiServer_1 pid=245168) WARNING 06-22 23:19:33 [cuda.py:102] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager
is enabled, async output processor cannot be used
Loading safetensors checkpoint shards: 100% Completed | 16/16 [00:07<00:00, 1.98it/s]
Loading safetensors checkpoint shards: 100% Completed | 16/16 [00:07<00:00, 2.15it/s]
(EngineCore_0 pid=245165)
(EngineCore_0 pid=245165) INFO 06-22 23:19:33 [default_loader.py:272] Loading weights took 7.48 seconds
(ApiServer_0 pid=245167) INFO 06-22 23:19:33 [config.py:831] This model supports multiple tasks: {'embed', 'reward', 'classify', 'generate', 'scor
e'}. Defaulting to 'generate'.
(ApiServer_0 pid=245167) INFO 06-22 23:19:33 [config.py:1444] Using max model len 5000
(ApiServer_0 pid=245167) INFO 06-22 23:19:33 [config.py:2188] Chunked prefill is enabled with max_num_batched_tokens=2048.
(ApiServer_0 pid=245167) WARNING 06-22 23:19:33 [cuda.py:102] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager
is enabled, async output processor cannot be used
(EngineCore_1 pid=245166) INFO 06-22 23:19:33 [default_loader.py:272] Loading weights took 7.82 seconds
(EngineCore_0 pid=245165) INFO 06-22 23:19:34 [gpu_model_runner.py:1725] Model loading took 29.8815 GiB and 7.751315 seconds
(EngineCore_1 pid=245166) INFO 06-22 23:19:34 [gpu_model_runner.py:1725] Model loading took 29.8815 GiB and 8.105841 seconds
(EngineCore_1 pid=245166) WARNING 06-22 23:19:35 [fused_moe.py:683] Using default MoE config. Performance might be sub-optimal! Config file not fo
und at /home/adeltoosi/vllm/vllm/model_executor/layers/fused_moe/configs/E=64,N=768,device_name=NVIDIA_A100_80GB_PCIe.json
(EngineCore_0 pid=245165) WARNING 06-22 23:19:35 [fused_moe.py:683] Using default MoE config. Performance might be sub-optimal! Config file not fo
und at /home/adeltoosi/vllm/vllm/model_executor/layers/fused_moe/configs/E=64,N=768,device_name=NVIDIA_A100_80GB_PCIe.json
(EngineCore_0 pid=245165) INFO 06-22 23:19:36 [gpu_worker.py:232] Available KV cache memory: 42.80 GiB
(EngineCore_1 pid=245166) INFO 06-22 23:19:36 [gpu_worker.py:232] Available KV cache memory: 42.80 GiB
(EngineCore_0 pid=245165) INFO 06-22 23:19:36 [kv_cache_utils.py:716] GPU KV cache size: 467,520 tokens
(EngineCore_0 pid=245165) INFO 06-22 23:19:36 [kv_cache_utils.py:720] Maximum concurrency for 5,000 tokens per request: 93.35x
(EngineCore_1 pid=245166) INFO 06-22 23:19:36 [kv_cache_utils.py:716] GPU KV cache size: 467,520 tokens
(EngineCore_1 pid=245166) INFO 06-22 23:19:36 [kv_cache_utils.py:720] Maximum concurrency for 5,000 tokens per request: 93.35x
(EngineCore_0 pid=245165) WARNING 06-22 23:19:36 [utils.py:101] Unable to detect current VLLM config. Defaulting to NHD kv cache layout.
(EngineCore_1 pid=245166) WARNING 06-22 23:19:36 [utils.py:101] Unable to detect current VLLM config. Defaulting to NHD kv cache layout.
(EngineCore_0 pid=245165) INFO 06-22 23:19:36 [core.py:172] init engine (profile, create kv cache, warmup model) took 2.55 seconds
(EngineCore_1 pid=245166) INFO 06-22 23:19:36 [core.py:172] init engine (profile, create kv cache, warmup model) took 2.21 seconds
INFO 06-22 23:19:37 [utils.py:569] Waiting for API servers to complete ...
(ApiServer_1 pid=245168) WARNING 06-22 23:19:37 [config.py:1371] Default sampling parameters have been overridden by the model's Hugging Face gene
ration config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [serving_chat.py:118] Using default chat sampling params from model: {'temperature': 0.6, 'top_k': 20
, 'top_p': 0.95}
(ApiServer_0 pid=245167) WARNING 06-22 23:19:37 [config.py:1371] Default sampling parameters have been overridden by the model's Hugging Face gene
ration config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [serving_chat.py:118] Using default chat sampling params from model: {'temperature': 0.6, 'top_k': 20
, 'top_p': 0.95}
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [serving_completion.py:66] Using default completion sampling params from model: {'temperature': 0.6,
'top_k': 20, 'top_p': 0.95}
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [api_server.py:1349] Starting vLLM API server 1 on http://0.0.0.0:8000
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [launcher.py:29] Available routes are:
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [launcher.py:37] Route: /openapi.json, Methods: GET, HEAD
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [launcher.py:37] Route: /docs, Methods: GET, HEAD
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [launcher.py:37] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [launcher.py:37] Route: /redoc, Methods: GET, HEAD
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [launcher.py:37] Route: /health, Methods: GET
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [launcher.py:37] Route: /load, Methods: GET
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [launcher.py:37] Route: /ping, Methods: POST
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [launcher.py:37] Route: /ping, Methods: GET
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [launcher.py:37] Route: /tokenize, Methods: POST
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [launcher.py:37] Route: /detokenize, Methods: POST
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [launcher.py:37] Route: /v1/models, Methods: GET
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [launcher.py:37] Route: /version, Methods: GET
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [launcher.py:37] Route: /v1/chat/completions, Methods: POST
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [launcher.py:37] Route: /v1/completions, Methods: POST
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [launcher.py:37] Route: /v1/embeddings, Methods: POST
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [launcher.py:37] Route: /pooling, Methods: POST
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [launcher.py:37] Route: /classify, Methods: POST
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [launcher.py:37] Route: /score, Methods: POST
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [launcher.py:37] Route: /v1/score, Methods: POST
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [launcher.py:37] Route: /v1/audio/transcriptions, Methods: POST
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [launcher.py:37] Route: /rerank, Methods: POST
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [launcher.py:37] Route: /v1/rerank, Methods: POST
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [launcher.py:37] Route: /v2/rerank, Methods: POST
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [launcher.py:37] Route: /invocations, Methods: POST
(ApiServer_1 pid=245168) INFO 06-22 23:19:37 [launcher.py:37] Route: /metrics, Methods: GET
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [serving_completion.py:66] Using default completion sampling params from model: {'temperature': 0.6,
'top_k': 20, 'top_p': 0.95}
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [api_server.py:1349] Starting vLLM API server 0 on http://0.0.0.0:8000
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [launcher.py:29] Available routes are:
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [launcher.py:37] Route: /openapi.json, Methods: HEAD, GET
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [launcher.py:37] Route: /docs, Methods: HEAD, GET
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [launcher.py:37] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [launcher.py:37] Route: /redoc, Methods: HEAD, GET
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [launcher.py:37] Route: /health, Methods: GET
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [launcher.py:37] Route: /load, Methods: GET
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [launcher.py:37] Route: /ping, Methods: POST
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [launcher.py:37] Route: /ping, Methods: GET
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [launcher.py:37] Route: /tokenize, Methods: POST
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [launcher.py:37] Route: /detokenize, Methods: POST
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [launcher.py:37] Route: /v1/models, Methods: GET
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [launcher.py:37] Route: /version, Methods: GET
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [launcher.py:37] Route: /v1/chat/completions, Methods: POST
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [launcher.py:37] Route: /v1/completions, Methods: POST
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [launcher.py:37] Route: /v1/embeddings, Methods: POST
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [launcher.py:37] Route: /pooling, Methods: POST
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [launcher.py:37] Route: /classify, Methods: POST
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [launcher.py:37] Route: /score, Methods: POST
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [launcher.py:37] Route: /v1/score, Methods: POST
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [launcher.py:37] Route: /v1/audio/transcriptions, Methods: POST
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [launcher.py:37] Route: /rerank, Methods: POST
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [launcher.py:37] Route: /v1/rerank, Methods: POST
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [launcher.py:37] Route: /v2/rerank, Methods: POST
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [launcher.py:37] Route: /invocations, Methods: POST
(ApiServer_0 pid=245167) INFO 06-22 23:19:37 [launcher.py:37] Route: /metrics, Methods: GET
(ApiServer_0 pid=245167) INFO: Started server process [245167]
(ApiServer_1 pid=245168) INFO: Started server process [245168]
(ApiServer_0 pid=245167) INFO: Waiting for application startup.
(ApiServer_1 pid=245168) INFO: Waiting for application startup.
(ApiServer_1 pid=245168) INFO: Application startup complete.
(ApiServer_0 pid=245167) INFO: Application startup complete.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Metadata
Metadata
Assignees
Labels
performancePerformance-related issuesPerformance-related issues