Checklist
Describe the bug
我用lmdeploy的turbomind后端和pytorch后端分别启动了Qwen2.5-VL-7B-Instruct进行推理,但是发现速度差了一半,尤其是在TTFT方面,是因为torbomind没有对image processor和vit部分做优化吗,所以导致pytorch后端生成更快一点?
Reproduction
- 运行 lmdeploy-turbomind
lmdeploy serve api_server Qwen2.5-VL-7B-Instruct/ \ --server-name 0.0.0.0 \ --server-port 23333 \ --backend turbomind \ --tp 1 \ --max-batch-size 32 \ --cache-max-entry-count 0.6 \ --session-len 25600
结果:
Maximum request concurrency: 4 100%|███████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:58<00:00, 1.84s/it] ============ Serving Benchmark Result ============ Successful requests: 32 Benchmark duration (s): 58.81 Total input tokens: 32768 Total generated tokens: 7072 Request throughput (req/s): 0.54 Output token throughput (tok/s): 120.26 Total Token throughput (tok/s): 677.48 ---------------Time to First Token---------------- Mean TTFT (ms): 5487.44 Median TTFT (ms): 5751.33 P95 TTFT (ms): 6553.60 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 6.90 Median TPOT (ms): 6.43 P95 TPOT (ms): 8.75 ----------------End-to-end Latency---------------- Mean E2EL (ms): 7004.85 Median E2EL (ms): 7198.83 P95 E2EL (ms): 8322.10 ==================================================
- 运行 lmdeploy-pytorch
lmdeploy serve api_server Qwen2.5-VL-7B-Instruct/ \ --server-name 0.0.0.0 \ --server-port 23333 \ --backend pytorch \ --tp 1 \ --max-batch-size 32 \ --cache-max-entry-count 0.6 \ --session-len 25600
结果:
Maximum request concurrency: 4 100%|███████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:33<00:00, 1.04s/it] ============ Serving Benchmark Result ============ Successful requests: 32 Benchmark duration (s): 33.23 Total input tokens: 32768 Total generated tokens: 7040 Request throughput (req/s): 0.96 Output token throughput (tok/s): 211.83 Total Token throughput (tok/s): 1197.80 ---------------Time to First Token---------------- Mean TTFT (ms): 3792.02 Median TTFT (ms): 4124.75 P95 TTFT (ms): 4255.24 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 1.58 Median TPOT (ms): 0.00 P95 TPOT (ms): 6.32 ----------------End-to-end Latency---------------- Mean E2EL (ms): 4138.02 Median E2EL (ms): 4169.22 P95 E2EL (ms): 4265.66 ==================================================
Environment
sys.platform: linux
Python: 3.10.12 (main, Nov 4 2025, 08:48:33) [GCC 11.4.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1,2,3,4,5,6,7: NVIDIA H20
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
GCC: x86_64-linux-gnu-gcc (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0
PyTorch: 2.8.0+cu128
PyTorch compiling details: PyTorch built with:
- GCC 13.3
- C++ Version: 201703
- Intel(R) oneAPI Math Kernel Library Version 2024.2-Product Build 20240605 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v3.7.1 (Git Hash 8d263e693366ef8db40acc569cc7d8edf644556d)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX512
- CUDA Runtime 12.8
- NVCC architecture flags: -gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90;-gencode;arch=compute_100,code=sm_100;-gencode;arch=compute_120,code=sm_120
- CuDNN 91.0.2 (built against CUDA 12.9)
- Built with CuDNN 90.8
- Magma 2.6.1
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, COMMIT_SHA=a1cb3cc05d46d198467bebbb6e8fba50a325d4e7, CUDA_VERSION=12.8, CUDNN_VERSION=9.8.0, CXX_COMPILER=/opt/rh/gcc-toolset-13/root/usr/bin/c++, CXX_FLAGS= -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DLIBKINETO_NOXPUPTI=ON -DUSE_FBGEMM -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -DC10_NODEPRECATED -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -faligned-new -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-dangling-reference -Wno-error=dangling-reference -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, TORCH_VERSION=2.8.0, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, USE_XCCL=OFF, USE_XPU=OFF,
TorchVision: 0.23.0+cu128
LMDeploy: 0.11.0+
transformers: 4.57.3
fastapi: 0.123.5
pydantic: 2.12.5
triton: 3.4.0
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6NIC7 NIC8 NIC9 NIC10 NIC11 NIC12 NIC13 NIC14 NIC15 NIC16 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 PIX NODE NODE NODE SYS SYS SYSSYS PIX NODE NODE NODE SYS SYS SYS SYS NODE 0-47,96-143 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 NODE PIX NODE NODE SYS SYS SYSSYS NODE PIX NODE NODE SYS SYS SYS SYS NODE 0-47,96-143 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 NODE NODE PIX NODE SYS SYS SYSSYS NODE NODE PIX NODE SYS SYS SYS SYS NODE 0-47,96-143 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 NODE NODE NODE PIX SYS SYS SYSSYS NODE NODE NODE PIX SYS SYS SYS SYS NODE 0-47,96-143 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS SYS SYS PIX NODE NODENODE SYS SYS SYS SYS PIX NODE NODE NODE SYS 48-95,144-191 1 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS SYS SYS NODE PIX NODENODE SYS SYS SYS SYS NODE PIX NODE NODE SYS 48-95,144-191 1 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS SYS NODE NODE PIXNODE SYS SYS SYS SYS NODE NODE PIX NODE SYS 48-95,144-191 1 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS SYS NODE NODE NODEPIX SYS SYS SYS SYS NODE NODE NODE PIX SYS 48-95,144-191 1 N/A
NIC0 PIX NODE NODE NODE SYS SYS SYS SYS X NODE NODE NODE SYS SYS SYSSYS PIX NODE NODE NODE SYS SYS SYS SYS NODE
NIC1 NODE PIX NODE NODE SYS SYS SYS SYS NODE X NODE NODE SYS SYS SYSSYS NODE PIX NODE NODE SYS SYS SYS SYS NODE
NIC2 NODE NODE PIX NODE SYS SYS SYS SYS NODE NODE X NODE SYS SYS SYSSYS NODE NODE PIX NODE SYS SYS SYS SYS NODE
NIC3 NODE NODE NODE PIX SYS SYS SYS SYS NODE NODE NODE X SYS SYS SYSSYS NODE NODE NODE PIX SYS SYS SYS SYS NODE
NIC4 SYS SYS SYS SYS PIX NODE NODE NODE SYS SYS SYS SYS X NODE NODENODE SYS SYS SYS SYS PIX NODE NODE NODE SYS
NIC5 SYS SYS SYS SYS NODE PIX NODE NODE SYS SYS SYS SYS NODE X NODENODE SYS SYS SYS SYS NODE PIX NODE NODE SYS
NIC6 SYS SYS SYS SYS NODE NODE PIX NODE SYS SYS SYS SYS NODE NODE X NODE SYS SYS SYS SYS NODE NODE PIX NODE SYS
NIC7 SYS SYS SYS SYS NODE NODE NODE PIX SYS SYS SYS SYS NODE NODE NODE X SYS SYS SYS SYS NODE NODE NODE PIX SYS
NIC8 PIX NODE NODE NODE SYS SYS SYS SYS PIX NODE NODE NODE SYS SYS SYSSYS X NODE NODE NODE SYS SYS SYS SYS NODE
NIC9 NODE PIX NODE NODE SYS SYS SYS SYS NODE PIX NODE NODE SYS SYS SYSSYS NODE X NODE NODE SYS SYS SYS SYS NODE
NIC10 NODE NODE PIX NODE SYS SYS SYS SYS NODE NODE PIX NODE SYS SYS SYSSYS NODE NODE X NODE SYS SYS SYS SYS NODE
NIC11 NODE NODE NODE PIX SYS SYS SYS SYS NODE NODE NODE PIX SYS SYS SYSSYS NODE NODE NODE X SYS SYS SYS SYS NODE
NIC12 SYS SYS SYS SYS PIX NODE NODE NODE SYS SYS SYS SYS PIX NODE NODENODE SYS SYS SYS SYS X NODE NODE NODE SYS
NIC13 SYS SYS SYS SYS NODE PIX NODE NODE SYS SYS SYS SYS NODE PIX NODENODE SYS SYS SYS SYS NODE X NODE NODE SYS
NIC14 SYS SYS SYS SYS NODE NODE PIX NODE SYS SYS SYS SYS NODE NODE PIXNODE SYS SYS SYS SYS NODE NODE X NODE SYS
NIC15 SYS SYS SYS SYS NODE NODE NODE PIX SYS SYS SYS SYS NODE NODE NODEPIX SYS SYS SYS SYS NODE NODE NODE X SYS
NIC16 NODE NODE NODE NODE SYS SYS SYS SYS NODE NODE NODE NODE SYS SYS SYSSYS NODE NODE NODE NODE SYS SYS SYS SYS X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_1
NIC1: mlx5_3
NIC2: mlx5_5
NIC3: mlx5_7
NIC4: mlx5_9
NIC5: mlx5_11
NIC6: mlx5_13
NIC7: mlx5_15
NIC8: mlx5_gdr_0
NIC9: mlx5_gdr_1
NIC10: mlx5_gdr_2
NIC11: mlx5_gdr_3
NIC12: mlx5_gdr_4
NIC13: mlx5_gdr_5
NIC14: mlx5_gdr_6
NIC15: mlx5_gdr_7
NIC16: mlx5_bond_0
Error traceback
Checklist
Describe the bug
我用lmdeploy的turbomind后端和pytorch后端分别启动了Qwen2.5-VL-7B-Instruct进行推理,但是发现速度差了一半,尤其是在TTFT方面,是因为torbomind没有对image processor和vit部分做优化吗,所以导致pytorch后端生成更快一点?
Reproduction
lmdeploy serve api_server Qwen2.5-VL-7B-Instruct/ \ --server-name 0.0.0.0 \ --server-port 23333 \ --backend turbomind \ --tp 1 \ --max-batch-size 32 \ --cache-max-entry-count 0.6 \ --session-len 25600结果:
Maximum request concurrency: 4 100%|███████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:58<00:00, 1.84s/it] ============ Serving Benchmark Result ============ Successful requests: 32 Benchmark duration (s): 58.81 Total input tokens: 32768 Total generated tokens: 7072 Request throughput (req/s): 0.54 Output token throughput (tok/s): 120.26 Total Token throughput (tok/s): 677.48 ---------------Time to First Token---------------- Mean TTFT (ms): 5487.44 Median TTFT (ms): 5751.33 P95 TTFT (ms): 6553.60 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 6.90 Median TPOT (ms): 6.43 P95 TPOT (ms): 8.75 ----------------End-to-end Latency---------------- Mean E2EL (ms): 7004.85 Median E2EL (ms): 7198.83 P95 E2EL (ms): 8322.10 ==================================================lmdeploy serve api_server Qwen2.5-VL-7B-Instruct/ \ --server-name 0.0.0.0 \ --server-port 23333 \ --backend pytorch \ --tp 1 \ --max-batch-size 32 \ --cache-max-entry-count 0.6 \ --session-len 25600结果:
Maximum request concurrency: 4 100%|███████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:33<00:00, 1.04s/it] ============ Serving Benchmark Result ============ Successful requests: 32 Benchmark duration (s): 33.23 Total input tokens: 32768 Total generated tokens: 7040 Request throughput (req/s): 0.96 Output token throughput (tok/s): 211.83 Total Token throughput (tok/s): 1197.80 ---------------Time to First Token---------------- Mean TTFT (ms): 3792.02 Median TTFT (ms): 4124.75 P95 TTFT (ms): 4255.24 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 1.58 Median TPOT (ms): 0.00 P95 TPOT (ms): 6.32 ----------------End-to-end Latency---------------- Mean E2EL (ms): 4138.02 Median E2EL (ms): 4169.22 P95 E2EL (ms): 4265.66 ==================================================Environment
Error traceback