I'm trying to serve Qwen2.5-3B-Instrut with tensor parallelism 4 and trtllm-serve, but the process gets stuck:
trtllm-serve Qwen2.5-3B-Instruct --tp_size 4
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
<frozen importlib._bootstrap_external>:1296: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
<frozen importlib._bootstrap_external>:1296: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
2025-07-02 03:34:03,877 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
/home/shareduser/miniconda3/envs/mz-test-tensorrt-llm-llama/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
[TensorRT-LLM] TensorRT-LLM version: 0.20.0
[07/02/2025-03:34:04] [TRT-LLM] [W] Overriding LlmArgs.max_input_len (annotation=int required=False default=1024 description='The maximum input length.') with build_config.max_input_len (1024).
[07/02/2025-03:34:04] [TRT-LLM] [W] Overriding LlmArgs.max_seq_len (annotation=Union[int, NoneType] required=False default=None description='The maximum sequence length.') with build_config.max_seq_len (None).
[07/02/2025-03:34:04] [TRT-LLM] [W] Overriding LlmArgs.max_beam_width (annotation=int required=False default=1 description='The maximum beam width.') with build_config.max_beam_width (1).
[07/02/2025-03:34:04] [TRT-LLM] [I] Compute capability: (8, 9)
[07/02/2025-03:34:04] [TRT-LLM] [I] SM count: 142
[07/02/2025-03:34:04] [TRT-LLM] [I] SM clock: 3105 MHz
[07/02/2025-03:34:04] [TRT-LLM] [I] int4 TFLOPS: 902
[07/02/2025-03:34:04] [TRT-LLM] [I] int8 TFLOPS: 451
[07/02/2025-03:34:04] [TRT-LLM] [I] fp8 TFLOPS: 451
[07/02/2025-03:34:04] [TRT-LLM] [I] float16 TFLOPS: 225
[07/02/2025-03:34:04] [TRT-LLM] [I] bfloat16 TFLOPS: 225
[07/02/2025-03:34:04] [TRT-LLM] [I] float32 TFLOPS: 112
[07/02/2025-03:34:04] [TRT-LLM] [I] Total Memory: 47 GiB
[07/02/2025-03:34:04] [TRT-LLM] [I] Memory clock: 10001 MHz
[07/02/2025-03:34:04] [TRT-LLM] [I] Memory bus width: 384
[07/02/2025-03:34:04] [TRT-LLM] [I] Memory bandwidth: 960 GB/s
[07/02/2025-03:34:04] [TRT-LLM] [I] PCIe speed: 2500 Mbps
[07/02/2025-03:34:04] [TRT-LLM] [I] PCIe link width: 16
[07/02/2025-03:34:04] [TRT-LLM] [I] PCIe bandwidth: 5 GB/s
[07/02/2025-03:34:04] [TRT-LLM] [I] Specified dtype 'auto'; inferred dtype 'bfloat16'.
[07/02/2025-03:34:04] [TRT-LLM] [W] Implicitly setting QWenConfig.seq_length = 8192
[07/02/2025-03:34:04] [TRT-LLM] [W] Implicitly setting QWenConfig.qwen_type = qwen2
[07/02/2025-03:34:04] [TRT-LLM] [W] Implicitly setting QWenConfig.moe_intermediate_size = 0
[07/02/2025-03:34:04] [TRT-LLM] [W] Implicitly setting QWenConfig.moe_shared_expert_intermediate_size = 0
[07/02/2025-03:34:04] [TRT-LLM] [W] Implicitly setting QWenConfig.tie_word_embeddings = True
[07/02/2025-03:34:04] [TRT-LLM] [I] Set nccl_plugin to None.
[07/02/2025-03:34:04] [TRT-LLM] [I] start MpiSession with 4 workers
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
Authorization required, but no authorization protocol specified
<frozen importlib._bootstrap_external>:1296: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
<frozen importlib._bootstrap_external>:1296: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
<frozen importlib._bootstrap_external>:1296: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
<frozen importlib._bootstrap_external>:1296: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1296: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1296: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1296: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
<frozen importlib._bootstrap_external>:1296: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
Multiple distributions found for package optimum. Picked distribution: optimum
Multiple distributions found for package optimum. Picked distribution: optimum
Multiple distributions found for package optimum. Picked distribution: optimum
Multiple distributions found for package optimum. Picked distribution: optimum
2025-07-02 03:34:10,672 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
2025-07-02 03:34:10,702 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
/home/shareduser/miniconda3/envs/mz-test-tensorrt-llm-llama/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
2025-07-02 03:34:10,714 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
2025-07-02 03:34:10,729 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
/home/shareduser/miniconda3/envs/mz-test-tensorrt-llm-llama/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
/home/shareduser/miniconda3/envs/mz-test-tensorrt-llm-llama/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
/home/shareduser/miniconda3/envs/mz-test-tensorrt-llm-llama/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
[TensorRT-LLM] TensorRT-LLM version: 0.20.0
[TensorRT-LLM] TensorRT-LLM version: 0.20.0
[TensorRT-LLM] TensorRT-LLM version: 0.20.0
[TensorRT-LLM] TensorRT-LLM version: 0.20.0
I tried with different models, and it's always stuck on that part. Yesterday it was working fine.
My system information:
I'm trying to serve Qwen2.5-3B-Instrut with tensor parallelism 4 and trtllm-serve, but the process gets stuck:
I tried with different models, and it's always stuck on that part. Yesterday it was working fine.