Skip to content

TensorRT-LLM gets stuck with trtllm-serve #5677

@mpaulazamin

Description

@mpaulazamin

My system information:

Image

I'm trying to serve Qwen2.5-3B-Instrut with tensor parallelism 4 and trtllm-serve, but the process gets stuck:

trtllm-serve Qwen2.5-3B-Instruct --tp_size 4
Authorization required, but no authorization protocol specified

Authorization required, but no authorization protocol specified

Authorization required, but no authorization protocol specified

Authorization required, but no authorization protocol specified

<frozen importlib._bootstrap_external>:1296: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
<frozen importlib._bootstrap_external>:1296: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
2025-07-02 03:34:03,877 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
/home/shareduser/miniconda3/envs/mz-test-tensorrt-llm-llama/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
[TensorRT-LLM] TensorRT-LLM version: 0.20.0
[07/02/2025-03:34:04] [TRT-LLM] [W] Overriding LlmArgs.max_input_len (annotation=int required=False default=1024 description='The maximum input length.') with build_config.max_input_len (1024).
[07/02/2025-03:34:04] [TRT-LLM] [W] Overriding LlmArgs.max_seq_len (annotation=Union[int, NoneType] required=False default=None description='The maximum sequence length.') with build_config.max_seq_len (None).
[07/02/2025-03:34:04] [TRT-LLM] [W] Overriding LlmArgs.max_beam_width (annotation=int required=False default=1 description='The maximum beam width.') with build_config.max_beam_width (1).
[07/02/2025-03:34:04] [TRT-LLM] [I] Compute capability: (8, 9)
[07/02/2025-03:34:04] [TRT-LLM] [I] SM count: 142
[07/02/2025-03:34:04] [TRT-LLM] [I] SM clock: 3105 MHz
[07/02/2025-03:34:04] [TRT-LLM] [I] int4 TFLOPS: 902
[07/02/2025-03:34:04] [TRT-LLM] [I] int8 TFLOPS: 451
[07/02/2025-03:34:04] [TRT-LLM] [I] fp8 TFLOPS: 451
[07/02/2025-03:34:04] [TRT-LLM] [I] float16 TFLOPS: 225
[07/02/2025-03:34:04] [TRT-LLM] [I] bfloat16 TFLOPS: 225
[07/02/2025-03:34:04] [TRT-LLM] [I] float32 TFLOPS: 112
[07/02/2025-03:34:04] [TRT-LLM] [I] Total Memory: 47 GiB
[07/02/2025-03:34:04] [TRT-LLM] [I] Memory clock: 10001 MHz
[07/02/2025-03:34:04] [TRT-LLM] [I] Memory bus width: 384
[07/02/2025-03:34:04] [TRT-LLM] [I] Memory bandwidth: 960 GB/s
[07/02/2025-03:34:04] [TRT-LLM] [I] PCIe speed: 2500 Mbps
[07/02/2025-03:34:04] [TRT-LLM] [I] PCIe link width: 16
[07/02/2025-03:34:04] [TRT-LLM] [I] PCIe bandwidth: 5 GB/s
[07/02/2025-03:34:04] [TRT-LLM] [I] Specified dtype 'auto'; inferred dtype 'bfloat16'.
[07/02/2025-03:34:04] [TRT-LLM] [W] Implicitly setting QWenConfig.seq_length = 8192
[07/02/2025-03:34:04] [TRT-LLM] [W] Implicitly setting QWenConfig.qwen_type = qwen2
[07/02/2025-03:34:04] [TRT-LLM] [W] Implicitly setting QWenConfig.moe_intermediate_size = 0
[07/02/2025-03:34:04] [TRT-LLM] [W] Implicitly setting QWenConfig.moe_shared_expert_intermediate_size = 0
[07/02/2025-03:34:04] [TRT-LLM] [W] Implicitly setting QWenConfig.tie_word_embeddings = True
[07/02/2025-03:34:04] [TRT-LLM] [I] Set nccl_plugin to None.
[07/02/2025-03:34:04] [TRT-LLM] [I] start MpiSession with 4 workers
Authorization required, but no authorization protocol specified

Authorization required, but no authorization protocol specified

Authorization required, but no authorization protocol specified

Authorization required, but no authorization protocol specified

<frozen importlib._bootstrap_external>:1296: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
<frozen importlib._bootstrap_external>:1296: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
<frozen importlib._bootstrap_external>:1296: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
<frozen importlib._bootstrap_external>:1296: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1296: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1296: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
<frozen importlib._bootstrap_external>:1296: FutureWarning: The cuda.cuda module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.driver module instead.
<frozen importlib._bootstrap_external>:1296: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
Multiple distributions found for package optimum. Picked distribution: optimum
Multiple distributions found for package optimum. Picked distribution: optimum
Multiple distributions found for package optimum. Picked distribution: optimum
Multiple distributions found for package optimum. Picked distribution: optimum
2025-07-02 03:34:10,672 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
2025-07-02 03:34:10,702 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
/home/shareduser/miniconda3/envs/mz-test-tensorrt-llm-llama/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
2025-07-02 03:34:10,714 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
2025-07-02 03:34:10,729 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
/home/shareduser/miniconda3/envs/mz-test-tensorrt-llm-llama/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
/home/shareduser/miniconda3/envs/mz-test-tensorrt-llm-llama/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
/home/shareduser/miniconda3/envs/mz-test-tensorrt-llm-llama/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
[TensorRT-LLM] TensorRT-LLM version: 0.20.0
[TensorRT-LLM] TensorRT-LLM version: 0.20.0
[TensorRT-LLM] TensorRT-LLM version: 0.20.0
[TensorRT-LLM] TensorRT-LLM version: 0.20.0

I tried with different models, and it's always stuck on that part. Yesterday it was working fine.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions