Skip to content

Cannot serve modelopt quantized nvfp4 model on TensorRT LLM #187

@enisaras

Description

@enisaras

Describe the bug

After quantizing Llama-3.1-70B-Instruct model using modelopt hf_ptq script, running into an error:

[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 40794 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 5.49 GB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 6.54 GB GPU memory for decoder.
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
  what():  Cannot determine size of FP4 data type (/code/tensorrt_llm/cpp/include/tensorrt_llm/common/dataType.h:40)
1       0x7f973fe01d33 /root/.local/lib/python3.12/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0xa1dd33) [0x7f973fe01d33]

Steps/Code to reproduce bug

  1. Quantize Llama-3.1-70B-Instruct using the helper script in modelopt repository:
python hf_ptq.py --pyt_ckpt_path meta-llama/Llama-3.1-70B-Instruct --qformat nvfp4 --batch_size 32 --kv_cache_qformat nvfp4 
  1. Build TensorRT LLM engine using the trt-llm build command:
trtllm-build --checkpoint-dir exported_model/
  1. Serve the engine built in step 2:
trtllm-serve serve engine_outputs/ --tokenizer meta-llama/Llama-3.1-70B-Instruct --log_level debug --tp_size 2

Step 3 fails with the following logs:

Details
2025-04-27 18:23:15,796 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[TensorRT-LLM] TensorRT-LLM version: 0.20.0rc1
[04/27/2025-18:23:16] [TRT-LLM] [W] Overriding LlmArgs.max_input_len (annotation=int required=False default=1024 description='The maximum input length.') with build_config.max_input_len (1024).
[04/27/2025-18:23:16] [TRT-LLM] [W] Overriding LlmArgs.max_seq_len (annotation=Union[int, NoneType] required=False default=None description='The maximum sequence length.') with build_config.max_seq_len (None).
[04/27/2025-18:23:16] [TRT-LLM] [W] Overriding LlmArgs.max_beam_width (annotation=int required=False default=1 description='The maximum beam width.') with build_config.max_beam_width (1).
[04/27/2025-18:23:16] [TRT-LLM] [I] Compute capability: (10, 0)
[04/27/2025-18:23:16] [TRT-LLM] [I] SM count: 148
[04/27/2025-18:23:16] [TRT-LLM] [I] SM clock: 1965 MHz
[04/27/2025-18:23:16] [TRT-LLM] [I] int4 TFLOPS: 0
[04/27/2025-18:23:16] [TRT-LLM] [I] int8 TFLOPS: 0
[04/27/2025-18:23:16] [TRT-LLM] [I] fp8 TFLOPS: 0
[04/27/2025-18:23:16] [TRT-LLM] [I] float16 TFLOPS: 0
[04/27/2025-18:23:16] [TRT-LLM] [I] bfloat16 TFLOPS: 0
[04/27/2025-18:23:16] [TRT-LLM] [I] float32 TFLOPS: 0
[04/27/2025-18:23:16] [TRT-LLM] [I] Total Memory: 179 GiB
[04/27/2025-18:23:16] [TRT-LLM] [I] Memory clock: 3996 MHz
[04/27/2025-18:23:16] [TRT-LLM] [I] Memory bus width: 7680
[04/27/2025-18:23:16] [TRT-LLM] [I] Memory bandwidth: 7672 GB/s
[04/27/2025-18:23:16] [TRT-LLM] [I] NVLink is active: True
[04/27/2025-18:23:16] [TRT-LLM] [I] NVLink version: 4
[04/27/2025-18:23:16] [TRT-LLM] [I] NVLink bandwidth: 450 GB/s
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.fc_after_embed = False
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.use_input_layernorm_in_first_layer = True
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.use_last_layernorm = True
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.layer_idx_offset = 0
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.has_partial_lora_mask = False
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.producer = {'name': 'modelopt', 'version': '0.27.1'}
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.share_embedding_table = False
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.bias = False
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.rotary_pct = 1.0
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.rank = 0
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.decoder = llama
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.rmsnorm = True
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.lm_head_bias = False
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.tie_word_embeddings = False
[04/27/2025-18:23:16] [TRT-LLM] [I] Set dtype to bfloat16.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set bert_attention_plugin to auto.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set gpt_attention_plugin to auto.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set gemm_plugin to nvfp4.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set explicitly_disable_gemm_plugin to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set qserve_gemm_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set identity_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set nccl_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set lora_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set dora_plugin to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set weight_only_groupwise_quant_matmul_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set weight_only_quant_matmul_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set smooth_quant_plugins to True.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set smooth_quant_gemm_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set layernorm_quantization_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set rmsnorm_quantization_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set quantize_per_token_plugin to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set quantize_tensor_plugin to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set moe_plugin to auto.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set low_latency_gemm_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set low_latency_gemm_swiglu_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set gemm_allreduce_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set context_fmha to True.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set paged_kv_cache to True.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set remove_input_padding to True.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set norm_quant_fusion to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set reduce_fusion to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set user_buffer to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set tokens_per_block to 32.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set use_paged_context_fmha to True.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set fuse_fp4_quant to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set multiple_profiles to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set paged_state to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set streamingllm to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set manage_weights to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set use_fused_mlp to True.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set pp_reduce_scatter to False.
[04/27/2025-18:23:16] [TRT-LLM] [W] The build_config is ignored for model format of TLLM_ENGINE.
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.fc_after_embed = False
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.use_input_layernorm_in_first_layer = True
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.use_last_layernorm = True
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.layer_idx_offset = 0
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.has_partial_lora_mask = False
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.producer = {'name': 'modelopt', 'version': '0.27.1'}
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.share_embedding_table = False
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.bias = False
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.rotary_pct = 1.0
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.rank = 0
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.decoder = llama
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.rmsnorm = True
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.lm_head_bias = False
[04/27/2025-18:23:16] [TRT-LLM] [W] Implicitly setting LLaMAConfig.tie_word_embeddings = False
[04/27/2025-18:23:16] [TRT-LLM] [I] Set dtype to bfloat16.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set bert_attention_plugin to auto.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set gpt_attention_plugin to auto.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set gemm_plugin to nvfp4.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set explicitly_disable_gemm_plugin to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set qserve_gemm_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set identity_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set nccl_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set lora_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set dora_plugin to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set weight_only_groupwise_quant_matmul_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set weight_only_quant_matmul_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set smooth_quant_plugins to True.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set smooth_quant_gemm_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set layernorm_quantization_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set rmsnorm_quantization_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set quantize_per_token_plugin to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set quantize_tensor_plugin to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set moe_plugin to auto.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set low_latency_gemm_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set low_latency_gemm_swiglu_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set gemm_allreduce_plugin to None.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set context_fmha to True.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set paged_kv_cache to True.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set remove_input_padding to True.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set norm_quant_fusion to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set reduce_fusion to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set user_buffer to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set tokens_per_block to 32.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set use_paged_context_fmha to True.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set fuse_fp4_quant to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set multiple_profiles to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set paged_state to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set streamingllm to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set manage_weights to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set use_fused_mlp to True.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set pp_reduce_scatter to False.
[04/27/2025-18:23:16] [TRT-LLM] [I] Set nccl_plugin to None.
rank 0 using MpiPoolSession to spawn MPI processes
[04/27/2025-18:23:17] [TRT-LLM] [I] Generating a new HMAC key for server proxy_request_queue
[04/27/2025-18:23:17] [TRT-LLM] [I] Generating a new HMAC key for server proxy_request_error_queue
[04/27/2025-18:23:17] [TRT-LLM] [I] Generating a new HMAC key for server proxy_result_queue
[04/27/2025-18:23:17] [TRT-LLM] [I] Generating a new HMAC key for server proxy_stats_queue
[04/27/2025-18:23:17] [TRT-LLM] [I] Generating a new HMAC key for server proxy_kv_cache_events_queue
2025-04-27 18:23:23,215 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[TensorRT-LLM] TensorRT-LLM version: 0.20.0rc1
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Engine version 0.20.0rc1 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 2048
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 131072
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (131072) * 80
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 0
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8192 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 40815 MiB
[TensorRT-LLM][INFO] Engine load time 9934 ms
[TensorRT-LLM][INFO] Inspecting the engine to identify potential runtime issues...
[TensorRT-LLM][INFO] The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with 'detailed' profiling verbosity to get more diagnostics.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1885.04 MiB for execution context memory.
[TensorRT-LLM][INFO] gatherContextLogits: 0
[TensorRT-LLM][INFO] gatherGenerationLogits: 0
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 40794 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 5.49 GB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 6.54 GB GPU memory for decoder.
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
  what():  Cannot determine size of FP4 data type (/code/tensorrt_llm/cpp/include/tensorrt_llm/common/dataType.h:40)
1       0x7f973fe01d33 /root/.local/lib/python3.12/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0xa1dd33) [0x7f973fe01d33]
2       0x7f9740c7fdf3 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(std::shared_ptr<nvinfer1::ILogger>, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::runtime::RawEngine const&, bool, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 3635
3       0x7f9740d691ce tensorrt_llm::batch_manager::TrtGptModelFactory::create(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::batch_manager::TrtGptModelType, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 654
4       0x7f9740d4dc39 tensorrt_llm::executor::Executor::Impl::createModel(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 185
5       0x7f9740d4ece5 tensorrt_llm::executor::Executor::Impl::loadModel(std::optional<std::filesystem::__cxx11::path> const&, std::optional<std::basic_string_view<unsigned char, std::char_traits<unsigned char> > > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::executor::ExecutorConfig const&, bool, std::optional<std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tensorrt_llm::executor::Tensor, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, tensorrt_llm::executor::Tensor> > > > const&) + 1173
6       0x7f9740d55dfa tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::__cxx11::path const&, std::optional<std::filesystem::__cxx11::path> const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 2474
7       0x7f9740d3a477 tensorrt_llm::executor::Executor::Executor(std::filesystem::__cxx11::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 87
8       0x7f976d2136e0 /root/.local/lib/python3.12/site-packages/tensorrt_llm/bindings.cpython-312-x86_64-linux-gnu.so(+0x13f6e0) [0x7f976d2136e0]
9       0x7f976d190af3 /root/.local/lib/python3.12/site-packages/tensorrt_llm/bindings.cpython-312-x86_64-linux-gnu.so(+0xbcaf3) [0x7f976d190af3]
10            0x58208f /usr/bin/python() [0x58208f]
11            0x549185 _PyObject_MakeTpCall + 117
12            0x54cea7 /usr/bin/python() [0x54cea7]
13            0x59e231 /usr/bin/python() [0x59e231]
14            0x599b63 /usr/bin/python() [0x599b63]
15      0x7f976d18df4d /root/.local/lib/python3.12/site-packages/tensorrt_llm/bindings.cpython-312-x86_64-linux-gnu.so(+0xb9f4d) [0x7f976d18df4d]
16            0x549185 _PyObject_MakeTpCall + 117
17            0x5d73c9 _PyEval_EvalFrameDefault + 2697
18            0x54aa9a _PyObject_Call_Prepend + 394
19            0x59e09f /usr/bin/python() [0x59e09f]
20            0x599b63 /usr/bin/python() [0x599b63]
21            0x54924e _PyObject_MakeTpCall + 318
22            0x5d73c9 _PyEval_EvalFrameDefault + 2697
23            0x5d58eb PyEval_EvalCode + 347
24            0x5d347c /usr/bin/python() [0x5d347c]
25            0x581f0d /usr/bin/python() [0x581f0d]
26            0x549b85 PyObject_Vectorcall + 53
27            0x5d73c9 _PyEval_EvalFrameDefault + 2697
28            0x6bcce2 /usr/bin/python() [0x6bcce2]
29            0x6bc912 Py_RunMain + 562
30            0x6bc57d Py_BytesMain + 45
31      0x7f9af03bb1ca /usr/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca) [0x7f9af03bb1ca]
32      0x7f9af03bb28b __libc_start_main + 139
33            0x657ce5 _start + 37
[b200enis-devel:785754] *** Process received signal ***
[b200enis-devel:785754] Signal: Aborted (6)
[b200enis-devel:785754] Signal code:  (-6)
[b200enis-devel:785754] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x45330)[0x7f9af03d6330]
[b200enis-devel:785754] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x11c)[0x7f9af042fb2c]
[b200enis-devel:785754] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x1e)[0x7f9af03d627e]
[b200enis-devel:785754] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0xdf)[0x7f9af03b98ff]
[b200enis-devel:785754] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xa5ff5)[0x7f987fdc1ff5]
[b200enis-devel:785754] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbb0da)[0x7f987fdd70da]
[b200enis-devel:785754] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__cxa_call_terminate+0x33)[0x7f987fdc18e6]
[b200enis-devel:785754] [ 7] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x31a)[0x7f987fdd68ba]
[b200enis-devel:785754] [ 8] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(+0x22b06)[0x7f98b805eb06]
[b200enis-devel:785754] [ 9] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_RaiseException+0x311)[0x7f98b805f1f1]
[b200enis-devel:785754] [10] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__cxa_throw+0x44)[0x7f987fdd7384]
[b200enis-devel:785754] [11] /root/.local/lib/python3.12/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0xa1dcec)[0x7f973fe01cec]
[b200enis-devel:785754] [12] /root/.local/lib/python3.12/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager27TrtGptModelInflightBatchingC1ESt10shared_ptrIN8nvinfer17ILoggerEERKNS_7runtime11ModelConfigERKNS6_11WorldConfigERKNS6_9RawEngineEbRKNS0_25TrtGptModelOptionalParamsE+0xe33)[0x7f9740c7fdf3]
[b200enis-devel:785754] [13] /root/.local/lib/python3.12/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager18TrtGptModelFactory6createERKNS_7runtime9RawEngineERKNS2_11ModelConfigERKNS2_11WorldConfigENS0_15TrtGptModelTypeERKNS0_25TrtGptModelOptionalParamsE+0x28e)[0x7f9740d691ce]
[b200enis-devel:785754] [14] /root/.local/lib/python3.12/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm8executor8Executor4Impl11createModelERKNS_7runtime9RawEngineERKNS3_11ModelConfigERKNS3_11WorldConfigERKNS0_14ExecutorConfigE+0xb9)[0x7f9740d4dc39]
[b200enis-devel:785754] [15] /root/.local/lib/python3.12/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm8executor8Executor4Impl9loadModelERKSt8optionalINSt10filesystem7__cxx114pathEERKS3_ISt17basic_string_viewIhSt11char_traitsIhEEERKNS_7runtime13GptJsonConfigERKNS0_14ExecutorConfigEbRKS3_ISt3mapINSt7__cxx1112basic_stringIcSB_IcESaIcEEENS0_6TensorESt4lessIST_ESaISt4pairIKST_SU_EEEE+0x495)[0x7f9740d4ece5]
[b200enis-devel:785754] [16] /root/.local/lib/python3.12/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm8executor8Executor4ImplC1ERKNSt10filesystem7__cxx114pathERKSt8optionalIS5_ENS0_9ModelTypeERKNS0_14ExecutorConfigE+0x9aa)[0x7f9740d55dfa]
[b200enis-devel:785754] [17] /root/.local/lib/python3.12/site-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm8executor8ExecutorC1ERKNSt10filesystem7__cxx114pathENS0_9ModelTypeERKNS0_14ExecutorConfigE+0x57)[0x7f9740d3a477]
[b200enis-devel:785754] [18] /root/.local/lib/python3.12/site-packages/tensorrt_llm/bindings.cpython-312-x86_64-linux-gnu.so(+0x13f6e0)[0x7f976d2136e0]
[b200enis-devel:785754] [19] /root/.local/lib/python3.12/site-packages/tensorrt_llm/bindings.cpython-312-x86_64-linux-gnu.so(+0xbcaf3)[0x7f976d190af3]
[b200enis-devel:785754] [20] /usr/bin/python[0x58208f]
[b200enis-devel:785754] [21] /usr/bin/python(_PyObject_MakeTpCall+0x75)[0x549185]
[b200enis-devel:785754] [22] /usr/bin/python[0x54cea7]
[b200enis-devel:785754] [23] /usr/bin/python[0x59e231]
[b200enis-devel:785754] [24] /usr/bin/python[0x599b63]
[b200enis-devel:785754] [25] /root/.local/lib/python3.12/site-packages/tensorrt_llm/bindings.cpython-312-x86_64-linux-gnu.so(+0xb9f4d)[0x7f976d18df4d]
[b200enis-devel:785754] [26] /usr/bin/python(_PyObject_MakeTpCall+0x75)[0x549185]
[b200enis-devel:785754] [27] /usr/bin/python(_PyEval_EvalFrameDefault+0xa89)[0x5d73c9]
[b200enis-devel:785754] [28] /usr/bin/python(_PyObject_Call_Prepend+0x18a)[0x54aa9a]
[b200enis-devel:785754] [29] /usr/bin/python[0x59e09f]
[b200enis-devel:785754] *** End of error message ***
--------------------------------------------------------------------------
Child job 2 terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
### Expected behavior

The engine starts up successfully and able to process inference requests.

System information

  • Container used (if applicable): TensorRT LLM container built from source using instructions here
  • OS (e.g., Ubuntu 22.04, CentOS 7, Windows 10): Ubuntu 24.04.1 LTS
  • CPU architecture (x86_64, aarch64): x86_64
  • GPU name (e.g. H100, A100, L40S): NVIDIA B200
  • GPU memory size: 179.1 GB
  • Number of GPUs: 8
  • Library versions (if applicable):
    • Python: 3.12.3
    • ModelOpt version or commit hash: 0.27.1
    • CUDA: 12.8
    • PyTorch: 2.7.0a0+7c8ec84dab.nv25.03
    • Transformers: 4.51.3
      2025-04-27 18:48:08,275 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
      [TensorRT-LLM] TensorRT-LLM version: 0.20.0rc1
    • TensorRT-LLM: 0.20.0rc1
    • ONNXRuntime: ?
    • TensorRT: 10.9.0.34
  • Any other details that may help: I have quantized weights and KV cache, this feature might be missing from TensorRT LLM, but I am not entirely sure.

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions