-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
System Info
System Information:
- OS:
- Python version:
- CUDA version:
- GPU model(s): H800
- Driver version:
- TensorRT-LLM version: 0.21
Detailed output:
Paste the output of the above commands here
How would you like to use TensorRT-LLM
Problem Description
I am attempting to implement a special computation for the Qwen2-VL model, which requires dynamically passing a custom attention_mask during inference.
After reviewing the code, I found that the forward method in the Python-side model definition (tensorrt_llm/models/qwen/model.py#L177) does indeed have an attention_mask parameter. However, in the request interface of TensorRT-LLM's C++ Runtime (tensorrt_llm::executor::Executor), there is no directly exposed field for attention_mask.
As an alternative, I discovered the cross_attention_mask field (e.g., in tensorrt_llm::executor::Request::input). I have successfully modified the model definition of qwen2-vl (model.py) to allow its forward method to receive and process the cross_attention_mask parameter.
However, when I create a request in C++ and pass in cross_attention_mask, the model does not seem to receive this input, or I am unsure how to pass it correctly.
My questions are:
Is the input name cross_attention_mask in the C++ Runtime officially supported? Or is it an internal implementation detail?
If supported, what is its expected tensor shape and data type? (e.g., [batch_size, seq_len] or [batch_size, 1, seq_len, seq_len]? Data type int32 or bool?)
After passing cross_attention_mask, will it affect the model's behavior, and what are the potential impacts?
How should cross_attention_mask be used in the C++ Runtime? Are there any additional settings required to enable it?
Before submitting a new issue...
- Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.