[Usage]: [Qwen2-VL] Inquiry Regarding the Usage of cross_attention_mask Input in C++ Runtime

### System Info

**System Information:**
- OS:
- Python version:
- CUDA version:
- GPU model(s): H800
- Driver version:
- TensorRT-LLM version: 0.21

**Detailed output:**
```text
Paste the output of the above commands here
```


### How would you like to use TensorRT-LLM

Problem Description
I am attempting to implement a special computation for the Qwen2-VL model, which requires dynamically passing a custom attention_mask during inference.

After reviewing the code, I found that the forward method in the Python-side model definition ([tensorrt_llm/models/qwen/model.py#L177](https://github.com/NVIDIA/TensorRT-LLM/blob/42a7b0922fc9e095f173eab9a7efa0bcdceadd0d/tensorrt_llm/models/qwen/model.py#L177)) does indeed have an attention_mask parameter. However, in the request interface of TensorRT-LLM's C++ Runtime (tensorrt_llm::executor::Executor), there is no directly exposed field for attention_mask.

As an alternative, I discovered the cross_attention_mask field (e.g., in tensorrt_llm::executor::Request::input). I have successfully modified the model definition of qwen2-vl (model.py) to allow its forward method to receive and process the cross_attention_mask parameter.

However, when I create a request in C++ and pass in cross_attention_mask, the model does not seem to receive this input, or I am unsure how to pass it correctly.

My questions are:

Is the input name cross_attention_mask in the C++ Runtime officially supported? Or is it an internal implementation detail?

If supported, what is its expected tensor shape and data type? (e.g., [batch_size, seq_len] or [batch_size, 1, seq_len, seq_len]? Data type int32 or bool?)

After passing cross_attention_mask, will it affect the model's behavior, and what are the potential impacts?

How should cross_attention_mask be used in the C++ Runtime? Are there any additional settings required to enable it?

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and checked the [documentation](https://nvidia.github.io/TensorRT-LLM/) and [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) for answers to frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Usage]: [Qwen2-VL] Inquiry Regarding the Usage of cross_attention_mask Input in C++ Runtime #7677

System Info

How would you like to use TensorRT-LLM

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Usage]: [Qwen2-VL] Inquiry Regarding the Usage of cross_attention_mask Input in C++ Runtime #7677

Description

System Info

How would you like to use TensorRT-LLM

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions