ERNIE-4.5-VL-28B模型在NVIDIA RTX PRO 6000 Blackwell 上加载失败，出现CUDA error 209和101

# GPU硬件信息
*   **Product Name**: NVIDIA RTX PRO 6000 Blackwell Server Edition
*   **Product Brand**: NVIDIA
*   **Product Architecture**: Blackwell
*   **显存**: 96GB GDDR7

Driver Version: 580.82.07 (575驱动也一样)
CUDA Version: 12.9
Paddle版本:  paddlepaddle-gpu==3.1.1 -i https://www.paddlepaddle.org.cn/packages/stable/cu129

#### 计算能力（SM版本）= SM120
计算能力通过以下代码查询：

python
import torch
if torch.cuda.is_available():
    capability = torch.cuda.get_device_capability(0)  
    print(f"SM{capability[0]}{capability[1]}") # 返回 SM120


#### fastdeploy-gpu-80_90 版本2.1.0（ Nightly 版本问题一样）

##### 验证环境-正常


import paddle

from paddle.jit.marker import unified

paddle.utils.run_check()

Running verify PaddlePaddle program ... 
I0906 02:12:41.045678 24555 pir_interpreter.cc:1524] New Executor is Running ...
W0906 02:12:41.046741 24555 gpu_resources.cc:114] Please NOTE: device: 0, GPU Compute Capability: 12.0, Driver API Version: 13.0, Runtime API Version: 12.9
I0906 02:12:41.047454 24555 pir_interpreter.cc:1547] pir interpreter is running by multi-thread mode ...
PaddlePaddle works well on 1 GPU.
PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now.
from fastdeploy.model_executor.ops.gpu import beam_search_softmax

W0906 02:12:42.124501 24555 ir_context.cc:306] custom_op.static_op_save_output_topk_ op already registered.
W0906 02:12:42.124531 24555 custom_operator.cc:967] Operator (static_op_save_output_topk) has been registered.
W0906 02:12:42.124600 24555 ir_context.cc:306] custom_op.static_op_save_output_dynamic_ op already registered.
W0906 02:12:42.124605 24555 custom_operator.cc:967] Operator (static_op_save_output_dynamic) has been registered.
W0906 02:12:42.124691 24555 ir_context.cc:306] custom_op.static_op_save_output_ op already registered.
W0906 02:12:42.124696 24555 custom_operator.cc:967] Operator (static_op_save_output) has been registered.
W0906 02:12:42.124879 24555 ir_context.cc:306] custom_op.static_op_transfer_output op already registered.
W0906 02:12:42.124886 24555 custom_operator.cc:967] Operator (static_op_transfer_output) has been registered.
W0906 02:12:42.124979 24555 ir_context.cc:306] custom_op.static_op_get_output_topk_ op already registered.
W0906 02:12:42.124984 24555 custom_operator.cc:967] Operator (static_op_get_output_topk) has been registered.
W0906 02:12:42.125034 24555 ir_context.cc:306] custom_op.static_op_get_output_dynamic_ op already registered.
W0906 02:12:42.125039 24555 custom_operator.cc:967] Operator (static_op_get_output_dynamic) has been registered.
W0906 02:12:42.125229 24555 ir_context.cc:306] custom_op.static_op_rebuild_padding_cpu op already registered.
W0906 02:12:42.125234 24555 custom_operator.cc:967] Operator (static_op_rebuild_padding_cpu) has been registered.
W0906 02:12:42.126255 24555 ir_context.cc:306] custom_op.static_op_get_output_ op already registered.
W0906 02:12:42.126262 24555 custom_operator.cc:967] Operator (static_op_get_output) has been registered.

 
## 运行 fastdeploy

bash
export ENABLE_V1_KVCACHE_SCHEDULER=1
python -m fastdeploy.entrypoints.openai.api_server \
    --model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \
    --port 8180 \
    --metrics-port 8181 \
    --engine-worker-queue-port 8182 \
    --tensor-parallel-size 1 \
    --max-model-len 32768 \
    --max-num-seqs 128 \
    --limit-mm-per-prompt '{"image": 10, "video": 1}' \
    --reasoning-parser ernie-45-vl \
    --gpu-memory-utilization 0.9 \
    --enable-chunked-prefill \
    --max-num-batched-tokens 384 \
    --enable-mm


模型加载过程中出错

[2025-09-06 02:21:12,148] [    INFO] - Start load layer 27
[2025-09-06 02:21:15,817] [    INFO] - Model loading took 69.24437856674194 seconds
CUDA error 209 [/paddle/third_party/cccl/cub/cub/util_device.cuh, 83]: no kernel image is available for execution on the device
CUDA error 101 [/paddle/third_party/cccl/cub/cub/util_device.cuh, 102]: invalid device ordinal
CUDA error 209 [/paddle/third_party/cccl/cub/cub/util_device.cuh, 83]: no kernel image is available for execution on the device


详细log 文件

[log.tar.gz](https://github.com/user-attachments/files/22184100/log.tar.gz)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ERNIE-4.5-VL-28B模型在NVIDIA RTX PRO 6000 Blackwell 上加载失败，出现CUDA error 209和101 #3930

GPU硬件信息

计算能力（SM版本）= SM120

fastdeploy-gpu-80_90 版本2.1.0（ Nightly 版本问题一样）

验证环境-正常

运行 fastdeploy

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ERNIE-4.5-VL-28B模型在NVIDIA RTX PRO 6000 Blackwell 上加载失败，出现CUDA error 209和101 #3930

Description

GPU硬件信息

计算能力（SM版本）= SM120

fastdeploy-gpu-80_90 版本2.1.0（ Nightly 版本问题一样）

验证环境-正常

运行 fastdeploy

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions