Skip to content

FP16 speedup failure of TensorRT 8.2 when running custom multi conv plugin on GPU GTX 1650 #4714

@2282232637chq-netizen

Description

@2282232637chq-netizen

Description

I've written a multi-branch convolution plugin for multi-task scenarios. The plugin accepts a task ID and selects the corresponding convolution weights to perform the convolution.
This plugin's function is convolution + bias. To speed it up, I've combined it into a single operator: convolution bias and identity activation.

I've found that my implementation struggles to match the speed of the official convolution.
In fact, I tested it on a GTX 1650 using a YOLOv5s-v3.1 model. The official model's speed is 7.1ms (FP16), but replacing ALL the convolutions with my multi-branch convolution plugin increased the time to 12ms.

Furthermore, I' trying to support FP16 within the plugin, and notice that in the engine's output logs, FP16 is slower than FP32, so FP32 is chosen by default.

[03/09/2026-03:13:16] [TRT] [V] *************** Autotuning format combination: Float(102400,400,20,1) -> Float(102400,400,20,1) ***************
[MultiConvPlugin::clone] start
[MultiConvPlugin::initialize] start
[MultiConvPlugin::findAlgo] start
[MultiConvPlugin::findAlgo] selected algo = 1 workspace = 1819007
[MultiConvPlugin::initialize] end
[MultiConvPlugin::clone] end
[03/09/2026-03:13:16] [TRT] [V] --------------- Timing Runner: (Unnamed Layer* 205) [PluginV2Ext] (PluginV2)
[MultiConvPlugin::configurePlugin] start
[MultiConvPlugin] configurePlugin dtype = FP32
[MultiConvPlugin::initialize] start
[MultiConvPlugin::findAlgo] start
[MultiConvPlugin::findAlgo] selected algo = 6 workspace = 6555648
[MultiConvPlugin::initialize] end
[MultiConvPlugin::configurePlugin] start
[MultiConvPlugin::getWorkspaceSize] called!
[MultiConvPlugin::getWorkspaceSize] called!
[MultiConvPlugin::initialize] start
[MultiConvPlugin::findAlgo] start
[MultiConvPlugin::findAlgo] selected algo = 6 workspace = 6555648
[MultiConvPlugin::initialize] end
[03/09/2026-03:13:16] [TRT] [V] Tactic: 0 Time: 0.332544
[03/09/2026-03:13:16] [TRT] [V] Fastest Tactic: 0 Time: 0.332544
[03/09/2026-03:13:16] [TRT] [V] >>>>>>>>>>>>>>> Chose Runner Type: PluginV2 Tactic: 0
[03/09/2026-03:13:16] [TRT] [V] *************** Autotuning format combination: Half(102400,400,20,1) -> Half(102400,400,20,1) ***************
[MultiConvPlugin::clone] start
[MultiConvPlugin::initialize] start
[MultiConvPlugin::findAlgo] start
[MultiConvPlugin::findAlgo] selected algo = 1 workspace = 1819007
[MultiConvPlugin::initialize] end
[MultiConvPlugin::clone] end
[03/09/2026-03:13:16] [TRT] [V] --------------- Timing Runner: (Unnamed Layer* 205) [PluginV2Ext] (PluginV2)
[MultiConvPlugin::configurePlugin] start
[MultiConvPlugin] configurePlugin dtype = FP16
[MultiConvPlugin::initialize] start
[MultiConvPlugin::findAlgo] start
[MultiConvPlugin::findAlgo] selected algo = 1 workspace = 1819007
[MultiConvPlugin::initialize] end
[MultiConvPlugin::configurePlugin] start
[MultiConvPlugin::getWorkspaceSize] called!
[MultiConvPlugin::getWorkspaceSize] called!
[MultiConvPlugin::initialize] start
[MultiConvPlugin::findAlgo] start
[MultiConvPlugin::findAlgo] selected algo = 1 workspace = 1819007
[MultiConvPlugin::initialize] end
[03/09/2026-03:13:16] [TRT] [V] Tactic: 0 Time: 0.9864
[03/09/2026-03:13:16] [TRT] [V] Fastest Tactic: 0 Time: 0.9864

I would like to ask how to optimize this plugin to approach the official speed, or why is FP16 slower than FP32?

Environment

TensorRT Version: 8.2.1.8

NVIDIA GPU: GTX 1650

NVIDIA Driver Version: 550.54.15

CUDA Version: 10.2

CUDNN Version: 8.6

Operating System: Ubuntu 18.04.6

Python Version (if applicable): 3.6.9

Relevant Files

I have uploaded the source code:
multi_branch_conv_plugin.cpp

head file:
multi_conv.txt

wts2engine.py:

wts2engine_beta.py

command:

export PYTHONPATH=./:$PYTHONPATH
export CUDA_VISIBLE_DEVICES=0
python trt/wts2engine.py \
--modelType 'yolov5s' \
--CL \
--TOTAL_TASK 3 \
--CLASS_NUM "80,2,2,1,1" \
--INPUT_SIZE 640 \
--MODEL_CAPACITY "8,16,8,8,8" \
--WTS_FILE 'runs/result/cl-model-task2/weights/best.wts' \
--ENGINE_FILE 'runs/result/cl-model-task2/weights/best.engine'

Have you tried the latest release?: No, due to some reasons, the latest version cannot be tried at the moment.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Module:PluginsIssues when using TensorRT plugins

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions