-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Description
I've written a multi-branch convolution plugin for multi-task scenarios. The plugin accepts a task ID and selects the corresponding convolution weights to perform the convolution.
This plugin's function is convolution + bias. To speed it up, I've combined it into a single operator: convolution bias and identity activation.
I've found that my implementation struggles to match the speed of the official convolution.
In fact, I tested it on a GTX 1650 using a YOLOv5s-v3.1 model. The official model's speed is 7.1ms (FP16), but replacing ALL the convolutions with my multi-branch convolution plugin increased the time to 12ms.
Furthermore, I' trying to support FP16 within the plugin, and notice that in the engine's output logs, FP16 is slower than FP32, so FP32 is chosen by default.
[03/09/2026-03:13:16] [TRT] [V] *************** Autotuning format combination: Float(102400,400,20,1) -> Float(102400,400,20,1) ***************
[MultiConvPlugin::clone] start
[MultiConvPlugin::initialize] start
[MultiConvPlugin::findAlgo] start
[MultiConvPlugin::findAlgo] selected algo = 1 workspace = 1819007
[MultiConvPlugin::initialize] end
[MultiConvPlugin::clone] end
[03/09/2026-03:13:16] [TRT] [V] --------------- Timing Runner: (Unnamed Layer* 205) [PluginV2Ext] (PluginV2)
[MultiConvPlugin::configurePlugin] start
[MultiConvPlugin] configurePlugin dtype = FP32
[MultiConvPlugin::initialize] start
[MultiConvPlugin::findAlgo] start
[MultiConvPlugin::findAlgo] selected algo = 6 workspace = 6555648
[MultiConvPlugin::initialize] end
[MultiConvPlugin::configurePlugin] start
[MultiConvPlugin::getWorkspaceSize] called!
[MultiConvPlugin::getWorkspaceSize] called!
[MultiConvPlugin::initialize] start
[MultiConvPlugin::findAlgo] start
[MultiConvPlugin::findAlgo] selected algo = 6 workspace = 6555648
[MultiConvPlugin::initialize] end
[03/09/2026-03:13:16] [TRT] [V] Tactic: 0 Time: 0.332544
[03/09/2026-03:13:16] [TRT] [V] Fastest Tactic: 0 Time: 0.332544
[03/09/2026-03:13:16] [TRT] [V] >>>>>>>>>>>>>>> Chose Runner Type: PluginV2 Tactic: 0
[03/09/2026-03:13:16] [TRT] [V] *************** Autotuning format combination: Half(102400,400,20,1) -> Half(102400,400,20,1) ***************
[MultiConvPlugin::clone] start
[MultiConvPlugin::initialize] start
[MultiConvPlugin::findAlgo] start
[MultiConvPlugin::findAlgo] selected algo = 1 workspace = 1819007
[MultiConvPlugin::initialize] end
[MultiConvPlugin::clone] end
[03/09/2026-03:13:16] [TRT] [V] --------------- Timing Runner: (Unnamed Layer* 205) [PluginV2Ext] (PluginV2)
[MultiConvPlugin::configurePlugin] start
[MultiConvPlugin] configurePlugin dtype = FP16
[MultiConvPlugin::initialize] start
[MultiConvPlugin::findAlgo] start
[MultiConvPlugin::findAlgo] selected algo = 1 workspace = 1819007
[MultiConvPlugin::initialize] end
[MultiConvPlugin::configurePlugin] start
[MultiConvPlugin::getWorkspaceSize] called!
[MultiConvPlugin::getWorkspaceSize] called!
[MultiConvPlugin::initialize] start
[MultiConvPlugin::findAlgo] start
[MultiConvPlugin::findAlgo] selected algo = 1 workspace = 1819007
[MultiConvPlugin::initialize] end
[03/09/2026-03:13:16] [TRT] [V] Tactic: 0 Time: 0.9864
[03/09/2026-03:13:16] [TRT] [V] Fastest Tactic: 0 Time: 0.9864
I would like to ask how to optimize this plugin to approach the official speed, or why is FP16 slower than FP32?
Environment
TensorRT Version: 8.2.1.8
NVIDIA GPU: GTX 1650
NVIDIA Driver Version: 550.54.15
CUDA Version: 10.2
CUDNN Version: 8.6
Operating System: Ubuntu 18.04.6
Python Version (if applicable): 3.6.9
Relevant Files
I have uploaded the source code:
multi_branch_conv_plugin.cpp
head file:
multi_conv.txt
wts2engine.py:
command:
export PYTHONPATH=./:$PYTHONPATH
export CUDA_VISIBLE_DEVICES=0
python trt/wts2engine.py \
--modelType 'yolov5s' \
--CL \
--TOTAL_TASK 3 \
--CLASS_NUM "80,2,2,1,1" \
--INPUT_SIZE 640 \
--MODEL_CAPACITY "8,16,8,8,8" \
--WTS_FILE 'runs/result/cl-model-task2/weights/best.wts' \
--ENGINE_FILE 'runs/result/cl-model-task2/weights/best.engine'
Have you tried the latest release?: No, due to some reasons, the latest version cannot be tried at the moment.