Update readme, add tensorrt yolov3-spp, yolov4 #5453

wang-xinyu · 2020-05-02T13:05:44Z

Hi @AlexeyAB , thanks for your remarkable work.

I have just implemented yolov4 in tensorrt today, and yolov3-spp weeks ago.

And got the following speed test on my machine:

Models	Device	BatchSize	Mode	Input Shape(HxW)	FPS
YOLOv3-spp(darknet53)	Xeon E5-2620/GTX1080	1	FP32	256x416	94
YOLOv4(CSPDarknet53)	Xeon E5-2620/GTX1080	1	FP32	256x416	67

Could you merge this PR, to add a link to my repo tensorrtx on your readme? It would be my pleasure:)))

regards,
xinyu

AlexeyAB · 2020-05-02T15:43:01Z

@wang-xinyu Hi,

What do you use non-square network resolution width=416 height=256 ?
What resizing approach do you use, resizing or letter_box (resizing with keeping aspect-ratio)?
Is there readme how to run, for example, YOLOv4(CSPDarknet53) with FP16 and batch=4 on videofile?

wang-xinyu · 2020-05-02T22:41:47Z

@AlexeyAB

Hi,

I was using w:416, h:256, for video resolution like 1920x1080. The Input W and H are defined in yololayer.h, it supports any number divisible by 32.
The resize approach is letter_box (resizing with keeping aspect-ratio and also padding). Same as the implementation in https://github.com/ultralytics/yolov3.
There is a readme for how to run yolov4 https://github.com/wang-xinyu/tensorrtx/tree/master/yolov4.
FP16/FP32 is selected by a macro defined in yolov4.cpp.
Currently only supports batchsize=1. I will implement multi-batch these days.:)

wang-xinyu · 2020-05-03T07:06:27Z

@AlexeyAB hello,

Updates. Now yolov4 supports multi-batch. And I retested the speed on batch=1, 4 and 8.

Models	Device	BatchSize	Mode	Input Shape(HxW)	FPS
YOLOv4(CSPDarknet53)	Xeon E5-2620/GTX1080	1	FP32	256x416	59
YOLOv4(CSPDarknet53)	Xeon E5-2620/GTX1080	4	FP32	256x416	74
YOLOv4(CSPDarknet53)	Xeon E5-2620/GTX1080	8	FP32	256x416	83

The config details can be found here https://github.com/wang-xinyu/tensorrtx/blob/master/yolov4/README.md#config. Including input shape, number of classes, FP16/FP32, batchsize, etc.

AlexeyAB · 2020-05-03T14:39:29Z

@wang-xinyu

Do you measure full cycle FPS? Do you run pre-processing, inference, post-processing asynchronously in 3 separate CPU-threads?
Can you check FPS for 608x608 for batch=1,4,8?
How many FPS can you get if you use with 608x608?
./darknet detector demo cfg/coco.data cfg/yolov4 yolov4.weights -dont_show
?

wang-xinyu · 2020-05-04T14:40:32Z

@AlexeyAB

The FPS tests above were including inference and NMS. And not using any multi-thread things.

In the following, we only test the inference time, exclude any pre and post processing.

I was using the following command, and got AVG_FPS: 20.0, the demo.mp4 is 1920x1080. Input shape is 608*608 in yolov4.cfg.

I have a question. Is the following command using FP16 by default?

My GPU is GTX1080.

./darknet detector demo cfg/coco.data cfg/yolov4.cfg yolov4.weights ~/demo.mp4 -benchmark

I retested the FPS for 608*608 in my tensorrt implementation.

Models	Device	BatchSize	Mode	Input Shape(HxW)	FPS
YOLOv4(CSPDarknet53)	Xeon E5-2620/GTX1080	1	FP16	608x608	23.3
YOLOv4(CSPDarknet53)	Xeon E5-2620/GTX1080	4	FP16	608x608	23.8
YOLOv4(CSPDarknet53)	Xeon E5-2620/GTX1080	8	FP16	608x608	24.1
YOLOv4(CSPDarknet53)	Xeon E5-2620/GTX1080	1	FP32	608x608	23.3
YOLOv4(CSPDarknet53)	Xeon E5-2620/GTX1080	4	FP32	608x608	24.0
YOLOv4(CSPDarknet53)	Xeon E5-2620/GTX1080	8	FP32	608x608	24.0

There is no big difference between FP16 and FP32, and no big FPS gain compared to darknet.

I guess the mish layer influence tensorrt to merge layer, because mish is not natively supported in tensorrt.

I will try to optimize mish implementation in the near future, and also try replace the mish with relu to see the FPS.

wang-xinyu · 2020-05-05T15:15:44Z

Hi @AlexeyAB

Updates,

I modified the mish layer in my tensorrt implementation, and using the same softplus, tanh and mish cuda kernel as your darknet implementation.

The main difference is that you are using expf(), while I was using exp().

And I retested the FPS, it's faster now!

Models	Device	BatchSize	Mode	Input Shape(HxW)	FPS
YOLOv4(CSPDarknet53)	Xeon E5-2620/GTX1080	1	FP16	608x608	35.7
YOLOv4(CSPDarknet53)	Xeon E5-2620/GTX1080	4	FP16	608x608	40.9
YOLOv4(CSPDarknet53)	Xeon E5-2620/GTX1080	8	FP16	608x608	41.3

bhavitvyamalik · 2020-05-11T13:09:10Z

@wang-xinyu did you try YOLOv4 with Jetson Nano? It also has Tensorrt 7

wang-xinyu · 2020-05-12T03:42:03Z

@wang-xinyu did you try YOLOv3 with Jetson Nano? It also has Tensorrt 7

No, but it should work on nano, you can try my repo.

bhavitvyamalik · 2020-05-12T07:35:48Z

Is there any python implementation of the same? I need to doa lot of pre and post processing using openCV and I'm not that comfortable with C++

wang-xinyu · 2020-05-12T10:45:50Z

Is there any python implementation of the same? I need to doa lot of pre and post processing using openCV and I'm not that comfortable with C++

@bhavitvyamalik no python, all c++ and cuda in tensorrtx

jkjung-avt · 2020-07-20T04:19:10Z

@bhavitvyamalik

did you try YOLOv4 with Jetson Nano? It also has Tensorrt 7

I have implemented TensorRT YOLOv4 with python API: Demo #5: YOLOv4. I tested it on Jetson Nano with JetPack-4.4 (TensorRT 7). The FPS numbers could be found in the README of my repo.

Is there any python implementation of the same? I need to doa lot of pre and post processing using openCV and I'm not that comfortable with C++

My implementation is purely in python. Do check it out.

I have also written a blog post about some of the implementation details: TensorRT YOLOv4

@AlexeyAB THANKS for sharing the code and the YOLOv4 model.

jkjung-avt · 2020-07-24T03:42:20Z

Here are mAP numbers of my TensorRT yolov4 and yolov4-tiny implementations, as well as FPS measurements on Jetson Nano.

TensorRT engine	mAP @ IoU=0.5:0.95	mAP @ IoU=0.5	FPS on Nano
yolov4-tiny-288 (FP16)	0.179	0.344	23.8
yolov4-tiny-416 (FP16)	0.196	0.386	16.5
yolov4-288 (FP16)	0.372	0.590	6.18
yolov4-416 (FP16)	0.454	0.698	3.50
yolov4-608 (FP16)	0.484	0.735	1.77

[2020-08-17 update] I've updated my tensorrt yolov4 implementation with a “yolo_layer” plugin. Here is the updated FPS numbers when I tested it on Jetson Nano (JetPack-4.4). Refer to my jkjung-avt/tensorrt_demos repo for details.

TensorRT engine	mAP @ IoU=0.5:0.95	mAP @ IoU=0.5	FPS on Nano
yolov3-tiny-288 (FP16)	0.077	0.158	35.8
yolov3-tiny-416 (FP16)	0.096	0.201	25.5
yolov3-288 (FP16)	0.331	0.600	8.16
yolov3-416 (FP16)	0.373	0.663	4.93
yolov3-608 (FP16)	0.376	0.664	2.53
yolov3-spp-288 (FP16)	0.339	0.594	8.16
yolov3-spp-416 (FP16)	0.391	0.663	4.82
yolov3-spp-608 (FP16)	0.409	0.685	2.49
yolov4-tiny-288 (FP16)	0.178	0.344	36.6
yolov4-tiny-416 (FP16)	0.195	0.386	25.5
yolov4-288 (FP16)	0.371	0.590	7.93
yolov4-416 (FP16)	0.453	0.698	4.62
yolov4-608 (FP16)	0.483	0.735	2.35

AlexeyAB · 2020-07-24T10:41:15Z

@jkjung-avt Hi,
Thanks!

Why did you get only 16.5 FPS for yolov4-tiny-416 on Jetson Nano by using TensorRT, while there we can get 39 FPS by using tkDNN+TensorRT? Feature-request: YOLOv4-tiny (detector) ceccocats/tkDNN#59 (comment)
Why did you get only 3.5 FPS for yolov4-416 on Jetson Nano by using TensorRT, while there we can get 3.9 FPS by using tkDNN+TensorRT? https://github.com/ceccocats/tkDNN#results
Do you run async in 3 threads: 1-video capturing and pre-processing, 2-inference, 3-post-processing and drawing/showing?

jkjung-avt · 2020-07-24T15:24:21Z

@AlexeyAB My implementation is based on NVIDIA's original TensorRT python/yolov3_onnx sample. NVIDIA's original code does TensorRT yolov3-608x608 inference at only 0.3 FPS on Jetson Nano. I made improvements in the postprocessing code and managed to boost TensorRT yolov3-608x608 inference speed to 1.53 FPS on Nano.

The major advantages of my implementation (jkjung-avt/tensorrt_demos) are:

All code is implemented in python. In particular, the inference code is using TensorRT's python API. This is much easier for most AI/DL practitioners to work with.
My implementation directly takes darknet cfg/weights files, converts them to onnx, and then to TensorRT engines. As of now, "yolov3", "yolov3-spp", "yolov3-tiny", "yolov4", "yolov4-tiny" models are all supported and tested. It's very easy to convert a custom trained darknet yolov3/yolov4 model and test TensorRT inference on Jetson or x86_64 with this code.

But as you've guessed, the downside of my implementation is mainly somewhat inferior performance. This is mainly due to:

python code is inherently slow comparing to C/C++,
python code cannot utilize multiple CPUs effectively, even with multithreading (GIL issue).

Why did you get only 3.5 FPS for yolov4-416 on Jetson Nano by using TensorRT, while there we can get 3.9 FPS by using tkDNN+TensorRT? https://github.com/ceccocats/tkDNN#results

Besides the slowness of python code, I think there are probably 2 additional reasons:

The postprocessing code (all processing in the yolo layers, including NMS) is implemented with python and runs on CPU. I estimate this postprocessing takes ~15% processing time of each frame (depending on how many candidate/target objects are present in the frame) for yolov4-416x416.
I implemented "Mish" with “Softplus” + “Tanh” + “Mul”. This runs slightly slowlier than a dedicated TensorRT plugin.

Why did you get only 16.5 FPS for yolov4-tiny-416 on Jetson Nano by using TensorRT, while there we can get 39 FPS by using tkDNN+TensorRT? ceccocats/tkDNN#59 (comment)

I think it's the same reason as above. Since the CNN portion of yolov4-tiny runs much faster than that of the large yolov4 model, the effect of slow python postprocessing code gets magnified quite a bit.

Do you run async in 3 threads: 1-video capturing and pre-processing, 2-inference, 3-post-processing and drawing/showing?

The short answer is no. But let me reply this question more properly in a separate post, since this one is getting pretty long.

jkjung-avt · 2020-07-28T15:22:43Z

@AlexeyAB Let me get back to this question.

Do you run async in 3 threads: 1-video capturing and pre-processing, 2-inference, 3-post-processing and drawing/showing?

The real answer should be yes:

I have tried implementing video capturing part as a separate thread. It did not improve FPS on the Jetson platform. The reason is that, since CNN inferencing part runs slowlier, the video capturing thread would always have image frames produced and "queued" (say, in the kernel driver). So during execution, the cv2.VideoCapture's read() function would always return a frame "immediately".
I have also tried implementing the postprocessing and video displaying part as a separate thread. For TensorRT SSD models, this could increase FPS by ~16% on Jetson Nano. But when I tried the same approach on TensorRT YOLOv3, it didn't help much. I'm not sure why...

So to recap, we have discussed the following for achieving better FPS for the TensorRT YOLOv4 and YOLOv4-tiny models:

more efficient code for preprocessing and postprocessing,
using more efficient plugin implementation for layers which are not supported by TensorRT directly (such as "Mish" activation),
multi-threading the preprocessing and postprocessing code,

But if you are really going after the best possible FPS, I think there are additional things that could be considered:

utilizing GPU to do preprocessing: CHW channel swapping, mean subtraction, int8-to-float32 conversion, etc.
parallelizing GPU/CPU memcpy and TensorRT kernel execution,
further pipelining of TensorRT operations, (splitting TensorRT YOLOv4 engine into 2 or 3 stages)

So I imagine the optimal design (in terms of FPS) of TensorRT YOLOv4 on Jetson is: video capturing into GPU memory directly (either through hardware H.264 decoder or by custom kernel drivers), image preprocessing by GPU, pipeline stages of TensorRT engine, postprocessing by GPU, and finally copying image and inference results to CPU for display. The data should stay in GPU memory most of the time, so there is no extra copying between GPU and CPU. The preprocessing, TensorRT pipeline stages, postprocessing and memcpy (from GPU to CPU) are all executed in different CUDA streams so they get fully parallelized.

That is not easy to implement, though...

PythonImageDeveloper · 2020-07-28T21:10:14Z

Hello @jkjung-avt
In my opinion, Your suggests are right, But I have some question.
a) You say video capturing be in GPU memory directly, In this case we can't using cv2.VideoCapture + Gstreamer, and this solution copied the decoded frames from NVVM buffer to CPU buffer, indeed occurred duplicated copy for one decoded frame, right? Do you have solution about decode frames into GPU memory directory?

b) Jetson nano used shared memory, then CPU and GPU memory are same, right? why we need GPU memory? Every things in CPU memory aren't in GPU memory?

c) If I use cv2.Videocapture + Gstreamer using H.264 HW decoder, the decoded frames copied from NVMM buffer to CPU buffer, in this case, for one decoded frame we use 2 times memory out of whole memory?

d) If I use cv2.Videocapture + Gstreamer using H.264 HW decoder, the decoded frames copied from NVMM buffer to CPU buffer, in this case, then If I want to use GPU for pre/post processing, we again need to copied from CPU memory to GPU memory? in this case we use 3 times memory out of whole memory for one decode frame?

e) You say about slowness of python, inference processing is done in c/c++ back, this part is good, for post/pre processing If we use pycuda and the whole of system implemented by python wrapper, In your opinion, which solution can give more performance in term of FPS? python wrapper for inference + pycuda for post/pre processing or c++ for inference + post/pre processing in c++ but in CPU?

Update readme, add tensorrt yolov3-spp, yolov4

jkjung-avt · 2020-08-17T06:57:11Z

@AlexeyAB I have updated my tensorrt yolov4 implementation as indicated in #5453 (comment).

Why did you get only 3.5 FPS for yolov4-416 on Jetson Nano by using TensorRT, while there we can get 3.9 FPS by using tkDNN+TensorRT? https://github.com/ceccocats/tkDNN#results

TensorRT "yolov4-416" (FP16) now runs at 4.62 FPS on Jetson Nano.

ttanzhiqiang · 2021-06-30T02:46:09Z

https://github.com/ttanzhiqiang/onnx_tensorrt_project

Update readme, add tensorrt yolov3-spp, yolov4

102e31b

wang-xinyu mentioned this pull request May 4, 2020

Speed for yolov4 tensorrt engine is slower than original yolov4 wang-xinyu/tensorrtx#13

Closed

AlexeyAB merged commit 0d764e4 into AlexeyAB:master May 16, 2020

This was referenced May 21, 2020

batch_size and the inference time wang-xinyu/tensorrtx#24

Closed

why is yolov4 slower than yolov3-spp? wang-xinyu/tensorrtx#10

Closed

jkjung-avt mentioned this pull request Jul 28, 2020

Run the ssd models on jetson xavier nx jkjung-avt/tensorrt_demos#167

Closed

TomHeaven pushed a commit to TomHeaven/darknet that referenced this pull request Aug 13, 2020

Merge pull request AlexeyAB#5453 from wang-xinyu/master

97fb039

Update readme, add tensorrt yolov3-spp, yolov4

This was referenced Oct 9, 2020

FPS with great difference in the same environment，Why -_-? jkjung-avt/tensorrt_demos#256

Closed

Int8 Inference and deployment on deepstream jkjung-avt/tensorrt_demos#200

Closed

jkjung-avt mentioned this pull request Nov 13, 2020

yolov3 output is zero jkjung-avt/tensorrt_demos#281

Closed

jkjung-avt mentioned this pull request Feb 25, 2021

why the inference speed in jetson nx is so slow? jkjung-avt/tensorrt_demos#357

Closed

jkjung-avt mentioned this pull request Jan 7, 2022

How come we have low FPS even after optimizing the models jkjung-avt/tensorrt_demos#524

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update readme, add tensorrt yolov3-spp, yolov4 #5453

Update readme, add tensorrt yolov3-spp, yolov4 #5453

wang-xinyu commented May 2, 2020

AlexeyAB commented May 2, 2020

wang-xinyu commented May 2, 2020 •

edited

Loading

wang-xinyu commented May 3, 2020

AlexeyAB commented May 3, 2020

wang-xinyu commented May 4, 2020

wang-xinyu commented May 5, 2020

bhavitvyamalik commented May 11, 2020 •

edited

Loading

wang-xinyu commented May 12, 2020

bhavitvyamalik commented May 12, 2020

wang-xinyu commented May 12, 2020

jkjung-avt commented Jul 20, 2020

jkjung-avt commented Jul 24, 2020 •

edited

Loading

AlexeyAB commented Jul 24, 2020

jkjung-avt commented Jul 24, 2020

jkjung-avt commented Jul 28, 2020

PythonImageDeveloper commented Jul 28, 2020 •

edited

Loading

jkjung-avt commented Aug 17, 2020 •

edited

Loading

ttanzhiqiang commented Jun 30, 2021

Update readme, add tensorrt yolov3-spp, yolov4 #5453

Update readme, add tensorrt yolov3-spp, yolov4 #5453

Conversation

wang-xinyu commented May 2, 2020

AlexeyAB commented May 2, 2020

wang-xinyu commented May 2, 2020 • edited Loading

wang-xinyu commented May 3, 2020

AlexeyAB commented May 3, 2020

wang-xinyu commented May 4, 2020

wang-xinyu commented May 5, 2020

bhavitvyamalik commented May 11, 2020 • edited Loading

wang-xinyu commented May 12, 2020

bhavitvyamalik commented May 12, 2020

wang-xinyu commented May 12, 2020

jkjung-avt commented Jul 20, 2020

jkjung-avt commented Jul 24, 2020 • edited Loading

AlexeyAB commented Jul 24, 2020

jkjung-avt commented Jul 24, 2020

jkjung-avt commented Jul 28, 2020

PythonImageDeveloper commented Jul 28, 2020 • edited Loading

jkjung-avt commented Aug 17, 2020 • edited Loading

ttanzhiqiang commented Jun 30, 2021

wang-xinyu commented May 2, 2020 •

edited

Loading

bhavitvyamalik commented May 11, 2020 •

edited

Loading

jkjung-avt commented Jul 24, 2020 •

edited

Loading

PythonImageDeveloper commented Jul 28, 2020 •

edited

Loading

jkjung-avt commented Aug 17, 2020 •

edited

Loading