Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update readme, add tensorrt yolov3-spp, yolov4 #5453

Merged
merged 1 commit into from
May 16, 2020

Conversation

wang-xinyu
Copy link

Hi @AlexeyAB , thanks for your remarkable work.

I have just implemented yolov4 in tensorrt today, and yolov3-spp weeks ago.

And got the following speed test on my machine:

Models Device BatchSize Mode Input Shape(HxW) FPS
YOLOv3-spp(darknet53) Xeon E5-2620/GTX1080 1 FP32 256x416 94
YOLOv4(CSPDarknet53) Xeon E5-2620/GTX1080 1 FP32 256x416 67

Could you merge this PR, to add a link to my repo tensorrtx on your readme? It would be my pleasure:)))

regards,
xinyu

@AlexeyAB
Copy link
Owner

AlexeyAB commented May 2, 2020

@wang-xinyu Hi,

  • What do you use non-square network resolution width=416 height=256 ?
  • What resizing approach do you use, resizing or letter_box (resizing with keeping aspect-ratio)?
  • Is there readme how to run, for example, YOLOv4(CSPDarknet53) with FP16 and batch=4 on videofile?

@wang-xinyu
Copy link
Author

wang-xinyu commented May 2, 2020

@AlexeyAB

Hi,

  • I was using w:416, h:256, for video resolution like 1920x1080. The Input W and H are defined in yololayer.h, it supports any number divisible by 32.

  • The resize approach is letter_box (resizing with keeping aspect-ratio and also padding). Same as the implementation in https://github.com/ultralytics/yolov3.

  • There is a readme for how to run yolov4 https://github.com/wang-xinyu/tensorrtx/tree/master/yolov4.

  • FP16/FP32 is selected by a macro defined in yolov4.cpp.

  • Currently only supports batchsize=1. I will implement multi-batch these days.:)

@wang-xinyu
Copy link
Author

@AlexeyAB hello,

Updates. Now yolov4 supports multi-batch. And I retested the speed on batch=1, 4 and 8.

Models Device BatchSize Mode Input Shape(HxW) FPS
YOLOv4(CSPDarknet53) Xeon E5-2620/GTX1080 1 FP32 256x416 59
YOLOv4(CSPDarknet53) Xeon E5-2620/GTX1080 4 FP32 256x416 74
YOLOv4(CSPDarknet53) Xeon E5-2620/GTX1080 8 FP32 256x416 83

The config details can be found here https://github.com/wang-xinyu/tensorrtx/blob/master/yolov4/README.md#config. Including input shape, number of classes, FP16/FP32, batchsize, etc.

@AlexeyAB
Copy link
Owner

AlexeyAB commented May 3, 2020

@wang-xinyu

  1. Do you measure full cycle FPS? Do you run pre-processing, inference, post-processing asynchronously in 3 separate CPU-threads?

  2. Can you check FPS for 608x608 for batch=1,4,8?

  3. How many FPS can you get if you use with 608x608?
    ./darknet detector demo cfg/coco.data cfg/yolov4 yolov4.weights -dont_show
    ?

@wang-xinyu
Copy link
Author

@AlexeyAB

The FPS tests above were including inference and NMS. And not using any multi-thread things.

In the following, we only test the inference time, exclude any pre and post processing.

I was using the following command, and got AVG_FPS: 20.0, the demo.mp4 is 1920x1080. Input shape is 608*608 in yolov4.cfg.

I have a question. Is the following command using FP16 by default?

My GPU is GTX1080.

./darknet detector demo cfg/coco.data cfg/yolov4.cfg yolov4.weights ~/demo.mp4 -benchmark

I retested the FPS for 608*608 in my tensorrt implementation.

Models Device BatchSize Mode Input Shape(HxW) FPS
YOLOv4(CSPDarknet53) Xeon E5-2620/GTX1080 1 FP16 608x608 23.3
YOLOv4(CSPDarknet53) Xeon E5-2620/GTX1080 4 FP16 608x608 23.8
YOLOv4(CSPDarknet53) Xeon E5-2620/GTX1080 8 FP16 608x608 24.1
YOLOv4(CSPDarknet53) Xeon E5-2620/GTX1080 1 FP32 608x608 23.3
YOLOv4(CSPDarknet53) Xeon E5-2620/GTX1080 4 FP32 608x608 24.0
YOLOv4(CSPDarknet53) Xeon E5-2620/GTX1080 8 FP32 608x608 24.0

There is no big difference between FP16 and FP32, and no big FPS gain compared to darknet.

I guess the mish layer influence tensorrt to merge layer, because mish is not natively supported in tensorrt.

I will try to optimize mish implementation in the near future, and also try replace the mish with relu to see the FPS.

@wang-xinyu
Copy link
Author

Hi @AlexeyAB

Updates,

I modified the mish layer in my tensorrt implementation, and using the same softplus, tanh and mish cuda kernel as your darknet implementation.

The main difference is that you are using expf(), while I was using exp().

And I retested the FPS, it's faster now!

Models Device BatchSize Mode Input Shape(HxW) FPS
YOLOv4(CSPDarknet53) Xeon E5-2620/GTX1080 1 FP16 608x608 35.7
YOLOv4(CSPDarknet53) Xeon E5-2620/GTX1080 4 FP16 608x608 40.9
YOLOv4(CSPDarknet53) Xeon E5-2620/GTX1080 8 FP16 608x608 41.3

@bhavitvyamalik
Copy link

bhavitvyamalik commented May 11, 2020

@wang-xinyu did you try YOLOv4 with Jetson Nano? It also has Tensorrt 7

@wang-xinyu
Copy link
Author

@wang-xinyu did you try YOLOv3 with Jetson Nano? It also has Tensorrt 7

No, but it should work on nano, you can try my repo.

@bhavitvyamalik
Copy link

Is there any python implementation of the same? I need to doa lot of pre and post processing using openCV and I'm not that comfortable with C++

@wang-xinyu
Copy link
Author

Is there any python implementation of the same? I need to doa lot of pre and post processing using openCV and I'm not that comfortable with C++

@bhavitvyamalik no python, all c++ and cuda in tensorrtx

@jkjung-avt
Copy link

@bhavitvyamalik

did you try YOLOv4 with Jetson Nano? It also has Tensorrt 7

I have implemented TensorRT YOLOv4 with python API: Demo #5: YOLOv4. I tested it on Jetson Nano with JetPack-4.4 (TensorRT 7). The FPS numbers could be found in the README of my repo.

Is there any python implementation of the same? I need to doa lot of pre and post processing using openCV and I'm not that comfortable with C++

My implementation is purely in python. Do check it out.

I have also written a blog post about some of the implementation details: TensorRT YOLOv4

@AlexeyAB THANKS for sharing the code and the YOLOv4 model.

@jkjung-avt
Copy link

jkjung-avt commented Jul 24, 2020

Here are mAP numbers of my TensorRT yolov4 and yolov4-tiny implementations, as well as FPS measurements on Jetson Nano.

TensorRT engine mAP @
IoU=0.5:0.95
mAP @
IoU=0.5
FPS on Nano
yolov4-tiny-288 (FP16) 0.179 0.344 23.8
yolov4-tiny-416 (FP16) 0.196 0.386 16.5
yolov4-288 (FP16) 0.372 0.590 6.18
yolov4-416 (FP16) 0.454 0.698 3.50
yolov4-608 (FP16) 0.484 0.735 1.77

[2020-08-17 update] I've updated my tensorrt yolov4 implementation with a “yolo_layer” plugin. Here is the updated FPS numbers when I tested it on Jetson Nano (JetPack-4.4). Refer to my jkjung-avt/tensorrt_demos repo for details.

TensorRT engine mAP @
IoU=0.5:0.95
mAP @
IoU=0.5
FPS on Nano
yolov3-tiny-288 (FP16) 0.077 0.158 35.8
yolov3-tiny-416 (FP16) 0.096 0.201 25.5
yolov3-288 (FP16) 0.331 0.600 8.16
yolov3-416 (FP16) 0.373 0.663 4.93
yolov3-608 (FP16) 0.376 0.664 2.53
yolov3-spp-288 (FP16) 0.339 0.594 8.16
yolov3-spp-416 (FP16) 0.391 0.663 4.82
yolov3-spp-608 (FP16) 0.409 0.685 2.49
yolov4-tiny-288 (FP16) 0.178 0.344 36.6
yolov4-tiny-416 (FP16) 0.195 0.386 25.5
yolov4-288 (FP16) 0.371 0.590 7.93
yolov4-416 (FP16) 0.453 0.698 4.62
yolov4-608 (FP16) 0.483 0.735 2.35

@AlexeyAB
Copy link
Owner

@jkjung-avt Hi,
Thanks!

@jkjung-avt
Copy link

@AlexeyAB My implementation is based on NVIDIA's original TensorRT python/yolov3_onnx sample. NVIDIA's original code does TensorRT yolov3-608x608 inference at only 0.3 FPS on Jetson Nano. I made improvements in the postprocessing code and managed to boost TensorRT yolov3-608x608 inference speed to 1.53 FPS on Nano.

The major advantages of my implementation (jkjung-avt/tensorrt_demos) are:

  • All code is implemented in python. In particular, the inference code is using TensorRT's python API. This is much easier for most AI/DL practitioners to work with.
  • My implementation directly takes darknet cfg/weights files, converts them to onnx, and then to TensorRT engines. As of now, "yolov3", "yolov3-spp", "yolov3-tiny", "yolov4", "yolov4-tiny" models are all supported and tested. It's very easy to convert a custom trained darknet yolov3/yolov4 model and test TensorRT inference on Jetson or x86_64 with this code.

But as you've guessed, the downside of my implementation is mainly somewhat inferior performance. This is mainly due to:

  • python code is inherently slow comparing to C/C++,
  • python code cannot utilize multiple CPUs effectively, even with multithreading (GIL issue).

Why did you get only 3.5 FPS for yolov4-416 on Jetson Nano by using TensorRT, while there we can get 3.9 FPS by using tkDNN+TensorRT? https://github.com/ceccocats/tkDNN#results

Besides the slowness of python code, I think there are probably 2 additional reasons:

  • The postprocessing code (all processing in the yolo layers, including NMS) is implemented with python and runs on CPU. I estimate this postprocessing takes ~15% processing time of each frame (depending on how many candidate/target objects are present in the frame) for yolov4-416x416.
  • I implemented "Mish" with “Softplus” + “Tanh” + “Mul”. This runs slightly slowlier than a dedicated TensorRT plugin.

Why did you get only 16.5 FPS for yolov4-tiny-416 on Jetson Nano by using TensorRT, while there we can get 39 FPS by using tkDNN+TensorRT? ceccocats/tkDNN#59 (comment)

I think it's the same reason as above. Since the CNN portion of yolov4-tiny runs much faster than that of the large yolov4 model, the effect of slow python postprocessing code gets magnified quite a bit.

Do you run async in 3 threads: 1-video capturing and pre-processing, 2-inference, 3-post-processing and drawing/showing?

The short answer is no. But let me reply this question more properly in a separate post, since this one is getting pretty long.

@jkjung-avt
Copy link

@AlexeyAB Let me get back to this question.

Do you run async in 3 threads: 1-video capturing and pre-processing, 2-inference, 3-post-processing and drawing/showing?

The real answer should be yes:

So to recap, we have discussed the following for achieving better FPS for the TensorRT YOLOv4 and YOLOv4-tiny models:

  • more efficient code for preprocessing and postprocessing,
  • using more efficient plugin implementation for layers which are not supported by TensorRT directly (such as "Mish" activation),
  • multi-threading the preprocessing and postprocessing code,

But if you are really going after the best possible FPS, I think there are additional things that could be considered:

  • utilizing GPU to do preprocessing: CHW channel swapping, mean subtraction, int8-to-float32 conversion, etc.
  • parallelizing GPU/CPU memcpy and TensorRT kernel execution,
  • further pipelining of TensorRT operations, (splitting TensorRT YOLOv4 engine into 2 or 3 stages)

So I imagine the optimal design (in terms of FPS) of TensorRT YOLOv4 on Jetson is: video capturing into GPU memory directly (either through hardware H.264 decoder or by custom kernel drivers), image preprocessing by GPU, pipeline stages of TensorRT engine, postprocessing by GPU, and finally copying image and inference results to CPU for display. The data should stay in GPU memory most of the time, so there is no extra copying between GPU and CPU. The preprocessing, TensorRT pipeline stages, postprocessing and memcpy (from GPU to CPU) are all executed in different CUDA streams so they get fully parallelized.

That is not easy to implement, though...

@PythonImageDeveloper
Copy link

PythonImageDeveloper commented Jul 28, 2020

Hello @jkjung-avt
In my opinion, Your suggests are right, But I have some question.
a) You say video capturing be in GPU memory directly, In this case we can't using cv2.VideoCapture + Gstreamer, and this solution copied the decoded frames from NVVM buffer to CPU buffer, indeed occurred duplicated copy for one decoded frame, right? Do you have solution about decode frames into GPU memory directory?

b) Jetson nano used shared memory, then CPU and GPU memory are same, right? why we need GPU memory? Every things in CPU memory aren't in GPU memory?

c) If I use cv2.Videocapture + Gstreamer using H.264 HW decoder, the decoded frames copied from NVMM buffer to CPU buffer, in this case, for one decoded frame we use 2 times memory out of whole memory?

d) If I use cv2.Videocapture + Gstreamer using H.264 HW decoder, the decoded frames copied from NVMM buffer to CPU buffer, in this case, then If I want to use GPU for pre/post processing, we again need to copied from CPU memory to GPU memory? in this case we use 3 times memory out of whole memory for one decode frame?

e) You say about slowness of python, inference processing is done in c/c++ back, this part is good, for post/pre processing If we use pycuda and the whole of system implemented by python wrapper, In your opinion, which solution can give more performance in term of FPS? python wrapper for inference + pycuda for post/pre processing or c++ for inference + post/pre processing in c++ but in CPU?

TomHeaven pushed a commit to TomHeaven/darknet that referenced this pull request Aug 13, 2020
Update readme, add tensorrt yolov3-spp, yolov4
@jkjung-avt
Copy link

jkjung-avt commented Aug 17, 2020

@AlexeyAB I have updated my tensorrt yolov4 implementation as indicated in #5453 (comment).

Why did you get only 3.5 FPS for yolov4-416 on Jetson Nano by using TensorRT, while there we can get 3.9 FPS by using tkDNN+TensorRT? https://github.com/ceccocats/tkDNN#results

TensorRT "yolov4-416" (FP16) now runs at 4.62 FPS on Jetson Nano.

@ttanzhiqiang
Copy link

https://github.com/ttanzhiqiang/onnx_tensorrt_project

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants