-
Notifications
You must be signed in to change notification settings - Fork 8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update readme, add tensorrt yolov3-spp, yolov4 #5453
Conversation
@wang-xinyu Hi,
|
Hi,
|
@AlexeyAB hello, Updates. Now yolov4 supports multi-batch. And I retested the speed on batch=1, 4 and 8.
The config details can be found here https://github.com/wang-xinyu/tensorrtx/blob/master/yolov4/README.md#config. Including input shape, number of classes, FP16/FP32, batchsize, etc. |
|
The FPS tests above were including inference and NMS. And not using any multi-thread things. In the following, we only test the inference time, exclude any pre and post processing. I was using the following command, and got I have a question. Is the following command using FP16 by default? My GPU is GTX1080.
I retested the FPS for 608*608 in my tensorrt implementation.
There is no big difference between FP16 and FP32, and no big FPS gain compared to darknet. I guess the mish layer influence tensorrt to merge layer, because mish is not natively supported in tensorrt. I will try to optimize mish implementation in the near future, and also try replace the mish with relu to see the FPS. |
Hi @AlexeyAB Updates, I modified the mish layer in my tensorrt implementation, and using the same The main difference is that you are using And I retested the FPS, it's faster now!
|
@wang-xinyu did you try YOLOv4 with Jetson Nano? It also has Tensorrt 7 |
No, but it should work on nano, you can try my repo. |
Is there any python implementation of the same? I need to doa lot of pre and post processing using openCV and I'm not that comfortable with C++ |
@bhavitvyamalik no python, all c++ and cuda in tensorrtx |
I have implemented TensorRT YOLOv4 with python API: Demo #5: YOLOv4. I tested it on Jetson Nano with JetPack-4.4 (TensorRT 7). The FPS numbers could be found in the README of my repo.
My implementation is purely in python. Do check it out. I have also written a blog post about some of the implementation details: TensorRT YOLOv4 @AlexeyAB THANKS for sharing the code and the YOLOv4 model. |
Here are mAP numbers of my TensorRT yolov4 and yolov4-tiny implementations, as well as FPS measurements on Jetson Nano.
[2020-08-17 update] I've updated my tensorrt yolov4 implementation with a “yolo_layer” plugin. Here is the updated FPS numbers when I tested it on Jetson Nano (JetPack-4.4). Refer to my jkjung-avt/tensorrt_demos repo for details.
|
@jkjung-avt Hi,
|
@AlexeyAB My implementation is based on NVIDIA's original TensorRT python/yolov3_onnx sample. NVIDIA's original code does TensorRT yolov3-608x608 inference at only 0.3 FPS on Jetson Nano. I made improvements in the postprocessing code and managed to boost TensorRT yolov3-608x608 inference speed to 1.53 FPS on Nano. The major advantages of my implementation (jkjung-avt/tensorrt_demos) are:
But as you've guessed, the downside of my implementation is mainly somewhat inferior performance. This is mainly due to:
Besides the slowness of python code, I think there are probably 2 additional reasons:
I think it's the same reason as above. Since the CNN portion of yolov4-tiny runs much faster than that of the large yolov4 model, the effect of slow python postprocessing code gets magnified quite a bit.
The short answer is no. But let me reply this question more properly in a separate post, since this one is getting pretty long. |
@AlexeyAB Let me get back to this question.
The real answer should be yes:
So to recap, we have discussed the following for achieving better FPS for the TensorRT YOLOv4 and YOLOv4-tiny models:
But if you are really going after the best possible FPS, I think there are additional things that could be considered:
So I imagine the optimal design (in terms of FPS) of TensorRT YOLOv4 on Jetson is: video capturing into GPU memory directly (either through hardware H.264 decoder or by custom kernel drivers), image preprocessing by GPU, pipeline stages of TensorRT engine, postprocessing by GPU, and finally copying image and inference results to CPU for display. The data should stay in GPU memory most of the time, so there is no extra copying between GPU and CPU. The preprocessing, TensorRT pipeline stages, postprocessing and memcpy (from GPU to CPU) are all executed in different CUDA streams so they get fully parallelized. That is not easy to implement, though... |
Hello @jkjung-avt b) Jetson nano used shared memory, then CPU and GPU memory are same, right? why we need GPU memory? Every things in CPU memory aren't in GPU memory? c) If I use cv2.Videocapture + Gstreamer using H.264 HW decoder, the decoded frames copied from NVMM buffer to CPU buffer, in this case, for one decoded frame we use 2 times memory out of whole memory? d) If I use cv2.Videocapture + Gstreamer using H.264 HW decoder, the decoded frames copied from NVMM buffer to CPU buffer, in this case, then If I want to use GPU for pre/post processing, we again need to copied from CPU memory to GPU memory? in this case we use 3 times memory out of whole memory for one decode frame? e) You say about slowness of python, inference processing is done in c/c++ back, this part is good, for post/pre processing If we use pycuda and the whole of system implemented by python wrapper, In your opinion, which solution can give more performance in term of FPS? python wrapper for inference + pycuda for post/pre processing or c++ for inference + post/pre processing in c++ but in CPU? |
Update readme, add tensorrt yolov3-spp, yolov4
@AlexeyAB I have updated my tensorrt yolov4 implementation as indicated in #5453 (comment).
TensorRT "yolov4-416" (FP16) now runs at 4.62 FPS on Jetson Nano. |
Hi @AlexeyAB , thanks for your remarkable work.
I have just implemented yolov4 in tensorrt today, and yolov3-spp weeks ago.
And got the following speed test on my machine:
Could you merge this PR, to add a link to my repo tensorrtx on your readme? It would be my pleasure:)))
regards,
xinyu