Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regarding mAP and latency of Yolov4 #5354

Open
mive93 opened this issue Apr 27, 2020 · 54 comments
Open

Regarding mAP and latency of Yolov4 #5354

mive93 opened this issue Apr 27, 2020 · 54 comments
Labels
Explanations Explanations of the source code, algorithms or method of use

Comments

@mive93
Copy link

mive93 commented Apr 27, 2020

Dear Alexey,

first of all, thank you for your work.
I have been doing some tests with your new yolov4, and I have some questions.
I compared the performance of Yolov4, Yolov3 and CSPResNext50-Panet-SPP (the one I found also in your repo) on two different GPUs, using input size 416x416, and I have checked the mAP for the COCO2017 validation set.

Here are the results (both FPS and mAP have been computed using your code):

GeForce RTX 2080 Ti Rev. A (while training going on, so maybe perfromance are a bit degrated)
--------------------------------------------------------
			FPS	mAP(.50, val COCO2017)
YOLOV3			39.0	66.16 % 
YOLOV4			38.8	70.22 %
CSPRESNEXT50-SPP-PANET	37.8	75.88 %
--------------------------------------------------------
GeForce GTX 1060 6GB
--------------------------------------------------------
			FPS 	FPS(CUDNN_HALF = 1 )
YOLOV3			31.7 	30.8
YOLOV4			29.9 	29.0
CSPRESNEXT50-SPP-PANET	28.6    28.7      
--------------------------------------------------------

However, I have noticed that you do not compare with the third network in your paper. I was wondering which was the reason, and if I am doing, maybe, something wrong.

Thank you in advance.


image

@WongKinYiu
Copy link
Collaborator

Hello,

We use trainvalno5k set for training, there are some images in val set are trained.
So CSPResNeXt50-PANet-SPP gets higher AP50 on val set may because that it is more fit to the training data.

The comparison of YOLOv4 (CSPDarknet53-PANet-SPP, BoF-backbone, Mish, optimal setting) and CSPResNeXt50-PANet-SPP are in Table 6.
image

We choose CSPDarknet53 as backbone of YOLOv4 since it gets both higher FPS and AP.
image

@mive93
Copy link
Author

mive93 commented Apr 27, 2020

Dear @WongKinYiu,

Thank you for your answer.
I see why the mAP could be better, however I'm not experiencing that better results in FPS.
I have tried again on the 2080Ti, with the GPU unloaded, and this is what I get:

Size 512x512
GeForce RTX 2080 Ti Rev. A 
--------------------------------------------------------
			FPS	
YOLOV3			55.5	
YOLOV4			60.5
CSRESNEXT50-SPP-PANET	59.6	

Therefore again, I don't see a big improvement in Yolov4.
Again, maybe it's my fault, I would just like to understand why I do not obtain your improvement.

@mive93
Copy link
Author

mive93 commented Apr 27, 2020

And another thing, sorry I forgot, you said you use both training and validation for training, but you meant for CSPResNeXt50-PANet-SPP or for Yolov4?

Thanks again.

@AlexeyAB
Copy link
Owner

AlexeyAB commented Apr 27, 2020

@mive93

using input size 416x416

  • Check that you use [net] width=416 height=416 in botch cfg-files

  • What command did you use for checking AP and FPS?

  • Can you show screenshot from https://competitions.codalab.org where do you get 0.75 AP50 on val2017?

  • Can you show screenshot with FPS?

@mive93
Copy link
Author

mive93 commented Apr 27, 2020

Dear @AlexeyAB,

Yes, I am sure the size were correct.

These are the commands I run to get the AVG FPS

./darknet detector demo cfg/coco.data  cfg/csresnext50-panet-spp.cfg  weights/csresnext50-panet-spp_final.weights  ../tkDNN/demo/yolo_test.mp4 -dont_show -ext_output
./darknet detector demo cfg/coco.data  cfg/yolov4.cfg  weights/yolov4.weights  ../tkDNN/demo/yolo_test.mp4 -dont_show -ext_output
./darknet detector demo cfg/coco.data  cfg/yolov3.cfg  weights/yolov3.weights  ../tkDNN/demo/yolo_test.mp4 -dont_show -ext_output

here the commands to obtain the detections for codalab

./darknet  detector valid cfg/coco.data cfg/csresnext50-panet-spp.cfg  weights/csresnext50-panet-spp_final.weights
./darknet  detector valid cfg/coco.data cfg/yolov4.cfg  weights/yolov4.weights
./darknet  detector valid cfg/coco.data cfg/yolov3.cfg  weights/yolov3.weights

In this folder you can find all the screenshots: https://cloud.hipert.unimore.it/s/g7KZNnytki5gExE

I summarize here the results from codalab on val2017

###############################################################################
#		YOLOV3 416x416 CODALAB res COCO2017 VAL       		      #
###############################################################################
overall performance
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.380
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.675
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.391
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.227
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.418
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.534
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.304
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.474
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.497
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.330
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.537
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.656
Done (t=81.34s)

GeForce RTX 2080 Ti Rev. A FPS: 75.5

###############################################################################
#		YOLOV4 416x416 CODALAB res COCO2017 VAL			      #
###############################################################################

overall performance
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.471
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.710
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.510
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.278
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.525
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.636
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.357
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.561
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.587
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.382
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.642
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.772
Done (t=120.59s)

GeForce RTX 2080 Ti Rev. A FPS: 71.7

###############################################################################
#		CSPRESNEXT50-PANET-SPP 416x416 CODALAB res COCO2017 VAL       #
###############################################################################

overall performance
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.497
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.766
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.535
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.269
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.549
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.708
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.363
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.559
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.583
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.376
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.637
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.776
Done (t=78.67s)

GeForce RTX 2080 Ti Rev. A FPS: 70.1

@AlexeyAB
Copy link
Owner

@AlexeyAB
Copy link
Owner

AlexeyAB commented Apr 27, 2020

Oh, you are about old model csresnext50-panet-spp.cfg not about csresnext50-panet-spp-original-optimal.cfg .

Yes, it seems csresnext50-panet-spp.cfg was trained by using trainvalno5k.list + 5k.list (may be), while csresnext50-panet-spp-original-optimal.cfg and yolov4.cfg were trained by using only trainvalno5k.list without 5k.list

I get this results on 5k.list:

  • yolov4.cfg - 416x416 - val2017:
    Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.471
    Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.710

  • csresnext50-panet-spp-original-optimal.cfg - 416x416 - val2017:
    Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.457
    Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.693

  • csresnext50-panet-spp.cfg - 416x416 - val2017:
    Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.497
    Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.766


By the way, I can't submit your json-files, so I just tested these models by myself again.

@WongKinYiu
Copy link
Collaborator

@AlexeyAB @mive93

I will test and update FPS on Turing architecture GPU in few days.

If use old CSPResNeXt50-PANet-SPP, you will get higher AP on 416x416 due to the anchor setting.
#5311 (comment)

@mive93
Copy link
Author

mive93 commented Apr 28, 2020

Dear @AlexeyAB and @WongKinYiu.
Sorry for my late answer. I have uploaded on the folder both my json detections and the txt list: https://cloud.hipert.unimore.it/s/g7KZNnytki5gExE. To test everything on codalab I have just follower your wiki (just using COCOval2017 instead of testdev-2017).

However, if you say that you trained that network using also those data, it makes a lot of sense that the mAP is higher, even though it's not fair. So yeah, I assume Yolov4 accuracy is better then :)

Thank you for your quick answers, and thank you for clarifying my doubts. I will wait for the FPS results then.

@mive93
Copy link
Author

mive93 commented Apr 29, 2020

Dear @AlexeyAB,

yesterday I have ported your Yolov4 on tensorRT using tkDNN, a framework developed by @ceccocats, @sapienzadavide and I (you can find it here).

Some performance results on 2 boards, a discrete and an embedded one. The outputs match with yours, so the mAP is the same.

AVG FPS over 5000 images, input size 416x416.

AGX Xavier		
	FPS - FP32	FPS - FP16
yolov3	19,47		49,62
yolov4	17,52		32,67

RTX 2080 Ti		
	FPS - FP32	FPS - FP16
yolov3	106,30		192,13
yolov4	93,00		133,41

@AlexeyAB
Copy link
Owner

AlexeyAB commented Apr 29, 2020

@mive93 Hi,
Thanks!

The outputs match with yours, so the mAP is the same.

Does it match even for FP16?
Is FP32 == FP32, while FP16 == Mixed Precision FP16+FP32 on TensorCores?
Did you test it with batch=1?
What network resolution did you use?
What is the advantage of tkDNN over TensorRT, and for what tkDNN is used if TensorRT is used for inference/quantization?
Is there some comparison table with FPS for different models/resolutions/float-precisions? https://github.com/ceccocats/tkDNN

Can you test YOLOv4 on RTX2080Ti (or preferably on Tesla V100) for 4 network resolutions with batch=1 and batch=4?

  1. 320
  2. 416
  3. 512
  4. 608

@AlexeyAB AlexeyAB added the Explanations Explanations of the source code, algorithms or method of use label Apr 29, 2020
@mive93
Copy link
Author

mive93 commented Apr 29, 2020

Dear @AlexeyAB ,

Sorry for the delay.

  • These are the results of the mAP computed on codalab.
###############################################################################
#		DARKNET 416x416 CODALAB res COCO2017 VAL			      #
###############################################################################

overall performance
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.471
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.710
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.510
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.278
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.525
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.636
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.357
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.561
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.587
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.382
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.642
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.772
Done (t=120.59s)


###############################################################################
		TKDNN YOLOV4 FP32 416x416 CODALAB res COCO2017 VAL			      
###############################################################################

overall performance
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.449
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.701
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.481
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.235
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.507
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.626
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.343
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.533
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.556
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.329
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.618
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.758
Done (t=73.61s)

###############################################################################
		TKDNN YOLOV4 FP16 416x416 CODALAB res COCO2017 VAL			      
###############################################################################

overall performance
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.449
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.701
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.481
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.235
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.507
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.626
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.343
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.533
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.555
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.330
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.617
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.758
Done (t=72.04s)

Actually there is a small loss, I guess due to different implementation of the operations. I have tried to investigate more, but I couldn't find another source of mismatch. However, FP16 has the same mAP as FP32.

  • FP32 is full precision, FP16 is half precision for tensorRT. Then with tkDNN is also possible to infer at INT8 (always int8 tensorRT quantization) and using DLA. However, given that in these kinds of networks a lot of layers are implemented via plugins and not native TensorRT APIs, the performance with DLA or INT8 could be degraded.

  • Yes, I did the tests with batch 1

  • I was using 416x416

  • TkDNN uses tensorRT. We just tried to achieve the best performances exploiting it, while not depending on deepstream for example. We also compared our performance with deepstream (to be precise last autumn), and we perform better.
    image
    In this table batch=1 is considered and the same video was used to compute the avg fps. Darknet is the one by Redmon. The considered board here is the NVIDIA Tx2.

  • Performance results for other networks are not yet available, but we're submitting a paper next week. Then we will make them public. Even though anyone could reproduce them already.

  • Here there are the results of the tests you asked for:

FPS on RTX 2080Ti of Yolov4 TkDNN (avg over 1200 img of size 640 x 480)				
		FP32 - BATCH=1	FP32 - BATCH=4	FP16 - BATCH=1	FP16 - BATCH=4
yolov4 320	116,99		58,29		204,99		105,82
yolov4 416	116,27		40,68		194,64		71,08
yolov4 512	91,31		32,97		137,85		51,51
yolov4 608	62,04		20,27		109,01		37,60

@AlexeyAB
Copy link
Owner

AlexeyAB commented Apr 29, 2020

@mive93 Thanks!

FPS on RTX 2080Ti of Yolov4 TkDNN (avg over 1200 img of size 640 x 480)				
		FP32 - BATCH=1	FP32 - BATCH=4	FP16 - BATCH=1	FP16 - BATCH=4
yolov4 320	116,99		58,29		204,99		105,82
yolov4 416	116,27		40,68		194,64		71,08
yolov4 512	91,31		32,97		137,85		51,51
yolov4 608	62,04		20,27		109,01		37,60
  • Does 37.60 FPS for batch=4 actually mean that tkDNN process 37.6 x 4 = 150,4 FPS for YOLOv4 width=608 height=608 batch_size=4 FP16 on RTX 2080 Ti?
    Usually high batch size increases FPS.

  • Do you measure just inference time, or do you measure full cycle fps? Just pre(resizing) and post(NMS) processing execute in separate CPU-threads asynchronously, therefore, do not reduce FPS?


Actually there is a small loss, I guess due to different implementation of the operations. I have tried to investigate more, but I couldn't find another source of mismatch. However, FP16 has the same mAP as FP32.

  • Do you use resizing before inference without keeping aspect ratio? This repo https://github.com/AlexeyAB/darknet doesn't keep aspect ratio (i.e. by default letter_box=0), while https://github.com/pjreddie/darknet keeps aspect ratio.
    Just do cv::resize(src, dst, Size(608,608)); without keeping aspect ratio
    Resizing : keeping aspect ratio, or not #232 (comment)

  • Also what NMS implementation do you use? (is it regular NMS or soft-NMS)

    darknet/src/box.c

    Lines 812 to 844 in 36c73c5

    void do_nms_sort(detection *dets, int total, int classes, float thresh)
    {
    int i, j, k;
    k = total - 1;
    for (i = 0; i <= k; ++i) {
    if (dets[i].objectness == 0) {
    detection swap = dets[i];
    dets[i] = dets[k];
    dets[k] = swap;
    --k;
    --i;
    }
    }
    total = k + 1;
    for (k = 0; k < classes; ++k) {
    for (i = 0; i < total; ++i) {
    dets[i].sort_class = k;
    }
    qsort(dets, total, sizeof(detection), nms_comparator_v3);
    for (i = 0; i < total; ++i) {
    //printf(" k = %d, \t i = %d \n", k, i);
    if (dets[i].prob[k] == 0) continue;
    box a = dets[i].bbox;
    for (j = i + 1; j < total; ++j) {
    box b = dets[j].bbox;
    if (box_iou(a, b) > thresh) {
    dets[j].prob[k] = 0;
    }
    }
    }
    }
    }

  • Did you implement scale_x_y= in the [yolo] layer? It is very simple addition:

There are 3 different scale_x_y= values

  1. scale_x_y = 1.2
  2. scale_x_y = 1.1
  3. scale_x_y = 1.05

@mive93
Copy link
Author

mive93 commented Apr 30, 2020

Hi @AlexeyAB,

  • Yes, it is correct.
  • You are absolutely right, it's not a 100% fair comparison, totally my bad. There were two mistakes, for batch=1 I was considering end-to-end latencies and 1200 images of size 640 x 480. For batching I was only considering inference (just because we still haven't integrating batching pre and post processing) with 1200 images of size = to the network input size.I have perfomed the tests again, more fairly, on 1200 images of size = to the network input size and only interence time.
FPS on RTX 2080Ti of Yolov4 TkDNN (avg over 1200 img of size = to network inputsize)				
		FP32 - BATCH=1	FP32 - BATCH=4	FP16 - BATCH=1	FP16 - BATCH=4
yolov4 320	116,56		233,16		202,02		423,29
yolov4 416	103,54		162,71		162,50		284,34
yolov4 512	91,63		131,90		134,94		206,04
yolov4 608	62,34		81,06		100,81		150,41
  • Moreover, hereafter some statistics about preprocessing-inference-postprocessing on yolov4-416.
				pre(%)		inference(%)	post(%)
RTX2080Ti	yolov4 FP32	10,22		79,60		10,18
RTX2080Ti	yolov4 FP16	17,27		68,28		14,45
				
AGX Xavier	yolov4 FP32	2,58		95,36		2,06
AGX Xavier	yolov4 FP16	4,83		91,47		3,70

  • We already do so.

  • We use the NM you reported.

  • Yes we have implemented the scale_xy for yolov4.

@AlexeyAB
Copy link
Owner

AlexeyAB commented Apr 30, 2020

@mive93 Hi,

Performance results for other networks are not yet available, but we're submitting a paper next week. Then we will make them public. Even though anyone could reproduce them already.

Will you publish paper on arxiv.org with AP / FPS or only FPS comparison of different models?


  • Do you use only FP32 (without Tensor Cores) and FP16 (with Tensor Cores), but don't use FP32/16 (Mixed-precision with Tensor Cores), because FP16 shows the same good accuracy?

  • Will you add manual how to measure AP / AP50 and FPS by using TkDNN+TensorRT?

  • Will you add demo on video-file that check FPS including inference+pre+post_processing which are runing in 3 CPU-threads, and can use both batch=1 and batch=4 ?
    And shows detection results to the console and optionally shows video in the window (can be switched off, because can reduce FPS).

  • Did you compare inference time with batch=1 for tkDNN vs OpenCV-dnn? Feature-request: State-of-art Yolo v4 Detector opencv/opencv#17148

  • Do you use the same Mish-implementation as in the Darknet?

    output_gpu[i] = x_val * tanh_activate_kernel( softplus_kernel(x_val, MISH_THRESHOLD) );

float softplus(float x, float threshold = 20) {
    if (x > threshold) return x;                // too large
    else if (x < -threshold) return expf(x);    // too small
    return logf(expf(x) + 1);
}

float mish_activation(float input) {
    const float MISH_THRESHOLD = 20;
    output = input * tanh( softplus(input, MISH_THRESHOLD) );
    return output;
}

@AlexeyAB
Copy link
Owner

AlexeyAB commented May 4, 2020

@mive93

Can you also test AVG_FPS for YOLOv4 on the Darknet (OpenCV + CUDA + cuDNN) on the same GPU 2080 Ti, for these network resolutions 320, 416, 512, 608?

By using such command:
./darknet detector demo cfg/coco.data cfg/yolov4.cfg weights/yolov4.weights yolo_test.mp4 -dont_show -ext_output

@mive93
Copy link
Author

mive93 commented May 4, 2020

Hi @AlexeyAB,

I am sorry for the delay but I have work-related deadline for this week.
I will be back to you after those, trying to address your requests :)
Sorry for now

@mive93
Copy link
Author

mive93 commented May 11, 2020

Hi @AlexeyAB, sorry for the long delay.

Will you publish paper on arxiv.org with AP / FPS or only FPS comparison of different models?

We submitted to a conference and we run experiments in terms of mAP, latency and power consumption. As soon as it's accepted we also plan to share the raw data.

Do you use only FP32 (without Tensor Cores) and FP16 (with Tensor Cores), but don't use FP32/16 (Mixed-precision with Tensor Cores), because FP16 shows the same good accuracy?

We use FP32 (without Tensor Cores) and FP32/16 (Mixed-precision with Tensor Cores) [as well as FP32/INT8], because plugins are always at FP32.

Will you add manual how to measure AP / AP50 and FPS by using TkDNN+TensorRT?

In the master there is already a demo that computes the mAP for each method supported by tkDNN. However, it is a bit different from your, because bounding boxes are accounted only one time, with the highest probability. In the README is explained how to compute the mAP, each precision is supported.
Moreover, in the branch eval I added the export to a json, to evaluate the performance of each netowork/precision on codalab.

Will you add demo on video-file that check FPS including inference+pre+post_processing which are runing in 3 CPU-threads, and can use both batch=1 and batch=4 ?
And shows detection results to the console and optionally shows video in the window (can be switched off, because can reduce FPS).

I will work on the demo with batch > 1 this week, will keep you updated when I'll have something working.

Did you compare inference time with batch=1 for tkDNN vs OpenCV-dnn? opencv/opencv#17148

Never heard of that, will take a look, thanks.

Do you use the same Mish-implementation as in the Darknet?

Yes

Can you also test AVG_FPS for YOLOv4 on the Darknet (OpenCV + CUDA + cuDNN) on the same GPU 2080 Ti, for these network resolutions 320, 416, 512, 608?

Here the results:

Size FPS (avg)
320 100.6
416 82.5
512 69.7
608 53.6

@AlexeyAB
Copy link
Owner

AlexeyAB commented May 11, 2020

@mive93 Hi, Thanks!

So, tkDNN accelerates yolov4 ~2x for batch=1 and 3x-4x for batch=4.

Size Darknet FPS (avg) tkDNN TensorRT FP32 FPS tkDNN TensorRT FP16 FPS tkDNN TensorRT FP16 batch=4 FPS Speedup
320 100.6 116 202 423 4.2x
416 82.5 103 162 284 3.5x
512 69.7 91 134 206 2.9x
608 53.6 62 100 150 2.8x

We submitted to a conference and we run experiments in terms of mAP, latency and power consumption. As soon as it's accepted we also plan to share the raw data.

When will the conference be?

Moreover, in the branch eval I added the export to a json, to evaluate the performance of each netowork/precision on codalab.

It would be great if you could get identical accuracy in the future like in Darknet.

@AlexeyAB
Copy link
Owner

@mive93 Hi,

We use new mish-implementation, and get +3% FPS with the same AP-detection accuracy on MSCOCO testdev:

__device__ float mish_yashas(float x)
{
float e = __expf(x);
if (x <= -18.0f)
return x * e;
float n = e * e + 2 * e;
if (x <= -5.0f)
return x * __fdividef(n, n + 2);
return x - 2 * __fdividef(x, n + 2);
}

More: #5452 (comment)

So you can try to use this implementation in tkDNN.

@lazerliu
Copy link

excuse me ,@mive93
how can you get this,what cmd did you use?thks!

###############################################################################
#		DARKNET 416x416 CODALAB res COCO2017 VAL			      #
###############################################################################

overall performance
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.471
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.710
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.510
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.278
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.525
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.636
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.357
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.561
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.587
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.382
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.642
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.772
Done (t=120.59s)

@mive93
Copy link
Author

mive93 commented May 14, 2020

Hi
@AlexeyAB The conference will be in September, and it covers not only tkDNN but also other stuff (we test 5 different CNN on 3 embedded boards with 3 different frameworks). We are considering also having an arxive only on tkDNN performances. When we'll have time, we'll probably do that. I will keep you updated if you're interested.

For the new mish function, I can include that. Thank you :)

@lazerliu I obtained those results using codalab. You have first to generate a json (in COCO format) of the detection, then submit it in the site. More info are given also in this repo wiki.

@AlexeyAB
Copy link
Owner

@lazerliu Create new topic.

@lazerliu
Copy link

@mive93 ,I just want to valid my dataset,thank [you!]
@AlexeyAB thanks for your reply,I've created an issue #5615 about geting all the metrics of COCO mAP using this repository.

@YashasSamaga
Copy link

YashasSamaga commented May 24, 2020

Results for OpenCV DNN @ master (https://github.com/opencv/opencv/tree/6b0fff72d9748345c6a079e4fce49af4130d8e12):

Device: RTX 2080 Ti

Input Size FP32 FPS FP16 FPS FP32 batch = 4 FP16 batch = 4
320 x 320 129.2 171.2 198 384
416 x 416 99.9 146 139.6 260.5
512 x 512 90.3 125.6 112.8 190.5
608 x 608 56 103.2 68.5 133

Code: https://gist.github.com/YashasSamaga/48bdb167303e10f4d07b754888ddbdcf

  • average of 100 runs
  • OpenCV DNN executes all operations in FP16 in FP16 mode
  • NMS is not included in the timings
  • 680 x 680 doesn't load on OpenCV DNN. It says inconsistent shape at some layer.

There are currently two open PRs which affect YOLOv4 performance. The performance will mostly improve by around 5-10%.


The timings often change slightly every time the benchmark program is run. Here is the raw output from the benchmark code:

CLICK ME

1 x 3 x 608 x 608:

YOLO v4
[CUDA FP32]
	init >> 463.515ms
	inference >> min = 16.964ms, max = 21.649ms, mean = 17.8347ms, stddev = 1.25985ms
[CUDA FP16]
	init >> 311.645ms
	inference >> min = 9.644ms, max = 9.867ms, mean = 9.69076ms, stddev = 0.0379731ms

4 x 3 x 608 x 608:

[CUDA FP32]
	init >> 625.919ms
	inference >> min = 57.811ms, max = 59.368ms, mean = 58.4264ms, stddev = 0.270633ms
[CUDA FP16]
	init >> 523.272ms
	inference >> min = 29.902ms, max = 31.423ms, mean = 30.0806ms, stddev = 0.16901ms

1 x 3 x 512 x 512:

YOLO v4
[CUDA FP32]
	init >> 432.214ms
	inference >> min = 10.87ms, max = 13.608ms, mean = 11.0792ms, stddev = 0.418999ms
[CUDA FP16]
	init >> 318.978ms
	inference >> min = 7.934ms, max = 8.003ms, mean = 7.96052ms, stddev = 0.0138107ms

4 x 3 x 512 x 512

YOLO v4
[CUDA FP32]
	init >> 551.452ms
	inference >> min = 34.908ms, max = 41.57ms, mean = 35.4624ms, stddev = 0.886297ms
[CUDA FP16]
	init >> 508.225ms
	inference >> min = 20.864ms, max = 21.621ms, mean = 21.0014ms, stddev = 0.111174ms

1 x 3 x 416 x 416:

YOLO v4
[CUDA FP32]
	init >> 379.083ms
	inference >> min = 9.701ms, max = 12.643ms, mean = 10.0155ms, stddev = 0.679755ms
[CUDA FP16]
	init >> 248.296ms
	inference >> min = 6.825ms, max = 6.91ms, mean = 6.85503ms, stddev = 0.0195312ms

4 x 3 x 416 x 416

YOLO v4
[CUDA FP32]
	init >> 462.255ms
	inference >> min = 28.082ms, max = 32.272ms, mean = 28.6683ms, stddev = 0.87224ms
[CUDA FP16]
	init >> 386.791ms
	inference >> min = 15.25ms, max = 18.449ms, mean = 15.3566ms, stddev = 0.317417ms

1 x 3 x 320 x 320:

YOLO v4
[CUDA FP32]
	init >> 377.244ms
	inference >> min = 7.506ms, max = 9.768ms, mean = 7.73995ms, stddev = 0.557712ms
[CUDA FP16]
	init >> 250.421ms
	inference >> min = 5.826ms, max = 5.879ms, mean = 5.84173ms, stddev = 0.00956832ms

4 x 3 x 320 x 320:

YOLO v4
[CUDA FP32]
	init >> 418.565ms
	inference >> min = 19.726ms, max = 24.998ms, mean = 20.1585ms, stddev = 0.504587ms
[CUDA FP16]
	init >> 336.484ms
	inference >> min = 10.383ms, max = 10.5ms, mean = 10.4267ms, stddev = 0.0210358ms

@AlexeyAB
Copy link
Owner

@YashasSamaga

Can you add column, what FPS can you get by using Darknet GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1 by using command
./darknet detector demo cfg/coco.data cfg/yolov4.cfg yolov4.weights test.mp4 -ext_output -dont_show -benchmark

680 x 680 doesn't load on OpenCV DNN. It says inconsistent shape at some layer.

Why did you try 680x680? It should be multiple of 32.

@AlexeyAB
Copy link
Owner

AlexeyAB commented May 24, 2020

So, tkDNN accelerates yolov4 ~2x for batch=1 and 3x-4x for batch=4.
OpenCV-dnn is ~10% slower than tkDNN-TensorRT.
tkDNN: https://github.com/ceccocats/tkDNN
OpenCV: https://gist.github.com/YashasSamaga/48bdb167303e10f4d07b754888ddbdcf

Size Darknet FPS (avg) tkDNN TensorRT FP32 FPS tkDNN TensorRT FP16 FPS OpenCV FP16 FPS tkDNN TensorRT FP16 batch=4 FPS OpenCV FP16 batch=4 FPS tkDNN Speedup
320 100 116 202 171 423 384 4.2x
416 82 103 162 146 284 260 3.5x
512 69 91 134 125 206 190 2.9x
608 53 62 103 100 150 133 2.8x

@YashasSamaga
Copy link

YashasSamaga commented May 27, 2020

I forgot to mention that I had set nms_threshold=0 in all [yolo] blocks in the configuration file. Otherwise, the NMS is done automatically in region layers.

RTX 2070S
608 x 608

without setting nms_threshold (opencv defaults to 0.2):

YOLO v4
[CUDA FP32]
init >> 1329.51ms
inference >> min = 45.596ms, max = 49.184ms, mean = 46.7278ms, stddev = 0.57918ms
[CUDA FP16]
init >> 865.449ms
inference >> min = 37.418ms, max = 43.093ms, mean = 39.4826ms, stddev = 1.24976ms

with nms_threshold=0 in all [yolo] blocks:

YOLO v4
[CUDA FP32]
        init >> 1245.76ms
        inference >> min = 29.934ms, max = 31.181ms, mean = 30.3622ms, stddev = 0.207436ms
[CUDA FP16]
        init >> 876.087ms
        inference >> min = 22.916ms, max = 28.212ms, mean = 24.5076ms, stddev = 1.09143ms

I have written an example which performs full NMS (not classwise) at the end instead of performing it three times during inference (which causes unnecessary context switches as NMS is performed on CPU). This barely changes the FPS.

@AlexeyAB
Copy link
Owner

@YashasSamaga Do you think we should request such improvement and switchable ability in OpenCV? to use

  • either separate NMS for each yolo detection-layer
  • or single NMS for all yolo detection-layers

@YashasSamaga
Copy link

I have always wondered about the benefits of performing NMS in each yolo detection layer. Is there any advantage of doing so compared with doing one combined NMS at the end?

Doing the NMS at the end will definitely help improve performance of the OpenCV CUDA backend currently but I don't know how things will change once GPU NMS kernels are added (some work is in progress for DetectionOutput layer at opencv/opencv#17301).

I think the best place such a thing could be introduced is in DetectionModel which is a part of the high-level model API that was recently introduced in OpenCV DNN.

@AlexeyAB
Copy link
Owner

I have always wondered about the benefits of performing NMS in each yolo detection layer. Is there any advantage of doing so compared with doing one combined NMS at the end?

I think no.
Darknet uses 1 NMS for all yolo-layers.

@YashasSamaga
Copy link

YashasSamaga commented May 28, 2020

I did a bit of investigation. YOLOv2 PR added NMS in region layer because there was only one region layer back then. YOLOv3 PR reused the region layer but this led to NMS being performed in each region layer. I think it's a bug which I thought was a feature all this time.

I have opened an issue opencv/opencv#17415

@mive93
Copy link
Author

mive93 commented Jun 4, 2020

Hi @YashasSamaga, thank you for profiling OpenCV-dnn and comparing it with tkDNN also :)

In the last days we have released a new version of tkDNN, with also a darknet parser, the new mish, and the handling of the batches also for pre-post processing. But I haven't profiled it seriously yet. If interested, I can do it soonish.

@mive93
Copy link
Author

mive93 commented Jun 12, 2020

Hi :)
On the README of tkDNN you can now find the performance of Yolov4 on different boards.
Here's the screenshot of the table

image

@YashasSamaga
Copy link

YashasSamaga commented Jun 19, 2020

Dataset and the list of images taken from How to evaluate accuracy and speed of YOLOv4.

Darknet as of e08a818 and original yolov4.cfg

Darknet (FP32)
GTX 1050

overall performance
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.435
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.657
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.473
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.267
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.467
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.533
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.342
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.549
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.580
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.403
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.617
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.713
Done (t=487.90s)

The number of detections were considerably smaller in OpenCV. I eventually figured that OpenCV was ditching detections with low confidence scores. So I added thresh=0.001 to all [yolo] blocks in yolov4.cfg. The number of detections from Darknet and OpenCV still isn't same but they are better than before. I suspect the difference is caused by NMS method (OpenCV does global optimal NMS).

Code: https://gist.github.com/YashasSamaga/077a1d69c48e4cdb9957d167b7000b98

OpenCV DNN CUDA (FP32)
RTX 2080 Ti

overall performance
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.436
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.657
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.474
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.267
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.467
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.533
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.342
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.549
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.580
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.405
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.617
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.715
Done (t=329.38s)
OpenCV DNN CUDA (FP16)
RTX 2080 Ti

overall performance
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.435
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.657
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.473
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.267
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.467
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.532
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.342
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.549
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.580
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.404
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.617
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.714
Done (t=325.70s)

The numbers for OpenCV are better than Darknet. I think it's because of the NMS but I wanted to rule out the possibility of variations arising due to different choices made while selecting convolution kernels on different devices (Darknet stats were generated on GTX 1050 while OpenCV stats were generated on RTX 2080 Ti).

OpenCV FP32 on GTX 1050
overall performance
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.436
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.657
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.474
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.267
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.467
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.533
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.342
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.549
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.580
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.405
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.617
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.715
Done (t=328.92s)

@YashasSamaga
Copy link

If I do not set thresh=0.001 in all [yolo] blocks, 0.2 is used as the confidence threshold. There is considerable performance degradation with using 0.2:

OpenCV CUDA FP16
RTX 2080 Ti
thresh = 0.2 (opencv default)

overall performance
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.400
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.583
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.445
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.231
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.431
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.500
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.313
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.461
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.471
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.275
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.505
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.604
Done (t=117.94s)

I wonder if this default behaviour in OpenCV is correct.

@AlexeyAB
Copy link
Owner

@YashasSamaga This is normal, since for Detection you should use optimal conf-thresh 0.2 - 0.25, while AP calculation should be done for each possible conf-thresh starting from 0.001.

@gachiemchiep
Copy link

Hello @AlexeyAB.
I try your model on OpenCV DNN and got the same result as reported.
The output is a list of array which have format as :

batch_size x 17328 x 85
batch_size x 4332 x 85
batch_size x 1083 x 85

I understand that 85 is equal to [center_x, center_y, width, height, box_confidence, class_1_score, .... ].
Coco has 80 classes so it is 4 + 1 + 80 = 85.
But what is 17328, 4332, 1083 is stand for? Would you mind give me a quick hint for this?
Thanks.

@WongKinYiu
Copy link
Collaborator

they are grid_width * grid_height * masks * (classes+coordinates+objectness)
so 1083 * 85 = 19 * 19 * 3 * (80 + 4 + 1).

@gachiemchiep
Copy link

gachiemchiep commented Jun 24, 2020

@WongKinYiu But there're 3 of them : 17328, 4332, 1083 . Do you know the meaning of other 2?

@WongKinYiu
Copy link
Collaborator

there are three yolo layers (feature pyramid).
17328* 85 = 76 * 76 * 3 * (80 + 4 + 1)
4332 * 85 = 38 * 38 * 3 * (80 + 4 + 1)
1083 * 85 = 19 * 19 * 3 * (80 + 4 + 1)

@gachiemchiep
Copy link

gachiemchiep commented Jun 25, 2020

@WongKinYiu Thank you for your kindness. It help me a lot. I checked the yolov3 and FPN paper and found the explanation about the feature pyramid.

@YashasSamaga
Copy link

YashasSamaga commented Jul 2, 2020

@mive93 Do tkDNN benchmarks include the host/device memory transfer time?

I was looking at tkDNN source and if I have understood correctly, the input is copied from the host to device. The input on the device is then copied to TRT's device buffer, inference is done, the outputs in TRT's buffer are copied to non-TRT output buffers. The outputs are then copied to the host. The time reported by tkDNN is the time it took for copying from a device buffer to TRT's buffers and vice-versa and the inference time. Is this correct?

@mive93
Copy link
Author

mive93 commented Jul 2, 2020

Hi @YashasSamaga,
What you said is correct. And yes, the tkDNN benchmarks I reported include only inference, preprocessing and postprocessing are left out.

@AlexeyAB
Copy link
Owner

AlexeyAB commented Jul 2, 2020

@mive93 Do you use overlapping in 3 thread/steams?

  1. pre-processing (CPU -> GPU)
  2. inference on GPU
    3 post-processing (GPU->CPU)

Yes, it reduces latency.
But does it reduce FPS?

@YashasSamaga
Copy link

YashasSamaga commented Jul 2, 2020

Btw OpenCV benchmarks that I reported had included the GPU-CPU transfer times. They total up to 1.1ms on RTX 2080 Ti (0.3ms for input, 0.53ms for output1, 0.25ms for output2 and 0.03ms for output3) for single image inference with pinned host memory. If this extra time is deducted from the OpenCV timings I reported, I think OpenCV is faster than tkDNN on RTX 2080 Ti for single image inference.

OpenCV master (as of today) takes 9.5ms for single image inference (inclusive of the 1.1ms) and tkDNN takes 9.0ms. Subtracting 1.1ms gives ~8.4ms for OpenCV but tkDNN is also making a device to device copy during inference which OpenCV doesn't but D2D copies are much faster (probably very very small compared to 1.1ms) than H2D or D2H copies.

Anyway, OpenCV and tkDNN are close enough that any benchmark will depend on these minute details. So it's not meaningful to compare with numbers very close to each other.

@AlexeyAB
Copy link
Owner

AlexeyAB commented Jul 2, 2020

If 3 operations are overlaped, then they increase the latency, but do not affect the FPS.

@AlexeyAB
Copy link
Owner

@mive93 Hi,

Have you increased FPS more than in your table a month ago? #5354 (comment)

Can you show actual FPS for RTX 2080Ti and yolov4.cfg 320-608 if FPS was increased?

@mive93
Copy link
Author

mive93 commented Aug 6, 2020

Hi @AlexeyAB,

I'm sorry, I didn't notice the notification with your question.
No, the times are not changed. In case, I will let you know :)

However, I had time to check the problem with the mAP and I finally understood why we had that accuracy drop.
Basically, it is due to 3 things:

  • an error (shame on us) caused by casting. At a certain point, we were casting boxes to int (that caused a drop of 0.03 point in mAP )
  • a different resize function used. We use the opencv one (linear), you use your custom one. It seems nothing, but it changes a bit of the outcome. However, the one by opencv is faster, and given that it does not affect that much, we'll keep that one for now. Maybe in the future, we will optimize yours on GPU.
  • a different batchnorm implementation. We use the one by tensorRT, you use your own one. Again, it's not that big thing, but the output of the network is a bit different if we use an epsilon of 0.0001 (not if smaller). However, when using a threshold of 0.001 to compute the mAP, ofc it affects the results.

Hereafter the new results on codalab for COCO val 2017, for threshold 0.001 and 0.3.

*******************************************************************************
darknet yolov4 416x416 t=0.001
*******************************************************************************
overall performance
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.471
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.710
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.510
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.278
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.525
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.636
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.357
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.561
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.587
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.382
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.642
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.772
*******************************************************************************
tkDNN yolov4 416x416 t=0.001 
*******************************************************************************
overall performance
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.468
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.705
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.506
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.274
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.522
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.633
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.356
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.561
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.587
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.381
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.642
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.773
*******************************************************************************
darknet yolov4 416x416 t=0.3
*******************************************************************************
overall performance
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.424
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.610
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.472
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.219
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.481
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.597
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.324
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.469
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.474
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.242
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.535
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.671
*******************************************************************************
tkDNN yolov4 416x416 t=0.3 
*******************************************************************************
overall performance
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.424
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.609
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.472
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.219
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.481
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.598
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.324
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.469
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.475
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.242
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.535
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.670

Here the results for other networks: https://github.com/ceccocats/tkDNN#map-results

@arnaud-nt2i
Copy link

@mive93 @YashasSamaga
The big difference tho is that tkDNN is for Linux only 😭 😿 😢 while OpenCV is for us W10 users as well 👍

@clydebailey
Copy link

interested

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Explanations Explanations of the source code, algorithms or method of use
Projects
None yet
Development

No branches or pull requests

8 participants