Regarding mAP and latency of Yolov4 #5354

mive93 · 2020-04-27T07:45:41Z

Dear Alexey,

first of all, thank you for your work.
I have been doing some tests with your new yolov4, and I have some questions.
I compared the performance of Yolov4, Yolov3 and CSPResNext50-Panet-SPP (the one I found also in your repo) on two different GPUs, using input size 416x416, and I have checked the mAP for the COCO2017 validation set.

Here are the results (both FPS and mAP have been computed using your code):

GeForce RTX 2080 Ti Rev. A (while training going on, so maybe perfromance are a bit degrated)
--------------------------------------------------------
			FPS	mAP(.50, val COCO2017)
YOLOV3			39.0	66.16 % 
YOLOV4			38.8	70.22 %
CSPRESNEXT50-SPP-PANET	37.8	75.88 %
--------------------------------------------------------
GeForce GTX 1060 6GB
--------------------------------------------------------
			FPS 	FPS(CUDNN_HALF = 1 )
YOLOV3			31.7 	30.8
YOLOV4			29.9 	29.0
CSPRESNEXT50-SPP-PANET	28.6    28.7      
--------------------------------------------------------

However, I have noticed that you do not compare with the third network in your paper. I was wondering which was the reason, and if I am doing, maybe, something wrong.

Thank you in advance.

The text was updated successfully, but these errors were encountered:

WongKinYiu · 2020-04-27T09:10:41Z

Hello,

We use trainvalno5k set for training, there are some images in val set are trained.
So CSPResNeXt50-PANet-SPP gets higher AP50 on val set may because that it is more fit to the training data.

The comparison of YOLOv4 (CSPDarknet53-PANet-SPP, BoF-backbone, Mish, optimal setting) and CSPResNeXt50-PANet-SPP are in Table 6.

We choose CSPDarknet53 as backbone of YOLOv4 since it gets both higher FPS and AP.

mive93 · 2020-04-27T11:54:24Z

Dear @WongKinYiu,

Thank you for your answer.
I see why the mAP could be better, however I'm not experiencing that better results in FPS.
I have tried again on the 2080Ti, with the GPU unloaded, and this is what I get:

Size 512x512
GeForce RTX 2080 Ti Rev. A 
--------------------------------------------------------
			FPS	
YOLOV3			55.5	
YOLOV4			60.5
CSRESNEXT50-SPP-PANET	59.6

Therefore again, I don't see a big improvement in Yolov4.
Again, maybe it's my fault, I would just like to understand why I do not obtain your improvement.

mive93 · 2020-04-27T12:02:19Z

And another thing, sorry I forgot, you said you use both training and validation for training, but you meant for CSPResNeXt50-PANet-SPP or for Yolov4?

Thanks again.

AlexeyAB · 2020-04-27T12:25:18Z

@mive93

using input size 416x416

Check that you use [net] width=416 height=416 in botch cfg-files
What command did you use for checking AP and FPS?
Can you show screenshot from https://competitions.codalab.org where do you get 0.75 AP50 on val2017?
Can you show screenshot with FPS?

mive93 · 2020-04-27T15:32:28Z

Dear @AlexeyAB,

Yes, I am sure the size were correct.

These are the commands I run to get the AVG FPS

./darknet detector demo cfg/coco.data  cfg/csresnext50-panet-spp.cfg  weights/csresnext50-panet-spp_final.weights  ../tkDNN/demo/yolo_test.mp4 -dont_show -ext_output
./darknet detector demo cfg/coco.data  cfg/yolov4.cfg  weights/yolov4.weights  ../tkDNN/demo/yolo_test.mp4 -dont_show -ext_output
./darknet detector demo cfg/coco.data  cfg/yolov3.cfg  weights/yolov3.weights  ../tkDNN/demo/yolo_test.mp4 -dont_show -ext_output

here the commands to obtain the detections for codalab

./darknet  detector valid cfg/coco.data cfg/csresnext50-panet-spp.cfg  weights/csresnext50-panet-spp_final.weights
./darknet  detector valid cfg/coco.data cfg/yolov4.cfg  weights/yolov4.weights
./darknet  detector valid cfg/coco.data cfg/yolov3.cfg  weights/yolov3.weights

In this folder you can find all the screenshots: https://cloud.hipert.unimore.it/s/g7KZNnytki5gExE

I summarize here the results from codalab on val2017

###############################################################################
#		YOLOV3 416x416 CODALAB res COCO2017 VAL       		      #
###############################################################################
overall performance
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.380
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.675
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.391
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.227
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.418
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.534
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.304
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.474
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.497
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.330
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.537
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.656
Done (t=81.34s)

GeForce RTX 2080 Ti Rev. A FPS: 75.5

###############################################################################
#		YOLOV4 416x416 CODALAB res COCO2017 VAL			      #
###############################################################################

overall performance
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.471
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.710
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.510
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.278
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.525
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.636
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.357
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.561
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.587
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.382
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.642
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.772
Done (t=120.59s)

GeForce RTX 2080 Ti Rev. A FPS: 71.7

###############################################################################
#		CSPRESNEXT50-PANET-SPP 416x416 CODALAB res COCO2017 VAL       #
###############################################################################

overall performance
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.497
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.766
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.535
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.269
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.549
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.708
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.363
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.559
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.583
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.376
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.637
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.776
Done (t=78.67s)

GeForce RTX 2080 Ti Rev. A FPS: 70.1

AlexeyAB · 2020-04-27T20:55:25Z

Attach your val2017.list with list of coco2017 validation images
Does it contain all or only some images from http://images.cocodataset.org/zips/val2017.zip ?
Post an URL that you used for valiation on codalab

AlexeyAB · 2020-04-27T21:32:01Z

Oh, you are about old model csresnext50-panet-spp.cfg not about csresnext50-panet-spp-original-optimal.cfg .

Yes, it seems csresnext50-panet-spp.cfg was trained by using trainvalno5k.list + 5k.list (may be), while csresnext50-panet-spp-original-optimal.cfg and yolov4.cfg were trained by using only trainvalno5k.list without 5k.list

I get this results on 5k.list:

yolov4.cfg - 416x416 - val2017:
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.471
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.710
csresnext50-panet-spp-original-optimal.cfg - 416x416 - val2017:
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.457
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.693
csresnext50-panet-spp.cfg - 416x416 - val2017:
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.497
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.766

By the way, I can't submit your json-files, so I just tested these models by myself again.

WongKinYiu · 2020-04-28T00:50:38Z

@AlexeyAB @mive93

I will test and update FPS on Turing architecture GPU in few days.

If use old CSPResNeXt50-PANet-SPP, you will get higher AP on 416x416 due to the anchor setting.
#5311 (comment)

mive93 · 2020-04-28T07:58:08Z

Dear @AlexeyAB and @WongKinYiu.
Sorry for my late answer. I have uploaded on the folder both my json detections and the txt list: https://cloud.hipert.unimore.it/s/g7KZNnytki5gExE. To test everything on codalab I have just follower your wiki (just using COCOval2017 instead of testdev-2017).

However, if you say that you trained that network using also those data, it makes a lot of sense that the mAP is higher, even though it's not fair. So yeah, I assume Yolov4 accuracy is better then :)

Thank you for your quick answers, and thank you for clarifying my doubts. I will wait for the FPS results then.

mive93 · 2020-04-29T10:27:20Z

Dear @AlexeyAB,

yesterday I have ported your Yolov4 on tensorRT using tkDNN, a framework developed by @ceccocats, @sapienzadavide and I (you can find it here).

Some performance results on 2 boards, a discrete and an embedded one. The outputs match with yours, so the mAP is the same.

AVG FPS over 5000 images, input size 416x416.

AGX Xavier		
	FPS - FP32	FPS - FP16
yolov3	19,47		49,62
yolov4	17,52		32,67

RTX 2080 Ti		
	FPS - FP32	FPS - FP16
yolov3	106,30		192,13
yolov4	93,00		133,41

AlexeyAB · 2020-04-29T11:43:54Z

@mive93 Hi,
Thanks!

The outputs match with yours, so the mAP is the same.

Does it match even for FP16?
Is FP32 == FP32, while FP16 == Mixed Precision FP16+FP32 on TensorCores?
Did you test it with batch=1?
What network resolution did you use?
What is the advantage of tkDNN over TensorRT, and for what tkDNN is used if TensorRT is used for inference/quantization?
Is there some comparison table with FPS for different models/resolutions/float-precisions? https://github.com/ceccocats/tkDNN

Can you test YOLOv4 on RTX2080Ti (or preferably on Tesla V100) for 4 network resolutions with batch=1 and batch=4?

320
416
512
608

mive93 · 2020-04-29T20:10:48Z

Dear @AlexeyAB ,

Sorry for the delay.

These are the results of the mAP computed on codalab.

###############################################################################
#		DARKNET 416x416 CODALAB res COCO2017 VAL			      #
###############################################################################

overall performance
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.471
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.710
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.510
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.278
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.525
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.636
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.357
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.561
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.587
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.382
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.642
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.772
Done (t=120.59s)


###############################################################################
		TKDNN YOLOV4 FP32 416x416 CODALAB res COCO2017 VAL			      
###############################################################################

overall performance
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.449
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.701
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.481
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.235
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.507
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.626
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.343
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.533
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.556
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.329
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.618
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.758
Done (t=73.61s)

###############################################################################
		TKDNN YOLOV4 FP16 416x416 CODALAB res COCO2017 VAL			      
###############################################################################

overall performance
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.449
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.701
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.481
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.235
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.507
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.626
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.343
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.533
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.555
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.330
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.617
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.758
Done (t=72.04s)

Actually there is a small loss, I guess due to different implementation of the operations. I have tried to investigate more, but I couldn't find another source of mismatch. However, FP16 has the same mAP as FP32.

FP32 is full precision, FP16 is half precision for tensorRT. Then with tkDNN is also possible to infer at INT8 (always int8 tensorRT quantization) and using DLA. However, given that in these kinds of networks a lot of layers are implemented via plugins and not native TensorRT APIs, the performance with DLA or INT8 could be degraded.
Yes, I did the tests with batch 1
I was using 416x416
TkDNN uses tensorRT. We just tried to achieve the best performances exploiting it, while not depending on deepstream for example. We also compared our performance with deepstream (to be precise last autumn), and we perform better.

In this table batch=1 is considered and the same video was used to compute the avg fps. Darknet is the one by Redmon. The considered board here is the NVIDIA Tx2.
Performance results for other networks are not yet available, but we're submitting a paper next week. Then we will make them public. Even though anyone could reproduce them already.
Here there are the results of the tests you asked for:

FPS on RTX 2080Ti of Yolov4 TkDNN (avg over 1200 img of size 640 x 480)				
		FP32 - BATCH=1	FP32 - BATCH=4	FP16 - BATCH=1	FP16 - BATCH=4
yolov4 320	116,99		58,29		204,99		105,82
yolov4 416	116,27		40,68		194,64		71,08
yolov4 512	91,31		32,97		137,85		51,51
yolov4 608	62,04		20,27		109,01		37,60

AlexeyAB · 2020-04-29T22:59:29Z

@mive93 Thanks!

FPS on RTX 2080Ti of Yolov4 TkDNN (avg over 1200 img of size 640 x 480)				
		FP32 - BATCH=1	FP32 - BATCH=4	FP16 - BATCH=1	FP16 - BATCH=4
yolov4 320	116,99		58,29		204,99		105,82
yolov4 416	116,27		40,68		194,64		71,08
yolov4 512	91,31		32,97		137,85		51,51
yolov4 608	62,04		20,27		109,01		37,60

Does 37.60 FPS for batch=4 actually mean that tkDNN process 37.6 x 4 = 150,4 FPS for YOLOv4 width=608 height=608 batch_size=4 FP16 on RTX 2080 Ti?
Usually high batch size increases FPS.
Do you measure just inference time, or do you measure full cycle fps? Just pre(resizing) and post(NMS) processing execute in separate CPU-threads asynchronously, therefore, do not reduce FPS?

Actually there is a small loss, I guess due to different implementation of the operations. I have tried to investigate more, but I couldn't find another source of mismatch. However, FP16 has the same mAP as FP32.

Do you use resizing before inference without keeping aspect ratio? This repo https://github.com/AlexeyAB/darknet doesn't keep aspect ratio (i.e. by default letter_box=0), while https://github.com/pjreddie/darknet keeps aspect ratio.
Just do cv::resize(src, dst, Size(608,608)); without keeping aspect ratio
Resizing : keeping aspect ratio, or not #232 (comment)

Also what NMS implementation do you use? (is it regular NMS or soft-NMS)

darknet/src/box.c

Lines 812 to 844 in 36c73c5

    
           void do_nms_sort(detection *dets, int total, int classes, float thresh) 
        
           { 
        
               int i, j, k; 
        
               k = total - 1; 
        
               for (i = 0; i <= k; ++i) { 
        
                   if (dets[i].objectness == 0) { 
        
                       detection swap = dets[i]; 
        
                       dets[i] = dets[k]; 
        
                       dets[k] = swap; 
        
                       --k; 
        
                       --i; 
        
                   } 
        
               } 
        
               total = k + 1; 
        
               for (k = 0; k < classes; ++k) { 
        
                   for (i = 0; i < total; ++i) { 
        
                       dets[i].sort_class = k; 
        
                   } 
        
                   qsort(dets, total, sizeof(detection), nms_comparator_v3); 
        
                   for (i = 0; i < total; ++i) { 
        
                       //printf("  k = %d, \t i = %d \n", k, i); 
        
                       if (dets[i].prob[k] == 0) continue; 
        
                       box a = dets[i].bbox; 
        
                       for (j = i + 1; j < total; ++j) { 
        
                           box b = dets[j].bbox; 
        
                           if (box_iou(a, b) > thresh) { 
        
                               dets[j].prob[k] = 0; 
        
                           } 
        
                       } 
        
                   } 
        
               } 
        
           }

Did you implement scale_x_y= in the [yolo] layer? It is very simple addition:

darknet/src/yolo_layer.c

Lines 336 to 337 in 36c73c5

    
           activate_array(l.output + index, 2 * l.w*l.h, LOGISTIC);        // x,y, 
        
           scal_add_cpu(2 * l.w*l.h, l.scale_x_y, -0.5*(l.scale_x_y - 1), l.output + index, 1);    // scale x,y

Feature-request: State-of-art Yolo v4 Detector opencv/opencv#17148

There are 3 different scale_x_y= values

darknet/cfg/yolov4.cfg

Line 973 in f14054e

scale_x_y = 1.2
darknet/cfg/yolov4.cfg

Line 1060 in f14054e

scale_x_y = 1.1
darknet/cfg/yolov4.cfg

Line 1148 in f14054e

scale_x_y = 1.05

mive93 · 2020-04-30T09:30:49Z

Hi @AlexeyAB,

Yes, it is correct.
You are absolutely right, it's not a 100% fair comparison, totally my bad. There were two mistakes, for batch=1 I was considering end-to-end latencies and 1200 images of size 640 x 480. For batching I was only considering inference (just because we still haven't integrating batching pre and post processing) with 1200 images of size = to the network input size.I have perfomed the tests again, more fairly, on 1200 images of size = to the network input size and only interence time.

FPS on RTX 2080Ti of Yolov4 TkDNN (avg over 1200 img of size = to network inputsize)				
		FP32 - BATCH=1	FP32 - BATCH=4	FP16 - BATCH=1	FP16 - BATCH=4
yolov4 320	116,56		233,16		202,02		423,29
yolov4 416	103,54		162,71		162,50		284,34
yolov4 512	91,63		131,90		134,94		206,04
yolov4 608	62,34		81,06		100,81		150,41

Moreover, hereafter some statistics about preprocessing-inference-postprocessing on yolov4-416.

				pre(%)		inference(%)	post(%)
RTX2080Ti	yolov4 FP32	10,22		79,60		10,18
RTX2080Ti	yolov4 FP16	17,27		68,28		14,45
				
AGX Xavier	yolov4 FP32	2,58		95,36		2,06
AGX Xavier	yolov4 FP16	4,83		91,47		3,70

We already do so.
We use the NM you reported.
Yes we have implemented the scale_xy for yolov4.

AlexeyAB · 2020-04-30T15:14:54Z

@mive93 Hi,

Performance results for other networks are not yet available, but we're submitting a paper next week. Then we will make them public. Even though anyone could reproduce them already.

Will you publish paper on arxiv.org with AP / FPS or only FPS comparison of different models?

Do you use only FP32 (without Tensor Cores) and FP16 (with Tensor Cores), but don't use FP32/16 (Mixed-precision with Tensor Cores), because FP16 shows the same good accuracy?
Will you add manual how to measure AP / AP50 and FPS by using TkDNN+TensorRT?
Will you add demo on video-file that check FPS including inference+pre+post_processing which are runing in 3 CPU-threads, and can use both batch=1 and batch=4 ?
And shows detection results to the console and optionally shows video in the window (can be switched off, because can reduce FPS).
Did you compare inference time with batch=1 for tkDNN vs OpenCV-dnn? Feature-request: State-of-art Yolo v4 Detector opencv/opencv#17148
Do you use the same Mish-implementation as in the Darknet?

darknet/src/activation_kernels.cu

Line 238 in f14054e

output_gpu[i] = x_val * tanh_activate_kernel( softplus_kernel(x_val, MISH_THRESHOLD) );

float softplus(float x, float threshold = 20) {
    if (x > threshold) return x;                // too large
    else if (x < -threshold) return expf(x);    // too small
    return logf(expf(x) + 1);
}

float mish_activation(float input) {
    const float MISH_THRESHOLD = 20;
    output = input * tanh( softplus(input, MISH_THRESHOLD) );
    return output;
}

AlexeyAB · 2020-05-04T18:38:16Z

@mive93

Can you also test AVG_FPS for YOLOv4 on the Darknet (OpenCV + CUDA + cuDNN) on the same GPU 2080 Ti, for these network resolutions 320, 416, 512, 608?

By using such command:
./darknet detector demo cfg/coco.data cfg/yolov4.cfg weights/yolov4.weights yolo_test.mp4 -dont_show -ext_output

mive93 · 2020-05-04T18:48:41Z

Hi @AlexeyAB,

I am sorry for the delay but I have work-related deadline for this week.
I will be back to you after those, trying to address your requests :)
Sorry for now

mive93 · 2020-05-11T10:35:52Z

Hi @AlexeyAB, sorry for the long delay.

Will you publish paper on arxiv.org with AP / FPS or only FPS comparison of different models?

We submitted to a conference and we run experiments in terms of mAP, latency and power consumption. As soon as it's accepted we also plan to share the raw data.

Do you use only FP32 (without Tensor Cores) and FP16 (with Tensor Cores), but don't use FP32/16 (Mixed-precision with Tensor Cores), because FP16 shows the same good accuracy?

We use FP32 (without Tensor Cores) and FP32/16 (Mixed-precision with Tensor Cores) [as well as FP32/INT8], because plugins are always at FP32.

Will you add manual how to measure AP / AP50 and FPS by using TkDNN+TensorRT?

In the master there is already a demo that computes the mAP for each method supported by tkDNN. However, it is a bit different from your, because bounding boxes are accounted only one time, with the highest probability. In the README is explained how to compute the mAP, each precision is supported.
Moreover, in the branch eval I added the export to a json, to evaluate the performance of each netowork/precision on codalab.

Will you add demo on video-file that check FPS including inference+pre+post_processing which are runing in 3 CPU-threads, and can use both batch=1 and batch=4 ?
And shows detection results to the console and optionally shows video in the window (can be switched off, because can reduce FPS).

I will work on the demo with batch > 1 this week, will keep you updated when I'll have something working.

Did you compare inference time with batch=1 for tkDNN vs OpenCV-dnn? opencv/opencv#17148

Never heard of that, will take a look, thanks.

Do you use the same Mish-implementation as in the Darknet?

Yes

Can you also test AVG_FPS for YOLOv4 on the Darknet (OpenCV + CUDA + cuDNN) on the same GPU 2080 Ti, for these network resolutions 320, 416, 512, 608?

Here the results:

Size	FPS (avg)
320	100.6
416	82.5
512	69.7
608	53.6

AlexeyAB · 2020-05-11T13:27:15Z

@mive93 Hi, Thanks!

So, tkDNN accelerates yolov4 ~2x for batch=1 and 3x-4x for batch=4.

Size	Darknet FPS (avg)	tkDNN TensorRT FP32 FPS	tkDNN TensorRT FP16 FPS	tkDNN TensorRT FP16 batch=4 FPS	Speedup
320	100.6	116	202	423	4.2x
416	82.5	103	162	284	3.5x
512	69.7	91	134	206	2.9x
608	53.6	62	100	150	2.8x

We submitted to a conference and we run experiments in terms of mAP, latency and power consumption. As soon as it's accepted we also plan to share the raw data.

When will the conference be?

Moreover, in the branch eval I added the export to a json, to evaluate the performance of each netowork/precision on codalab.

It would be great if you could get identical accuracy in the future like in Darknet.

AlexeyAB · 2020-05-13T22:47:50Z

@mive93 Hi,

We use new mish-implementation, and get +3% FPS with the same AP-detection accuracy on MSCOCO testdev:

darknet/src/activation_kernels.cu

Lines 235 to 246 in bef2844

    
           __device__ float mish_yashas(float x) 
        
           { 
        
               float e = __expf(x); 
        
               if (x <= -18.0f) 
        
                   return x * e; 
        
               float n = e * e + 2 * e; 
        
               if (x <= -5.0f) 
        
                   return x * __fdividef(n, n + 2); 
        
               return x - 2 * __fdividef(x, n + 2); 
        
           }

More: #5452 (comment)

So you can try to use this implementation in tkDNN.

lazerliu · 2020-05-14T15:08:43Z

excuse me ,@mive93
how can you get this,what cmd did you use?thks!

###############################################################################
#		DARKNET 416x416 CODALAB res COCO2017 VAL			      #
###############################################################################

overall performance
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.471
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.710
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.510
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.278
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.525
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.636
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.357
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.561
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.587
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.382
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.642
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.772
Done (t=120.59s)

mive93 · 2020-05-14T15:25:43Z

Hi
@AlexeyAB The conference will be in September, and it covers not only tkDNN but also other stuff (we test 5 different CNN on 3 embedded boards with 3 different frameworks). We are considering also having an arxive only on tkDNN performances. When we'll have time, we'll probably do that. I will keep you updated if you're interested.

For the new mish function, I can include that. Thank you :)

@lazerliu I obtained those results using codalab. You have first to generate a json (in COCO format) of the detection, then submit it in the site. More info are given also in this repo wiki.

AlexeyAB · 2020-05-14T15:42:13Z

@lazerliu Create new topic.

lazerliu · 2020-05-14T15:55:40Z

@mive93 ,I just want to valid my dataset,thank [you!]
@AlexeyAB thanks for your reply,I've created an issue #5615 about geting all the metrics of COCO mAP using this repository.

YashasSamaga · 2020-05-24T17:27:38Z

Results for OpenCV DNN @ master (https://github.com/opencv/opencv/tree/6b0fff72d9748345c6a079e4fce49af4130d8e12):

Device: RTX 2080 Ti

Input Size	FP32 FPS	FP16 FPS	FP32 batch = 4	FP16 batch = 4
320 x 320	129.2	171.2	198	384
416 x 416	99.9	146	139.6	260.5
512 x 512	90.3	125.6	112.8	190.5
608 x 608	56	103.2	68.5	133

Code: https://gist.github.com/YashasSamaga/48bdb167303e10f4d07b754888ddbdcf

average of 100 runs
OpenCV DNN executes all operations in FP16 in FP16 mode
NMS is not included in the timings
680 x 680 doesn't load on OpenCV DNN. It says inconsistent shape at some layer.

There are currently two open PRs which affect YOLOv4 performance. The performance will mostly improve by around 5-10%.

The timings often change slightly every time the benchmark program is run. Here is the raw output from the benchmark code:

CLICK ME

1 x 3 x 608 x 608:

YOLO v4
[CUDA FP32]
	init >> 463.515ms
	inference >> min = 16.964ms, max = 21.649ms, mean = 17.8347ms, stddev = 1.25985ms
[CUDA FP16]
	init >> 311.645ms
	inference >> min = 9.644ms, max = 9.867ms, mean = 9.69076ms, stddev = 0.0379731ms

4 x 3 x 608 x 608:

[CUDA FP32]
	init >> 625.919ms
	inference >> min = 57.811ms, max = 59.368ms, mean = 58.4264ms, stddev = 0.270633ms
[CUDA FP16]
	init >> 523.272ms
	inference >> min = 29.902ms, max = 31.423ms, mean = 30.0806ms, stddev = 0.16901ms

1 x 3 x 512 x 512:

YOLO v4
[CUDA FP32]
	init >> 432.214ms
	inference >> min = 10.87ms, max = 13.608ms, mean = 11.0792ms, stddev = 0.418999ms
[CUDA FP16]
	init >> 318.978ms
	inference >> min = 7.934ms, max = 8.003ms, mean = 7.96052ms, stddev = 0.0138107ms

4 x 3 x 512 x 512

YOLO v4
[CUDA FP32]
	init >> 551.452ms
	inference >> min = 34.908ms, max = 41.57ms, mean = 35.4624ms, stddev = 0.886297ms
[CUDA FP16]
	init >> 508.225ms
	inference >> min = 20.864ms, max = 21.621ms, mean = 21.0014ms, stddev = 0.111174ms

1 x 3 x 416 x 416:

YOLO v4
[CUDA FP32]
	init >> 379.083ms
	inference >> min = 9.701ms, max = 12.643ms, mean = 10.0155ms, stddev = 0.679755ms
[CUDA FP16]
	init >> 248.296ms
	inference >> min = 6.825ms, max = 6.91ms, mean = 6.85503ms, stddev = 0.0195312ms

4 x 3 x 416 x 416

YOLO v4
[CUDA FP32]
	init >> 462.255ms
	inference >> min = 28.082ms, max = 32.272ms, mean = 28.6683ms, stddev = 0.87224ms
[CUDA FP16]
	init >> 386.791ms
	inference >> min = 15.25ms, max = 18.449ms, mean = 15.3566ms, stddev = 0.317417ms

1 x 3 x 320 x 320:

YOLO v4
[CUDA FP32]
	init >> 377.244ms
	inference >> min = 7.506ms, max = 9.768ms, mean = 7.73995ms, stddev = 0.557712ms
[CUDA FP16]
	init >> 250.421ms
	inference >> min = 5.826ms, max = 5.879ms, mean = 5.84173ms, stddev = 0.00956832ms

4 x 3 x 320 x 320:

YOLO v4
[CUDA FP32]
	init >> 418.565ms
	inference >> min = 19.726ms, max = 24.998ms, mean = 20.1585ms, stddev = 0.504587ms
[CUDA FP16]
	init >> 336.484ms
	inference >> min = 10.383ms, max = 10.5ms, mean = 10.4267ms, stddev = 0.0210358ms

AlexeyAB · 2020-05-24T18:24:58Z

@YashasSamaga

Can you add column, what FPS can you get by using Darknet GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1 by using command
./darknet detector demo cfg/coco.data cfg/yolov4.cfg yolov4.weights test.mp4 -ext_output -dont_show -benchmark

680 x 680 doesn't load on OpenCV DNN. It says inconsistent shape at some layer.

Why did you try 680x680? It should be multiple of 32.

AlexeyAB · 2020-05-24T19:34:52Z

So, tkDNN accelerates yolov4 ~2x for batch=1 and 3x-4x for batch=4.
OpenCV-dnn is ~10% slower than tkDNN-TensorRT.
tkDNN: https://github.com/ceccocats/tkDNN
OpenCV: https://gist.github.com/YashasSamaga/48bdb167303e10f4d07b754888ddbdcf

Size	Darknet FPS (avg)	tkDNN TensorRT FP32 FPS	tkDNN TensorRT FP16 FPS	OpenCV FP16 FPS	tkDNN TensorRT FP16 batch=4 FPS	OpenCV FP16 batch=4 FPS	tkDNN Speedup
320	100	116	202	171	423	384	4.2x
416	82	103	162	146	284	260	3.5x
512	69	91	134	125	206	190	2.9x
608	53	62	103	100	150	133	2.8x

YashasSamaga · 2020-05-27T04:39:04Z

I forgot to mention that I had set nms_threshold=0 in all [yolo] blocks in the configuration file. Otherwise, the NMS is done automatically in region layers.

RTX 2070S
608 x 608

without setting nms_threshold (opencv defaults to 0.2):

YOLO v4
[CUDA FP32]
init >> 1329.51ms
inference >> min = 45.596ms, max = 49.184ms, mean = 46.7278ms, stddev = 0.57918ms
[CUDA FP16]
init >> 865.449ms
inference >> min = 37.418ms, max = 43.093ms, mean = 39.4826ms, stddev = 1.24976ms

with nms_threshold=0 in all [yolo] blocks:

YOLO v4
[CUDA FP32]
        init >> 1245.76ms
        inference >> min = 29.934ms, max = 31.181ms, mean = 30.3622ms, stddev = 0.207436ms
[CUDA FP16]
        init >> 876.087ms
        inference >> min = 22.916ms, max = 28.212ms, mean = 24.5076ms, stddev = 1.09143ms

I have written an example which performs full NMS (not classwise) at the end instead of performing it three times during inference (which causes unnecessary context switches as NMS is performed on CPU). This barely changes the FPS.

AlexeyAB · 2020-05-27T11:11:05Z

@YashasSamaga Do you think we should request such improvement and switchable ability in OpenCV? to use

either separate NMS for each yolo detection-layer
or single NMS for all yolo detection-layers

YashasSamaga · 2020-05-28T16:42:30Z

I have always wondered about the benefits of performing NMS in each yolo detection layer. Is there any advantage of doing so compared with doing one combined NMS at the end?

Doing the NMS at the end will definitely help improve performance of the OpenCV CUDA backend currently but I don't know how things will change once GPU NMS kernels are added (some work is in progress for DetectionOutput layer at opencv/opencv#17301).

I think the best place such a thing could be introduced is in DetectionModel which is a part of the high-level model API that was recently introduced in OpenCV DNN.

AlexeyAB · 2020-05-28T17:36:55Z

I have always wondered about the benefits of performing NMS in each yolo detection layer. Is there any advantage of doing so compared with doing one combined NMS at the end?

I think no.
Darknet uses 1 NMS for all yolo-layers.

YashasSamaga · 2020-05-28T18:15:08Z

I did a bit of investigation. YOLOv2 PR added NMS in region layer because there was only one region layer back then. YOLOv3 PR reused the region layer but this led to NMS being performed in each region layer. I think it's a bug which I thought was a feature all this time.

I have opened an issue opencv/opencv#17415

mive93 · 2020-06-04T20:32:38Z

Hi @YashasSamaga, thank you for profiling OpenCV-dnn and comparing it with tkDNN also :)

In the last days we have released a new version of tkDNN, with also a darknet parser, the new mish, and the handling of the batches also for pre-post processing. But I haven't profiled it seriously yet. If interested, I can do it soonish.

mive93 · 2020-06-12T09:42:44Z

Hi :)
On the README of tkDNN you can now find the performance of Yolov4 on different boards.
Here's the screenshot of the table

YashasSamaga · 2020-06-19T12:34:30Z

Dataset and the list of images taken from How to evaluate accuracy and speed of YOLOv4.

Darknet as of e08a818 and original yolov4.cfg

Darknet (FP32)
GTX 1050

overall performance
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.435
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.657
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.473
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.267
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.467
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.533
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.342
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.549
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.580
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.403
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.617
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.713
Done (t=487.90s)

The number of detections were considerably smaller in OpenCV. I eventually figured that OpenCV was ditching detections with low confidence scores. So I added thresh=0.001 to all [yolo] blocks in yolov4.cfg. The number of detections from Darknet and OpenCV still isn't same but they are better than before. I suspect the difference is caused by NMS method (OpenCV does global optimal NMS).

Code: https://gist.github.com/YashasSamaga/077a1d69c48e4cdb9957d167b7000b98

OpenCV DNN CUDA (FP32)
RTX 2080 Ti

overall performance
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.436
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.657
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.474
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.267
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.467
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.533
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.342
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.549
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.580
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.405
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.617
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.715
Done (t=329.38s)

OpenCV DNN CUDA (FP16)
RTX 2080 Ti

overall performance
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.435
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.657
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.473
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.267
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.467
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.532
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.342
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.549
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.580
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.404
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.617
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.714
Done (t=325.70s)

The numbers for OpenCV are better than Darknet. I think it's because of the NMS but I wanted to rule out the possibility of variations arising due to different choices made while selecting convolution kernels on different devices (Darknet stats were generated on GTX 1050 while OpenCV stats were generated on RTX 2080 Ti).

OpenCV FP32 on GTX 1050

overall performance
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.436
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.657
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.474
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.267
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.467
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.533
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.342
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.549
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.580
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.405
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.617
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.715
Done (t=328.92s)

YashasSamaga · 2020-06-19T12:42:02Z

If I do not set thresh=0.001 in all [yolo] blocks, 0.2 is used as the confidence threshold. There is considerable performance degradation with using 0.2:

OpenCV CUDA FP16
RTX 2080 Ti
thresh = 0.2 (opencv default)

overall performance
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.400
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.583
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.445
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.231
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.431
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.500
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.313
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.461
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.471
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.275
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.505
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.604
Done (t=117.94s)

I wonder if this default behaviour in OpenCV is correct.

AlexeyAB · 2020-06-19T12:46:02Z

@YashasSamaga This is normal, since for Detection you should use optimal conf-thresh 0.2 - 0.25, while AP calculation should be done for each possible conf-thresh starting from 0.001.

gachiemchiep · 2020-06-24T10:04:46Z

Hello @AlexeyAB.
I try your model on OpenCV DNN and got the same result as reported.
The output is a list of array which have format as :

batch_size x 17328 x 85
batch_size x 4332 x 85
batch_size x 1083 x 85

I understand that 85 is equal to [center_x, center_y, width, height, box_confidence, class_1_score, .... ].
Coco has 80 classes so it is 4 + 1 + 80 = 85.
But what is 17328, 4332, 1083 is stand for? Would you mind give me a quick hint for this?
Thanks.

WongKinYiu · 2020-06-24T10:12:27Z

they are grid_width * grid_height * masks * (classes+coordinates+objectness)
so 1083 * 85 = 19 * 19 * 3 * (80 + 4 + 1).

gachiemchiep · 2020-06-24T10:16:06Z

@WongKinYiu But there're 3 of them : 17328, 4332, 1083 . Do you know the meaning of other 2?

WongKinYiu · 2020-06-24T12:15:25Z

there are three yolo layers (feature pyramid).
17328* 85 = 76 * 76 * 3 * (80 + 4 + 1)
4332 * 85 = 38 * 38 * 3 * (80 + 4 + 1)
1083 * 85 = 19 * 19 * 3 * (80 + 4 + 1)

gachiemchiep · 2020-06-25T01:38:11Z

@WongKinYiu Thank you for your kindness. It help me a lot. I checked the yolov3 and FPN paper and found the explanation about the feature pyramid.

YashasSamaga · 2020-07-02T08:33:39Z

@mive93 Do tkDNN benchmarks include the host/device memory transfer time?

I was looking at tkDNN source and if I have understood correctly, the input is copied from the host to device. The input on the device is then copied to TRT's device buffer, inference is done, the outputs in TRT's buffer are copied to non-TRT output buffers. The outputs are then copied to the host. The time reported by tkDNN is the time it took for copying from a device buffer to TRT's buffers and vice-versa and the inference time. Is this correct?

mive93 · 2020-07-02T08:43:11Z

Hi @YashasSamaga,
What you said is correct. And yes, the tkDNN benchmarks I reported include only inference, preprocessing and postprocessing are left out.

AlexeyAB · 2020-07-02T16:22:05Z

@mive93 Do you use overlapping in 3 thread/steams?

pre-processing (CPU -> GPU)
inference on GPU
3 post-processing (GPU->CPU)

Yes, it reduces latency.
But does it reduce FPS?

YashasSamaga · 2020-07-02T16:37:22Z

Btw OpenCV benchmarks that I reported had included the GPU-CPU transfer times. They total up to 1.1ms on RTX 2080 Ti (0.3ms for input, 0.53ms for output1, 0.25ms for output2 and 0.03ms for output3) for single image inference with pinned host memory. If this extra time is deducted from the OpenCV timings I reported, I think OpenCV is faster than tkDNN on RTX 2080 Ti for single image inference.

OpenCV master (as of today) takes 9.5ms for single image inference (inclusive of the 1.1ms) and tkDNN takes 9.0ms. Subtracting 1.1ms gives ~8.4ms for OpenCV but tkDNN is also making a device to device copy during inference which OpenCV doesn't but D2D copies are much faster (probably very very small compared to 1.1ms) than H2D or D2H copies.

Anyway, OpenCV and tkDNN are close enough that any benchmark will depend on these minute details. So it's not meaningful to compare with numbers very close to each other.

AlexeyAB · 2020-07-02T16:48:21Z

If 3 operations are overlaped, then they increase the latency, but do not affect the FPS.

AlexeyAB · 2020-07-11T00:45:36Z

@mive93 Hi,

Have you increased FPS more than in your table a month ago? #5354 (comment)

Can you show actual FPS for RTX 2080Ti and yolov4.cfg 320-608 if FPS was increased?

mive93 · 2020-08-06T06:49:38Z

Hi @AlexeyAB,

I'm sorry, I didn't notice the notification with your question.
No, the times are not changed. In case, I will let you know :)

However, I had time to check the problem with the mAP and I finally understood why we had that accuracy drop.
Basically, it is due to 3 things:

an error (shame on us) caused by casting. At a certain point, we were casting boxes to int (that caused a drop of 0.03 point in mAP )
a different resize function used. We use the opencv one (linear), you use your custom one. It seems nothing, but it changes a bit of the outcome. However, the one by opencv is faster, and given that it does not affect that much, we'll keep that one for now. Maybe in the future, we will optimize yours on GPU.
a different batchnorm implementation. We use the one by tensorRT, you use your own one. Again, it's not that big thing, but the output of the network is a bit different if we use an epsilon of 0.0001 (not if smaller). However, when using a threshold of 0.001 to compute the mAP, ofc it affects the results.

Hereafter the new results on codalab for COCO val 2017, for threshold 0.001 and 0.3.

*******************************************************************************
darknet yolov4 416x416 t=0.001
*******************************************************************************
overall performance
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.471
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.710
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.510
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.278
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.525
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.636
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.357
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.561
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.587
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.382
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.642
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.772
*******************************************************************************
tkDNN yolov4 416x416 t=0.001 
*******************************************************************************
overall performance
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.468
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.705
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.506
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.274
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.522
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.633
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.356
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.561
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.587
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.381
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.642
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.773
*******************************************************************************
darknet yolov4 416x416 t=0.3
*******************************************************************************
overall performance
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.424
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.610
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.472
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.219
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.481
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.597
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.324
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.469
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.474
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.242
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.535
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.671
*******************************************************************************
tkDNN yolov4 416x416 t=0.3 
*******************************************************************************
overall performance
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.424
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.609
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.472
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.219
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.481
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.598
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.324
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.469
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.475
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.242
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.535
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.670

Here the results for other networks: https://github.com/ceccocats/tkDNN#map-results

arnaud-nt2i · 2020-09-11T13:20:54Z

@mive93 @YashasSamaga
The big difference tho is that tkDNN is for Linux only 😭 😿 😢 while OpenCV is for us W10 users as well 👍

clydebailey · 2022-06-11T21:53:33Z

interested

AlexeyAB mentioned this issue Apr 29, 2020

Best way to use YOLO on Jetson Xavier with max FPS #5386

Open

AlexeyAB added the Explanations Explanations of the source code, algorithms or method of use label Apr 29, 2020

AlexeyAB mentioned this issue May 7, 2020

EfficientDet: Scalable and Efficient Object Detection - 51.0% mAP@0.5...0.95 COCO #4346

Open

georgetseng1128 mentioned this issue May 18, 2020

About performance of FPS with yolov3 and yolov3-tiny on Xavier ceccocats/tkDNN#12

Closed

AlexeyAB mentioned this issue May 25, 2020

Why are most models either light or heavy? #5729

Open

ersheng-ai mentioned this issue Jul 30, 2020

Did you check accuracy of YOLOv4 by using your implementation on MSCOCO? Tianxiaomo/pytorch-YOLOv4#190

Open

This was referenced Aug 6, 2020

tkDNN and YoloV4(AlexeyAB) are not the same ceccocats/tkDNN#103

Closed

mAPs of some networks are much lower than expected. ceccocats/tkDNN#100

Closed

mive93 mentioned this issue Jan 22, 2021

My tests show that tkDNN is not faster than Darknet, why? ceccocats/tkDNN#186

Closed

Regarding mAP and latency of Yolov4 #5354

Regarding mAP and latency of Yolov4 #5354

Comments

mive93 commented Apr 27, 2020 • edited by AlexeyAB Loading

WongKinYiu commented Apr 27, 2020

mive93 commented Apr 27, 2020 • edited Loading

mive93 commented Apr 27, 2020

AlexeyAB commented Apr 27, 2020 • edited Loading

mive93 commented Apr 27, 2020 • edited Loading

AlexeyAB commented Apr 27, 2020

AlexeyAB commented Apr 27, 2020 • edited Loading

WongKinYiu commented Apr 28, 2020

mive93 commented Apr 28, 2020

mive93 commented Apr 29, 2020 • edited Loading

AlexeyAB commented Apr 29, 2020 • edited Loading

mive93 commented Apr 29, 2020

AlexeyAB commented Apr 29, 2020 • edited Loading

mive93 commented Apr 30, 2020

AlexeyAB commented Apr 30, 2020 • edited Loading

AlexeyAB commented May 4, 2020

mive93 commented May 4, 2020

mive93 commented May 11, 2020 • edited by AlexeyAB Loading

AlexeyAB commented May 11, 2020 • edited Loading

AlexeyAB commented May 13, 2020

lazerliu commented May 14, 2020

mive93 commented May 14, 2020

AlexeyAB commented May 14, 2020

lazerliu commented May 14, 2020

YashasSamaga commented May 24, 2020 • edited Loading

AlexeyAB commented May 24, 2020

AlexeyAB commented May 24, 2020 • edited Loading

YashasSamaga commented May 27, 2020 • edited Loading

AlexeyAB commented May 27, 2020

YashasSamaga commented May 28, 2020

AlexeyAB commented May 28, 2020

YashasSamaga commented May 28, 2020 • edited Loading

mive93 commented Jun 4, 2020

mive93 commented Jun 12, 2020

YashasSamaga commented Jun 19, 2020 • edited Loading

YashasSamaga commented Jun 19, 2020

AlexeyAB commented Jun 19, 2020

gachiemchiep commented Jun 24, 2020

WongKinYiu commented Jun 24, 2020

gachiemchiep commented Jun 24, 2020 • edited Loading

WongKinYiu commented Jun 24, 2020

gachiemchiep commented Jun 25, 2020 • edited Loading

YashasSamaga commented Jul 2, 2020 • edited Loading

mive93 commented Jul 2, 2020

AlexeyAB commented Jul 2, 2020

YashasSamaga commented Jul 2, 2020 • edited Loading

AlexeyAB commented Jul 2, 2020

AlexeyAB commented Jul 11, 2020

mive93 commented Aug 6, 2020 • edited Loading

arnaud-nt2i commented Sep 11, 2020

clydebailey commented Jun 11, 2022

mive93 commented Apr 27, 2020 •

edited by AlexeyAB

Loading

mive93 commented Apr 27, 2020 •

edited

Loading

AlexeyAB commented Apr 27, 2020 •

edited

Loading

mive93 commented Apr 27, 2020 •

edited

Loading

AlexeyAB commented Apr 27, 2020 •

edited

Loading

mive93 commented Apr 29, 2020 •

edited

Loading

AlexeyAB commented Apr 29, 2020 •

edited

Loading

AlexeyAB commented Apr 29, 2020 •

edited

Loading

AlexeyAB commented Apr 30, 2020 •

edited

Loading

mive93 commented May 11, 2020 •

edited by AlexeyAB

Loading

AlexeyAB commented May 11, 2020 •

edited

Loading

YashasSamaga commented May 24, 2020 •

edited

Loading

AlexeyAB commented May 24, 2020 •

edited

Loading

YashasSamaga commented May 27, 2020 •

edited

Loading

YashasSamaga commented May 28, 2020 •

edited

Loading

YashasSamaga commented Jun 19, 2020 •

edited

Loading

gachiemchiep commented Jun 24, 2020 •

edited

Loading

gachiemchiep commented Jun 25, 2020 •

edited

Loading

YashasSamaga commented Jul 2, 2020 •

edited

Loading

YashasSamaga commented Jul 2, 2020 •

edited

Loading

mive93 commented Aug 6, 2020 •

edited

Loading