The Difference of AlexeyAB/Darknet and Pjreddie/Darknet #969

yyuzhongpv · 2018-06-03T05:30:31Z

Hello,

On my test machine (GTX 1080 GPU, CentOS 7, CUDA 9.0), I have both Darknet from Pjreddie and AlexeyAB. I used the same dataset and config file to train the detection models. With Pjreddie's darknet, I can get good performance in training and testing. However, while I changed to AlexeyAB's darknet, I use same options in Makefile, and train with the same dataset, the training process seems to good, it converged quickly, however, while I used that model to test my images, I get very bad accuracy. I just want to know what's the main differences of these two repos? and how to debug? I really want to use the optimization of CUDNN_HALF from Alexey.
Thanks!

AlexeyAB · 2018-06-03T12:50:47Z

Hello,

CUDNN_HALF=1 can be used with speedup and without drop in accuracy only on GPU Volta (Titan V, Tesla V100, Quadro GV100, DGX-2, HGX-2, ...) and later.

What version of CUDNN do you use?
Do you get bad accuracy when you use CUDNN_HALF=1 on GTX 1080 GPU?
Do you get bad accuracy when CUDNN_HALF=0?

yyuzhongpv · 2018-06-03T14:04:34Z

Hello Alexey,

Thanks for your quick reply.

•What version of CUDNN do you use?
cudnn 7.0

•Do you get bad accuracy when you use CUDNN_HALF=1 on GTX 1080 GPU?
•Do you get bad accuracy when CUDNN_HALF=0?

In this test, I use GTX 1080 (A physical Dell Workstation). On both Darknet repo, I set CUDNN_HALF=0. and train with my own dataset. (About 3000 images, the size of image is about 2000x1500).

With Pjreddie's darknet, after training 50000 batches, I got more than 90% on both Precision and Recall in testing.
However, with AlexeyAB's darknet, in training, I can see the error decrease quickly and everything seems good. However, after training, while I try to validate the model with the same Python codes, I found the accuracy is very bad, and it even can't detect anything in most of images.

I will try more tests with the command line and post the results later.

Thanks!

AlexeyAB · 2018-06-03T19:28:26Z

What date of your code from this repository?
What dataset do you use?
What model do you use?
What mAP can you get for both weights-files that is trained on Original and This repository? Using CUDNN_HALF=0

yyuzhongpv · 2018-06-03T21:05:54Z

Thanks Alexey!
•What date of your code from this repository?
I used the codes at the beginning of May, and forget the exact date. I will check it.
•What dataset do you use?
It is my own dataset, and the image size is about 2000x1500. I trained this dataset with Yolov3 both on Pjreddie's darknet and AlexyAB's darknet.
•What model do you use?
Yolov3.cfg
•What mAP can you get for both weights-files that is trained on Original and This repository? Using CUDNN_HALF=0
I only computed the precision and recall, will post you details later.
In short, with Original darknet, both precision and recall are 90%+ on 6 classes of objects, however, with this repo, all of them are near to 0. I'm sure something is wrong. I will train the model with latest codes in this repository.

AlexeyAB · 2018-06-03T21:33:14Z

Yes, just something wrong. Try to train with latest code.

And then compare accuracy of models trained on Original and This repo by using such command in this repo:
./darknet detector map data/obj.data yolo-obj.cfg backup\yolo-obj_50000.weights

Just set valid=valid.txt or valid=train.txt in your obj.data file.

AlexeyAB · 2018-06-03T22:14:48Z

Also what network resolution width= and height= do you use?

IlyaOvodov · 2018-06-04T11:10:32Z

Possibly it can be caused by mixture of letterbox and not letterbox image modes that are mixed in different modes of detector at least in this fork. Training and validate is done without letterbox (i.e. image is just resized to net input size), but test is done with letterbox (image is resized keeping aspect ratio, margins are filled by gray uniform). In my case it resulted in bad visual performance in "detector test" while train and validate show very good statistics, until I've found this "not bug but feature" :)
See:
https://github.com/AlexeyAB/darknet/blob/master/src/detector.c#L394
https://github.com/AlexeyAB/darknet/blob/master/src/detector.c#L492
https://github.com/AlexeyAB/darknet/blob/master/src/detector.c#L602
https://github.com/AlexeyAB/darknet/blob/master/src/detector.c#L646
https://github.com/AlexeyAB/darknet/blob/master/src/detector.c#L1102
https://github.com/AlexeyAB/darknet/blob/master/src/detector.c#L1112

AlexeyAB · 2018-06-04T11:35:13Z

@yyuzhongpv @IlyaOvodov

I think there is a problem with non-square network and old commit with hardcoded network resizing to random square size random=1, it broke any training of non-sqaure network.
But latest code should work successfully.

Also, just try to comment this line:

darknet/src/detector.c

Line 1102 in a7ddb20

image sized = letterbox_image(im, net.w, net.h); letterbox = 1;

and un-comment this:

darknet/src/detector.c

Line 1101 in a7ddb20

//image sized = resize_image(im, net.w, net.h);

IlyaOvodov · 2018-06-04T12:35:40Z

Yes, problem with random=1 was another one and it is fixed now. But inconsistency in letterbox mode is still present. At least one have to comment-uncomment lines above to make "train","valid" and "test" commands working in the same manner.

AlexeyAB · 2018-06-04T12:40:03Z

@IlyaOvodov I just fixed it. Now by default LETTERBOX_DATA is disabled anywhere.

yyuzhongpv · 2018-06-04T14:23:59Z

Thanks @AlexeyAB @IlyaOvodov
I use the YoloV3 without change, so the width and height is 416x416.
I will try the latest codes.

yyuzhongpv · 2018-06-04T18:32:05Z

Thanks @AlexeyAB @IlyaOvodov

Basing on your comments, I also found the problem will happen in darknet.py, which calls network_predict_image directly, and only uses letterbox_image.

float *network_predict_image(network *net, image im)
{
image imr = letterbox_image(im, net->w, net->h);
set_batch_network(net, 1);
float *p = network_predict(*net, imr.data);
free_image(imr);
return p;
}

The quick question is, are there any differences between letterbox and no letterbox if I keep the training and testing consistence?

AlexeyAB · 2018-06-04T19:04:05Z

@yyuzhongpv There are pros and cons in each case: #232 (comment)

AlexeyAB · 2018-06-04T19:36:52Z

I disabled letter_box in the darknet.py by default.

Also check some differences in the original and this repository that can affect on your result: #529 (comment)

yyuzhongpv · 2018-06-09T22:23:32Z

Hello @AlexeyAB,

After testing inseveral days, I found some more interesting things.
I use the exact same dataset, which has about 3k grayscale images with 2000x3000. There are some large (500x1000, <=4) objects and also small objects (50x100, dozens) in the images. The only model I tried is yolov3-voc.cfg.

I have tested on two machines: one is the Workstation with 1080 GPU, CUDA 9.0 and CUDNN 7.0.
Another is the Azure V100 VM, with CUDA 9.1 and CUDNN 7.0.

Here are my results.

I worked on Pjreddie's darknet on 1080 GPU before. After 50000 iteration (batchsize 16 and subdivision=8), I can get very good precision and recall (90%) on almost all of the objects.
I want to optimize the processing time on V100, so I switch to V100 with both Pjreddie's darknet and AlexeyAB's repo. I updated both Pjreddie's and AlexeyAB's darknet at 06/05/2018, and use exactly same yolov3-voc.cfg, only change the batchsize=64 and subdivision=16.

A. With Pjreddie's darknet, after training, I can't detect anything with the command detector. However, If I use the detector in AlexeyAB's repo, I can detect all of objects.

B. With AlexeyAB's repo, after training nearly 30000 iterations, I got below log. While testing with the latest weight with AlexeyAB's detector, I can only detect the large object. With Pjreddie's detector, I detect nothing.
'''
30376: nan, nan avg loss, 0.001000 rate, 2.594721 seconds, 1944064 images
Loaded: 0.000113 seconds
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.001509, .5R: -nan, .75R: -nan, count: 0
Region 94 Avg IOU: 0.746136, Class: 0.887348, Obj: 0.495092, No Obj: 0.002121, .5R: 1.000000, .75R: 0.636364, count: 11
Region 106 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000, count: 26
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000262, .5R: -nan, .75R: -nan, count: 0
Region 94 Avg IOU: 0.867755, Class: 0.999704, Obj: 0.776341, No Obj: 0.002099, .5R: 1.000000, .75R: 1.000000, count: 8
Region 106 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000, count: 52
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.001928, .5R: -nan, .75R: -nan, count: 0
Region 94 Avg IOU: 0.880615, Class: 0.999778, Obj: 0.949572, No Obj: 0.002001, .5R: 1.000000, .75R: 1.000000, count: 6
Region 106 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000, count: 41
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000134, .5R: -nan, .75R: -nan, count: 0
Region 94 Avg IOU: 0.840572, Class: 0.999879, Obj: 0.972163, No Obj: 0.000742, .5R: 1.000000, .75R: 0.500000, count: 2
Region 106 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000, count: 19
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000164, .5R: -nan, .75R: -nan, count: 0
Region 94 Avg IOU: 0.842480, Class: 0.999470, Obj: 0.610240, No Obj: 0.001531, .5R: 1.000000, .75R: 1.000000, count: 5
Region 106 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000, count: 35
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000080, .5R: -nan, .75R: -nan, count: 0
Region 94 Avg IOU: 0.835363, Class: 0.999735, Obj: 0.821422, No Obj: 0.001978, .5R: 1.000000, .75R: 1.000000, count: 6
Region 106 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000, count: 41
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000196, .5R: -nan, .75R: -nan, count: 0
Region 94 Avg IOU: 0.849555, Class: 0.999705, Obj: 0.804512, No Obj: 0.002521, .5R: 1.000000, .75R: 0.857143, count: 7
Region 106 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000, count: 45
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000034, .5R: -nan, .75R: -nan, count: 0
Region 94 Avg IOU: 0.774943, Class: 0.999629, Obj: 0.699336, No Obj: 0.000951, .5R: 1.000000, .75R: 0.333333, count: 3
Region 106 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000, count: 25
'''

I have been stuck on problem for several days. The questions I want to check are:

Why I get different results on 1080 and V100 GPU?
With Pjreddie/Darknet on V100, the train process seems good. However, why the detector in Pjreddie/Darknet detected nothing, and detector AlexeyAB/Darknet can detect objects?
With AlexeyAB/Darknet, why it can only detect one kind of large object?

I am really confused about the difference of these two repos, even with the exact same yolov3-voc.cfg.

Any suggestions are most welcome!
Thanks in advance!

AlexeyAB · 2018-06-09T23:17:01Z

@yyuzhongpv

I can get very good precision and recall (90%) on almost all of the objects.

The main question is - what implementation of calculation of precision and recall do you use? The most of implementations are totaly wrong.

A. With Pjreddie's darknet, after training, I can't detect anything with the command detector. However, If I use the detector in AlexeyAB's repo, I can detect all of objects.

B. With AlexeyAB's repo, after training nearly 30000 iterations, I got below log. While testing with the latest weight with AlexeyAB's detector, I can only detect the large object. With Pjreddie's detector, I detect nothing.
'''
30376: nan, nan avg loss, 0.001000 rate, 2.594721 seconds, 1944064 images

What mAP can you get in this case?
As I see - avg loss is Nan so training goes wrong.
Since you get bad result on both repo Joseph's and my, I think you do something wrong, or you broke dataset.
Attach your cfg-file.
What parameters did you use in the Makefile for both repositories?
I added in the last commits some fixes that will reject bad labels or stop training if you use inconsistent labels and cfg-files, because ~80% of issues due to an incorrect dataset
Do you get files bad_labels.list and bad.list after training in the same directory where is ./darknet?

yyuzhongpv · 2018-06-10T03:42:13Z

Thanks Alexey!

The main question is - what implementation of calculation of precision and recall do you use? The most of implementations are totally wrong.

I implemented the calculation of Precision and Recall by ourselves. I compute the number of IoU larger than the thresh hold value (0.5) of predicted bounding boxes and ground truth to get the TP, and also get the number of FP (Predict bounding box, but no overlap with ground truth). Precision = the sum of TP for each test images/ (the sum of TP for each test images + the sum of FP for each test images). Recall is similar.

On the 1080 GPU, I already checked the output of test manually by drawing the predicted bounding boxes on the test images, and went through of them. They all were very close to the ground truth. So I assume the calculation of precision and recall is not a big problem.

The key problem is, in Joseph's darknet on V100, after training, the ./darknet detector test ... detects nothing from my test images.

Cfg file for both repos. I only made small changes on yolov3-voc.cfg.

diff /mnt/test/xxx_WS/yolo.cfg ../darknet-official0605/cfg/yolov3-voc.cfg 
3,4c3,4
< # batch=1
< # subdivisions=1
---
>  batch=1
>  subdivisions=1
6,7c6,7
< batch=64
< subdivisions=16
---
> # batch=64
> # subdivisions=16
605c605
< filters=33
---
> filters=75
611c611
< classes=6
---
> classes=20
689c689
< filters=33
---
> filters=75
695c695
< classes=6
---
> classes=20
773c773
< filters=33
---
> filters=75
779c779
< classes=6
---
> classes=20

Makefile of Joseph's darknet. Only change the options in header.

GPU=1
CUDNN=1
OPENCV=1
OPENMP=1
DEBUG=0

ARCH= -gencode arch=compute_30,code=sm_30 \
      -gencode arch=compute_35,code=sm_35 \
      -gencode arch=compute_50,code=[sm_50,compute_50] \
      -gencode arch=compute_52,code=[sm_52,compute_52]
#      -gencode arch=compute_20,code=[sm_20,sm_21] \ This one is deprecated?

# This is what I use, uncomment if you know your arch and want to specify
#ARCH= -gencode arch=compute_52,code=compute_52

VPATH=./src/:./examples
SLIB=libdarknet.so
ALIB=libdarknet.a
EXEC=darknet
OBJDIR=./obj/

CC=gcc
NVCC=nvcc
AR=ar
ARFLAGS=rcs
OPTS=-Ofast
LDFLAGS= -lm -pthread
COMMON= -Iinclude/ -Isrc/
CFLAGS=-Wall -Wno-unused-result -Wno-unknown-pragmas -Wfatal-errors -fPIC

Training command:

/home/yyuzhong/darknet-official0605/darknet detector train /mnt/test/xxxx_WS/yolo.data /mnt/test/xxxx_WS/yolo.cfg darknet53.conv.74 -dont_show -gpus 0

The training log of Joseph's darknet:

273: 39.363007, 41.344463 avg, 0.000006 rate, 2.329178 seconds, 17472 images
Loaded: 0.000065 seconds
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.007486, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.305817, Class: 0.432831, Obj: 0.007301, No Obj: 0.003944, .5R: 0.200000, .75R: 0.100000,  count: 10
Region 106 Avg IOU: 0.282727, Class: 0.417738, Obj: 0.008525, No Obj: 0.002046, .5R: 0.120690, .75R: 0.000000,  count: 58
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.007615, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.309755, Class: 0.432035, Obj: 0.011104, No Obj: 0.004084, .5R: 0.100000, .75R: 0.000000,  count: 10
Region 106 Avg IOU: 0.298968, Class: 0.503030, Obj: 0.008877, No Obj: 0.002170, .5R: 0.155556, .75R: 0.000000,  count: 45
Region 82 Avg IOU: 0.326712, Class: 0.348199, Obj: 0.011348, No Obj: 0.007550, .5R: 0.000000, .75R: 0.000000,  count: 1
Region 94 Avg IOU: 0.263263, Class: 0.438575, Obj: 0.002804, No Obj: 0.003932, .5R: 0.000000, .75R: 0.000000,  count: 6
Region 106 Avg IOU: 0.282571, Class: 0.449923, Obj: 0.007379, No Obj: 0.002098, .5R: 0.116667, .75R: 0.000000,  count: 60
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.007522, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.286575, Class: 0.441582, Obj: 0.005456, No Obj: 0.004115, .5R: 0.000000, .75R: 0.000000,  count: 6
Region 106 Avg IOU: 0.338090, Class: 0.500828, Obj: 0.007312, No Obj: 0.002256, .5R: 0.200000, .75R: 0.000000,  count: 30
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.007382, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.298540, Class: 0.380783, Obj: 0.008436, No Obj: 0.004021, .5R: 0.300000, .75R: 0.000000,  count: 10
Region 106 Avg IOU: 0.295573, Class: 0.482574, Obj: 0.007328, No Obj: 0.002103, .5R: 0.151515, .75R: 0.015152,  count: 66
Region 82 Avg IOU: 0.260993, Class: 0.461224, Obj: 0.011011, No Obj: 0.007552, .5R: 0.000000, .75R: 0.000000,  count: 1
Region 94 Avg IOU: 0.185033, Class: 0.445341, Obj: 0.005748, No Obj: 0.004036, .5R: 0.000000, .75R: 0.000000,  count: 7
Region 106 Avg IOU: 0.319298, Class: 0.481619, Obj: 0.006593, No Obj: 0.002172, .5R: 0.186047, .75R: 0.000000,  count: 43
Region 82 Avg IOU: 0.145409, Class: 0.610898, Obj: 0.010052, No Obj: 0.007521, .5R: 0.000000, .75R: 0.000000,  count: 2
Region 94 Avg IOU: 0.216485, Class: 0.421778, Obj: 0.003331, No Obj: 0.004207, .5R: 0.000000, .75R: 0.000000,  count: 12
Region 106 Avg IOU: 0.347528, Class: 0.479033, Obj: 0.006581, No Obj: 0.002305, .5R: 0.121951, .75R: 0.024390,  count: 41
Region 82 Avg IOU: 0.173740, Class: 0.442087, Obj: 0.004729, No Obj: 0.007409, .5R: 0.000000, .75R: 0.000000,  count: 1
Region 94 Avg IOU: 0.300622, Class: 0.421547, Obj: 0.005921, No Obj: 0.004037, .5R: 0.125000, .75R: 0.000000,  count: 8
Region 106 Avg IOU: 0.292677, Class: 0.426323, Obj: 0.005118, No Obj: 0.002244, .5R: 0.170213, .75R: 0.000000,  count: 47
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.007638, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.199439, Class: 0.424072, Obj: 0.004754, No Obj: 0.004125, .5R: 0.000000, .75R: 0.000000,  count: 8
Region 106 Avg IOU: 0.323963, Class: 0.436944, Obj: 0.006814, No Obj: 0.002223, .5R: 0.183673, .75R: 0.020408,  count: 49
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.007709, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.317504, Class: 0.444540, Obj: 0.006530, No Obj: 0.004090, .5R: 0.166667, .75R: 0.000000,  count: 12
Region 106 Avg IOU: 0.365333, Class: 0.453353, Obj: 0.007359, No Obj: 0.002256, .5R: 0.276596, .75R: 0.042553,  count: 47
Region 82 Avg IOU: 0.057973, Class: 0.675667, Obj: 0.016377, No Obj: 0.007599, .5R: 0.000000, .75R: 0.000000,  count: 1
Region 94 Avg IOU: 0.381708, Class: 0.382659, Obj: 0.005441, No Obj: 0.004044, .5R: 0.333333, .75R: 0.000000,  count: 9
Region 106 Avg IOU: 0.297493, Class: 0.423628, Obj: 0.005643, No Obj: 0.002229, .5R: 0.120000, .75R: 0.020000,  count: 50
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.007638, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.309132, Class: 0.340190, Obj: 0.005039, No Obj: 0.003864, .5R: 0.142857, .75R: 0.000000,  count: 7
Region 106 Avg IOU: 0.276905, Class: 0.450940, Obj: 0.009095, No Obj: 0.002096, .5R: 0.078431, .75R: 0.019608,  count: 51
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.007480, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.321556, Class: 0.358655, Obj: 0.005864, No Obj: 0.004140, .5R: 0.285714, .75R: 0.000000,  count: 7
Region 106 Avg IOU: 0.339810, Class: 0.425059, Obj: 0.007491, No Obj: 0.002207, .5R: 0.170213, .75R: 0.000000,  count: 47
......

648: 7.944275, 8.417584 avg, 0.000176 rate, 4.588053 seconds, 41472 images
Loaded: 0.000081 seconds
Region 82 Avg IOU: 0.652460, Class: 0.996889, Obj: 0.455688, No Obj: 0.002242, .5R: 0.500000, .75R: 0.500000,  count: 2
Region 94 Avg IOU: 0.524300, Class: 0.878305, Obj: 0.438937, No Obj: 0.001154, .5R: 0.555556, .75R: 0.000000,  count: 9
Region 106 Avg IOU: 0.596807, Class: 0.874502, Obj: 0.726350, No Obj: 0.000996, .5R: 0.812500, .75R: 0.125000,  count: 32
Region 82 Avg IOU: 0.560426, Class: 0.998632, Obj: 0.661910, No Obj: 0.000585, .5R: 1.000000, .75R: 0.000000,  count: 1
Region 94 Avg IOU: 0.655807, Class: 0.985865, Obj: 0.713722, No Obj: 0.001311, .5R: 0.833333, .75R: 0.333333,  count: 12
Region 106 Avg IOU: 0.623416, Class: 0.823355, Obj: 0.525871, No Obj: 0.001481, .5R: 0.813559, .75R: 0.220339,  count: 59
Region 82 Avg IOU: 0.748478, Class: 0.992079, Obj: 0.331865, No Obj: 0.001283, .5R: 1.000000, .75R: 0.500000,  count: 2
Region 94 Avg IOU: 0.569297, Class: 0.894295, Obj: 0.268277, No Obj: 0.001745, .5R: 0.684211, .75R: 0.052632,  count: 19
Region 106 Avg IOU: 0.617301, Class: 0.895878, Obj: 0.561171, No Obj: 0.001280, .5R: 0.864865, .75R: 0.189189,  count: 37
Region 82 Avg IOU: 0.617124, Class: 0.993862, Obj: 0.500197, No Obj: 0.002522, .5R: 1.000000, .75R: 0.000000,  count: 4
Region 94 Avg IOU: 0.601353, Class: 0.904737, Obj: 0.547848, No Obj: 0.002094, .5R: 0.863636, .75R: 0.090909,  count: 22
Region 106 Avg IOU: 0.648495, Class: 0.896536, Obj: 0.653840, No Obj: 0.001114, .5R: 0.939394, .75R: 0.212121,  count: 33
Region 82 Avg IOU: 0.835922, Class: 0.989319, Obj: 0.367180, No Obj: 0.001742, .5R: 1.000000, .75R: 1.000000,  count: 2
Region 94 Avg IOU: 0.563951, Class: 0.934366, Obj: 0.484253, No Obj: 0.001929, .5R: 0.736842, .75R: 0.157895,  count: 19
Region 106 Avg IOU: 0.661426, Class: 0.964040, Obj: 0.799646, No Obj: 0.001123, .5R: 0.925926, .75R: 0.296296,  count: 27
Region 82 Avg IOU: 0.539442, Class: 0.985715, Obj: 0.197194, No Obj: 0.001021, .5R: 0.333333, .75R: 0.000000,  count: 3
Region 94 Avg IOU: 0.579014, Class: 0.798964, Obj: 0.468050, No Obj: 0.001925, .5R: 0.687500, .75R: 0.250000,  count: 16
Region 106 Avg IOU: 0.622752, Class: 0.871247, Obj: 0.468912, No Obj: 0.001483, .5R: 0.830189, .75R: 0.188679,  count: 53
Region 82 Avg IOU: 0.725466, Class: 0.996516, Obj: 0.311251, No Obj: 0.002156, .5R: 1.000000, .75R: 0.666667,  count: 3
Region 94 Avg IOU: 0.479724, Class: 0.908354, Obj: 0.671701, No Obj: 0.001487, .5R: 0.500000, .75R: 0.000000,  count: 12
Region 106 Avg IOU: 0.608657, Class: 0.753718, Obj: 0.550959, No Obj: 0.000696, .5R: 0.678571, .75R: 0.250000,  count: 28
Region 82 Avg IOU: 0.613398, Class: 0.998283, Obj: 0.485302, No Obj: 0.001676, .5R: 1.000000, .75R: 0.000000,  count: 2
Region 94 Avg IOU: 0.550962, Class: 0.885224, Obj: 0.341467, No Obj: 0.001789, .5R: 0.636364, .75R: 0.045455,  count: 22
Region 106 Avg IOU: 0.638456, Class: 0.875538, Obj: 0.697060, No Obj: 0.001186, .5R: 0.833333, .75R: 0.194444,  count: 36
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000377, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.654423, Class: 0.962119, Obj: 0.541582, No Obj: 0.001542, .5R: 0.727273, .75R: 0.272727,  count: 11

Makefile of Alexey's darknet. Only change the options in header, and set ARCH to support V100.

GPU=1
CUDNN=1
CUDNN_HALF=1
OPENCV=1
AVX=1
OPENMP=1
LIBSO=1

# set GPU=1 and CUDNN=1 to speedup on GPU
# set CUDNN_HALF=1 to further speedup 3 x times (Mixed-precision using Tensor Cores) on GPU Tesla V100, Titan V, DGX-2
# set AVX=1 and OPENMP=1 to speedup on CPU (if error occurs then set AVX=0)

DEBUG=0

ARCH= -gencode arch=compute_30,code=sm_30 \
      -gencode arch=compute_35,code=sm_35 \
      -gencode arch=compute_50,code=[sm_50,compute_50] \
      -gencode arch=compute_52,code=[sm_52,compute_52] \
          -gencode arch=compute_61,code=[sm_61,compute_61]

OS := $(shell uname)

# Tesla V100
ARCH= -gencode arch=compute_70,code=[sm_70,compute_70]

Training log of Alexey's darknet. The nan avg loss shows after iteration 84

82: 23.580215, 49.005016 avg loss, 0.001000 rate, 4.531976 seconds, 5248 images
Loaded: 0.000088 seconds
Region 82 Avg IOU: 0.676631, Class: 0.657015, Obj: 0.013579, No Obj: 0.000307, .5R: 1.000000, .75R: 0.000000,  count: 1
Region 94 Avg IOU: 0.611994, Class: 0.578710, Obj: 0.002161, No Obj: 0.000327, .5R: 0.928571, .75R: 0.142857,  count: 14
Region 106 Avg IOU: 0.000000, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 31
Region 82 Avg IOU: 0.715527, Class: 0.669567, Obj: 0.005944, No Obj: 0.000333, .5R: 1.000000, .75R: 0.500000,  count: 2
Region 94 Avg IOU: 0.563441, Class: 0.764334, Obj: 0.006268, No Obj: 0.000308, .5R: 0.833333, .75R: 0.000000,  count: 6
Region 106 Avg IOU: 0.000000, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 60
Region 82 Avg IOU: 0.613072, Class: 0.618635, Obj: 0.008895, No Obj: 0.000318, .5R: 1.000000, .75R: 0.000000,  count: 1
Region 94 Avg IOU: 0.618610, Class: 0.617791, Obj: 0.003350, No Obj: 0.000328, .5R: 0.666667, .75R: 0.222222,  count: 9
Region 106 Avg IOU: 0.000000, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 20
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000292, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.555021, Class: 0.602331, Obj: 0.002210, No Obj: 0.000284, .5R: 1.000000, .75R: 0.000000,  count: 2
Region 106 Avg IOU: 0.000000, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 11
Region 82 Avg IOU: 0.567027, Class: 0.626732, Obj: 0.012662, No Obj: 0.000445, .5R: 0.750000, .75R: 0.000000,  count: 4
Region 94 Avg IOU: 0.584985, Class: 0.517245, Obj: 0.001348, No Obj: 0.000330, .5R: 1.000000, .75R: 0.000000,  count: 11
Region 106 Avg IOU: 0.000000, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 34
Region 82 Avg IOU: 0.389180, Class: 0.619162, Obj: 0.004810, No Obj: 0.000252, .5R: 0.000000, .75R: 0.000000,  count: 2
Region 94 Avg IOU: 0.512456, Class: 0.524628, Obj: 0.001560, No Obj: 0.000342, .5R: 0.500000, .75R: 0.000000,  count: 18
Region 106 Avg IOU: 0.000000, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 24
Region 82 Avg IOU: 0.499203, Class: 0.634386, Obj: 0.003762, No Obj: 0.000197, .5R: 0.500000, .75R: 0.000000,  count: 2
Region 94 Avg IOU: 0.566642, Class: 0.652625, Obj: 0.004396, No Obj: 0.000363, .5R: 0.700000, .75R: 0.000000,  count: 10
Region 106 Avg IOU: 0.000000, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 47
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000222, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.574259, Class: 0.686245, Obj: 0.004401, No Obj: 0.000351, .5R: 0.818182, .75R: 0.000000,  count: 11
Region 106 Avg IOU: 0.000000, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 39
Region 82 Avg IOU: 0.454278, Class: 0.650282, Obj: 0.010042, No Obj: 0.000290, .5R: 0.000000, .75R: 0.000000,  count: 2
Region 94 Avg IOU: 0.613155, Class: 0.635823, Obj: 0.004758, No Obj: 0.000443, .5R: 0.875000, .75R: 0.000000,  count: 8
Region 106 Avg IOU: 0.000000, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 40
Region 82 Avg IOU: 0.541684, Class: 0.626609, Obj: 0.009354, No Obj: 0.000228, .5R: 1.000000, .75R: 0.000000,  count: 1
Region 94 Avg IOU: 0.645020, Class: 0.512855, Obj: 0.000979, No Obj: 0.000298, .5R: 1.000000, .75R: 0.000000,  count: 2
Region 106 Avg IOU: 0.000000, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 5
Region 82 Avg IOU: 0.547379, Class: 0.628108, Obj: 0.019025, No Obj: 0.000254, .5R: 0.500000, .75R: 0.000000,  count: 2
Region 94 Avg IOU: 0.687717, Class: 0.591308, Obj: 0.005402, No Obj: 0.000321, .5R: 1.000000, .75R: 0.166667,  count: 6
Region 106 Avg IOU: 0.000000, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 30
Region 82 Avg IOU: 0.624384, Class: 0.635476, Obj: 0.003742, No Obj: 0.000256, .5R: 1.000000, .75R: 0.000000,  count: 1
Region 94 Avg IOU: 0.563145, Class: 0.719392, Obj: 0.005922, No Obj: 0.000365, .5R: 1.000000, .75R: 0.000000,  count: 6
Region 106 Avg IOU: 0.000000, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 39
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000173, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.573622, Class: 0.640934, Obj: 0.004082, No Obj: 0.000380, .5R: 1.000000, .75R: 0.000000,  count: 10
Region 106 Avg IOU: 0.000000, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 25
Region 82 Avg IOU: 0.585302, Class: 0.625714, Obj: 0.013245, No Obj: 0.000237, .5R: 1.000000, .75R: 0.000000,  count: 3
Region 106 Avg IOU: 0.000000, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 39
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000173, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.573622, Class: 0.640934, Obj: 0.004082, No Obj: 0.000380, .5R: 1.000000, .75R: 0.000000,  count: 10
Region 106 Avg IOU: 0.000000, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 25
Region 82 Avg IOU: 0.585302, Class: 0.625714, Obj: 0.013245, No Obj: 0.000237, .5R: 1.000000, .75R: 0.000000,  count: 3
Region 94 Avg IOU: 0.556830, Class: 0.647776, Obj: 0.001029, No Obj: 0.000333, .5R: 0.500000, .75R: 0.000000,  count: 6
Region 106 Avg IOU: 0.000000, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 34
Region 82 Avg IOU: 0.632286, Class: 0.638544, Obj: 0.012061, No Obj: 0.000321, .5R: 0.750000, .75R: 0.250000,  count: 4
Region 94 Avg IOU: 0.647450, Class: 0.583517, Obj: 0.002508, No Obj: 0.000392, .5R: 0.714286, .75R: 0.142857,  count: 7
Region 106 Avg IOU: 0.000000, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 41
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000250, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.395261, Class: 0.764918, Obj: 0.005050, No Obj: 0.000304, .5R: 0.000000, .75R: 0.000000,  count: 4
Region 106 Avg IOU: 0.000000, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 21

 83: inf, inf avg loss, 0.001000 rate, 4.540668 seconds, 5312 images
Loaded: 0.000146 seconds
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000453, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.621627, Class: 0.755543, Obj: 0.016679, No Obj: 0.000547, .5R: 1.000000, .75R: 0.000000,  count: 2
Region 106 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 13
Region 82 Avg IOU: 0.716994, Class: 0.629274, Obj: 0.010065, No Obj: 0.000501, .5R: 1.000000, .75R: 0.500000,  count: 2
Region 94 Avg IOU: 0.612688, Class: 0.666641, Obj: 0.005844, No Obj: 0.000384, .5R: 0.666667, .75R: 0.333333,  count: 6
Region 106 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 40
Region 82 Avg IOU: 0.479883, Class: 0.592741, Obj: 0.004319, No Obj: 0.000646, .5R: 0.000000, .75R: 0.000000,  count: 2
Region 94 Avg IOU: 0.560601, Class: 0.579608, Obj: 0.001256, No Obj: 0.000300, .5R: 0.750000, .75R: 0.000000,  count: 8
Region 106 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 28
Region 82 Avg IOU: 0.421037, Class: 0.615552, Obj: 0.025422, No Obj: 0.000828, .5R: 0.000000, .75R: 0.000000,  count: 4
Region 94 Avg IOU: 0.565889, Class: 0.533750, Obj: 0.000729, No Obj: 0.000305, .5R: 0.791667, .75R: 0.000000,  count: 24
Region 106 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 38
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000569, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.645602, Class: 0.613226, Obj: 0.003775, No Obj: 0.000436, .5R: 0.888889, .75R: 0.000000,  count: 9
Region 106 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 24
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000478, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.572421, Class: 0.593322, Obj: 0.004701, No Obj: 0.000468, .5R: 0.700000, .75R: 0.000000,  count: 10
Region 106 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 32
Region 82 Avg IOU: 0.641817, Class: 0.635031, Obj: 0.035640, No Obj: 0.000784, .5R: 1.000000, .75R: 0.000000,  count: 1
...

Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000356, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.601235, Class: 0.657700, Obj: 0.003757, No Obj: 0.000395, .5R: 0.900000, .75R: 0.000000,  count: 10
Region 106 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 38
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000509, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.612902, Class: 0.762554, Obj: 0.008244, No Obj: 0.000359, .5R: 0.750000, .75R: 0.250000,  count: 4
Region 106 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 30
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000631, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.559490, Class: 0.648356, Obj: 0.006230, No Obj: 0.000325, .5R: 0.833333, .75R: 0.000000,  count: 6
Region 106 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 20
Region 82 Avg IOU: 0.533837, Class: 0.627053, Obj: 0.018822, No Obj: 0.000553, .5R: 1.000000, .75R: 0.000000,  count: 2
Region 94 Avg IOU: 0.510067, Class: 0.638865, Obj: 0.002965, No Obj: 0.000325, .5R: 0.583333, .75R: 0.000000,  count: 12
Region 106 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 38
Region 82 Avg IOU: 0.504176, Class: 0.630334, Obj: 0.015104, No Obj: 0.000486, .5R: 0.500000, .75R: 0.000000,  count: 2
Region 94 Avg IOU: 0.595772, Class: 0.559108, Obj: 0.001982, No Obj: 0.000340, .5R: 0.894737, .75R: 0.000000,  count: 19
Region 106 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 24
Region 82 Avg IOU: 0.581129, Class: 0.620991, Obj: 0.015682, No Obj: 0.000537, .5R: 1.000000, .75R: 0.000000,  count: 2
Region 94 Avg IOU: 0.545876, Class: 0.631375, Obj: 0.002244, No Obj: 0.000349, .5R: 0.727273, .75R: 0.000000,  count: 11
Region 106 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 40
Region 82 Avg IOU: 0.550739, Class: 0.615534, Obj: 0.017859, No Obj: 0.000651, .5R: 0.500000, .75R: 0.250000,  count: 4
Region 94 Avg IOU: 0.582396, Class: 0.528876, Obj: 0.001154, No Obj: 0.000286, .5R: 0.900000, .75R: 0.000000,  count: 10
Region 106 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 38

 84: nan, nan avg loss, 0.001000 rate, 4.538573 seconds, 5376 images
Loaded: 0.000163 seconds
Region 82 Avg IOU: 0.605543, Class: 0.617683, Obj: 0.022642, No Obj: 0.000854, .5R: 0.666667, .75R: 0.000000,  count: 3
Region 94 Avg IOU: 0.571682, Class: 0.510673, Obj: 0.001435, No Obj: 0.000364, .5R: 0.833333, .75R: 0.000000,  count: 18
Region 106 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 22
Region 82 Avg IOU: 0.549848, Class: 0.628538, Obj: 0.011007, No Obj: 0.000708, .5R: 0.500000, .75R: 0.000000,  count: 2
Region 94 Avg IOU: 0.573516, Class: 0.625826, Obj: 0.003037, No Obj: 0.000344, .5R: 0.750000, .75R: 0.062500,  count: 16
Region 106 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 55
Region 82 Avg IOU: 0.634130, Class: 0.621600, Obj: 0.067672, No Obj: 0.000919, .5R: 1.000000, .75R: 0.000000,  count: 1
Region 94 Avg IOU: 0.543719, Class: 0.719193, Obj: 0.011762, No Obj: 0.000508, .5R: 0.666667, .75R: 0.000000,  count: 6
Region 106 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 41
Region 82 Avg IOU: 0.516325, Class: 0.626551, Obj: 0.032900, No Obj: 0.000898, .5R: 0.333333, .75R: 0.000000,  count: 3
Region 94 Avg IOU: 0.601918, Class: 0.444663, Obj: 0.000872, No Obj: 0.000315, .5R: 1.000000, .75R: 0.000000,  count: 6
Region 106 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 14
Region 82 Avg IOU: 0.548722, Class: 0.646741, Obj: 0.014310, No Obj: 0.000774, .5R: 1.000000, .75R: 0.000000,  count: 1
Region 94 Avg IOU: 0.518356, Class: 0.730058, Obj: 0.006983, No Obj: 0.000498, .5R: 0.571429, .75R: 0.000000,  count: 7

AlexeyAB · 2018-06-10T11:01:42Z

@yyuzhongpv

Update your code from this repository
Try to re-calculate anchors:
./darknet detector calc_anchors data/obj.data -num_of_clusters 9 -width 416 -height 416
set these anchors in each of 3 [yolo] layers, and show them here
Try to train by using my repo:
- first 1000 iterations with CUDNN_HALF=0 and make
- after 1000 iteration, set CUDNN_HALF=1, rebuild (do make) and continue training using yolo_1000.weights, will there nan occur?

yyuzhongpv · 2018-06-11T19:19:15Z

@AlexeyAB
Hello Alexey,

I follow your instructions and get these results.

I use today's codes.
The anchors:
anchors = 12.5214,14.6001, 17.5892,18.6171, 26.0970,22.3336, 29.9592,28.2223, 48.2814,75.2532, 48.3668,199.5443, 45.4486,275.1175, 49.9200,286.9578, 76.5882,390.3690

A. The training log of first 1000 iterations shows here. With yolo_1000.weights (CUDNN_HALF=0), I can detect objects using detector in your repo.


1016: 3.586827, 3.088684 avg loss, 0.001000 rate, 1.378484 seconds, 65024 images
Loaded: 0.000074 seconds
Region 82 Avg IOU: 0.813881, Class: 0.999278, Obj: 0.645282, No Obj: 0.003569, .5R: 1.000000, .75R: 0.750000,  count: 4
Region 94 Avg IOU: 0.723154, Class: 0.847382, Obj: 0.649724, No Obj: 0.001934, .5R: 1.000000, .75R: 0.333333,  count: 3
Region 106 Avg IOU: 0.655643, Class: 0.970920, Obj: 0.729974, No Obj: 0.004603, .5R: 0.952381, .75R: 0.142857,  count: 42
Region 82 Avg IOU: 0.809238, Class: 0.999405, Obj: 0.361409, No Obj: 0.001196, .5R: 1.000000, .75R: 1.000000,  count: 2
Region 94 Avg IOU: 0.770336, Class: 0.913955, Obj: 0.556664, No Obj: 0.001484, .5R: 1.000000, .75R: 0.750000,  count: 4
Region 106 Avg IOU: 0.586111, Class: 0.768945, Obj: 0.601715, No Obj: 0.003149, .5R: 0.857143, .75R: 0.142857,  count: 28
Region 82 Avg IOU: 0.779693, Class: 0.999894, Obj: 0.690659, No Obj: 0.003170, .5R: 1.000000, .75R: 0.500000,  count: 2
Region 94 Avg IOU: 0.693006, Class: 0.975745, Obj: 0.582555, No Obj: 0.002297, .5R: 1.000000, .75R: 0.000000,  count: 5
Region 106 Avg IOU: 0.640808, Class: 0.922579, Obj: 0.637060, No Obj: 0.004327, .5R: 0.783784, .75R: 0.189189,  count: 37
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.001053, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.802638, Class: 0.925374, Obj: 0.509822, No Obj: 0.003300, .5R: 1.000000, .75R: 0.500000,  count: 6
Region 106 Avg IOU: 0.616160, Class: 0.886506, Obj: 0.701310, No Obj: 0.005053, .5R: 0.750000, .75R: 0.227273,  count: 44
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000327, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.754958, Class: 0.963965, Obj: 0.365209, No Obj: 0.001992, .5R: 1.000000, .75R: 0.400000,  count: 5
Region 106 Avg IOU: 0.489382, Class: 0.795860, Obj: 0.758662, No Obj: 0.002849, .5R: 0.520000, .75R: 0.160000,  count: 25
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000037, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.729015, Class: 0.967462, Obj: 0.717134, No Obj: 0.001929, .5R: 0.750000, .75R: 0.500000,  count: 4
Region 106 Avg IOU: 0.647692, Class: 0.977685, Obj: 0.846630, No Obj: 0.003124, .5R: 0.851852, .75R: 0.259259,  count: 27
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000076, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.739086, Class: 0.999300, Obj: 0.972048, No Obj: 0.001930, .5R: 1.000000, .75R: 0.250000,  count: 4
Region 106 Avg IOU: 0.661888, Class: 0.979235, Obj: 0.794365, No Obj: 0.003189, .5R: 0.793103, .75R: 0.344828,  count: 29
Region 82 Avg IOU: 0.704681, Class: 0.995547, Obj: 0.387201, No Obj: 0.004917, .5R: 0.875000, .75R: 0.375000,  count: 8
Region 94 Avg IOU: 0.774302, Class: 0.991398, Obj: 0.277391, No Obj: 0.001359, .5R: 1.000000, .75R: 1.000000,  count: 1
Region 106 Avg IOU: 0.678386, Class: 0.990535, Obj: 0.612305, No Obj: 0.004908, .5R: 0.953488, .75R: 0.325581,  count: 43
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000071, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.714987, Class: 0.998025, Obj: 0.877913, No Obj: 0.001984, .5R: 1.000000, .75R: 0.500000,  count: 4
Region 106 Avg IOU: 0.586376, Class: 0.984365, Obj: 0.728304, No Obj: 0.003214, .5R: 0.785714, .75R: 0.000000,  count: 28
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000212, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.761933, Class: 0.997826, Obj: 0.740437, No Obj: 0.002768, .5R: 1.000000, .75R: 0.833333,  count: 6
Region 106 Avg IOU: 0.660410, Class: 0.938265, Obj: 0.793664, No Obj: 0.004022, .5R: 0.925000, .75R: 0.225000,  count: 40
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000197, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.689387, Class: 0.998265, Obj: 0.673457, No Obj: 0.002302, .5R: 1.000000, .75R: 0.250000,  count: 4
Region 106 Avg IOU: 0.569088, Class: 0.866124, Obj: 0.653784, No Obj: 0.004086, .5R: 0.769231, .75R: 0.153846,  count: 26
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000461, .5R: -nan, .75R: -nan,  count: 0

B. The training log after 1000 iterations shows here. With yolo_2000.weights (CUDNN_HALF=0), I can detect objects using detector in your repo. However, I still detect nothing with detector in Joseph's detector. No nan occurs in training.

2341: 2.669755, 2.127507 avg loss, 0.001000 rate, 2.177491 seconds, 149824 images
Loaded: 0.000085 seconds
Region 82 Avg IOU: 0.781696, Class: 0.999761, Obj: 0.038998, No Obj: 0.000233, .5R: 1.000000, .75R: 1.000000,  count: 1
Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000661, .5R: -nan, .75R: -nan,  count: 0
Region 106 Avg IOU: 0.818872, Class: 0.996039, Obj: 0.468270, No Obj: 0.000803, .5R: 1.000000, .75R: 0.777778,  count: 9
Region 82 Avg IOU: 0.870273, Class: 0.999668, Obj: 0.886586, No Obj: 0.005818, .5R: 1.000000, .75R: 1.000000,  count: 3
Region 94 Avg IOU: 0.788085, Class: 0.995233, Obj: 0.742202, No Obj: 0.002847, .5R: 1.000000, .75R: 0.555556,  count: 9
Region 106 Avg IOU: 0.747890, Class: 0.932715, Obj: 0.574414, No Obj: 0.002961, .5R: 0.956522, .75R: 0.565217,  count: 46
Region 82 Avg IOU: 0.893271, Class: 0.999246, Obj: 0.804811, No Obj: 0.009647, .5R: 1.000000, .75R: 1.000000,  count: 6
Region 94 Avg IOU: 0.814228, Class: 0.999748, Obj: 0.966014, No Obj: 0.003996, .5R: 1.000000, .75R: 1.000000,  count: 9
Region 106 Avg IOU: 0.775236, Class: 0.994676, Obj: 0.735384, No Obj: 0.004661, .5R: 1.000000, .75R: 0.650794,  count: 63
Region 82 Avg IOU: 0.788876, Class: 0.999952, Obj: 0.975545, No Obj: 0.001319, .5R: 1.000000, .75R: 1.000000,  count: 1
Region 94 Avg IOU: 0.821742, Class: 0.997696, Obj: 0.524333, No Obj: 0.001495, .5R: 1.000000, .75R: 1.000000,  count: 3
Region 106 Avg IOU: 0.806486, Class: 0.997632, Obj: 0.714795, No Obj: 0.001924, .5R: 1.000000, .75R: 0.761905,  count: 21
Region 82 Avg IOU: 0.877605, Class: 0.999866, Obj: 0.989194, No Obj: 0.004142, .5R: 1.000000, .75R: 1.000000,  count: 2
Region 94 Avg IOU: 0.820618, Class: 0.998972, Obj: 0.845366, No Obj: 0.002911, .5R: 1.000000, .75R: 0.833333,  count: 6
Region 106 Avg IOU: 0.748662, Class: 0.929423, Obj: 0.712750, No Obj: 0.003052, .5R: 0.925000, .75R: 0.650000,  count: 40
Region 82 Avg IOU: 0.826724, Class: 0.999622, Obj: 0.973740, No Obj: 0.007880, .5R: 1.000000, .75R: 0.500000,  count: 4
Region 94 Avg IOU: 0.847491, Class: 0.993900, Obj: 0.740988, No Obj: 0.002937, .5R: 1.000000, .75R: 0.888889,  count: 9
Region 106 Avg IOU: 0.742122, Class: 0.975453, Obj: 0.788626, No Obj: 0.002403, .5R: 0.935484, .75R: 0.548387,  count: 31
Region 82 Avg IOU: 0.882419, Class: 0.999716, Obj: 0.861224, No Obj: 0.005413, .5R: 1.000000, .75R: 1.000000,  count: 6
Region 94 Avg IOU: 0.785805, Class: 0.984351, Obj: 0.680408, No Obj: 0.003579, .5R: 1.000000, .75R: 0.600000,  count: 10
Region 106 Avg IOU: 0.797855, Class: 0.997227, Obj: 0.757247, No Obj: 0.002562, .5R: 1.000000, .75R: 0.741935,  count: 31
Region 82 Avg IOU: 0.876814, Class: 0.999947, Obj: 0.972225, No Obj: 0.001938, .5R: 1.000000, .75R: 1.000000,  count: 2
Region 94 Avg IOU: 0.810955, Class: 0.998985, Obj: 0.599165, No Obj: 0.002519, .5R: 1.000000, .75R: 0.666667,  count: 6
Region 106 Avg IOU: 0.755856, Class: 0.946647, Obj: 0.765261, No Obj: 0.002008, .5R: 1.000000, .75R: 0.531250,  count: 32
Region 82 Avg IOU: 0.771740, Class: 0.999893, Obj: 0.968240, No Obj: 0.003285, .5R: 1.000000, .75R: 0.500000,  count: 2
Region 94 Avg IOU: 0.865066, Class: 0.999535, Obj: 0.951941, No Obj: 0.002444, .5R: 1.000000, .75R: 1.000000,  count: 4
Region 106 Avg IOU: 0.732979, Class: 0.996645, Obj: 0.798545, No Obj: 0.002100, .5R: 1.000000, .75R: 0.400000,  count: 20
Region 82 Avg IOU: 0.769176, Class: 0.999657, Obj: 0.035232, No Obj: 0.000095, .5R: 1.000000, .75R: 1.000000,  count: 1
Region 94 Avg IOU: 0.832877, Class: 0.999830, Obj: 0.949717, No Obj: 0.002328, .5R: 1.000000, .75R: 1.000000,  count: 4
Region 106 Avg IOU: 0.777802, Class: 0.998452, Obj: 0.897912, No Obj: 0.002340, .5R: 1.000000, .75R: 0.750000,  count: 28
Region 82 Avg IOU: 0.826240, Class: 0.999946, Obj: 0.936032, No Obj: 0.003600, .5R: 1.000000, .75R: 0.750000,  count: 4
Region 94 Avg IOU: 0.780202, Class: 0.998233, Obj: 0.889746, No Obj: 0.003007, .5R: 1.000000, .75R: 0.818182,  count: 11
Region 106 Avg IOU: 0.769170, Class: 0.994288, Obj: 0.831500, No Obj: 0.002834, .5R: 1.000000, .75R: 0.571429,  count: 28
Region 82 Avg IOU: 0.888917, Class: 0.999564, Obj: 0.777210, No Obj: 0.005676, .5R: 1.000000, .75R: 1.000000,  count: 4
Region 94 Avg IOU: 0.749526, Class: 0.995966, Obj: 0.675944, No Obj: 0.002745, .5R: 0.875000, .75R: 0.500000,  count: 8
Region 106 Avg IOU: 0.809502, Class: 0.920627, Obj: 0.648274, No Obj: 0.001870, .5R: 1.000000, .75R: 0.700000,  count: 20

The questions:

What magic you did in this process? CUDNN_HALF matters, right?
Why the Joseph's detector still detect nothing with the same weight file that work with your detector?
Next suggestion? how to make it work on training/inference well?

Regards,

AlexeyAB · 2018-06-12T00:17:38Z

First 1000 iterations is the most un-stable period, so general recommendation to use 1 GPU and Float-32. After 1000 iterations you can use multi-GPU -gpus 0,1,2,3 and Mixed-precision CUDNN_HALF=1
What width= height= and random=1 do you use?
So, as you wrote With yolo_2000.weights (CUDNN_HALF=0), I can detect objects using detector in your repo. - it is already work on training/inference well, or what do you mean? Just train about 12 000 iterations and check mAP.

yyuzhongpv · 2018-06-12T00:54:15Z

@AlexeyAB
Thank you so much!

Got it. I will try to use these two stage training later.
All default values in yolov3-voc.cfg. width=416, height=416 and random=1. I only changed the classes, filters and anchors.
Basing on your suggestion, first 1000 iterations with CUDNN_HALF=0, and after that set CUDNN_HALF=1. I just checked one single image with very obvious objects in it at that time, and want to check if the weights make sense or not. If it detects nothing, probably something was wrong for that training.

With yolo_1000.weights, it can detect most of objects on this single image, although the confidence is not very high (30%~90%).

With yolo_8800.weights, it can detect all of the objects with high confidence. (85%+)
And with yolo_8800.weights, the mAP is much better now. I will train more iterations (12000 as you suggested) and check the mAP anyway.


detections_count = 6954, unique_truth_count = 5168  
class_id = 0, name = Jointbar, 	 ap = 90.91 % 
class_id = 1, name = Bolt, 	 ap = 89.73 % 
class_id = 2, name = Hole, 	 ap = 88.50 % 
class_id = 3, name = Nut, 	 ap = 90.66 % 
class_id = 4, name = Discontinuity, 	 ap = 90.41 % 
class_id = 5, name = Crack, 	 ap = 55.96 % 
 for thresh = 0.25, precision = 0.96, recall = 0.95, F1-score = 0.96 
 for thresh = 0.25, TP = 4896, FP = 188, FN = 272, average IoU = 75.00 % 

 mean average precision (mAP) = 0.843603, or 84.36 % 
Total Detection Time: 11.000000 Seconds

The last question, why Joseph's darknet is not working well on V100 now? It worked well on my 1080 GPU before. Any suggestions I can try more tests?

Regards.

AlexeyAB · 2018-06-12T01:19:12Z

It is good result

 for thresh = 0.25, precision = 0.96, recall = 0.95, F1-score = 0.96 
 for thresh = 0.25, TP = 4896, FP = 188, FN = 272, average IoU = 75.00 % 

 mean average precision (mAP) = 0.843603, or 84.36 %

Try to use ARCH= -gencode arch=compute_70,code=[sm_70,compute_70] in the Makefile for Joseph's darknet

yyuzhongpv · 2018-06-12T14:29:13Z

Hello Alexey!
4. After I set ARCH= -gencode arch=compute_70,code=[sm_70,compute_70] in the Makefile for Joseph's darknet, and train 10000 iterations, I still can't detect the objects on the sample image.

However, if I use the detector in your repo., It can detect the objects on that image, and also the mAP is good. I think something is wrong with the detector test code in Joseph's darknet,

detections_count = 6532, unique_truth_count = 5038
class_id = 0, name = Jointbar, ap = 90.90 %
class_id = 1, name = Bolt, ap = 90.55 %
class_id = 2, name = Hole, ap = 89.53 %
class_id = 3, name = Nut, ap = 90.88 %
class_id = 4, name = Discontinuity, ap = 90.26 %
class_id = 5, name = Crack, ap = 88.64 %
for thresh = 0.25, precision = 0.97, recall = 0.98, F1-score = 0.97
for thresh = 0.25, TP = 4927, FP = 170, FN = 111, average IoU = 75.18 %

mean average precision (mAP) = 0.901253, or 90.13 %
Total Detection Time: 13.000000 Seconds

AlexeyAB · 2018-06-12T15:06:36Z

I think something is wrong with the detector test code in Joseph's darknet,

May be yes.

yyuzhongpv · 2018-06-12T23:22:42Z

There is minor difference in mAP of these two repo.

On my dataset, with same model configuration, the training process of both Joseph's darknet (GPU and CUDNN enable) and Alexey's darknet (Two stage, CUDNN_HALF enable after 1000 iterations) can get good weights.

The Joseph's darknet seems to have some issues in detector test code with CUDNN=1. If I disable the CUDNN only for testing, it can detect objects on single image.

willbattel · 2019-04-01T21:06:55Z

First 1000 iterations is the most un-stable period, so general recommendation to use 1 GPU and Float-32. After 1000 iterations you can use multi-GPU -gpus 0,1,2,3 and Mixed-precision CUDNN_HALF=1

@AlexeyAB is this still the case? If so, would it be possible to modify the training code so that I can Make with CUDNN_HALF=1 and start the detection process with multiple GPUs, and the training process will automatically only use full-precision with 1 GPU until iteration 1000? Seems silly to have to remake the program part-way through a training process just to support CUDNN_HALF on multiple GPUs.

AlexeyAB · 2019-04-01T21:16:12Z

@willbattel

Currently Darknet automatically disables Tensor Cores for the first 1000-3000 iterations.
So just make once with CUDNN_HALF=1.

yyuzhongpv changed the title ~~The Difference of AlexyAB/Darknet and Pjreddie/Darknet~~ The Difference of AlexeyAB/Darknet and Pjreddie/Darknet Jun 4, 2018

yyuzhongpv closed this as completed Jun 12, 2018

AlexeyAB added the Solved The problem is solved using the correct settings label Jun 12, 2018

naivewhim mentioned this issue Jun 20, 2019

loss Nan after inf issue #3449

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Difference of AlexeyAB/Darknet and Pjreddie/Darknet #969

The Difference of AlexeyAB/Darknet and Pjreddie/Darknet #969

yyuzhongpv commented Jun 3, 2018 •

edited

Loading

AlexeyAB commented Jun 3, 2018

yyuzhongpv commented Jun 3, 2018 •

edited

Loading

AlexeyAB commented Jun 3, 2018

yyuzhongpv commented Jun 3, 2018

AlexeyAB commented Jun 3, 2018

AlexeyAB commented Jun 3, 2018

IlyaOvodov commented Jun 4, 2018

AlexeyAB commented Jun 4, 2018

IlyaOvodov commented Jun 4, 2018

AlexeyAB commented Jun 4, 2018

yyuzhongpv commented Jun 4, 2018

yyuzhongpv commented Jun 4, 2018

AlexeyAB commented Jun 4, 2018

AlexeyAB commented Jun 4, 2018

yyuzhongpv commented Jun 9, 2018

AlexeyAB commented Jun 9, 2018 •

edited

Loading

yyuzhongpv commented Jun 10, 2018 •

edited

Loading

AlexeyAB commented Jun 10, 2018

yyuzhongpv commented Jun 11, 2018 •

edited

Loading

AlexeyAB commented Jun 12, 2018

yyuzhongpv commented Jun 12, 2018

AlexeyAB commented Jun 12, 2018

yyuzhongpv commented Jun 12, 2018

AlexeyAB commented Jun 12, 2018

yyuzhongpv commented Jun 12, 2018 •

edited

Loading

willbattel commented Apr 1, 2019

AlexeyAB commented Apr 1, 2019

The Difference of AlexeyAB/Darknet and Pjreddie/Darknet #969

The Difference of AlexeyAB/Darknet and Pjreddie/Darknet #969

Comments

yyuzhongpv commented Jun 3, 2018 • edited Loading

AlexeyAB commented Jun 3, 2018

yyuzhongpv commented Jun 3, 2018 • edited Loading

AlexeyAB commented Jun 3, 2018

yyuzhongpv commented Jun 3, 2018

AlexeyAB commented Jun 3, 2018

AlexeyAB commented Jun 3, 2018

IlyaOvodov commented Jun 4, 2018

AlexeyAB commented Jun 4, 2018

IlyaOvodov commented Jun 4, 2018

AlexeyAB commented Jun 4, 2018

yyuzhongpv commented Jun 4, 2018

yyuzhongpv commented Jun 4, 2018

AlexeyAB commented Jun 4, 2018

AlexeyAB commented Jun 4, 2018

yyuzhongpv commented Jun 9, 2018

AlexeyAB commented Jun 9, 2018 • edited Loading

yyuzhongpv commented Jun 10, 2018 • edited Loading

AlexeyAB commented Jun 10, 2018

yyuzhongpv commented Jun 11, 2018 • edited Loading

AlexeyAB commented Jun 12, 2018

yyuzhongpv commented Jun 12, 2018

AlexeyAB commented Jun 12, 2018

yyuzhongpv commented Jun 12, 2018

AlexeyAB commented Jun 12, 2018

yyuzhongpv commented Jun 12, 2018 • edited Loading

willbattel commented Apr 1, 2019

AlexeyAB commented Apr 1, 2019

yyuzhongpv commented Jun 3, 2018 •

edited

Loading

yyuzhongpv commented Jun 3, 2018 •

edited

Loading

AlexeyAB commented Jun 9, 2018 •

edited

Loading

yyuzhongpv commented Jun 10, 2018 •

edited

Loading

yyuzhongpv commented Jun 11, 2018 •

edited

Loading

yyuzhongpv commented Jun 12, 2018 •

edited

Loading