Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The Difference of AlexeyAB/Darknet and Pjreddie/Darknet #969

Closed
yyuzhongpv opened this issue Jun 3, 2018 · 27 comments
Closed

The Difference of AlexeyAB/Darknet and Pjreddie/Darknet #969

yyuzhongpv opened this issue Jun 3, 2018 · 27 comments
Labels
Solved The problem is solved using the correct settings

Comments

@yyuzhongpv
Copy link

yyuzhongpv commented Jun 3, 2018

Hello,

On my test machine (GTX 1080 GPU, CentOS 7, CUDA 9.0), I have both Darknet from Pjreddie and AlexeyAB. I used the same dataset and config file to train the detection models. With Pjreddie's darknet, I can get good performance in training and testing. However, while I changed to AlexeyAB's darknet, I use same options in Makefile, and train with the same dataset, the training process seems to good, it converged quickly, however, while I used that model to test my images, I get very bad accuracy. I just want to know what's the main differences of these two repos? and how to debug? I really want to use the optimization of CUDNN_HALF from Alexey.
Thanks!

@AlexeyAB
Copy link
Owner

AlexeyAB commented Jun 3, 2018

Hello,

CUDNN_HALF=1 can be used with speedup and without drop in accuracy only on GPU Volta (Titan V, Tesla V100, Quadro GV100, DGX-2, HGX-2, ...) and later.

  • What version of CUDNN do you use?
  • Do you get bad accuracy when you use CUDNN_HALF=1 on GTX 1080 GPU?
  • Do you get bad accuracy when CUDNN_HALF=0?

@yyuzhongpv
Copy link
Author

yyuzhongpv commented Jun 3, 2018

Hello Alexey,

Thanks for your quick reply.

•What version of CUDNN do you use?
cudnn 7.0

•Do you get bad accuracy when you use CUDNN_HALF=1 on GTX 1080 GPU?
•Do you get bad accuracy when CUDNN_HALF=0?

In this test, I use GTX 1080 (A physical Dell Workstation). On both Darknet repo, I set CUDNN_HALF=0. and train with my own dataset. (About 3000 images, the size of image is about 2000x1500).

With Pjreddie's darknet, after training 50000 batches, I got more than 90% on both Precision and Recall in testing.
However, with AlexeyAB's darknet, in training, I can see the error decrease quickly and everything seems good. However, after training, while I try to validate the model with the same Python codes, I found the accuracy is very bad, and it even can't detect anything in most of images.

I will try more tests with the command line and post the results later.

Thanks!

@AlexeyAB
Copy link
Owner

AlexeyAB commented Jun 3, 2018

  • What date of your code from this repository?
  • What dataset do you use?
  • What model do you use?
  • What mAP can you get for both weights-files that is trained on Original and This repository? Using CUDNN_HALF=0

@yyuzhongpv
Copy link
Author

Thanks Alexey!
•What date of your code from this repository?
I used the codes at the beginning of May, and forget the exact date. I will check it.
•What dataset do you use?
It is my own dataset, and the image size is about 2000x1500. I trained this dataset with Yolov3 both on Pjreddie's darknet and AlexyAB's darknet.
•What model do you use?
Yolov3.cfg
•What mAP can you get for both weights-files that is trained on Original and This repository? Using CUDNN_HALF=0
I only computed the precision and recall, will post you details later.
In short, with Original darknet, both precision and recall are 90%+ on 6 classes of objects, however, with this repo, all of them are near to 0. I'm sure something is wrong. I will train the model with latest codes in this repository.

@AlexeyAB
Copy link
Owner

AlexeyAB commented Jun 3, 2018

Yes, just something wrong. Try to train with latest code.

And then compare accuracy of models trained on Original and This repo by using such command in this repo:
./darknet detector map data/obj.data yolo-obj.cfg backup\yolo-obj_50000.weights

Just set valid=valid.txt or valid=train.txt in your obj.data file.

@AlexeyAB
Copy link
Owner

AlexeyAB commented Jun 3, 2018

Also what network resolution width= and height= do you use?

@IlyaOvodov
Copy link

Possibly it can be caused by mixture of letterbox and not letterbox image modes that are mixed in different modes of detector at least in this fork. Training and validate is done without letterbox (i.e. image is just resized to net input size), but test is done with letterbox (image is resized keeping aspect ratio, margins are filled by gray uniform). In my case it resulted in bad visual performance in "detector test" while train and validate show very good statistics, until I've found this "not bug but feature" :)
See:
https://github.com/AlexeyAB/darknet/blob/master/src/detector.c#L394
https://github.com/AlexeyAB/darknet/blob/master/src/detector.c#L492
https://github.com/AlexeyAB/darknet/blob/master/src/detector.c#L602
https://github.com/AlexeyAB/darknet/blob/master/src/detector.c#L646
https://github.com/AlexeyAB/darknet/blob/master/src/detector.c#L1102
https://github.com/AlexeyAB/darknet/blob/master/src/detector.c#L1112

@AlexeyAB
Copy link
Owner

AlexeyAB commented Jun 4, 2018

@yyuzhongpv @IlyaOvodov

I think there is a problem with non-square network and old commit with hardcoded network resizing to random square size random=1, it broke any training of non-sqaure network.
But latest code should work successfully.

Also, just try to comment this line:

image sized = letterbox_image(im, net.w, net.h); letterbox = 1;

and un-comment this:
//image sized = resize_image(im, net.w, net.h);

@IlyaOvodov
Copy link

Yes, problem with random=1 was another one and it is fixed now. But inconsistency in letterbox mode is still present. At least one have to comment-uncomment lines above to make "train","valid" and "test" commands working in the same manner.

@AlexeyAB
Copy link
Owner

AlexeyAB commented Jun 4, 2018

@IlyaOvodov I just fixed it. Now by default LETTERBOX_DATA is disabled anywhere.

@yyuzhongpv
Copy link
Author

Thanks @AlexeyAB @IlyaOvodov
I use the YoloV3 without change, so the width and height is 416x416.
I will try the latest codes.

@yyuzhongpv
Copy link
Author

Thanks @AlexeyAB @IlyaOvodov

Basing on your comments, I also found the problem will happen in darknet.py, which calls network_predict_image directly, and only uses letterbox_image.

float *network_predict_image(network *net, image im)
{
image imr = letterbox_image(im, net->w, net->h);
set_batch_network(net, 1);
float *p = network_predict(*net, imr.data);
free_image(imr);
return p;
}

The quick question is, are there any differences between letterbox and no letterbox if I keep the training and testing consistence?

@AlexeyAB
Copy link
Owner

AlexeyAB commented Jun 4, 2018

@yyuzhongpv There are pros and cons in each case: #232 (comment)

@AlexeyAB
Copy link
Owner

AlexeyAB commented Jun 4, 2018

I disabled letter_box in the darknet.py by default.

Also check some differences in the original and this repository that can affect on your result: #529 (comment)

@yyuzhongpv yyuzhongpv changed the title The Difference of AlexyAB/Darknet and Pjreddie/Darknet The Difference of AlexeyAB/Darknet and Pjreddie/Darknet Jun 4, 2018
@yyuzhongpv
Copy link
Author

Hello @AlexeyAB,

After testing inseveral days, I found some more interesting things.
I use the exact same dataset, which has about 3k grayscale images with 2000x3000. There are some large (500x1000, <=4) objects and also small objects (50x100, dozens) in the images. The only model I tried is yolov3-voc.cfg.

I have tested on two machines: one is the Workstation with 1080 GPU, CUDA 9.0 and CUDNN 7.0.
Another is the Azure V100 VM, with CUDA 9.1 and CUDNN 7.0.

Here are my results.

  1. I worked on Pjreddie's darknet on 1080 GPU before. After 50000 iteration (batchsize 16 and subdivision=8), I can get very good precision and recall (90%) on almost all of the objects.

  2. I want to optimize the processing time on V100, so I switch to V100 with both Pjreddie's darknet and AlexeyAB's repo. I updated both Pjreddie's and AlexeyAB's darknet at 06/05/2018, and use exactly same yolov3-voc.cfg, only change the batchsize=64 and subdivision=16.

    A. With Pjreddie's darknet, after training, I can't detect anything with the command detector. However, If I use the detector in AlexeyAB's repo, I can detect all of objects.

    B. With AlexeyAB's repo, after training nearly 30000 iterations, I got below log. While testing with the latest weight with AlexeyAB's detector, I can only detect the large object. With Pjreddie's detector, I detect nothing.
    '''
    30376: nan, nan avg loss, 0.001000 rate, 2.594721 seconds, 1944064 images
    Loaded: 0.000113 seconds
    Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.001509, .5R: -nan, .75R: -nan, count: 0
    Region 94 Avg IOU: 0.746136, Class: 0.887348, Obj: 0.495092, No Obj: 0.002121, .5R: 1.000000, .75R: 0.636364, count: 11
    Region 106 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000, count: 26
    Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000262, .5R: -nan, .75R: -nan, count: 0
    Region 94 Avg IOU: 0.867755, Class: 0.999704, Obj: 0.776341, No Obj: 0.002099, .5R: 1.000000, .75R: 1.000000, count: 8
    Region 106 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000, count: 52
    Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.001928, .5R: -nan, .75R: -nan, count: 0
    Region 94 Avg IOU: 0.880615, Class: 0.999778, Obj: 0.949572, No Obj: 0.002001, .5R: 1.000000, .75R: 1.000000, count: 6
    Region 106 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000, count: 41
    Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000134, .5R: -nan, .75R: -nan, count: 0
    Region 94 Avg IOU: 0.840572, Class: 0.999879, Obj: 0.972163, No Obj: 0.000742, .5R: 1.000000, .75R: 0.500000, count: 2
    Region 106 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000, count: 19
    Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000164, .5R: -nan, .75R: -nan, count: 0
    Region 94 Avg IOU: 0.842480, Class: 0.999470, Obj: 0.610240, No Obj: 0.001531, .5R: 1.000000, .75R: 1.000000, count: 5
    Region 106 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000, count: 35
    Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000080, .5R: -nan, .75R: -nan, count: 0
    Region 94 Avg IOU: 0.835363, Class: 0.999735, Obj: 0.821422, No Obj: 0.001978, .5R: 1.000000, .75R: 1.000000, count: 6
    Region 106 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000, count: 41
    Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000196, .5R: -nan, .75R: -nan, count: 0
    Region 94 Avg IOU: 0.849555, Class: 0.999705, Obj: 0.804512, No Obj: 0.002521, .5R: 1.000000, .75R: 0.857143, count: 7
    Region 106 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000, count: 45
    Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000034, .5R: -nan, .75R: -nan, count: 0
    Region 94 Avg IOU: 0.774943, Class: 0.999629, Obj: 0.699336, No Obj: 0.000951, .5R: 1.000000, .75R: 0.333333, count: 3
    Region 106 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000, count: 25
    '''

I have been stuck on problem for several days. The questions I want to check are:

  1. Why I get different results on 1080 and V100 GPU?

  2. With Pjreddie/Darknet on V100, the train process seems good. However, why the detector in Pjreddie/Darknet detected nothing, and detector AlexeyAB/Darknet can detect objects?

  3. With AlexeyAB/Darknet, why it can only detect one kind of large object?

I am really confused about the difference of these two repos, even with the exact same yolov3-voc.cfg.

Any suggestions are most welcome!
Thanks in advance!

@AlexeyAB
Copy link
Owner

AlexeyAB commented Jun 9, 2018

@yyuzhongpv

I can get very good precision and recall (90%) on almost all of the objects.

The main question is - what implementation of calculation of precision and recall do you use? The most of implementations are totaly wrong.

A. With Pjreddie's darknet, after training, I can't detect anything with the command detector. However, If I use the detector in AlexeyAB's repo, I can detect all of objects.

B. With AlexeyAB's repo, after training nearly 30000 iterations, I got below log. While testing with the latest weight with AlexeyAB's detector, I can only detect the large object. With Pjreddie's detector, I detect nothing.
'''
30376: nan, nan avg loss, 0.001000 rate, 2.594721 seconds, 1944064 images

  • What mAP can you get in this case?

  • As I see - avg loss is Nan so training goes wrong.

  • Since you get bad result on both repo Joseph's and my, I think you do something wrong, or you broke dataset.

  • Attach your cfg-file.

  • What parameters did you use in the Makefile for both repositories?

  • I added in the last commits some fixes that will reject bad labels or stop training if you use inconsistent labels and cfg-files, because ~80% of issues due to an incorrect dataset

  • Do you get files bad_labels.list and bad.list after training in the same directory where is ./darknet?

@yyuzhongpv
Copy link
Author

yyuzhongpv commented Jun 10, 2018

Thanks Alexey!

The main question is - what implementation of calculation of precision and recall do you use? The most of implementations are totally wrong.

I implemented the calculation of Precision and Recall by ourselves. I compute the number of IoU larger than the thresh hold value (0.5) of predicted bounding boxes and ground truth to get the TP, and also get the number of FP (Predict bounding box, but no overlap with ground truth). Precision = the sum of TP for each test images/ (the sum of TP for each test images + the sum of FP for each test images). Recall is similar.

On the 1080 GPU, I already checked the output of test manually by drawing the predicted bounding boxes on the test images, and went through of them. They all were very close to the ground truth. So I assume the calculation of precision and recall is not a big problem.

The key problem is, in Joseph's darknet on V100, after training, the ./darknet detector test ... detects nothing from my test images.

Cfg file for both repos. I only made small changes on yolov3-voc.cfg.

diff /mnt/test/xxx_WS/yolo.cfg ../darknet-official0605/cfg/yolov3-voc.cfg 
3,4c3,4
< # batch=1
< # subdivisions=1
---
>  batch=1
>  subdivisions=1
6,7c6,7
< batch=64
< subdivisions=16
---
> # batch=64
> # subdivisions=16
605c605
< filters=33
---
> filters=75
611c611
< classes=6
---
> classes=20
689c689
< filters=33
---
> filters=75
695c695
< classes=6
---
> classes=20
773c773
< filters=33
---
> filters=75
779c779
< classes=6
---
> classes=20

Makefile of Joseph's darknet. Only change the options in header.

GPU=1
CUDNN=1
OPENCV=1
OPENMP=1
DEBUG=0

ARCH= -gencode arch=compute_30,code=sm_30 \
      -gencode arch=compute_35,code=sm_35 \
      -gencode arch=compute_50,code=[sm_50,compute_50] \
      -gencode arch=compute_52,code=[sm_52,compute_52]
#      -gencode arch=compute_20,code=[sm_20,sm_21] \ This one is deprecated?

# This is what I use, uncomment if you know your arch and want to specify
#ARCH= -gencode arch=compute_52,code=compute_52

VPATH=./src/:./examples
SLIB=libdarknet.so
ALIB=libdarknet.a
EXEC=darknet
OBJDIR=./obj/

CC=gcc
NVCC=nvcc
AR=ar
ARFLAGS=rcs
OPTS=-Ofast
LDFLAGS= -lm -pthread
COMMON= -Iinclude/ -Isrc/
CFLAGS=-Wall -Wno-unused-result -Wno-unknown-pragmas -Wfatal-errors -fPIC

Training command:

/home/yyuzhong/darknet-official0605/darknet detector train /mnt/test/xxxx_WS/yolo.data /mnt/test/xxxx_WS/yolo.cfg darknet53.conv.74 -dont_show -gpus 0

The training log of Joseph's darknet:

273: 39.363007, 41.344463 avg, 0.000006 rate, 2.329178 seconds, 17472 images
Loaded: 0.000065 seconds
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.007486, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.305817, Class: 0.432831, Obj: 0.007301, No Obj: 0.003944, .5R: 0.200000, .75R: 0.100000,  count: 10
Region 106 Avg IOU: 0.282727, Class: 0.417738, Obj: 0.008525, No Obj: 0.002046, .5R: 0.120690, .75R: 0.000000,  count: 58
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.007615, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.309755, Class: 0.432035, Obj: 0.011104, No Obj: 0.004084, .5R: 0.100000, .75R: 0.000000,  count: 10
Region 106 Avg IOU: 0.298968, Class: 0.503030, Obj: 0.008877, No Obj: 0.002170, .5R: 0.155556, .75R: 0.000000,  count: 45
Region 82 Avg IOU: 0.326712, Class: 0.348199, Obj: 0.011348, No Obj: 0.007550, .5R: 0.000000, .75R: 0.000000,  count: 1
Region 94 Avg IOU: 0.263263, Class: 0.438575, Obj: 0.002804, No Obj: 0.003932, .5R: 0.000000, .75R: 0.000000,  count: 6
Region 106 Avg IOU: 0.282571, Class: 0.449923, Obj: 0.007379, No Obj: 0.002098, .5R: 0.116667, .75R: 0.000000,  count: 60
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.007522, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.286575, Class: 0.441582, Obj: 0.005456, No Obj: 0.004115, .5R: 0.000000, .75R: 0.000000,  count: 6
Region 106 Avg IOU: 0.338090, Class: 0.500828, Obj: 0.007312, No Obj: 0.002256, .5R: 0.200000, .75R: 0.000000,  count: 30
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.007382, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.298540, Class: 0.380783, Obj: 0.008436, No Obj: 0.004021, .5R: 0.300000, .75R: 0.000000,  count: 10
Region 106 Avg IOU: 0.295573, Class: 0.482574, Obj: 0.007328, No Obj: 0.002103, .5R: 0.151515, .75R: 0.015152,  count: 66
Region 82 Avg IOU: 0.260993, Class: 0.461224, Obj: 0.011011, No Obj: 0.007552, .5R: 0.000000, .75R: 0.000000,  count: 1
Region 94 Avg IOU: 0.185033, Class: 0.445341, Obj: 0.005748, No Obj: 0.004036, .5R: 0.000000, .75R: 0.000000,  count: 7
Region 106 Avg IOU: 0.319298, Class: 0.481619, Obj: 0.006593, No Obj: 0.002172, .5R: 0.186047, .75R: 0.000000,  count: 43
Region 82 Avg IOU: 0.145409, Class: 0.610898, Obj: 0.010052, No Obj: 0.007521, .5R: 0.000000, .75R: 0.000000,  count: 2
Region 94 Avg IOU: 0.216485, Class: 0.421778, Obj: 0.003331, No Obj: 0.004207, .5R: 0.000000, .75R: 0.000000,  count: 12
Region 106 Avg IOU: 0.347528, Class: 0.479033, Obj: 0.006581, No Obj: 0.002305, .5R: 0.121951, .75R: 0.024390,  count: 41
Region 82 Avg IOU: 0.173740, Class: 0.442087, Obj: 0.004729, No Obj: 0.007409, .5R: 0.000000, .75R: 0.000000,  count: 1
Region 94 Avg IOU: 0.300622, Class: 0.421547, Obj: 0.005921, No Obj: 0.004037, .5R: 0.125000, .75R: 0.000000,  count: 8
Region 106 Avg IOU: 0.292677, Class: 0.426323, Obj: 0.005118, No Obj: 0.002244, .5R: 0.170213, .75R: 0.000000,  count: 47
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.007638, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.199439, Class: 0.424072, Obj: 0.004754, No Obj: 0.004125, .5R: 0.000000, .75R: 0.000000,  count: 8
Region 106 Avg IOU: 0.323963, Class: 0.436944, Obj: 0.006814, No Obj: 0.002223, .5R: 0.183673, .75R: 0.020408,  count: 49
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.007709, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.317504, Class: 0.444540, Obj: 0.006530, No Obj: 0.004090, .5R: 0.166667, .75R: 0.000000,  count: 12
Region 106 Avg IOU: 0.365333, Class: 0.453353, Obj: 0.007359, No Obj: 0.002256, .5R: 0.276596, .75R: 0.042553,  count: 47
Region 82 Avg IOU: 0.057973, Class: 0.675667, Obj: 0.016377, No Obj: 0.007599, .5R: 0.000000, .75R: 0.000000,  count: 1
Region 94 Avg IOU: 0.381708, Class: 0.382659, Obj: 0.005441, No Obj: 0.004044, .5R: 0.333333, .75R: 0.000000,  count: 9
Region 106 Avg IOU: 0.297493, Class: 0.423628, Obj: 0.005643, No Obj: 0.002229, .5R: 0.120000, .75R: 0.020000,  count: 50
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.007638, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.309132, Class: 0.340190, Obj: 0.005039, No Obj: 0.003864, .5R: 0.142857, .75R: 0.000000,  count: 7
Region 106 Avg IOU: 0.276905, Class: 0.450940, Obj: 0.009095, No Obj: 0.002096, .5R: 0.078431, .75R: 0.019608,  count: 51
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.007480, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.321556, Class: 0.358655, Obj: 0.005864, No Obj: 0.004140, .5R: 0.285714, .75R: 0.000000,  count: 7
Region 106 Avg IOU: 0.339810, Class: 0.425059, Obj: 0.007491, No Obj: 0.002207, .5R: 0.170213, .75R: 0.000000,  count: 47
......

648: 7.944275, 8.417584 avg, 0.000176 rate, 4.588053 seconds, 41472 images
Loaded: 0.000081 seconds
Region 82 Avg IOU: 0.652460, Class: 0.996889, Obj: 0.455688, No Obj: 0.002242, .5R: 0.500000, .75R: 0.500000,  count: 2
Region 94 Avg IOU: 0.524300, Class: 0.878305, Obj: 0.438937, No Obj: 0.001154, .5R: 0.555556, .75R: 0.000000,  count: 9
Region 106 Avg IOU: 0.596807, Class: 0.874502, Obj: 0.726350, No Obj: 0.000996, .5R: 0.812500, .75R: 0.125000,  count: 32
Region 82 Avg IOU: 0.560426, Class: 0.998632, Obj: 0.661910, No Obj: 0.000585, .5R: 1.000000, .75R: 0.000000,  count: 1
Region 94 Avg IOU: 0.655807, Class: 0.985865, Obj: 0.713722, No Obj: 0.001311, .5R: 0.833333, .75R: 0.333333,  count: 12
Region 106 Avg IOU: 0.623416, Class: 0.823355, Obj: 0.525871, No Obj: 0.001481, .5R: 0.813559, .75R: 0.220339,  count: 59
Region 82 Avg IOU: 0.748478, Class: 0.992079, Obj: 0.331865, No Obj: 0.001283, .5R: 1.000000, .75R: 0.500000,  count: 2
Region 94 Avg IOU: 0.569297, Class: 0.894295, Obj: 0.268277, No Obj: 0.001745, .5R: 0.684211, .75R: 0.052632,  count: 19
Region 106 Avg IOU: 0.617301, Class: 0.895878, Obj: 0.561171, No Obj: 0.001280, .5R: 0.864865, .75R: 0.189189,  count: 37
Region 82 Avg IOU: 0.617124, Class: 0.993862, Obj: 0.500197, No Obj: 0.002522, .5R: 1.000000, .75R: 0.000000,  count: 4
Region 94 Avg IOU: 0.601353, Class: 0.904737, Obj: 0.547848, No Obj: 0.002094, .5R: 0.863636, .75R: 0.090909,  count: 22
Region 106 Avg IOU: 0.648495, Class: 0.896536, Obj: 0.653840, No Obj: 0.001114, .5R: 0.939394, .75R: 0.212121,  count: 33
Region 82 Avg IOU: 0.835922, Class: 0.989319, Obj: 0.367180, No Obj: 0.001742, .5R: 1.000000, .75R: 1.000000,  count: 2
Region 94 Avg IOU: 0.563951, Class: 0.934366, Obj: 0.484253, No Obj: 0.001929, .5R: 0.736842, .75R: 0.157895,  count: 19
Region 106 Avg IOU: 0.661426, Class: 0.964040, Obj: 0.799646, No Obj: 0.001123, .5R: 0.925926, .75R: 0.296296,  count: 27
Region 82 Avg IOU: 0.539442, Class: 0.985715, Obj: 0.197194, No Obj: 0.001021, .5R: 0.333333, .75R: 0.000000,  count: 3
Region 94 Avg IOU: 0.579014, Class: 0.798964, Obj: 0.468050, No Obj: 0.001925, .5R: 0.687500, .75R: 0.250000,  count: 16
Region 106 Avg IOU: 0.622752, Class: 0.871247, Obj: 0.468912, No Obj: 0.001483, .5R: 0.830189, .75R: 0.188679,  count: 53
Region 82 Avg IOU: 0.725466, Class: 0.996516, Obj: 0.311251, No Obj: 0.002156, .5R: 1.000000, .75R: 0.666667,  count: 3
Region 94 Avg IOU: 0.479724, Class: 0.908354, Obj: 0.671701, No Obj: 0.001487, .5R: 0.500000, .75R: 0.000000,  count: 12
Region 106 Avg IOU: 0.608657, Class: 0.753718, Obj: 0.550959, No Obj: 0.000696, .5R: 0.678571, .75R: 0.250000,  count: 28
Region 82 Avg IOU: 0.613398, Class: 0.998283, Obj: 0.485302, No Obj: 0.001676, .5R: 1.000000, .75R: 0.000000,  count: 2
Region 94 Avg IOU: 0.550962, Class: 0.885224, Obj: 0.341467, No Obj: 0.001789, .5R: 0.636364, .75R: 0.045455,  count: 22
Region 106 Avg IOU: 0.638456, Class: 0.875538, Obj: 0.697060, No Obj: 0.001186, .5R: 0.833333, .75R: 0.194444,  count: 36
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000377, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.654423, Class: 0.962119, Obj: 0.541582, No Obj: 0.001542, .5R: 0.727273, .75R: 0.272727,  count: 11


Makefile of Alexey's darknet. Only change the options in header, and set ARCH to support V100.

GPU=1
CUDNN=1
CUDNN_HALF=1
OPENCV=1
AVX=1
OPENMP=1
LIBSO=1

# set GPU=1 and CUDNN=1 to speedup on GPU
# set CUDNN_HALF=1 to further speedup 3 x times (Mixed-precision using Tensor Cores) on GPU Tesla V100, Titan V, DGX-2
# set AVX=1 and OPENMP=1 to speedup on CPU (if error occurs then set AVX=0)

DEBUG=0

ARCH= -gencode arch=compute_30,code=sm_30 \
      -gencode arch=compute_35,code=sm_35 \
      -gencode arch=compute_50,code=[sm_50,compute_50] \
      -gencode arch=compute_52,code=[sm_52,compute_52] \
          -gencode arch=compute_61,code=[sm_61,compute_61]

OS := $(shell uname)

# Tesla V100
ARCH= -gencode arch=compute_70,code=[sm_70,compute_70]

Training log of Alexey's darknet. The nan avg loss shows after iteration 84

82: 23.580215, 49.005016 avg loss, 0.001000 rate, 4.531976 seconds, 5248 images
Loaded: 0.000088 seconds
Region 82 Avg IOU: 0.676631, Class: 0.657015, Obj: 0.013579, No Obj: 0.000307, .5R: 1.000000, .75R: 0.000000,  count: 1
Region 94 Avg IOU: 0.611994, Class: 0.578710, Obj: 0.002161, No Obj: 0.000327, .5R: 0.928571, .75R: 0.142857,  count: 14
Region 106 Avg IOU: 0.000000, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 31
Region 82 Avg IOU: 0.715527, Class: 0.669567, Obj: 0.005944, No Obj: 0.000333, .5R: 1.000000, .75R: 0.500000,  count: 2
Region 94 Avg IOU: 0.563441, Class: 0.764334, Obj: 0.006268, No Obj: 0.000308, .5R: 0.833333, .75R: 0.000000,  count: 6
Region 106 Avg IOU: 0.000000, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 60
Region 82 Avg IOU: 0.613072, Class: 0.618635, Obj: 0.008895, No Obj: 0.000318, .5R: 1.000000, .75R: 0.000000,  count: 1
Region 94 Avg IOU: 0.618610, Class: 0.617791, Obj: 0.003350, No Obj: 0.000328, .5R: 0.666667, .75R: 0.222222,  count: 9
Region 106 Avg IOU: 0.000000, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 20
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000292, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.555021, Class: 0.602331, Obj: 0.002210, No Obj: 0.000284, .5R: 1.000000, .75R: 0.000000,  count: 2
Region 106 Avg IOU: 0.000000, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 11
Region 82 Avg IOU: 0.567027, Class: 0.626732, Obj: 0.012662, No Obj: 0.000445, .5R: 0.750000, .75R: 0.000000,  count: 4
Region 94 Avg IOU: 0.584985, Class: 0.517245, Obj: 0.001348, No Obj: 0.000330, .5R: 1.000000, .75R: 0.000000,  count: 11
Region 106 Avg IOU: 0.000000, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 34
Region 82 Avg IOU: 0.389180, Class: 0.619162, Obj: 0.004810, No Obj: 0.000252, .5R: 0.000000, .75R: 0.000000,  count: 2
Region 94 Avg IOU: 0.512456, Class: 0.524628, Obj: 0.001560, No Obj: 0.000342, .5R: 0.500000, .75R: 0.000000,  count: 18
Region 106 Avg IOU: 0.000000, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 24
Region 82 Avg IOU: 0.499203, Class: 0.634386, Obj: 0.003762, No Obj: 0.000197, .5R: 0.500000, .75R: 0.000000,  count: 2
Region 94 Avg IOU: 0.566642, Class: 0.652625, Obj: 0.004396, No Obj: 0.000363, .5R: 0.700000, .75R: 0.000000,  count: 10
Region 106 Avg IOU: 0.000000, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 47
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000222, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.574259, Class: 0.686245, Obj: 0.004401, No Obj: 0.000351, .5R: 0.818182, .75R: 0.000000,  count: 11
Region 106 Avg IOU: 0.000000, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 39
Region 82 Avg IOU: 0.454278, Class: 0.650282, Obj: 0.010042, No Obj: 0.000290, .5R: 0.000000, .75R: 0.000000,  count: 2
Region 94 Avg IOU: 0.613155, Class: 0.635823, Obj: 0.004758, No Obj: 0.000443, .5R: 0.875000, .75R: 0.000000,  count: 8
Region 106 Avg IOU: 0.000000, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 40
Region 82 Avg IOU: 0.541684, Class: 0.626609, Obj: 0.009354, No Obj: 0.000228, .5R: 1.000000, .75R: 0.000000,  count: 1
Region 94 Avg IOU: 0.645020, Class: 0.512855, Obj: 0.000979, No Obj: 0.000298, .5R: 1.000000, .75R: 0.000000,  count: 2
Region 106 Avg IOU: 0.000000, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 5
Region 82 Avg IOU: 0.547379, Class: 0.628108, Obj: 0.019025, No Obj: 0.000254, .5R: 0.500000, .75R: 0.000000,  count: 2
Region 94 Avg IOU: 0.687717, Class: 0.591308, Obj: 0.005402, No Obj: 0.000321, .5R: 1.000000, .75R: 0.166667,  count: 6
Region 106 Avg IOU: 0.000000, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 30
Region 82 Avg IOU: 0.624384, Class: 0.635476, Obj: 0.003742, No Obj: 0.000256, .5R: 1.000000, .75R: 0.000000,  count: 1
Region 94 Avg IOU: 0.563145, Class: 0.719392, Obj: 0.005922, No Obj: 0.000365, .5R: 1.000000, .75R: 0.000000,  count: 6
Region 106 Avg IOU: 0.000000, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 39
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000173, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.573622, Class: 0.640934, Obj: 0.004082, No Obj: 0.000380, .5R: 1.000000, .75R: 0.000000,  count: 10
Region 106 Avg IOU: 0.000000, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 25
Region 82 Avg IOU: 0.585302, Class: 0.625714, Obj: 0.013245, No Obj: 0.000237, .5R: 1.000000, .75R: 0.000000,  count: 3
Region 106 Avg IOU: 0.000000, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 39
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000173, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.573622, Class: 0.640934, Obj: 0.004082, No Obj: 0.000380, .5R: 1.000000, .75R: 0.000000,  count: 10
Region 106 Avg IOU: 0.000000, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 25
Region 82 Avg IOU: 0.585302, Class: 0.625714, Obj: 0.013245, No Obj: 0.000237, .5R: 1.000000, .75R: 0.000000,  count: 3
Region 94 Avg IOU: 0.556830, Class: 0.647776, Obj: 0.001029, No Obj: 0.000333, .5R: 0.500000, .75R: 0.000000,  count: 6
Region 106 Avg IOU: 0.000000, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 34
Region 82 Avg IOU: 0.632286, Class: 0.638544, Obj: 0.012061, No Obj: 0.000321, .5R: 0.750000, .75R: 0.250000,  count: 4
Region 94 Avg IOU: 0.647450, Class: 0.583517, Obj: 0.002508, No Obj: 0.000392, .5R: 0.714286, .75R: 0.142857,  count: 7
Region 106 Avg IOU: 0.000000, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 41
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000250, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.395261, Class: 0.764918, Obj: 0.005050, No Obj: 0.000304, .5R: 0.000000, .75R: 0.000000,  count: 4
Region 106 Avg IOU: 0.000000, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 21

 83: inf, inf avg loss, 0.001000 rate, 4.540668 seconds, 5312 images
Loaded: 0.000146 seconds
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000453, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.621627, Class: 0.755543, Obj: 0.016679, No Obj: 0.000547, .5R: 1.000000, .75R: 0.000000,  count: 2
Region 106 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 13
Region 82 Avg IOU: 0.716994, Class: 0.629274, Obj: 0.010065, No Obj: 0.000501, .5R: 1.000000, .75R: 0.500000,  count: 2
Region 94 Avg IOU: 0.612688, Class: 0.666641, Obj: 0.005844, No Obj: 0.000384, .5R: 0.666667, .75R: 0.333333,  count: 6
Region 106 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 40
Region 82 Avg IOU: 0.479883, Class: 0.592741, Obj: 0.004319, No Obj: 0.000646, .5R: 0.000000, .75R: 0.000000,  count: 2
Region 94 Avg IOU: 0.560601, Class: 0.579608, Obj: 0.001256, No Obj: 0.000300, .5R: 0.750000, .75R: 0.000000,  count: 8
Region 106 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 28
Region 82 Avg IOU: 0.421037, Class: 0.615552, Obj: 0.025422, No Obj: 0.000828, .5R: 0.000000, .75R: 0.000000,  count: 4
Region 94 Avg IOU: 0.565889, Class: 0.533750, Obj: 0.000729, No Obj: 0.000305, .5R: 0.791667, .75R: 0.000000,  count: 24
Region 106 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 38
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000569, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.645602, Class: 0.613226, Obj: 0.003775, No Obj: 0.000436, .5R: 0.888889, .75R: 0.000000,  count: 9
Region 106 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 24
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000478, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.572421, Class: 0.593322, Obj: 0.004701, No Obj: 0.000468, .5R: 0.700000, .75R: 0.000000,  count: 10
Region 106 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 32
Region 82 Avg IOU: 0.641817, Class: 0.635031, Obj: 0.035640, No Obj: 0.000784, .5R: 1.000000, .75R: 0.000000,  count: 1
...

Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000356, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.601235, Class: 0.657700, Obj: 0.003757, No Obj: 0.000395, .5R: 0.900000, .75R: 0.000000,  count: 10
Region 106 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 38
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000509, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.612902, Class: 0.762554, Obj: 0.008244, No Obj: 0.000359, .5R: 0.750000, .75R: 0.250000,  count: 4
Region 106 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 30
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000631, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.559490, Class: 0.648356, Obj: 0.006230, No Obj: 0.000325, .5R: 0.833333, .75R: 0.000000,  count: 6
Region 106 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 20
Region 82 Avg IOU: 0.533837, Class: 0.627053, Obj: 0.018822, No Obj: 0.000553, .5R: 1.000000, .75R: 0.000000,  count: 2
Region 94 Avg IOU: 0.510067, Class: 0.638865, Obj: 0.002965, No Obj: 0.000325, .5R: 0.583333, .75R: 0.000000,  count: 12
Region 106 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 38
Region 82 Avg IOU: 0.504176, Class: 0.630334, Obj: 0.015104, No Obj: 0.000486, .5R: 0.500000, .75R: 0.000000,  count: 2
Region 94 Avg IOU: 0.595772, Class: 0.559108, Obj: 0.001982, No Obj: 0.000340, .5R: 0.894737, .75R: 0.000000,  count: 19
Region 106 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 24
Region 82 Avg IOU: 0.581129, Class: 0.620991, Obj: 0.015682, No Obj: 0.000537, .5R: 1.000000, .75R: 0.000000,  count: 2
Region 94 Avg IOU: 0.545876, Class: 0.631375, Obj: 0.002244, No Obj: 0.000349, .5R: 0.727273, .75R: 0.000000,  count: 11
Region 106 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 40
Region 82 Avg IOU: 0.550739, Class: 0.615534, Obj: 0.017859, No Obj: 0.000651, .5R: 0.500000, .75R: 0.250000,  count: 4
Region 94 Avg IOU: 0.582396, Class: 0.528876, Obj: 0.001154, No Obj: 0.000286, .5R: 0.900000, .75R: 0.000000,  count: 10
Region 106 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 38

 84: nan, nan avg loss, 0.001000 rate, 4.538573 seconds, 5376 images
Loaded: 0.000163 seconds
Region 82 Avg IOU: 0.605543, Class: 0.617683, Obj: 0.022642, No Obj: 0.000854, .5R: 0.666667, .75R: 0.000000,  count: 3
Region 94 Avg IOU: 0.571682, Class: 0.510673, Obj: 0.001435, No Obj: 0.000364, .5R: 0.833333, .75R: 0.000000,  count: 18
Region 106 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 22
Region 82 Avg IOU: 0.549848, Class: 0.628538, Obj: 0.011007, No Obj: 0.000708, .5R: 0.500000, .75R: 0.000000,  count: 2
Region 94 Avg IOU: 0.573516, Class: 0.625826, Obj: 0.003037, No Obj: 0.000344, .5R: 0.750000, .75R: 0.062500,  count: 16
Region 106 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 55
Region 82 Avg IOU: 0.634130, Class: 0.621600, Obj: 0.067672, No Obj: 0.000919, .5R: 1.000000, .75R: 0.000000,  count: 1
Region 94 Avg IOU: 0.543719, Class: 0.719193, Obj: 0.011762, No Obj: 0.000508, .5R: 0.666667, .75R: 0.000000,  count: 6
Region 106 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 41
Region 82 Avg IOU: 0.516325, Class: 0.626551, Obj: 0.032900, No Obj: 0.000898, .5R: 0.333333, .75R: 0.000000,  count: 3
Region 94 Avg IOU: 0.601918, Class: 0.444663, Obj: 0.000872, No Obj: 0.000315, .5R: 1.000000, .75R: 0.000000,  count: 6
Region 106 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000,  count: 14
Region 82 Avg IOU: 0.548722, Class: 0.646741, Obj: 0.014310, No Obj: 0.000774, .5R: 1.000000, .75R: 0.000000,  count: 1
Region 94 Avg IOU: 0.518356, Class: 0.730058, Obj: 0.006983, No Obj: 0.000498, .5R: 0.571429, .75R: 0.000000,  count: 7


@AlexeyAB
Copy link
Owner

@yyuzhongpv

  1. Update your code from this repository

  2. Try to re-calculate anchors:
    ./darknet detector calc_anchors data/obj.data -num_of_clusters 9 -width 416 -height 416
    set these anchors in each of 3 [yolo] layers, and show them here

  3. Try to train by using my repo:

    • first 1000 iterations with CUDNN_HALF=0 and make

    • after 1000 iteration, set CUDNN_HALF=1, rebuild (do make) and continue training using yolo_1000.weights, will there nan occur?

@yyuzhongpv
Copy link
Author

yyuzhongpv commented Jun 11, 2018

@AlexeyAB
Hello Alexey,

I follow your instructions and get these results.

  1. I use today's codes.

  2. The anchors:
    anchors = 12.5214,14.6001, 17.5892,18.6171, 26.0970,22.3336, 29.9592,28.2223, 48.2814,75.2532, 48.3668,199.5443, 45.4486,275.1175, 49.9200,286.9578, 76.5882,390.3690

A. The training log of first 1000 iterations shows here. With yolo_1000.weights (CUDNN_HALF=0), I can detect objects using detector in your repo.


1016: 3.586827, 3.088684 avg loss, 0.001000 rate, 1.378484 seconds, 65024 images
Loaded: 0.000074 seconds
Region 82 Avg IOU: 0.813881, Class: 0.999278, Obj: 0.645282, No Obj: 0.003569, .5R: 1.000000, .75R: 0.750000,  count: 4
Region 94 Avg IOU: 0.723154, Class: 0.847382, Obj: 0.649724, No Obj: 0.001934, .5R: 1.000000, .75R: 0.333333,  count: 3
Region 106 Avg IOU: 0.655643, Class: 0.970920, Obj: 0.729974, No Obj: 0.004603, .5R: 0.952381, .75R: 0.142857,  count: 42
Region 82 Avg IOU: 0.809238, Class: 0.999405, Obj: 0.361409, No Obj: 0.001196, .5R: 1.000000, .75R: 1.000000,  count: 2
Region 94 Avg IOU: 0.770336, Class: 0.913955, Obj: 0.556664, No Obj: 0.001484, .5R: 1.000000, .75R: 0.750000,  count: 4
Region 106 Avg IOU: 0.586111, Class: 0.768945, Obj: 0.601715, No Obj: 0.003149, .5R: 0.857143, .75R: 0.142857,  count: 28
Region 82 Avg IOU: 0.779693, Class: 0.999894, Obj: 0.690659, No Obj: 0.003170, .5R: 1.000000, .75R: 0.500000,  count: 2
Region 94 Avg IOU: 0.693006, Class: 0.975745, Obj: 0.582555, No Obj: 0.002297, .5R: 1.000000, .75R: 0.000000,  count: 5
Region 106 Avg IOU: 0.640808, Class: 0.922579, Obj: 0.637060, No Obj: 0.004327, .5R: 0.783784, .75R: 0.189189,  count: 37
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.001053, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.802638, Class: 0.925374, Obj: 0.509822, No Obj: 0.003300, .5R: 1.000000, .75R: 0.500000,  count: 6
Region 106 Avg IOU: 0.616160, Class: 0.886506, Obj: 0.701310, No Obj: 0.005053, .5R: 0.750000, .75R: 0.227273,  count: 44
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000327, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.754958, Class: 0.963965, Obj: 0.365209, No Obj: 0.001992, .5R: 1.000000, .75R: 0.400000,  count: 5
Region 106 Avg IOU: 0.489382, Class: 0.795860, Obj: 0.758662, No Obj: 0.002849, .5R: 0.520000, .75R: 0.160000,  count: 25
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000037, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.729015, Class: 0.967462, Obj: 0.717134, No Obj: 0.001929, .5R: 0.750000, .75R: 0.500000,  count: 4
Region 106 Avg IOU: 0.647692, Class: 0.977685, Obj: 0.846630, No Obj: 0.003124, .5R: 0.851852, .75R: 0.259259,  count: 27
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000076, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.739086, Class: 0.999300, Obj: 0.972048, No Obj: 0.001930, .5R: 1.000000, .75R: 0.250000,  count: 4
Region 106 Avg IOU: 0.661888, Class: 0.979235, Obj: 0.794365, No Obj: 0.003189, .5R: 0.793103, .75R: 0.344828,  count: 29
Region 82 Avg IOU: 0.704681, Class: 0.995547, Obj: 0.387201, No Obj: 0.004917, .5R: 0.875000, .75R: 0.375000,  count: 8
Region 94 Avg IOU: 0.774302, Class: 0.991398, Obj: 0.277391, No Obj: 0.001359, .5R: 1.000000, .75R: 1.000000,  count: 1
Region 106 Avg IOU: 0.678386, Class: 0.990535, Obj: 0.612305, No Obj: 0.004908, .5R: 0.953488, .75R: 0.325581,  count: 43
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000071, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.714987, Class: 0.998025, Obj: 0.877913, No Obj: 0.001984, .5R: 1.000000, .75R: 0.500000,  count: 4
Region 106 Avg IOU: 0.586376, Class: 0.984365, Obj: 0.728304, No Obj: 0.003214, .5R: 0.785714, .75R: 0.000000,  count: 28
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000212, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.761933, Class: 0.997826, Obj: 0.740437, No Obj: 0.002768, .5R: 1.000000, .75R: 0.833333,  count: 6
Region 106 Avg IOU: 0.660410, Class: 0.938265, Obj: 0.793664, No Obj: 0.004022, .5R: 0.925000, .75R: 0.225000,  count: 40
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000197, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.689387, Class: 0.998265, Obj: 0.673457, No Obj: 0.002302, .5R: 1.000000, .75R: 0.250000,  count: 4
Region 106 Avg IOU: 0.569088, Class: 0.866124, Obj: 0.653784, No Obj: 0.004086, .5R: 0.769231, .75R: 0.153846,  count: 26
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000461, .5R: -nan, .75R: -nan,  count: 0

B. The training log after 1000 iterations shows here. With yolo_2000.weights (CUDNN_HALF=0), I can detect objects using detector in your repo. However, I still detect nothing with detector in Joseph's detector. No nan occurs in training.

2341: 2.669755, 2.127507 avg loss, 0.001000 rate, 2.177491 seconds, 149824 images
Loaded: 0.000085 seconds
Region 82 Avg IOU: 0.781696, Class: 0.999761, Obj: 0.038998, No Obj: 0.000233, .5R: 1.000000, .75R: 1.000000,  count: 1
Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000661, .5R: -nan, .75R: -nan,  count: 0
Region 106 Avg IOU: 0.818872, Class: 0.996039, Obj: 0.468270, No Obj: 0.000803, .5R: 1.000000, .75R: 0.777778,  count: 9
Region 82 Avg IOU: 0.870273, Class: 0.999668, Obj: 0.886586, No Obj: 0.005818, .5R: 1.000000, .75R: 1.000000,  count: 3
Region 94 Avg IOU: 0.788085, Class: 0.995233, Obj: 0.742202, No Obj: 0.002847, .5R: 1.000000, .75R: 0.555556,  count: 9
Region 106 Avg IOU: 0.747890, Class: 0.932715, Obj: 0.574414, No Obj: 0.002961, .5R: 0.956522, .75R: 0.565217,  count: 46
Region 82 Avg IOU: 0.893271, Class: 0.999246, Obj: 0.804811, No Obj: 0.009647, .5R: 1.000000, .75R: 1.000000,  count: 6
Region 94 Avg IOU: 0.814228, Class: 0.999748, Obj: 0.966014, No Obj: 0.003996, .5R: 1.000000, .75R: 1.000000,  count: 9
Region 106 Avg IOU: 0.775236, Class: 0.994676, Obj: 0.735384, No Obj: 0.004661, .5R: 1.000000, .75R: 0.650794,  count: 63
Region 82 Avg IOU: 0.788876, Class: 0.999952, Obj: 0.975545, No Obj: 0.001319, .5R: 1.000000, .75R: 1.000000,  count: 1
Region 94 Avg IOU: 0.821742, Class: 0.997696, Obj: 0.524333, No Obj: 0.001495, .5R: 1.000000, .75R: 1.000000,  count: 3
Region 106 Avg IOU: 0.806486, Class: 0.997632, Obj: 0.714795, No Obj: 0.001924, .5R: 1.000000, .75R: 0.761905,  count: 21
Region 82 Avg IOU: 0.877605, Class: 0.999866, Obj: 0.989194, No Obj: 0.004142, .5R: 1.000000, .75R: 1.000000,  count: 2
Region 94 Avg IOU: 0.820618, Class: 0.998972, Obj: 0.845366, No Obj: 0.002911, .5R: 1.000000, .75R: 0.833333,  count: 6
Region 106 Avg IOU: 0.748662, Class: 0.929423, Obj: 0.712750, No Obj: 0.003052, .5R: 0.925000, .75R: 0.650000,  count: 40
Region 82 Avg IOU: 0.826724, Class: 0.999622, Obj: 0.973740, No Obj: 0.007880, .5R: 1.000000, .75R: 0.500000,  count: 4
Region 94 Avg IOU: 0.847491, Class: 0.993900, Obj: 0.740988, No Obj: 0.002937, .5R: 1.000000, .75R: 0.888889,  count: 9
Region 106 Avg IOU: 0.742122, Class: 0.975453, Obj: 0.788626, No Obj: 0.002403, .5R: 0.935484, .75R: 0.548387,  count: 31
Region 82 Avg IOU: 0.882419, Class: 0.999716, Obj: 0.861224, No Obj: 0.005413, .5R: 1.000000, .75R: 1.000000,  count: 6
Region 94 Avg IOU: 0.785805, Class: 0.984351, Obj: 0.680408, No Obj: 0.003579, .5R: 1.000000, .75R: 0.600000,  count: 10
Region 106 Avg IOU: 0.797855, Class: 0.997227, Obj: 0.757247, No Obj: 0.002562, .5R: 1.000000, .75R: 0.741935,  count: 31
Region 82 Avg IOU: 0.876814, Class: 0.999947, Obj: 0.972225, No Obj: 0.001938, .5R: 1.000000, .75R: 1.000000,  count: 2
Region 94 Avg IOU: 0.810955, Class: 0.998985, Obj: 0.599165, No Obj: 0.002519, .5R: 1.000000, .75R: 0.666667,  count: 6
Region 106 Avg IOU: 0.755856, Class: 0.946647, Obj: 0.765261, No Obj: 0.002008, .5R: 1.000000, .75R: 0.531250,  count: 32
Region 82 Avg IOU: 0.771740, Class: 0.999893, Obj: 0.968240, No Obj: 0.003285, .5R: 1.000000, .75R: 0.500000,  count: 2
Region 94 Avg IOU: 0.865066, Class: 0.999535, Obj: 0.951941, No Obj: 0.002444, .5R: 1.000000, .75R: 1.000000,  count: 4
Region 106 Avg IOU: 0.732979, Class: 0.996645, Obj: 0.798545, No Obj: 0.002100, .5R: 1.000000, .75R: 0.400000,  count: 20
Region 82 Avg IOU: 0.769176, Class: 0.999657, Obj: 0.035232, No Obj: 0.000095, .5R: 1.000000, .75R: 1.000000,  count: 1
Region 94 Avg IOU: 0.832877, Class: 0.999830, Obj: 0.949717, No Obj: 0.002328, .5R: 1.000000, .75R: 1.000000,  count: 4
Region 106 Avg IOU: 0.777802, Class: 0.998452, Obj: 0.897912, No Obj: 0.002340, .5R: 1.000000, .75R: 0.750000,  count: 28
Region 82 Avg IOU: 0.826240, Class: 0.999946, Obj: 0.936032, No Obj: 0.003600, .5R: 1.000000, .75R: 0.750000,  count: 4
Region 94 Avg IOU: 0.780202, Class: 0.998233, Obj: 0.889746, No Obj: 0.003007, .5R: 1.000000, .75R: 0.818182,  count: 11
Region 106 Avg IOU: 0.769170, Class: 0.994288, Obj: 0.831500, No Obj: 0.002834, .5R: 1.000000, .75R: 0.571429,  count: 28
Region 82 Avg IOU: 0.888917, Class: 0.999564, Obj: 0.777210, No Obj: 0.005676, .5R: 1.000000, .75R: 1.000000,  count: 4
Region 94 Avg IOU: 0.749526, Class: 0.995966, Obj: 0.675944, No Obj: 0.002745, .5R: 0.875000, .75R: 0.500000,  count: 8
Region 106 Avg IOU: 0.809502, Class: 0.920627, Obj: 0.648274, No Obj: 0.001870, .5R: 1.000000, .75R: 0.700000,  count: 20

The questions:

  1. What magic you did in this process? CUDNN_HALF matters, right?
  2. Why the Joseph's detector still detect nothing with the same weight file that work with your detector?
  3. Next suggestion? how to make it work on training/inference well?

Regards,

@AlexeyAB
Copy link
Owner

  1. First 1000 iterations is the most un-stable period, so general recommendation to use 1 GPU and Float-32. After 1000 iterations you can use multi-GPU -gpus 0,1,2,3 and Mixed-precision CUDNN_HALF=1

  2. What width= height= and random=1 do you use?

  3. So, as you wrote With yolo_2000.weights (CUDNN_HALF=0), I can detect objects using detector in your repo. - it is already work on training/inference well, or what do you mean? Just train about 12 000 iterations and check mAP.

@yyuzhongpv
Copy link
Author

@AlexeyAB
Thank you so much!

  1. Got it. I will try to use these two stage training later.

  2. All default values in yolov3-voc.cfg. width=416, height=416 and random=1. I only changed the classes, filters and anchors.

  3. Basing on your suggestion, first 1000 iterations with CUDNN_HALF=0, and after that set CUDNN_HALF=1. I just checked one single image with very obvious objects in it at that time, and want to check if the weights make sense or not. If it detects nothing, probably something was wrong for that training.

With yolo_1000.weights, it can detect most of objects on this single image, although the confidence is not very high (30%~90%).

With yolo_8800.weights, it can detect all of the objects with high confidence. (85%+)
And with yolo_8800.weights, the mAP is much better now. I will train more iterations (12000 as you suggested) and check the mAP anyway.


detections_count = 6954, unique_truth_count = 5168  
class_id = 0, name = Jointbar, 	 ap = 90.91 % 
class_id = 1, name = Bolt, 	 ap = 89.73 % 
class_id = 2, name = Hole, 	 ap = 88.50 % 
class_id = 3, name = Nut, 	 ap = 90.66 % 
class_id = 4, name = Discontinuity, 	 ap = 90.41 % 
class_id = 5, name = Crack, 	 ap = 55.96 % 
 for thresh = 0.25, precision = 0.96, recall = 0.95, F1-score = 0.96 
 for thresh = 0.25, TP = 4896, FP = 188, FN = 272, average IoU = 75.00 % 

 mean average precision (mAP) = 0.843603, or 84.36 % 
Total Detection Time: 11.000000 Seconds 

  1. The last question, why Joseph's darknet is not working well on V100 now? It worked well on my 1080 GPU before. Any suggestions I can try more tests?

Regards.

@AlexeyAB
Copy link
Owner

  1. It is good result
 for thresh = 0.25, precision = 0.96, recall = 0.95, F1-score = 0.96 
 for thresh = 0.25, TP = 4896, FP = 188, FN = 272, average IoU = 75.00 % 

 mean average precision (mAP) = 0.843603, or 84.36 % 
  1. Try to use ARCH= -gencode arch=compute_70,code=[sm_70,compute_70] in the Makefile for Joseph's darknet

@yyuzhongpv
Copy link
Author

Hello Alexey!
4. After I set ARCH= -gencode arch=compute_70,code=[sm_70,compute_70] in the Makefile for Joseph's darknet, and train 10000 iterations, I still can't detect the objects on the sample image.

However, if I use the detector in your repo., It can detect the objects on that image, and also the mAP is good. I think something is wrong with the detector test code in Joseph's darknet,

detections_count = 6532, unique_truth_count = 5038
class_id = 0, name = Jointbar, ap = 90.90 %
class_id = 1, name = Bolt, ap = 90.55 %
class_id = 2, name = Hole, ap = 89.53 %
class_id = 3, name = Nut, ap = 90.88 %
class_id = 4, name = Discontinuity, ap = 90.26 %
class_id = 5, name = Crack, ap = 88.64 %
for thresh = 0.25, precision = 0.97, recall = 0.98, F1-score = 0.97
for thresh = 0.25, TP = 4927, FP = 170, FN = 111, average IoU = 75.18 %

mean average precision (mAP) = 0.901253, or 90.13 %
Total Detection Time: 13.000000 Seconds

@AlexeyAB
Copy link
Owner

I think something is wrong with the detector test code in Joseph's darknet,

May be yes.

@yyuzhongpv
Copy link
Author

yyuzhongpv commented Jun 12, 2018

There is minor difference in mAP of these two repo.

On my dataset, with same model configuration, the training process of both Joseph's darknet (GPU and CUDNN enable) and Alexey's darknet (Two stage, CUDNN_HALF enable after 1000 iterations) can get good weights.

The Joseph's darknet seems to have some issues in detector test code with CUDNN=1. If I disable the CUDNN only for testing, it can detect objects on single image.

@AlexeyAB AlexeyAB added the Solved The problem is solved using the correct settings label Jun 12, 2018
@willbattel
Copy link

  • First 1000 iterations is the most un-stable period, so general recommendation to use 1 GPU and Float-32. After 1000 iterations you can use multi-GPU -gpus 0,1,2,3 and Mixed-precision CUDNN_HALF=1

@AlexeyAB is this still the case? If so, would it be possible to modify the training code so that I can Make with CUDNN_HALF=1 and start the detection process with multiple GPUs, and the training process will automatically only use full-precision with 1 GPU until iteration 1000? Seems silly to have to remake the program part-way through a training process just to support CUDNN_HALF on multiple GPUs.

@AlexeyAB
Copy link
Owner

AlexeyAB commented Apr 1, 2019

@willbattel

Currently Darknet automatically disables Tensor Cores for the first 1000-3000 iterations.
So just make once with CUDNN_HALF=1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Solved The problem is solved using the correct settings
Projects
None yet
Development

No branches or pull requests

4 participants