yolov4 Training questions nan nan nan nan nan #5395

yzh1527 · 2020-04-29T02:13:53Z

It's normal at the beginning of training. After a while of training, the loss value suddenly increases, and then it's all Nan.
According to the current check, there is no problem with the data set, and the learning rate is set to 0.001,who can help me?thanks

AlexeyAB · 2020-04-30T01:15:33Z

What cfg-file do you use?
What training command do you use?

learning rate is set to 0.001

If you use 4x GPUs, then use 4/ less learning_rate=0.00025

https://github.com/AlexeyAB/darknet#how-to-train-with-multi-gpu

deep-practice · 2020-04-30T07:11:46Z

still got nan using learning_rate=0.00025@AlexeyAB

AlexeyAB · 2020-04-30T12:25:15Z

@deep-practice

still got nan using learning_rate=0.0002

Show screenshot.
What dataset do you use?
What cfg-file do you use?
What training command do you use?

samux87 · 2020-05-01T09:25:17Z

I have a similar issue when I train in a p2.xlarge with a modified COCO dataset

AlexeyAB · 2020-05-01T13:24:36Z

Did you follow? https://github.com/AlexeyAB/darknet#how-to-train-with-multi-gpu

samux87 · 2020-05-01T14:00:50Z

I trained 1000 epochs on my pc and pushed everything on an ec2 instance. I think p2.xlarge is not a multi GPU instance anymore.
Do I need to reduce the Learning rate due to the nan "issue" anyway?

AlexeyAB · 2020-05-01T14:10:54Z

with a modified COCO dataset

Did you check that your changes are correct?
What command do you use for training, which cfg and pre-trained weights do you use?
Do you follow manual https://github.com/AlexeyAB/darknet#how-to-train-to-detect-your-custom-objects and use https://github.com/AlexeyAB/darknet/blob/master/cfg/yolov4-custom.cfg ?
Try to train with

learning_rate=0.001 https://github.com/AlexeyAB/darknet/blob/master/cfg/yolov4-custom.cfg#L18
max_delta=5 for each [yolo] layer https://github.com/AlexeyAB/darknet/blob/master/cfg/yolov4-custom.cfg#L1158
also try to add stopbackward=800 there https://github.com/AlexeyAB/darknet/blob/master/cfg/yolov4-custom.cfg#L744 or there https://github.com/AlexeyAB/darknet/blob/master/cfg/yolov4-custom.cfg#L947

samux87 · 2020-05-01T15:06:52Z

yes
./darknet detector train obj.data yolov4.cfg yolov4_1000.weights -map
I used yolov4.cfg, I will try with the custom one; thank you for the suggestion.
I will try this setting after a try with yolov4-custom.cfg; thank you again.

After having configured batch=32 and subdivisions=16, the training start but after a few iterations, NAN appears xD.

I also used a p3.2xlarge instance; same issue.

AlexeyAB · 2020-05-01T15:24:53Z

In the manual said that you should use yolov4-custom.cfg for your custom dataset
In the manual said that you should set batch=64. And if OOM error occurs, then increase subdivisions=16,32,64
What pre-trained weights file do you use initially?

samux87 · 2020-05-01T15:32:19Z

yolov4.conv.137

AlexeyAB · 2020-05-01T15:47:06Z

I just fixed yolov4-custom.cfg, try to use it.

Change batch=64 subdivisions=32 or 64/64 if OOM error occurs.

And train:
./darknet detector train obj.data yolov4.cfg yolov4.conv.137 -map

samux87 · 2020-05-06T05:18:18Z

It works now, thank you @AlexeyAB !

beizhengren · 2020-05-08T02:45:05Z

@AlexeyAB
In build/darknet/x64/cfg/yolov4-custom.cfg
stopbackward=800
In #4728 (comment) stopbackward=1
I think put stopbackward=1 before some layer meaning that all the layers' weights before current layer will not be updated.
In #5253 (comment) stopbackward=6000
so, how to understand stopbackward=800 stopbackward=6000?
Thanks!

AlexeyAB · 2020-05-08T07:42:23Z

put stopbackward=800 before some layer meaning that all the layers' weights before current layer will not be updated only for the first 800 iterations.

beizhengren · 2020-05-08T12:09:21Z

@AlexeyAB
Thanks!
So stopbackward=1 means that all the layers' weights before current layer will not be updated all the time ? In another words, only when stopbackward isn't 1, such as 2, represent iterations?

AlexeyAB · 2020-05-08T12:20:20Z

@beizhengren Yes!

beizhengren · 2020-05-09T01:05:47Z

@AlexeyAB Thanks!

Li505358678 · 2020-05-23T08:42:37Z

@AlexeyAB
I trained detector on MS COCO (trainvalno5k 2014) dataset
I just did as https://github.com/AlexeyAB/darknet/wiki/Train-Detector-on-MS-COCO-(trainvalno5k-2014)-dataset

I followed it .all parameters in yolov4.cfg by default, except for width=512 height=512

Training command:
./darknet detector train cfg/coco.data cfg/yolov4.cfg csdarknet53-omega.conv.105

but NAN happened at 10000 iteration.I use 32G GPU

What's wrong with it?

AlexeyAB · 2020-05-23T13:12:29Z

@Li505358678

but NAN happened at 10000 iteration.

Show screenshot.

And attach cfg-file.

Li505358678 · 2020-05-23T13:46:29Z

@Li505358678

but NAN happened at 10000 iteration.

Show screenshot.

And attach cfg-file.

just like this .

chart.png,loss is always Greater than 10 .So it is a line

cfg-file:
yolo.txt

AlexeyAB · 2020-05-23T14:35:06Z

@Li505358678

What date of Darknet do you use?
Try to use the latest version of Darknet.
Were there generated bad.list or bad_label.list files?

Li505358678 · 2020-05-23T14:40:58Z

@Li505358678

What date of Darknet do you use?

Try to use the latest version of Darknet.

Were there generated bad.list or bad_label.list files?

5.13，There should be no problem with the code. Have you modified it recently?
every label I download from the internet on the links like trainvalno5k.txt.just

Li505358678 · 2020-05-23T14:41:28Z

https://pjreddie.com/media/files/coco/trainvalno5k.part

AlexeyAB · 2020-05-23T14:54:32Z

@Li505358678 I fixed yolov4.cfg for stable training without Nan, it will have almost the same AP: c0d6b81

change width=512 height=512 and train: https://raw.githubusercontent.com/AlexeyAB/darknet/c0d6b81a78e204c04c9bf4277974e0dadad0c4e2/cfg/yolov4.cfg

Li505358678 · 2020-05-23T14:58:40Z

I will try.What's mean about have almost the same AP?the AP will not decrease and bad?

Li505358678 · 2020-05-23T15:01:39Z

Increase instead of decrease

AlexeyAB · 2020-05-23T15:16:05Z

Accuracy will be the same.

Li505358678 · 2020-05-23T15:17:34Z

准确性是一样的。

Thanks!I will try it.

Li505358678 · 2020-05-24T03:29:57Z

Accuracy will be the same.

@AlexeyAB

I have another question.
Why yolov4.cfg need to change width=512 height=512 with csdarknet53-omega.conv.105 .
But train with yolov4.conv.137 need to change width=608 height=608

I want to get a model just like yolov4(yolov4.conv.137),Will this training lead to a different model from yolov4?

if I want is a model close to yolov4, including the configuration of the cfg file. What should I do?

AlexeyAB · 2020-05-24T18:52:28Z

Why yolov4.cfg need to change width=512 height=512 with csdarknet53-omega.conv.105

Just if you want to reproduce our results from the paper.

But train with yolov4.conv.137 need to change width=608 height=608

yolov4.conv.137 is already trained on MSCOCO, so you should use only for custom object training, so you can use any width and height that are multiple of 32.

I want to get a model just like yolov4(yolov4.conv.137),Will this training lead to a different model from yolov4?
if I want is a model close to yolov4, including the configuration of the cfg file. What should I do?

I don't understand. Do you want to check our results from paper, or do you want to train your different model on MSCOCO, or do you want to train your own model for your custom objects, ...?

Li505358678 · 2020-05-25T02:16:44Z

Why yolov4.cfg need to change width=512 height=512 with csdarknet53-omega.conv.105

Just if you want to reproduce our results from the paper.

But train with yolov4.conv.137 need to change width=608 height=608

yolov4.conv.137 is already trained on MSCOCO, so you should use only for custom object training, so you can use any width and height that are multiple of 32.

I want to get a model just like yolov4(yolov4.conv.137),Will this training lead to a different model from yolov4? if I want is a model close to yolov4, including the configuration of the cfg file. What should I do?

I don't understand. Do you want to check our results from paper, or do you want to train your different model on MSCOCO, or do you want to train your own model for your custom objects, ...?

I want to reproduce your results from the paper.I want to get the same model as yolov4.conv.137.

Now the question is, if I follow the https://github.com/AlexeyAB/darknet/wiki/Train-Detector-on-MS-COCO-(trainvalno5k-2014)-dataset. Can I get the model close to yolov4.conv.137?

The reason why I have this problem is that the height and width in the cfg-file must be set to 512 during training, but if I directly use yolov4.conv.137 for custom object training or evaluating MSCOCO, I can use any width and height that are multiple of 32. If I train according to the above link(just train on MSCOCO), can the resulting model also be used to evaluate by using any width and height that are multiple of 32

AlexeyAB · 2020-05-25T12:04:21Z

Now the question is, if I follow the https://github.com/AlexeyAB/darknet/wiki/Train-Detector-on-MS-COCO-(trainvalno5k-2014)-dataset. Can I get the model close to yolov4.conv.137?

Yes.
You will get yolov4_final.weights then you can do
./darknet partial cfg/yolov4.cfg backup/yolov4_final.weights yolov4.conv.137 137
to save weights of only the first 137 layers into file yolov4.conv.137

The reason why I have this problem is that the height and width in the cfg-file must be set to 512 during training, but if I directly use yolov4.conv.137 for custom object training or evaluating MSCOCO, I can use any width and height that are multiple of 32.

To reproduce paper results:

you should use 512x512 for Training
any resolution (multiple of 32) for Detection/Evaluation

Li505358678 · 2020-05-26T05:45:19Z

@Li505358678 I fixed yolov4.cfg for stable training without Nan, it will have almost the same AP: c0d6b81

change width=512 height=512 and train: https://raw.githubusercontent.com/AlexeyAB/darknet/c0d6b81a78e204c04c9bf4277974e0dadad0c4e2/cfg/yolov4.cfg

for the new code ?If I want to train with multi-GPU on MS COCO.Do I need to change the learning_rate and burn_in of the yolov4.cfg

Li505358678 · 2020-05-29T04:58:06Z

@Li505358678 I fixed yolov4.cfg for stable training without Nan, it will have almost the same AP: c0d6b81

change width=512 height=512 and train: https://raw.githubusercontent.com/AlexeyAB/darknet/c0d6b81a78e204c04c9bf4277974e0dadad0c4e2/cfg/yolov4.cfg

Training has been more than 10000 times, why the loss is still hundreds

AlexeyAB · 2020-05-29T13:41:41Z

Training has been more than 10000 times, why the loss is still hundreds

Show chart.png
What command do you use?

for the new code ?If I want to train with multi-GPU on MS COCO.Do I need to change the learning_rate and burn_in of the yolov4.cfg

I would recommend to use for 4xGPUs
batch=16 subdivisios=2 max_batches=2000000 steps=1600000, 1800000

Li505358678 · 2020-05-29T14:20:13Z

Training has been more than 10000 times, why the loss is still hundreds

Show chart.png
What command do you use?

for the new code ?If I want to train with multi-GPU on MS COCO.Do I need to change the learning_rate and burn_in of the yolov4.cfg

I would recommend to use for 4xGPUs
batch=16 subdivisios=2 max_batches=2000000 steps=1600000, 1800000

the chart.png just can see a line on the top

command:./darknet detector train cfg/coco.data cfg/yolov4.cfg csdarknet53-omega.conv.105

just train on MSCOCO follow https://github.com/AlexeyAB/darknet/wiki/Train-Detector-on-MS-COCO-(trainvalno5k-2014)-dataset

If I just have 1GPU. What should I do?

AlexeyAB · 2020-05-29T14:25:37Z

just train on MSCOCO follow https://github.com/AlexeyAB/darknet/wiki/Train-Detector-on-MS-COCO-(trainvalno5k-2014)-dataset

If I just have 1GPU. What should I do?

Just follow: https://github.com/AlexeyAB/darknet/wiki/Train-Detector-on-MS-COCO-(trainvalno5k-2014)-dataset

Li505358678 · 2020-05-29T14:28:04Z

just train on MSCOCO follow https://github.com/AlexeyAB/darknet/wiki/Train-Detector-on-MS-COCO-(trainvalno5k-2014)-dataset
If I just have 1GPU. What should I do?

Just follow: https://github.com/AlexeyAB/darknet/wiki/Train-Detector-on-MS-COCO-(trainvalno5k-2014)-dataset

Yes.I follow it . and after 10000 times. the loss was also high that over 100. is this normal?

AlexeyAB · 2020-05-29T14:54:59Z

Loss should be ~20. But show chart.png

Li505358678 · 2020-05-29T15:11:39Z

I I stopped halfway. Before, due to my carelessness, I didn't decompress the label, but only two txts. The loss of training was normally reduced. I found the error, and then I followed get_ coco_ dataset.sh I did it once again, but now the training loss has been very high

AlexeyAB · 2020-05-29T15:17:46Z

Check you dataset - run training with flag -show_imgs do you see correct bboxes?
Run training from the begining.

Li505358678 · 2020-05-29T15:28:34Z

the bboxes looks good.

Run training from the begining can work?I will try next time .Because of time. Now I train with custom data. loss looks good.

I wonder why this happens. When I didn't unzip the label of coco, I only had two txt files. Why can I also achieve a good loss

AlexeyAB added the Solved The problem is solved using the correct settings label May 6, 2020

cenit closed this as completed Jan 21, 2021

yolov4 Training questions nan nan nan nan nan #5395

yolov4 Training questions nan nan nan nan nan #5395

Comments

yzh1527 commented Apr 29, 2020

AlexeyAB commented Apr 30, 2020

deep-practice commented Apr 30, 2020

AlexeyAB commented Apr 30, 2020

samux87 commented May 1, 2020

AlexeyAB commented May 1, 2020

samux87 commented May 1, 2020

AlexeyAB commented May 1, 2020 • edited Loading

samux87 commented May 1, 2020

AlexeyAB commented May 1, 2020

samux87 commented May 1, 2020

AlexeyAB commented May 1, 2020

samux87 commented May 6, 2020

beizhengren commented May 8, 2020 • edited Loading

AlexeyAB commented May 8, 2020

beizhengren commented May 8, 2020

AlexeyAB commented May 8, 2020

beizhengren commented May 9, 2020

Li505358678 commented May 23, 2020

AlexeyAB commented May 23, 2020 • edited Loading

Li505358678 commented May 23, 2020 • edited by AlexeyAB Loading

AlexeyAB commented May 23, 2020

Li505358678 commented May 23, 2020

Li505358678 commented May 23, 2020

AlexeyAB commented May 23, 2020

Li505358678 commented May 23, 2020

Li505358678 commented May 23, 2020

AlexeyAB commented May 23, 2020

Li505358678 commented May 23, 2020

Li505358678 commented May 24, 2020

AlexeyAB commented May 24, 2020

Li505358678 commented May 25, 2020

AlexeyAB commented May 25, 2020

Li505358678 commented May 26, 2020

Li505358678 commented May 29, 2020

AlexeyAB commented May 29, 2020

Li505358678 commented May 29, 2020

AlexeyAB commented May 29, 2020

Li505358678 commented May 29, 2020

AlexeyAB commented May 29, 2020

Li505358678 commented May 29, 2020

AlexeyAB commented May 29, 2020

Li505358678 commented May 29, 2020

AlexeyAB commented May 1, 2020 •

edited

Loading

beizhengren commented May 8, 2020 •

edited

Loading

AlexeyAB commented May 23, 2020 •

edited

Loading

Li505358678 commented May 23, 2020 •

edited by AlexeyAB

Loading