Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

yolov4 Training questions nan nan nan nan nan #5395

Closed
yzh1527 opened this issue Apr 29, 2020 · 42 comments
Closed

yolov4 Training questions nan nan nan nan nan #5395

yzh1527 opened this issue Apr 29, 2020 · 42 comments
Labels
Solved The problem is solved using the correct settings

Comments

@yzh1527
Copy link

yzh1527 commented Apr 29, 2020

It's normal at the beginning of training. After a while of training, the loss value suddenly increases, and then it's all Nan.
According to the current check, there is no problem with the data set, and the learning rate is set to 0.001,who can help me?thanks
image
image

@AlexeyAB
Copy link
Owner

What cfg-file do you use?
What training command do you use?

learning rate is set to 0.001

If you use 4x GPUs, then use 4/ less learning_rate=0.00025

https://github.com/AlexeyAB/darknet#how-to-train-with-multi-gpu

@deep-practice
Copy link

still got nan using learning_rate=0.00025@AlexeyAB

@AlexeyAB
Copy link
Owner

@deep-practice

still got nan using learning_rate=0.0002

Show screenshot.
What dataset do you use?
What cfg-file do you use?
What training command do you use?

@samux87
Copy link

samux87 commented May 1, 2020

I have a similar issue when I train in a p2.xlarge with a modified COCO dataset

@AlexeyAB
Copy link
Owner

AlexeyAB commented May 1, 2020

@samux87
Copy link

samux87 commented May 1, 2020

I trained 1000 epochs on my pc and pushed everything on an ec2 instance. I think p2.xlarge is not a multi GPU instance anymore.
Do I need to reduce the Learning rate due to the nan "issue" anyway?

@AlexeyAB
Copy link
Owner

AlexeyAB commented May 1, 2020

with a modified COCO dataset

  1. Did you check that your changes are correct?
  2. What command do you use for training, which cfg and pre-trained weights do you use?
  3. Do you follow manual https://github.com/AlexeyAB/darknet#how-to-train-to-detect-your-custom-objects and use https://github.com/AlexeyAB/darknet/blob/master/cfg/yolov4-custom.cfg ?
  4. Try to train with

@samux87
Copy link

samux87 commented May 1, 2020

  1. yes
  2. ./darknet detector train obj.data yolov4.cfg yolov4_1000.weights -map
  3. I used yolov4.cfg, I will try with the custom one; thank you for the suggestion.
  4. I will try this setting after a try with yolov4-custom.cfg; thank you again.

After having configured batch=32 and subdivisions=16, the training start but after a few iterations, NAN appears xD.

I also used a p3.2xlarge instance; same issue.

@AlexeyAB
Copy link
Owner

AlexeyAB commented May 1, 2020

  1. In the manual said that you should use yolov4-custom.cfg for your custom dataset

  2. In the manual said that you should set batch=64. And if OOM error occurs, then increase subdivisions=16,32,64

  3. What pre-trained weights file do you use initially?

@samux87
Copy link

samux87 commented May 1, 2020

yolov4.conv.137

@AlexeyAB
Copy link
Owner

AlexeyAB commented May 1, 2020

I just fixed yolov4-custom.cfg, try to use it.

Change batch=64 subdivisions=32 or 64/64 if OOM error occurs.

And train:
./darknet detector train obj.data yolov4.cfg yolov4.conv.137 -map

@samux87
Copy link

samux87 commented May 6, 2020

It works now, thank you @AlexeyAB !

@AlexeyAB AlexeyAB added the Solved The problem is solved using the correct settings label May 6, 2020
@beizhengren
Copy link

beizhengren commented May 8, 2020

@AlexeyAB
In build/darknet/x64/cfg/yolov4-custom.cfg
stopbackward=800
In #4728 (comment) stopbackward=1
I think put stopbackward=1 before some layer meaning that all the layers' weights before current layer will not be updated.
In #5253 (comment) stopbackward=6000
so, how to understand stopbackward=800 stopbackward=6000?
Thanks!

@AlexeyAB
Copy link
Owner

AlexeyAB commented May 8, 2020

put stopbackward=800 before some layer meaning that all the layers' weights before current layer will not be updated only for the first 800 iterations.

@beizhengren
Copy link

@AlexeyAB
Thanks!
So stopbackward=1 means that all the layers' weights before current layer will not be updated all the time ? In another words, only when stopbackward isn't 1, such as 2, represent iterations?

@AlexeyAB
Copy link
Owner

AlexeyAB commented May 8, 2020

@beizhengren Yes!

@beizhengren
Copy link

@AlexeyAB Thanks!

@Li505358678
Copy link

@AlexeyAB
I trained detector on MS COCO (trainvalno5k 2014) dataset
I just did as https://github.com/AlexeyAB/darknet/wiki/Train-Detector-on-MS-COCO-(trainvalno5k-2014)-dataset

I followed it .all parameters in yolov4.cfg by default, except for width=512 height=512

Training command:
./darknet detector train cfg/coco.data cfg/yolov4.cfg csdarknet53-omega.conv.105

but NAN happened at 10000 iteration.I use 32G GPU
image

What's wrong with it?

@AlexeyAB
Copy link
Owner

AlexeyAB commented May 23, 2020

@Li505358678

but NAN happened at 10000 iteration.

Show screenshot.

And attach cfg-file.

@Li505358678
Copy link

Li505358678 commented May 23, 2020

@Li505358678

but NAN happened at 10000 iteration.

Show screenshot.

And attach cfg-file.

image
just like this .

chart.png,loss is always Greater than 10 .So it is a line
image

cfg-file:
yolo.txt

@AlexeyAB
Copy link
Owner

@Li505358678

  • What date of Darknet do you use?
  • Try to use the latest version of Darknet.
  • Were there generated bad.list or bad_label.list files?

@Li505358678
Copy link

@Li505358678

  • What date of Darknet do you use?
  • Try to use the latest version of Darknet.
  • Were there generated bad.list or bad_label.list files?
  1. 5.13,There should be no problem with the code. Have you modified it recently?
  2. every label I download from the internet on the links like trainvalno5k.txt.just

@Li505358678
Copy link

@AlexeyAB
Copy link
Owner

@Li505358678 I fixed yolov4.cfg for stable training without Nan, it will have almost the same AP: c0d6b81

change width=512 height=512 and train: https://raw.githubusercontent.com/AlexeyAB/darknet/c0d6b81a78e204c04c9bf4277974e0dadad0c4e2/cfg/yolov4.cfg

@Li505358678
Copy link

I will try.What's mean about have almost the same AP?the AP will not decrease and bad?

@Li505358678
Copy link

Increase instead of decrease

@AlexeyAB
Copy link
Owner

Accuracy will be the same.

@Li505358678
Copy link

准确性是一样的。

Thanks!I will try it.

@Li505358678
Copy link

Accuracy will be the same.

@AlexeyAB

I have another question.
Why yolov4.cfg need to change width=512 height=512 with csdarknet53-omega.conv.105 .
But train with yolov4.conv.137 need to change width=608 height=608

I want to get a model just like yolov4(yolov4.conv.137),Will this training lead to a different model from yolov4?

if I want is a model close to yolov4, including the configuration of the cfg file. What should I do?

@AlexeyAB
Copy link
Owner

Why yolov4.cfg need to change width=512 height=512 with csdarknet53-omega.conv.105

Just if you want to reproduce our results from the paper.

But train with yolov4.conv.137 need to change width=608 height=608

yolov4.conv.137 is already trained on MSCOCO, so you should use only for custom object training, so you can use any width and height that are multiple of 32.

I want to get a model just like yolov4(yolov4.conv.137),Will this training lead to a different model from yolov4?
if I want is a model close to yolov4, including the configuration of the cfg file. What should I do?

I don't understand. Do you want to check our results from paper, or do you want to train your different model on MSCOCO, or do you want to train your own model for your custom objects, ...?

@Li505358678
Copy link

Why yolov4.cfg need to change width=512 height=512 with csdarknet53-omega.conv.105

Just if you want to reproduce our results from the paper.

But train with yolov4.conv.137 need to change width=608 height=608

yolov4.conv.137 is already trained on MSCOCO, so you should use only for custom object training, so you can use any width and height that are multiple of 32.

I want to get a model just like yolov4(yolov4.conv.137),Will this training lead to a different model from yolov4? if I want is a model close to yolov4, including the configuration of the cfg file. What should I do?

I don't understand. Do you want to check our results from paper, or do you want to train your different model on MSCOCO, or do you want to train your own model for your custom objects, ...?

I want to reproduce your results from the paper.I want to get the same model as yolov4.conv.137.

Now the question is, if I follow the https://github.com/AlexeyAB/darknet/wiki/Train-Detector-on-MS-COCO-(trainvalno5k-2014)-dataset. Can I get the model close to yolov4.conv.137?

The reason why I have this problem is that the height and width in the cfg-file must be set to 512 during training, but if I directly use yolov4.conv.137 for custom object training or evaluating MSCOCO, I can use any width and height that are multiple of 32. If I train according to the above link(just train on MSCOCO), can the resulting model also be used to evaluate by using any width and height that are multiple of 32

@AlexeyAB
Copy link
Owner

Now the question is, if I follow the https://github.com/AlexeyAB/darknet/wiki/Train-Detector-on-MS-COCO-(trainvalno5k-2014)-dataset. Can I get the model close to yolov4.conv.137?

Yes.
You will get yolov4_final.weights then you can do
./darknet partial cfg/yolov4.cfg backup/yolov4_final.weights yolov4.conv.137 137
to save weights of only the first 137 layers into file yolov4.conv.137

The reason why I have this problem is that the height and width in the cfg-file must be set to 512 during training, but if I directly use yolov4.conv.137 for custom object training or evaluating MSCOCO, I can use any width and height that are multiple of 32.

To reproduce paper results:

  • you should use 512x512 for Training
  • any resolution (multiple of 32) for Detection/Evaluation

@Li505358678
Copy link

@Li505358678 I fixed yolov4.cfg for stable training without Nan, it will have almost the same AP: c0d6b81

change width=512 height=512 and train: https://raw.githubusercontent.com/AlexeyAB/darknet/c0d6b81a78e204c04c9bf4277974e0dadad0c4e2/cfg/yolov4.cfg

for the new code ?If I want to train with multi-GPU on MS COCO.Do I need to change the learning_rate and burn_in of the yolov4.cfg

@Li505358678
Copy link

@Li505358678 I fixed yolov4.cfg for stable training without Nan, it will have almost the same AP: c0d6b81

change width=512 height=512 and train: https://raw.githubusercontent.com/AlexeyAB/darknet/c0d6b81a78e204c04c9bf4277974e0dadad0c4e2/cfg/yolov4.cfg

Training has been more than 10000 times, why the loss is still hundreds

@AlexeyAB
Copy link
Owner

Training has been more than 10000 times, why the loss is still hundreds

Show chart.png
What command do you use?

for the new code ?If I want to train with multi-GPU on MS COCO.Do I need to change the learning_rate and burn_in of the yolov4.cfg

I would recommend to use for 4xGPUs
batch=16 subdivisios=2 max_batches=2000000 steps=1600000, 1800000

@Li505358678
Copy link

Training has been more than 10000 times, why the loss is still hundreds

Show chart.png
What command do you use?

for the new code ?If I want to train with multi-GPU on MS COCO.Do I need to change the learning_rate and burn_in of the yolov4.cfg

I would recommend to use for 4xGPUs
batch=16 subdivisios=2 max_batches=2000000 steps=1600000, 1800000

the chart.png just can see a line on the top

command:./darknet detector train cfg/coco.data cfg/yolov4.cfg csdarknet53-omega.conv.105

just train on MSCOCO follow https://github.com/AlexeyAB/darknet/wiki/Train-Detector-on-MS-COCO-(trainvalno5k-2014)-dataset

If I just have 1GPU. What should I do?

@AlexeyAB
Copy link
Owner

@Li505358678
Copy link

just train on MSCOCO follow https://github.com/AlexeyAB/darknet/wiki/Train-Detector-on-MS-COCO-(trainvalno5k-2014)-dataset
If I just have 1GPU. What should I do?

Just follow: https://github.com/AlexeyAB/darknet/wiki/Train-Detector-on-MS-COCO-(trainvalno5k-2014)-dataset

Yes.I follow it . and after 10000 times. the loss was also high that over 100. is this normal?

@AlexeyAB
Copy link
Owner

Loss should be ~20. But show chart.png

@Li505358678
Copy link

image
image

I I stopped halfway. Before, due to my carelessness, I didn't decompress the label, but only two txts. The loss of training was normally reduced. I found the error, and then I followed get_ coco_ dataset.sh I did it once again, but now the training loss has been very high

@AlexeyAB
Copy link
Owner

  • Check you dataset - run training with flag -show_imgs do you see correct bboxes?

  • Run training from the begining.

@Li505358678
Copy link

image
image

the bboxes looks good.

Run training from the begining can work?I will try next time .Because of time. Now I train with custom data. loss looks good.

I wonder why this happens. When I didn't unzip the label of coco, I only had two txt files. Why can I also achieve a good loss

@cenit cenit closed this as completed Jan 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Solved The problem is solved using the correct settings
Projects
None yet
Development

No branches or pull requests

7 participants