Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loss Nan after inf issue #3449

Closed
naivewhim opened this issue Jun 20, 2019 · 3 comments
Closed

loss Nan after inf issue #3449

naivewhim opened this issue Jun 20, 2019 · 3 comments
Labels
Solved The problem is solved using the correct settings

Comments

@naivewhim
Copy link

naivewhim commented Jun 20, 2019

Hi.

I have a problem with nan after inf loss in random step .
It is doubtful whether the low (or nan, -nan) phenomenon of iou and giou is the cause of the problem.

We looked for similar situations in Isue, but we don't know the exact cause

  1. Increased loss due to object detection too small for image
    => Because the loss cost is deliberately saved by delta_yolo_box figures
    The delta value was given as float max, but NaN was not generated, as the loss continued to inflate.

ref ) #930, #2783

  1. Batch Normalize is not applied and is greatly influenced by certain loss values
    => Even when I gave it to batch 1, NaN didn't occur

  2. Unstable loss calculation using Multi GPU early in learning
    ref ) The Difference of AlexeyAB/Darknet and Pjreddie/Darknet #969 , avg loss = -nan when tensor cores are used #2783

Below is the log to track the problem.
[delta_yolo_box delta] is a record of the parameters used to calculate the delta loss of the box.

Inf occurs at 889 steps and NaN continues to occur after that.

 (next mAP calculation at 22252 iterations) 
 888: 4849889955545088.000000, 484989015687168.000000 avg loss, 0.001244 rate, 0.552078 seconds, 1776 images
Loaded: 1.735580 seconds
v3 (mse loss, Normalizer: (iou: 0.750000, cls: 1.000000) Region 82 Avg (IOU: -nan, GIOU: -nan), Class: -nan, Obj: -nan, No Obj: 0.000000, .5R: -nan, .75R: -nan, count: 0
v3 (mse loss, Normalizer: (iou: 0.750000, cls: 1.000000) Region 94 Avg (IOU: -nan, GIOU: -nan), Class: -nan, Obj: -nan, No Obj: 0.000000, .5R: -nan, .75R: -nan, count: 0
v3 (mse loss, Normalizer: (iou: 0.750000, cls: 1.000000) Region 106 Avg (IOU: 0.000000, GIOU: -0.053442), Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000, count: 1
v3 (mse loss, Normalizer: (iou: 0.750000, cls: 1.000000) Region 82 Avg (IOU: -nan, GIOU: -nan), Class: -nan, Obj: -nan, No Obj: 0.000000, .5R: -nan, .75R: -nan, count: 0
v3 (mse loss, Normalizer: (iou: 0.750000, cls: 1.000000) Region 94 Avg (IOU: -nan, GIOU: -nan), Class: -nan, Obj: -nan, No Obj: 0.000000, .5R: -nan, .75R: -nan, count: 0
v3 (mse loss, Normalizer: (iou: 0.750000, cls: 1.000000) Region 106 Avg (IOU: 0.000000, GIOU: -0.145821), Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000, count: 5

 (next mAP calculation at 22252 iterations) 
 889: inf, inf avg loss, 0.001249 rate, 0.368008 seconds, 1778 images
Loaded: 2.490613 seconds
v3 (mse loss, Normalizer: (iou: 0.750000, cls: 1.000000) Region 82 Avg (IOU: -nan, GIOU: -nan), Class: -nan, Obj: -nan, No Obj: 0.000000, .5R: -nan, .75R: -nan, count: 0
v3 (mse loss, Normalizer: (iou: 0.750000, cls: 1.000000) Region 94 Avg (IOU: -nan, GIOU: -nan), Class: -nan, Obj: -nan, No Obj: 0.000000, .5R: -nan, .75R: -nan, count: 0
v3 (mse loss, Normalizer: (iou: 0.750000, cls: 1.000000) Region 106 Avg (IOU: 0.000000, GIOU: 0.000000), Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000, count: 1
v3 (mse loss, Normalizer: (iou: 0.750000, cls: 1.000000) Region 82 Avg (IOU: -nan, GIOU: -nan), Class: -nan, Obj: -nan, No Obj: 0.000000, .5R: -nan, .75R: -nan, count: 0
v3 (mse loss, Normalizer: (iou: 0.750000, cls: 1.000000) Region 94 Avg (IOU: -nan, GIOU: -nan), Class: -nan, Obj: -nan, No Obj: 0.000000, .5R: -nan, .75R: -nan, count: 0
v3 (mse loss, Normalizer: (iou: 0.750000, cls: 1.000000) Region 106 Avg (IOU: 0.000000, GIOU: 0.000000), Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000, count: 1

 (next mAP calculation at 22252 iterations) 
 890: nan, nan avg loss, 0.001255 rate, 0.367751 seconds, 1780 images
Loaded: 2.041787 seconds
v3 (mse loss, Normalizer: (iou: 0.750000, cls: 1.000000) Region 82 Avg (IOU: -nan, GIOU: -nan), Class: -nan, Obj: -nan, No Obj: 0.000000, .5R: -nan, .75R: -nan, count: 0
v3 (mse loss, Normalizer: (iou: 0.750000, cls: 1.000000) Region 94 Avg (IOU: -nan, GIOU: -nan), Class: -nan, Obj: -nan, No Obj: 0.000000, .5R: -nan, .75R: -nan, count: 0

We sincerely hope to find out the exact cause and solve it.
Thanks...

@primepake
Copy link

I have same problem, I restarted my training process and then it returned to normal

@AlexeyAB
Copy link
Owner

Use burn_in=1000 in [net] section of cfg-file and lower learing_rate= to avoid inf/nan.

@naivewhim
Copy link
Author

When using multi gpu, the problem has been solved by applying as indicated. Thank you, burn_in * GPU count, learning_rate / GPU

@AlexeyAB AlexeyAB added the Solved The problem is solved using the correct settings label Jun 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Solved The problem is solved using the correct settings
Projects
None yet
Development

No branches or pull requests

3 participants