loss Nan after inf issue #3449

naivewhim · 2019-06-20T03:30:54Z

Hi.

I have a problem with nan after inf loss in random step .
It is doubtful whether the low (or nan, -nan) phenomenon of iou and giou is the cause of the problem.

We looked for similar situations in Isue, but we don't know the exact cause

Increased loss due to object detection too small for image
=> Because the loss cost is deliberately saved by delta_yolo_box figures
The delta value was given as float max, but NaN was not generated, as the loss continued to inflate.

ref ) #930, #2783

Batch Normalize is not applied and is greatly influenced by certain loss values
=> Even when I gave it to batch 1, NaN didn't occur
Unstable loss calculation using Multi GPU early in learning
ref ) The Difference of AlexeyAB/Darknet and Pjreddie/Darknet #969 , avg loss = -nan when tensor cores are used #2783

Below is the log to track the problem.
[delta_yolo_box delta] is a record of the parameters used to calculate the delta loss of the box.

Inf occurs at 889 steps and NaN continues to occur after that.

 (next mAP calculation at 22252 iterations) 
 888: 4849889955545088.000000, 484989015687168.000000 avg loss, 0.001244 rate, 0.552078 seconds, 1776 images
Loaded: 1.735580 seconds
v3 (mse loss, Normalizer: (iou: 0.750000, cls: 1.000000) Region 82 Avg (IOU: -nan, GIOU: -nan), Class: -nan, Obj: -nan, No Obj: 0.000000, .5R: -nan, .75R: -nan, count: 0
v3 (mse loss, Normalizer: (iou: 0.750000, cls: 1.000000) Region 94 Avg (IOU: -nan, GIOU: -nan), Class: -nan, Obj: -nan, No Obj: 0.000000, .5R: -nan, .75R: -nan, count: 0
v3 (mse loss, Normalizer: (iou: 0.750000, cls: 1.000000) Region 106 Avg (IOU: 0.000000, GIOU: -0.053442), Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000, count: 1
v3 (mse loss, Normalizer: (iou: 0.750000, cls: 1.000000) Region 82 Avg (IOU: -nan, GIOU: -nan), Class: -nan, Obj: -nan, No Obj: 0.000000, .5R: -nan, .75R: -nan, count: 0
v3 (mse loss, Normalizer: (iou: 0.750000, cls: 1.000000) Region 94 Avg (IOU: -nan, GIOU: -nan), Class: -nan, Obj: -nan, No Obj: 0.000000, .5R: -nan, .75R: -nan, count: 0
v3 (mse loss, Normalizer: (iou: 0.750000, cls: 1.000000) Region 106 Avg (IOU: 0.000000, GIOU: -0.145821), Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000, count: 5

 (next mAP calculation at 22252 iterations) 
 889: inf, inf avg loss, 0.001249 rate, 0.368008 seconds, 1778 images
Loaded: 2.490613 seconds
v3 (mse loss, Normalizer: (iou: 0.750000, cls: 1.000000) Region 82 Avg (IOU: -nan, GIOU: -nan), Class: -nan, Obj: -nan, No Obj: 0.000000, .5R: -nan, .75R: -nan, count: 0
v3 (mse loss, Normalizer: (iou: 0.750000, cls: 1.000000) Region 94 Avg (IOU: -nan, GIOU: -nan), Class: -nan, Obj: -nan, No Obj: 0.000000, .5R: -nan, .75R: -nan, count: 0
v3 (mse loss, Normalizer: (iou: 0.750000, cls: 1.000000) Region 106 Avg (IOU: 0.000000, GIOU: 0.000000), Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000, count: 1
v3 (mse loss, Normalizer: (iou: 0.750000, cls: 1.000000) Region 82 Avg (IOU: -nan, GIOU: -nan), Class: -nan, Obj: -nan, No Obj: 0.000000, .5R: -nan, .75R: -nan, count: 0
v3 (mse loss, Normalizer: (iou: 0.750000, cls: 1.000000) Region 94 Avg (IOU: -nan, GIOU: -nan), Class: -nan, Obj: -nan, No Obj: 0.000000, .5R: -nan, .75R: -nan, count: 0
v3 (mse loss, Normalizer: (iou: 0.750000, cls: 1.000000) Region 106 Avg (IOU: 0.000000, GIOU: 0.000000), Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000, count: 1

 (next mAP calculation at 22252 iterations) 
 890: nan, nan avg loss, 0.001255 rate, 0.367751 seconds, 1780 images
Loaded: 2.041787 seconds
v3 (mse loss, Normalizer: (iou: 0.750000, cls: 1.000000) Region 82 Avg (IOU: -nan, GIOU: -nan), Class: -nan, Obj: -nan, No Obj: 0.000000, .5R: -nan, .75R: -nan, count: 0
v3 (mse loss, Normalizer: (iou: 0.750000, cls: 1.000000) Region 94 Avg (IOU: -nan, GIOU: -nan), Class: -nan, Obj: -nan, No Obj: 0.000000, .5R: -nan, .75R: -nan, count: 0

We sincerely hope to find out the exact cause and solve it.
Thanks...

The text was updated successfully, but these errors were encountered:

primepake · 2019-06-20T04:11:06Z

I have same problem, I restarted my training process and then it returned to normal

AlexeyAB · 2019-06-22T14:23:33Z

Use burn_in=1000 in [net] section of cfg-file and lower learing_rate= to avoid inf/nan.

naivewhim · 2019-06-27T09:42:53Z

When using multi gpu, the problem has been solved by applying as indicated. Thank you, burn_in * GPU count, learning_rate / GPU

naivewhim closed this as completed Jun 27, 2019

AlexeyAB added the Solved The problem is solved using the correct settings label Jun 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

loss Nan after inf issue #3449

loss Nan after inf issue #3449

naivewhim commented Jun 20, 2019 •

edited

Loading

primepake commented Jun 20, 2019

AlexeyAB commented Jun 22, 2019

naivewhim commented Jun 27, 2019

loss Nan after inf issue #3449

loss Nan after inf issue #3449

Comments

naivewhim commented Jun 20, 2019 • edited Loading

primepake commented Jun 20, 2019

AlexeyAB commented Jun 22, 2019

naivewhim commented Jun 27, 2019

naivewhim commented Jun 20, 2019 •

edited

Loading