the value of loss is NaN #31

mohammad749 · 2021-02-05T19:30:32Z

At the time of training, in the step "python tools/Train_HICO_DET_DJR.py --model --num_iteration 400000 "o, the value of loss in all of the images in NaN. why????

mohammad749 · 2021-02-06T16:13:59Z

After updating the file of DJR.py, by running the program in the first iteration, the value of the loss function is not NaN, but in the next iterations it becomes Nan again!

Foruck · 2021-02-08T17:19:37Z

To our experience, this is due to align losses, which are sometimes unstable and needs complex tuning. One of the alleviating methods used is to add a value clip function as in the latest commit. There are also some helpful tuning techniques, including progressively loss tuning. We will release them along with the revised and cleaned data generation scripts. By the way, we switch from KL divergence to MSE loss in our on-working journal version, and re-design the align losses and finally address the loss issue.

mohammad749 · 2021-02-08T18:10:47Z

Will you correct the code? How long does it take to correct your code? I'm sorry, I'm in a hurry.

mahsa1363 · 2021-02-11T17:09:39Z

how switch switch from KL divergence to MSE loss?

Foruck · 2021-02-25T12:30:11Z

To address the loss issue, some training techniques might be helpful:

For L552 in DJR.py, replace it with 'L_att = (A_2D - A_3D) * (A_2D - A_3D)'.
You could try progressively tuning the losses. In detail, the joint training process could be divided into two stages: First, train the network with classification losses only for 350K iterations using SGD with momentum of 0.9, following cosine learning rate restart with initial learning rate of 1e-3. Second, add the other losses and finetune the model for another 50K iterations with a lower learning rate of 1e-4, while setting cfg.TRAIN_MODULE_UPDATE=2 .
The re-designed code for the journal version would be released when we receive the decision response for our paper.

mahsa1363 · 2021-02-25T20:16:21Z

Thanks for your reply
when I replace it with L_att=(A_2D-A_3D)*(A_2D - A_3D)', I have following error:
InvalidArgumentError (see above for traceback): tags and values not the same shape: [] != [22,17] (tag 'L_att')
[[Node: L_att = ScalarSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](L_att/tags, LOSS/ArithmeticOptimizer/ReplaceMulWithSquare_mul_6/_989)]]

mahsa1363 · 2021-02-26T17:02:24Z

Hello
I fix the error, and again, the result is Nan. Your code has a problem!

monacv · 2021-05-25T04:44:06Z

@mahsa1363 how did you fix this problem?

mahsa1363 · 2021-05-26T06:29:33Z

@monacv I could not solve his problem. Not answering themselves. All I did was make the part that made loss NaN that I deleted.
The Latt make the loss be Nan and I remove this part from loss. indeed the line loss = L_cls + 0.01 * L_sem + 0.001 * L_tri + 0.00001 * L_att in DJR.py file is replaced with loss = L_cls + 0.01 * L_sem + 0.001 * L_tri

Foruck closed this as completed Apr 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

the value of loss is NaN #31

the value of loss is NaN #31

mohammad749 commented Feb 5, 2021

mohammad749 commented Feb 6, 2021

Foruck commented Feb 8, 2021

mohammad749 commented Feb 8, 2021

mahsa1363 commented Feb 11, 2021

Foruck commented Feb 25, 2021

mahsa1363 commented Feb 25, 2021

mahsa1363 commented Feb 26, 2021

monacv commented May 25, 2021

mahsa1363 commented May 26, 2021

the value of loss is NaN #31

the value of loss is NaN #31

Comments

mohammad749 commented Feb 5, 2021

mohammad749 commented Feb 6, 2021

Foruck commented Feb 8, 2021

mohammad749 commented Feb 8, 2021

mahsa1363 commented Feb 11, 2021

Foruck commented Feb 25, 2021

mahsa1363 commented Feb 25, 2021

mahsa1363 commented Feb 26, 2021

monacv commented May 25, 2021

mahsa1363 commented May 26, 2021