Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the value of loss is NaN #31

Closed
mohammad749 opened this issue Feb 5, 2021 · 9 comments
Closed

the value of loss is NaN #31

mohammad749 opened this issue Feb 5, 2021 · 9 comments

Comments

@mohammad749
Copy link

At the time of training, in the step "python tools/Train_HICO_DET_DJR.py --model --num_iteration 400000 "o, the value of loss in all of the images in NaN. why????

@mohammad749
Copy link
Author

After updating the file of DJR.py, by running the program in the first iteration, the value of the loss function is not NaN, but in the next iterations it becomes Nan again!
image

@Foruck
Copy link
Collaborator

Foruck commented Feb 8, 2021

To our experience, this is due to align losses, which are sometimes unstable and needs complex tuning. One of the alleviating methods used is to add a value clip function as in the latest commit. There are also some helpful tuning techniques, including progressively loss tuning. We will release them along with the revised and cleaned data generation scripts. By the way, we switch from KL divergence to MSE loss in our on-working journal version, and re-design the align losses and finally address the loss issue.

@mohammad749
Copy link
Author

Will you correct the code? How long does it take to correct your code? I'm sorry, I'm in a hurry.

@mahsa1363
Copy link

how switch switch from KL divergence to MSE loss?

@Foruck
Copy link
Collaborator

Foruck commented Feb 25, 2021

To address the loss issue, some training techniques might be helpful:

  1. For L552 in DJR.py, replace it with 'L_att = (A_2D - A_3D) * (A_2D - A_3D)'.
  2. You could try progressively tuning the losses. In detail, the joint training process could be divided into two stages: First, train the network with classification losses only for 350K iterations using SGD with momentum of 0.9, following cosine learning rate restart with initial learning rate of 1e-3. Second, add the other losses and finetune the model for another 50K iterations with a lower learning rate of 1e-4, while setting cfg.TRAIN_MODULE_UPDATE=2 .
  3. The re-designed code for the journal version would be released when we receive the decision response for our paper.

@mahsa1363
Copy link

Thanks for your reply
when I replace it with L_att=(A_2D-A_3D)*(A_2D - A_3D)', I have following error:
InvalidArgumentError (see above for traceback): tags and values not the same shape: [] != [22,17] (tag 'L_att')
[[Node: L_att = ScalarSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](L_att/tags, LOSS/ArithmeticOptimizer/ReplaceMulWithSquare_mul_6/_989)]]

@mahsa1363
Copy link

Hello
I fix the error, and again, the result is Nan. Your code has a problem!

@Foruck Foruck closed this as completed Apr 19, 2021
@monacv
Copy link

monacv commented May 25, 2021

@mahsa1363 how did you fix this problem?

@mahsa1363
Copy link

@monacv I could not solve his problem. Not answering themselves. All I did was make the part that made loss NaN that I deleted.
The Latt make the loss be Nan and I remove this part from loss. indeed the line loss = L_cls + 0.01 * L_sem + 0.001 * L_tri + 0.00001 * L_att in DJR.py file is replaced with loss = L_cls + 0.01 * L_sem + 0.001 * L_tri

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants