Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training issue #5

Open
920703 opened this issue Jan 25, 2023 · 6 comments
Open

Training issue #5

920703 opened this issue Jan 25, 2023 · 6 comments

Comments

@920703
Copy link

920703 commented Jan 25, 2023

I was training your model. I had run it for 20 epochs and set the training batch size to 5.
During training , i have seen that when it comes to 6th epoch at iteration 4660 out of 5495, the learning rate becomes 0.0 and it remain until the training finished. i. till 20th epoch.

image

and the last epoch results is

image

what is the reason behind this?
I have used all the default values, nothing changed.

Any help will be appreciated.
Thanks

@Redaimao
Copy link
Owner

Hi,

Can you please check the format of printing lr? Maybe the lr is too small.

@920703
Copy link
Author

920703 commented Jan 30, 2023

@Redaimao
I have run the code again with following changes.

I have changed the following in your train_net.py file. (Changed according to the training details mentioned in the paper)

  1. Changed the learning rate from 0.05 to 0.001
  2. Added weight_decay=0.5, because it was not mentioned there.
  3. Increased the step size in scheduler to 300 because I am using train_bs=5 and 20 epochs. By doing so, there are 5495 iterations in one epoch. So the learning rate will be reduced every 300th iteration

See below:

parser.add_argument('--lr_init', type=float, default=0.001, help='learning rate for generator')
optimizer = optim.Adam(net.parameters(), lr=opt.lr_init, weight_decay=0.5)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=300, gamma=0.5)

It is running now, lets see if that problem comes again.

But another problem is I am getting loss_bt = 0 from the very start of training. Why so? Is the model overfitted, or something else

@Redaimao
Copy link
Owner

Hi,
I am not really sure why it is 0 as we didn't encounter such an issue. It may come from the configuration you made, and also as I mentioned you should check the printing format, and how many decimal points are for printing. You can tune lr to see whether the performance improved. Thanks.

@920703 920703 closed this as completed Jan 31, 2023
@920703 920703 reopened this Jan 31, 2023
@920703
Copy link
Author

920703 commented Jan 31, 2023

The format of printing the learning rate (lr) is as a string with the value of "0.001".

See below:

"Training: Epoch[{:0>3}/{:0>3}] Iteration[{:0>3}/{:0>3}] lr:{} Loss: {:.4f} Loss_pair: {:.4f} Loss_bt: {:.4f} Loss_grads: {:.4f} Loss_ssim: {:.4f} ".format(epoch + 1, opt.max_epoch, i + 1, len(train_loader), '0.001', loss_avg,
loss_pair_lable_avg, loss_between_pair_avg, loss_gradients_avg, loss_ssim_avg))

both before and after changing the configuration, loss_bt is consistently zero, but this time lr does not become 0.

See below the screenshot. (showing last epoch with some last iterations)

Training: Epoch[020/020] Iteration[5370/5495] lr:1.330612450002547e-113 Loss: 0.3635 Loss_pair: 0.4194 Loss_bt: 0.0000 Loss_grads: 0.2198 Loss_ssim: 0.0601
Training: Epoch[020/020] Iteration[5380/5495] lr:1.330612450002547e-113 Loss: 0.3662 Loss_pair: 0.4213 Loss_bt: 0.0000 Loss_grads: 0.2278 Loss_ssim: 0.0636
Training: Epoch[020/020] Iteration[5390/5495] lr:1.330612450002547e-113 Loss: 0.3615 Loss_pair: 0.4209 Loss_bt: 0.0000 Loss_grads: 0.1901 Loss_ssim: 0.0573
Training: Epoch[020/020] Iteration[5400/5495] lr:6.653062250012736e-114 Loss: 0.3672 Loss_pair: 0.4200 Loss_bt: 0.0000 Loss_grads: 0.2468 Loss_ssim: 0.0653
Training: Epoch[020/020] Iteration[5410/5495] lr:6.653062250012736e-114 Loss: 0.3699 Loss_pair: 0.4189 Loss_bt: 0.0000 Loss_grads: 0.2767 Loss_ssim: 0.0707
Training: Epoch[020/020] Iteration[5420/5495] lr:6.653062250012736e-114 Loss: 0.3660 Loss_pair: 0.4192 Loss_bt: 0.0000 Loss_grads: 0.2425 Loss_ssim: 0.0638
Training: Epoch[020/020] Iteration[5430/5495] lr:6.653062250012736e-114 Loss: 0.3621 Loss_pair: 0.4200 Loss_bt: 0.0000 Loss_grads: 0.2011 Loss_ssim: 0.0604
Training: Epoch[020/020] Iteration[5440/5495] lr:6.653062250012736e-114 Loss: 0.3663 Loss_pair: 0.4196 Loss_bt: 0.0000 Loss_grads: 0.2405 Loss_ssim: 0.0656
Training: Epoch[020/020] Iteration[5450/5495] lr:6.653062250012736e-114 Loss: 0.3648 Loss_pair: 0.4203 Loss_bt: 0.0000 Loss_grads: 0.2228 Loss_ssim: 0.0635
Training: Epoch[020/020] Iteration[5460/5495] lr:6.653062250012736e-114 Loss: 0.3642 Loss_pair: 0.4193 Loss_bt: 0.0000 Loss_grads: 0.2256 Loss_ssim: 0.0618
Training: Epoch[020/020] Iteration[5470/5495] lr:6.653062250012736e-114 Loss: 0.3638 Loss_pair: 0.4192 Loss_bt: 0.0000 Loss_grads: 0.2235 Loss_ssim: 0.0609
Training: Epoch[020/020] Iteration[5480/5495] lr:6.653062250012736e-114 Loss: 0.3655 Loss_pair: 0.4192 Loss_bt: 0.0000 Loss_grads: 0.2383 Loss_ssim: 0.0625
Training: Epoch[020/020] Iteration[5490/5495] lr:6.653062250012736e-114 Loss: 0.3623 Loss_pair: 0.4197 Loss_bt: 0.0000 Loss_grads: 0.2063 Loss_ssim: 0.0587
Model saved
Finished Training
net_save_path= ./Result/Latest/01-30_08-07-43/20_net_params.pkl

@Zhaohaojie4598
Copy link

这是因为你的lr下降的太快了导致无限接近于0

@920703
Copy link
Author

920703 commented May 14, 2023

@Redaimao @Zhaohaojie4598 I am ruuning the model with 20 epochs, but after few iterations in the very first epoch, I am getting loss_bt=0. I am not able to understand the reason behind this. Please help.

Untitled

and second problem is I have set step size to 300 in the learning rate as my batch size is 8. See above, at 300th iteration, how does it become 0.00025? and immediately in the next iteration it multiplies by 0.5 and gives 0.0005. From where does 0.00025 come?

Please reply. I am waiting for your response. Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants