Training issue #5

920703 · 2023-01-25T07:07:17Z

I was training your model. I had run it for 20 epochs and set the training batch size to 5.
During training , i have seen that when it comes to 6th epoch at iteration 4660 out of 5495, the learning rate becomes 0.0 and it remain until the training finished. i. till 20th epoch.

and the last epoch results is

what is the reason behind this?
I have used all the default values, nothing changed.

Any help will be appreciated.
Thanks

Redaimao · 2023-01-30T04:13:20Z

Hi,

Can you please check the format of printing lr? Maybe the lr is too small.

920703 · 2023-01-30T14:47:02Z

@Redaimao
I have run the code again with following changes.

I have changed the following in your train_net.py file. (Changed according to the training details mentioned in the paper)

Changed the learning rate from 0.05 to 0.001
Added weight_decay=0.5, because it was not mentioned there.
Increased the step size in scheduler to 300 because I am using train_bs=5 and 20 epochs. By doing so, there are 5495 iterations in one epoch. So the learning rate will be reduced every 300th iteration

See below:

parser.add_argument('--lr_init', type=float, default=0.001, help='learning rate for generator')
optimizer = optim.Adam(net.parameters(), lr=opt.lr_init, weight_decay=0.5)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=300, gamma=0.5)

It is running now, lets see if that problem comes again.

But another problem is I am getting loss_bt = 0 from the very start of training. Why so? Is the model overfitted, or something else

Redaimao · 2023-01-31T03:20:32Z

Hi,
I am not really sure why it is 0 as we didn't encounter such an issue. It may come from the configuration you made, and also as I mentioned you should check the printing format, and how many decimal points are for printing. You can tune lr to see whether the performance improved. Thanks.

920703 · 2023-01-31T05:52:18Z

The format of printing the learning rate (lr) is as a string with the value of "0.001".

See below:

"Training: Epoch[{:0>3}/{:0>3}] Iteration[{:0>3}/{:0>3}] lr:{} Loss: {:.4f} Loss_pair: {:.4f} Loss_bt: {:.4f} Loss_grads: {:.4f} Loss_ssim: {:.4f} ".format(epoch + 1, opt.max_epoch, i + 1, len(train_loader), '0.001', loss_avg,
loss_pair_lable_avg, loss_between_pair_avg, loss_gradients_avg, loss_ssim_avg))

both before and after changing the configuration, loss_bt is consistently zero, but this time lr does not become 0.

See below the screenshot. (showing last epoch with some last iterations)

Training: Epoch[020/020] Iteration[5370/5495] lr:1.330612450002547e-113 Loss: 0.3635 Loss_pair: 0.4194 Loss_bt: 0.0000 Loss_grads: 0.2198 Loss_ssim: 0.0601
Training: Epoch[020/020] Iteration[5380/5495] lr:1.330612450002547e-113 Loss: 0.3662 Loss_pair: 0.4213 Loss_bt: 0.0000 Loss_grads: 0.2278 Loss_ssim: 0.0636
Training: Epoch[020/020] Iteration[5390/5495] lr:1.330612450002547e-113 Loss: 0.3615 Loss_pair: 0.4209 Loss_bt: 0.0000 Loss_grads: 0.1901 Loss_ssim: 0.0573
Training: Epoch[020/020] Iteration[5400/5495] lr:6.653062250012736e-114 Loss: 0.3672 Loss_pair: 0.4200 Loss_bt: 0.0000 Loss_grads: 0.2468 Loss_ssim: 0.0653
Training: Epoch[020/020] Iteration[5410/5495] lr:6.653062250012736e-114 Loss: 0.3699 Loss_pair: 0.4189 Loss_bt: 0.0000 Loss_grads: 0.2767 Loss_ssim: 0.0707
Training: Epoch[020/020] Iteration[5420/5495] lr:6.653062250012736e-114 Loss: 0.3660 Loss_pair: 0.4192 Loss_bt: 0.0000 Loss_grads: 0.2425 Loss_ssim: 0.0638
Training: Epoch[020/020] Iteration[5430/5495] lr:6.653062250012736e-114 Loss: 0.3621 Loss_pair: 0.4200 Loss_bt: 0.0000 Loss_grads: 0.2011 Loss_ssim: 0.0604
Training: Epoch[020/020] Iteration[5440/5495] lr:6.653062250012736e-114 Loss: 0.3663 Loss_pair: 0.4196 Loss_bt: 0.0000 Loss_grads: 0.2405 Loss_ssim: 0.0656
Training: Epoch[020/020] Iteration[5450/5495] lr:6.653062250012736e-114 Loss: 0.3648 Loss_pair: 0.4203 Loss_bt: 0.0000 Loss_grads: 0.2228 Loss_ssim: 0.0635
Training: Epoch[020/020] Iteration[5460/5495] lr:6.653062250012736e-114 Loss: 0.3642 Loss_pair: 0.4193 Loss_bt: 0.0000 Loss_grads: 0.2256 Loss_ssim: 0.0618
Training: Epoch[020/020] Iteration[5470/5495] lr:6.653062250012736e-114 Loss: 0.3638 Loss_pair: 0.4192 Loss_bt: 0.0000 Loss_grads: 0.2235 Loss_ssim: 0.0609
Training: Epoch[020/020] Iteration[5480/5495] lr:6.653062250012736e-114 Loss: 0.3655 Loss_pair: 0.4192 Loss_bt: 0.0000 Loss_grads: 0.2383 Loss_ssim: 0.0625
Training: Epoch[020/020] Iteration[5490/5495] lr:6.653062250012736e-114 Loss: 0.3623 Loss_pair: 0.4197 Loss_bt: 0.0000 Loss_grads: 0.2063 Loss_ssim: 0.0587
Model saved
Finished Training
net_save_path= ./Result/Latest/01-30_08-07-43/20_net_params.pkl

Zhaohaojie4598 · 2023-04-28T12:06:46Z

这是因为你的lr下降的太快了导致无限接近于0

920703 · 2023-05-14T18:11:56Z

@Redaimao @Zhaohaojie4598 I am ruuning the model with 20 epochs, but after few iterations in the very first epoch, I am getting loss_bt=0. I am not able to understand the reason behind this. Please help.

and second problem is I have set step size to 300 in the learning rate as my batch size is 8. See above, at 300th iteration, how does it become 0.00025? and immediately in the next iteration it multiplies by 0.5 and gives 0.0005. From where does 0.00025 come?

Please reply. I am waiting for your response. Thank you

920703 closed this as completed Jan 31, 2023

920703 reopened this Jan 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training issue #5

Training issue #5

920703 commented Jan 25, 2023

Redaimao commented Jan 30, 2023

920703 commented Jan 30, 2023 •

edited

Redaimao commented Jan 31, 2023

920703 commented Jan 31, 2023 •

edited

Zhaohaojie4598 commented Apr 28, 2023

920703 commented May 14, 2023

Training issue #5

Training issue #5

Comments

920703 commented Jan 25, 2023

Redaimao commented Jan 30, 2023

920703 commented Jan 30, 2023 • edited

Redaimao commented Jan 31, 2023

920703 commented Jan 31, 2023 • edited

Zhaohaojie4598 commented Apr 28, 2023

920703 commented May 14, 2023

920703 commented Jan 30, 2023 •

edited

920703 commented Jan 31, 2023 •

edited