loss explosion when training on custom Dataset #75

LokiXun · 2023-09-05T08:13:22Z

Hi, it an awesome work! May I ask some help, I met some problems when training the model on REDS video dataset. When the training elapses about 40K iterations, the loss suddenly explode and the predict image became un-identifiable.

ps: the loss value show in picture is the sum of last 100 iterations

In order to run this dataset, I do the following modifications:

Dataset: the frame_size=1280x72 100 frames video. I crop them to 256x256 and add random blur. I use 7 local frames and 5 reference frames (which is equally sample from whole video except the local frame region). My objective is to deblur so i do not use the mask to cover the origin image
In order to train, I modify the SoftSplit and Tansformer's parameter: output_size = (64, 64)
in this line and small_window_size = (11, 11) to match the [12, 22, 22, 512] size feature out of Softsplit.
I set no_dis: 1 in config file to not using the adversarial loss and gan_loss, I thought it may cause training unstable so I dismiss it
I only have one 24G-memory 4090 GPU so I could only set batchsize=1 and I did not change the scheduler which means the learning rate for the whole time is 1e-4.

the predict result at the loss-explosion iteration is like

ps: the first row: first 7 pic is local frames and latter 5 pic are non-local image; second row is correspinding GT. 3rd row is model's prediction

Does I mistakenly modified the param in TimeFocalTransformer? Have u guys have simiar issue and how u solve it, thanks.

The text was updated successfully, but these errors were encountered:

asfaukas · 2024-03-21T02:39:08Z

Dear @LokiXun, have you solved this problem? The loss function increase at about 40k iterations.

stayhungry1 · 2024-04-09T07:05:40Z

Is the loss increase from the DCN layer in the training

Paper99 · 2024-04-09T07:14:41Z

Yes, the simple solution is resuming from a non-crashed checkpoint. stayhungry1 ***@***.***> 于2024年4月9日周二 15:06写道：

…

Is the loss increase from the DCN layer in the training — Reply to this email directly, view it on GitHub <#75 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFATMT4W4KBMUT7UF2AORM3Y4OHNZAVCNFSM6AAAAAA4LMAUZ2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBUGI4DKMJWGY> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

loss explosion when training on custom Dataset #75

loss explosion when training on custom Dataset #75

LokiXun commented Sep 5, 2023 •

edited

Loading

asfaukas commented Mar 21, 2024

stayhungry1 commented Apr 9, 2024

Paper99 commented Apr 9, 2024 via email

loss explosion when training on custom Dataset #75

loss explosion when training on custom Dataset #75

Comments

LokiXun commented Sep 5, 2023 • edited Loading

asfaukas commented Mar 21, 2024

stayhungry1 commented Apr 9, 2024

Paper99 commented Apr 9, 2024 via email

LokiXun commented Sep 5, 2023 •

edited

Loading