Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loss explosion when training on custom Dataset #75

Open
LokiXun opened this issue Sep 5, 2023 · 3 comments
Open

loss explosion when training on custom Dataset #75

LokiXun opened this issue Sep 5, 2023 · 3 comments

Comments

@LokiXun
Copy link

LokiXun commented Sep 5, 2023

Hi, it an awesome work! May I ask some help, I met some problems when training the model on REDS video dataset. When the training elapses about 40K iterations, the loss suddenly explode and the predict image became un-identifiable.
image
ps: the loss value show in picture is the sum of last 100 iterations

In order to run this dataset, I do the following modifications:

  1. Dataset: the frame_size=1280x72 100 frames video. I crop them to 256x256 and add random blur. I use 7 local frames and 5 reference frames (which is equally sample from whole video except the local frame region). My objective is to deblur so i do not use the mask to cover the origin image
  2. In order to train, I modify the SoftSplit and Tansformer's parameter: output_size = (64, 64)
    in this line and small_window_size = (11, 11) to match the [12, 22, 22, 512] size feature out of Softsplit.
  3. I set no_dis: 1 in config file to not using the adversarial loss and gan_loss, I thought it may cause training unstable so I dismiss it
  4. I only have one 24G-memory 4090 GPU so I could only set batchsize=1 and I did not change the scheduler which means the learning rate for the whole time is 1e-4.

the predict result at the loss-explosion iteration is like
grid_164_39300_030_00000000
ps: the first row: first 7 pic is local frames and latter 5 pic are non-local image; second row is correspinding GT. 3rd row is model's prediction

Does I mistakenly modified the param in TimeFocalTransformer? Have u guys have simiar issue and how u solve it, thanks.

@asfaukas
Copy link

Dear @LokiXun, have you solved this problem? The loss function increase at about 40k iterations.

@stayhungry1
Copy link

Is the loss increase from the DCN layer in the training

@Paper99
Copy link
Collaborator

Paper99 commented Apr 9, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants