Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adaptive Pixel Intensity Loss generated NaN values while training #9

Closed
ThiruRJST opened this issue Jan 20, 2022 · 10 comments
Closed

Comments

@ThiruRJST
Copy link

ThiruRJST commented Jan 20, 2022

Was training on custom human dataset.
Batch Size = 8
No of training images = 3800

No of steps trained before showing error = 75

After 75th step It generated an error:

RuntimeError: Function 'UpsampleBilinear2DBackward1' returned nan values in its 0th output.

The model trained successfully when using BCE loss.

We even checked for NaN values using torch.autograd.set_detect_anamoly(True) But it returned False stating that no NaN values were found

@ThiruRJST
Copy link
Author

File "/opt/conda/envs/test/lib/python3.7/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "/opt/conda/envs/test/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/opt/conda/envs/test/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/envs/test/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/opt/conda/envs/test/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/jupyter/TRACER/model/TRACER.py", line 38, in forward
    features, edge = self.model.get_blocks(x, H, W)
  File "/home/jupyter/TRACER/model/EfficientNet.py", line 250, in get_blocks
    edge = F.interpolate(edge, size=(H, W), mode='bilinear')
  File "/opt/conda/envs/test/lib/python3.7/site-packages/torch/nn/functional.py", line 3709, in interpolate
    return torch._C._nn.upsample_bilinear2d(input, output_size, align_corners, scale_factors)
 (function _print_stack)
 16%|███████████████▎                                                                                | 76/475 [04:51<25:29,  3.83s/it]
Traceback (most recent call last):
  File "main.py", line 49, in <module>
    main(cfg)
  File "main.py", line 34, in main
    Trainer(cfg, save_path)
  File "/home/jupyter/TRACER/trainer.py", line 59, in _init_
    train_loss, train_mae = self.training(args)
  File "/home/jupyter/TRACER/trainer.py", line 117, in training
    loss.backward()
  File "/opt/conda/envs/test/lib/python3.7/site-packages/torch/_tensor.py", line 255, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/opt/conda/envs/test/lib/python3.7/site-packages/torch/autograd/_init_.py", line 149, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: Function 'UpsampleBilinear2DBackward1' returned nan values in its 0th output.

The entire stacktrace of the error

@Karel911
Copy link
Owner

Hi,
It seems the MEAM did not clearly generate the edges.
I recommend you to remove all lines related with the edge generation parts (e.g., generating edges or computing loss).

@ThiruRJST
Copy link
Author

But how did that run completely fine when using BCE loss alone

@Karel911
Copy link
Owner

I don't exactly know about the dataset you used so I'm not sure what the problem is.
But the error you posted shows that the MEAM module could not capture the edges.
What does it say when you execute the torch.autograd.set_detect_anamoly(True) ?
And also, excluding the lines related with the edge parts works well under the using API loss?

@ThiruRJST
Copy link
Author

Actually using torch.autograd.set_detect_anamoly returns False for all tensors

@Karel911
Copy link
Owner

Hi, It seems the MEAM did not clearly generate the edges. I recommend you to remove all lines related with the edge generation parts (e.g., generating edges or computing loss).

How about this approach? Does it work?

@hackkhai
Copy link

@Karel911 can you help me with removing the edge generation parts? because i am facing a similar issue.

@ThiruRJST
Copy link
Author

@Karel911 my team mate @hackkhai is working on that.

@Karel911
Copy link
Owner

@Karel911 can you help me with removing the edge generation parts? because i am facing a similar issue.

I also curious about which parts make this issue.
I released the version of TRACER without edge generation. Replace the released scripts with the existing ones.
I briefly tested it so if there is any problem, please let me know.

Thanks.

@hackkhai
Copy link

Thanks, Let me check this out

@HuiqianLi HuiqianLi mentioned this issue Jan 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants