Adaptive Pixel Intensity Loss generated NaN values while training #9

ThiruRJST · 2022-01-20T15:05:24Z

Was training on custom human dataset.
Batch Size = 8
No of training images = 3800

No of steps trained before showing error = 75

After 75th step It generated an error:

RuntimeError: Function 'UpsampleBilinear2DBackward1' returned nan values in its 0th output.

The model trained successfully when using BCE loss.

We even checked for NaN values using torch.autograd.set_detect_anamoly(True) But it returned False stating that no NaN values were found

The text was updated successfully, but these errors were encountered:

ThiruRJST · 2022-01-20T15:07:46Z

File "/opt/conda/envs/test/lib/python3.7/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "/opt/conda/envs/test/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/opt/conda/envs/test/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/envs/test/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/opt/conda/envs/test/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/jupyter/TRACER/model/TRACER.py", line 38, in forward
    features, edge = self.model.get_blocks(x, H, W)
  File "/home/jupyter/TRACER/model/EfficientNet.py", line 250, in get_blocks
    edge = F.interpolate(edge, size=(H, W), mode='bilinear')
  File "/opt/conda/envs/test/lib/python3.7/site-packages/torch/nn/functional.py", line 3709, in interpolate
    return torch._C._nn.upsample_bilinear2d(input, output_size, align_corners, scale_factors)
 (function _print_stack)
 16%|███████████████▎                                                                                | 76/475 [04:51<25:29,  3.83s/it]
Traceback (most recent call last):
  File "main.py", line 49, in <module>
    main(cfg)
  File "main.py", line 34, in main
    Trainer(cfg, save_path)
  File "/home/jupyter/TRACER/trainer.py", line 59, in _init_
    train_loss, train_mae = self.training(args)
  File "/home/jupyter/TRACER/trainer.py", line 117, in training
    loss.backward()
  File "/opt/conda/envs/test/lib/python3.7/site-packages/torch/_tensor.py", line 255, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/opt/conda/envs/test/lib/python3.7/site-packages/torch/autograd/_init_.py", line 149, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: Function 'UpsampleBilinear2DBackward1' returned nan values in its 0th output.

The entire stacktrace of the error

Karel911 · 2022-01-21T08:03:53Z

Hi,
It seems the MEAM did not clearly generate the edges.
I recommend you to remove all lines related with the edge generation parts (e.g., generating edges or computing loss).

ThiruRJST · 2022-01-21T15:23:21Z

But how did that run completely fine when using BCE loss alone

Karel911 · 2022-01-22T07:29:03Z

I don't exactly know about the dataset you used so I'm not sure what the problem is.
But the error you posted shows that the MEAM module could not capture the edges.
What does it say when you execute the torch.autograd.set_detect_anamoly(True) ?
And also, excluding the lines related with the edge parts works well under the using API loss?

ThiruRJST · 2022-01-25T06:03:17Z

Actually using torch.autograd.set_detect_anamoly returns False for all tensors

Karel911 · 2022-01-27T11:09:39Z

Hi, It seems the MEAM did not clearly generate the edges. I recommend you to remove all lines related with the edge generation parts (e.g., generating edges or computing loss).

How about this approach? Does it work?

hackkhai · 2022-01-28T06:48:08Z

@Karel911 can you help me with removing the edge generation parts? because i am facing a similar issue.

ThiruRJST · 2022-01-28T07:32:36Z

@Karel911 my team mate @hackkhai is working on that.

Karel911 · 2022-01-28T07:35:16Z

@Karel911 can you help me with removing the edge generation parts? because i am facing a similar issue.

I also curious about which parts make this issue.
I released the version of TRACER without edge generation. Replace the released scripts with the existing ones.
I briefly tested it so if there is any problem, please let me know.

Thanks.

hackkhai · 2022-01-28T07:40:49Z

Thanks, Let me check this out

Karel911 closed this as completed Feb 13, 2022

HuiqianLi mentioned this issue Jan 5, 2023

the API loss #33

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adaptive Pixel Intensity Loss generated NaN values while training #9

Adaptive Pixel Intensity Loss generated NaN values while training #9

ThiruRJST commented Jan 20, 2022 •

edited

Loading

ThiruRJST commented Jan 20, 2022

Karel911 commented Jan 21, 2022

ThiruRJST commented Jan 21, 2022

Karel911 commented Jan 22, 2022

ThiruRJST commented Jan 25, 2022

Karel911 commented Jan 27, 2022

hackkhai commented Jan 28, 2022

ThiruRJST commented Jan 28, 2022

Karel911 commented Jan 28, 2022

hackkhai commented Jan 28, 2022

Adaptive Pixel Intensity Loss generated NaN values while training #9

Adaptive Pixel Intensity Loss generated NaN values while training #9

Comments

ThiruRJST commented Jan 20, 2022 • edited Loading

ThiruRJST commented Jan 20, 2022

Karel911 commented Jan 21, 2022

ThiruRJST commented Jan 21, 2022

Karel911 commented Jan 22, 2022

ThiruRJST commented Jan 25, 2022

Karel911 commented Jan 27, 2022

hackkhai commented Jan 28, 2022

ThiruRJST commented Jan 28, 2022

Karel911 commented Jan 28, 2022

hackkhai commented Jan 28, 2022

ThiruRJST commented Jan 20, 2022 •

edited

Loading