Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The tensor output by self.vertice_mapping in the TransformerEncoder of stage1 is all nan #10

Open
FortisCK opened this issue Apr 14, 2023 · 2 comments

Comments

@FortisCK
Copy link

Generally, when training to the second epoch, the output results are all nan. At this time, I check the bias and weight of the linear layer, and the results are all nan.

self.encoder.vertice_mapping[0]
Linear(in_features=15069, out_features=1024, bias=True)
self.encoder.vertice_mapping[0].bias
Parameter containing:
tensor([nan, nan, nan,  ..., nan, nan, nan], device='cuda:0',
       requires_grad=True)
self.encoder.vertice_mapping[0].weight
Parameter containing:
tensor([[nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        ...,
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan]], device='cuda:0',
       requires_grad=True)
@Doubiiu
Copy link
Owner

Doubiiu commented Apr 15, 2023

I have not encountered this before. Are you using the default config for training? It may be solved by scaling down the learning rate I guess?

@youngstu
Copy link

I has the same problem.

[2023-04-21 17:04:49,310 INFO train_vq.py line 181 11368]=>Epoch: [1/200][70/314] Data: 0.024 (0.035) Batch: 0.109 (0.121) Remain: 02:06:09 Loss: 0.1313 
[2023-04-21 17:04:50,289 INFO train_vq.py line 181 11368]=>Epoch: [1/200][80/314] Data: 0.026 (0.033) Batch: 0.130 (0.118) Remain: 02:03:08 Loss: 0.1381 
[2023-04-21 17:04:51,143 INFO train_vq.py line 181 11368]=>Epoch: [1/200][90/314] Data: 0.029 (0.033) Batch: 0.130 (0.114) Remain: 01:59:22 Loss: 0.1342 
[2023-04-21 17:04:51,857 INFO train_vq.py line 181 11368]=>Epoch: [1/200][100/314] Data: 0.024 (0.032) Batch: 0.063 (0.110) Remain: 01:54:52 Loss: 0.1323 
[2023-04-21 17:04:52,757 INFO train_vq.py line 181 11368]=>Epoch: [1/200][110/314] Data: 0.026 (0.031) Batch: 0.066 (0.108) Remain: 01:52:58 Loss: 0.1308 
[2023-04-21 17:04:53,606 INFO train_vq.py line 181 11368]=>Epoch: [1/200][120/314] Data: 0.025 (0.031) Batch: 0.072 (0.106) Remain: 01:50:53 Loss: 0.1322 
[2023-04-21 17:04:54,501 INFO train_vq.py line 181 11368]=>Epoch: [1/200][130/314] Data: 0.024 (0.030) Batch: 0.071 (0.105) Remain: 01:49:34 Loss: nan 
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
[2023-04-21 17:04:55,388 INFO train_vq.py line 181 11368]=>Epoch: [1/200][140/314] Data: 0.024 (0.030) Batch: 0.076 (0.104) Remain: 01:48:20 Loss: nan 
INFO:main-logger:Epoch: [1/200][140/314] Data: 0.024 (0.030) Batch: 0.076 (0.104) Remain: 01:48:20 Loss: nan 
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.
[2023-04-21 17:04:56,192 INFO train_vq.py line 181 11368]=>Epoch: [1/200][150/314] Data: 0.024 (0.029) Batch: 0.071 (0.102) Remain: 01:46:41 Loss: nan 
INFO:main-logger:Epoch: [1/200][150/314] Data: 0.024 (0.029) Batch: 0.071 (0.102) Remain: 01:46:41 Loss: nan 
WARNING:root:NaN or Inf found in input tensor.
WARNING:root:NaN or Inf found in input tensor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants