Problems during training. #15

XuyangBai · 2019-07-31T05:15:01Z

Hi, Thanks for your sharing. I have tried your code on my own dataset but the I found that initially everything goes well but after several epochs the training suddenly broke up ( accuracy becomes 1 and the loss becomes 0 ) I use tf 1.12.0 and the cuda version is 9.0, cudnn version is 7.1.4

# conda list | grep tensorflow
tensorflow-estimator      1.13.0                     py_0    anaconda
tensorflow-gpu            1.12.0                   pypi_0    pypi
tensorflow-tensorboard    0.4.0                    pypi_0    pypi

Have you met this kind of problem? Another potential problem is that sometimes the training takes 4400 MB GPU memory (see from nvidia-smi), but sometimes it takes more than 7000 MB ( and I do not change the batch size and network architecture) I am pretty confused about these problems. Could you give me some advice?

The text was updated successfully, but these errors were encountered:

HuguesTHOMAS · 2019-07-31T09:14:02Z

Hi @XuyangBai,

The first error is quite strange and I never encountered such behavior on my datasets. It is very unlikely that the loss really became zero if you use correct augmentation strategies. It seems more like a bug, but it will be difficult to help you without reproducing your experiments with your dataset.

The second one could be explained by your dataset. The GPU memory that you see from nvidia-smi is the memory taken by the tensors at runtime. It thus depends on the input size. If you use the same network parameters, but with a different dataset, with denser point cloud for example, this memory will be larger. It is strange that you get different values GPU memory for the same dataset, but might be explained by the nature of your dataset and your implementation of the dataset class.

What does your data look like? Real or artificial point clouds? Indoor outdour scenes? Objects?

XuyangBai · 2019-07-31T14:00:56Z

Hi @HuguesTHOMAS ,

I also think the first error may because some bugs of my implementation. I just want to check whether it is because the tensorflow version. BTW, what's the situation when using TF 1.13 and CUDA 10 ?

For the second one, I mean that for some experiments, the GPU memory is always 4400MB and for others is always 7000+MB. It is really strange. But I check the training.txt in result fold and found the memory showed in that file is similar.

XuyangBai · 2019-07-31T14:23:04Z

Oh Sorry I think I find the reason of second problem. I forget the dropout. It seems when I use dropout = 0.5, the GPU memory is around 4400 MB while using dropout = 1, the GPU memory is 7000MB.

Sorry for the bothering.

HuguesTHOMAS · 2019-08-02T10:15:12Z

Hi @XuyangBai,

If you look at the code, the dropout variable is extremely important in the implementation, because the network use it to know if you use it for training or for test.

If you use a dropout < 0.99, the network is in training configuration, and if you use a dropout = 1, the network is in test configuration. This is a trick that I used to avoid creating a 'training/test' boolean placeholder, and that I never corrected.

It will be corrected in the next month (I currently don't have any time to spend on the code). Until then, you should not use dropout = 1 when training, as the variables will not be updated by gradient back propagation in that case. If you have dropout blocks and don't want to use them, just remove them or use dropout = 0.98 and they will be insignificant.

Best,
Hugues

XuyangBai · 2019-08-02T11:28:13Z

Thanks a lot for your reply :)

XuyangBai · 2019-08-03T11:25:57Z

Hi @HuguesTHOMAS

I reopen this issue because I go through the code to check out the bug but I am still facing the first problem that training will broke up (loss become zero). Actually, everything goes well for environment CUDA 9, TF 1.12.0. But when I tried to run the code on my RTX 2080Ti with CUDA 10, TF 1.12.0 (build from source), the problem will appear. And I printed the input value and variables for when the model broken and found the input value is OK but all the variables all NaN.

I have tried to add 1e-12 to operation like tf.sqrt to avoid infinite number but the problem is still there.

Another thing need to mention is that sometimes the training and validation both 'broke up' while sometimes the training seems correct but the validation is broken up, like the curve below.

Have you ever met such problems like the variables become all NaN? Thank you so much.

nejcd · 2019-08-13T10:39:03Z

Hi @XuyangBai, I have noticed similar behaviour as you have described. I have not been able to debug it as it occurs randomly.
As you have closed the issue, have you found some solution?

XuyangBai · 2019-08-13T13:19:00Z

Hi @nejcd , I didn't find the solution, so I just change my environment to CUDA9 TF 1.12.0 and everything goes well. There might be some bugs in CUDA10.

HuguesTHOMAS · 2019-08-19T10:43:21Z

Hi,

I had some time to dig into this problem and it seems that CUDA10 is not working correctly with RTX 2080Ti GPUs. Here is what I found:

Tested configurations

CUDA9-TF1.12 / GTX 1080ti => No bug
CUDA10-TF1.13 / GTX 1080ti => No bug
CUDA9-TF1.12 / RTX 2080ti => No bug
CUDA10-TF1.13 / RTX 2080ti => Bug appears only in this configuration

Origin of the bug
I tracked down the NaN values in my code and found that they appear after a tf.matmul operation:

KPConv/kernels/convolution_ops.py

Line 240 in 5f9ceca

weighted_features = tf.matmul(all_weights, neighborhood_features)

Before the appearance of NaN I noticed some weird value higher than 1e10. If you print the two matrix that are multiplied and the result matrix, you will see that the result is completly false. This seems to be caused by a CUDA internal bug. At some point one of this mistake lead to a value so high that it becomes NaN and the network crashes.

For now I would just advise to avoid using CUDA10 with a RTX 2080ti.

miiller · 2020-01-10T09:51:15Z

I ported the model to Keras layers and tried training it on a Tesla V100 GPU (CUDA10.2:tf2.0) with the result of also getting NaN values after some epochs. After changing the KP influence from guassian to linear everything worked fine, so I would assume the issue lies in the gradient computation for the Gaussian influence, although increasing the epsilon from 1e-9 to 1e-6 did not resolve the problem. But the linear influence works just fine and in my case leads to good results with higher computational efficiency.

longmalongma · 2020-05-03T01:42:12Z

after a tf.matmul operation:

Thanks for your great work,I have met this problem for a long time,i want to konw the version of python.

HuguesTHOMAS · 2020-05-04T14:09:08Z

If I remember correctly, the python version was 3.5 or 3.6. IF you are willing to switch libraries, a newer implementation has been released in PyTorch.

Arjun-NA · 2020-09-16T15:50:39Z

Just to mention my experience:

I got NaN when I used Tensorflow 1.15, cuda 10 cudnn 7.6.5
with some specific configurations only
with GPU : NVIDIA Tesla P100

densechen · 2021-02-06T05:37:32Z

I use TF 1.15 and also occur NaN. You can solve this via reducing the batch size to 2.

wuqianliang · 2022-01-17T08:22:55Z

Nice.

XuyangBai closed this as completed Aug 2, 2019

XuyangBai reopened this Aug 3, 2019

XuyangBai closed this as completed Aug 9, 2019

HuguesTHOMAS reopened this Aug 19, 2019

HuguesTHOMAS closed this as completed Sep 23, 2019

XuyangBai mentioned this issue Oct 3, 2019

NaN error during training #25

Closed

HuguesTHOMAS mentioned this issue Oct 22, 2019

NaN values #33

Closed

xbb2017 mentioned this issue Nov 13, 2019

NPM3D dataset #40

Closed

XuyangBai mentioned this issue May 28, 2020

How to use multi GPU? XuyangBai/D3Feat#5

Closed

XuyangBai mentioned this issue Oct 21, 2020

9753 nan nan 0.00 nan nan 1698.001 16280.8 XuyangBai/D3Feat#23

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems during training. #15

Problems during training. #15

XuyangBai commented Jul 31, 2019

HuguesTHOMAS commented Jul 31, 2019

XuyangBai commented Jul 31, 2019

XuyangBai commented Jul 31, 2019

HuguesTHOMAS commented Aug 2, 2019 •

edited

XuyangBai commented Aug 2, 2019

XuyangBai commented Aug 3, 2019

nejcd commented Aug 13, 2019

XuyangBai commented Aug 13, 2019

HuguesTHOMAS commented Aug 19, 2019 •

edited

miiller commented Jan 10, 2020

longmalongma commented May 3, 2020

HuguesTHOMAS commented May 4, 2020

Arjun-NA commented Sep 16, 2020

densechen commented Feb 6, 2021 •

edited

wuqianliang commented Jan 17, 2022

Problems during training. #15

Problems during training. #15

Comments

XuyangBai commented Jul 31, 2019

HuguesTHOMAS commented Jul 31, 2019

XuyangBai commented Jul 31, 2019

XuyangBai commented Jul 31, 2019

HuguesTHOMAS commented Aug 2, 2019 • edited

XuyangBai commented Aug 2, 2019

XuyangBai commented Aug 3, 2019

nejcd commented Aug 13, 2019

XuyangBai commented Aug 13, 2019

HuguesTHOMAS commented Aug 19, 2019 • edited

miiller commented Jan 10, 2020

longmalongma commented May 3, 2020

HuguesTHOMAS commented May 4, 2020

Arjun-NA commented Sep 16, 2020

densechen commented Feb 6, 2021 • edited

wuqianliang commented Jan 17, 2022

HuguesTHOMAS commented Aug 2, 2019 •

edited

HuguesTHOMAS commented Aug 19, 2019 •

edited

densechen commented Feb 6, 2021 •

edited