Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overflow encountered in exp after 10 iters, and Segmentation fault(core dumped) after 40 iters. #20

Open
maxenceliu opened this issue Apr 14, 2017 · 15 comments

Comments

@maxenceliu
Copy link

Has anyone encountered these two problems and fixed them?
Overflow in exp after 10 iters, and Segmentation fault after 40 iters.

@lihungchieh
Copy link

Hi @maxenceliu , How long it takes per iteration?

@maxenceliu
Copy link
Author

2-3secs on GTX1080

@Lafi
Copy link

Lafi commented Apr 19, 2017

I've encountered the same problem, it seems that the dw and dh is too small to cause the overflow over exp function.

image

@maxenceliu
Copy link
Author

after the newest commit, total_loss explode after 350 iterations due to the rpn_cls_loss exploded.

@maxenceliu
Copy link
Author

result is not stable, this time, regular_loss become Nan ater 500 iters...

@KeyKy
Copy link

KeyKy commented Apr 24, 2017

I also encounter the total_loss explosion when try the newest commit. I implement a caffe version of mask-rcnn and also encounter the same problem.

here is the newest commit loss:
iter 583: image-id:0272412, time:0.525(sec), regular_loss: 0.167962, total-loss 503.3103(0.0436, 488.1062, 0.000484, 14.5497, 0.6103), instances: 22, batch:(250|1016, 21|86, 21|21)

iter 584: image-id:0262213, time:0.359(sec), regular_loss: 0.177546, total-loss 739.0580(47.7280, 112.5653, 1.444757, 577.0145, 0.3054), instances: 1, batch:(1|33, 1|19, 1|1)

iter 585: image-id:0534559, time:0.429(sec), regular_loss: 0.355617, total-loss nan(nan, 1685118073183372762735414607872.0000, nan, 713696030880020606040835379691520.0000, 4810318291543261184.0000), instances: 16, batch:(128|528, 14|18, 14|14)

@Nikasa1889
Copy link

Nikasa1889 commented Apr 27, 2017

I have got similar issue at iter 493, this seem to be caused by rpn loss explosion. We might need to double check the box matching strategy.

The error:

iter 493: image-id:0254301, time:1.600(sec), regular_loss: 0.215555, total-loss 0.9129(0.0529, 0.8338, 0.000000, 0.0263, 0.0000), instances: 29, batch:(89|372, 0|64, 0|0) train/../libs/boxes/bbox_transform.py:62: RuntimeWarning: overflow encountered in exp pred_h = np.exp(dh) * heights[:, np.newaxis] train/../libs/boxes/bbox_transform.py:62: RuntimeWarning: overflow encountered in multiply pred_h = np.exp(dh) * heights[:, np.newaxis] iter 494: image-id:0115028, time:0.463(sec), regular_loss: 0.215730, total-loss 265.2788(9.3407, 255.5391, 0.000000, 0.3989, 0.0000), instances: 5, batch:(14|69, 0|64, 0|0) train/../libs/boxes/bbox_transform.py:61: RuntimeWarning: overflow encountered in exp pred_w = np.exp(dw) * widths[:, np.newaxis] train/../libs/boxes/bbox_transform.py:61: RuntimeWarning: overflow encountered in multiply pred_w = np.exp(dw) * widths[:, np.newaxis] iter 495: image-id:0428026, time:0.363(sec), regular_loss: 0.221351, total-loss 773565.1250(48176.1797, 704032.8750, 0.000000, 21356.0664, 0.0000), instances: 1, batch:(17|92, 0|8, 0|0) train/../libs/layers/sample.py:144: RuntimeWarning: invalid value encountered in greater_equal keep = np.where((ws >= min_size) & (hs >= min_size))[0] iter 496: image-id:0165883, time:0.399(sec), regular_loss: 61892.210938, total-loss nan(nan, nan, 0.009564, 0.2174, 407103320661720213487616.0000), instances: 4, batch:(6|45, 12|76, 12|12) [[ 365.89337158 268.93334961 729.21331787 530.10668945 17. ] [ 447.27999878 25.89333344 759.37341309 428.58666992 1. ] [ 134.04000854 234.01333618 334.67999268 385.97335815 1. ] [ 353.80001831 273.25335693 852.85339355 632.81335449 60. ]] Traceback (most recent call last): File "train/train.py", line 195, in <module> train() File "train/train.py", line 178, in train raise TypeError: exceptions must be old-style classes or derived from BaseException, not NoneType

@opikalo
Copy link

opikalo commented Apr 28, 2017

I have experienced the same issue, then updated to cuda_8.0.61_375.26 and cudnn to 5.1 and it went away. Could be sporadic too.

@Nikasa1889
Copy link

I confirm that upgrading to cuda 8.0 fix the problem. Thank you very much @opikalo

@Kongsea
Copy link
Contributor

Kongsea commented May 4, 2017

I also encounter this problem.

After 29683 iters, it gives warnings:

train/../libs/boxes/bbox_transform.py:62: RuntimeWarning: overflow encountered in exp
pred_h = np.exp(dh) * heights[:, np.newaxis]
train/../libs/boxes/bbox_transform.py:61: RuntimeWarning: overflow encountered in exp
pred_w = np.exp(dw) * widths[:, np.newaxis]

Then, in iter 29684, the loss becomes unusual:

iter 29684: image-id:0094949, time:0.605(sec), regular_loss: 0.179757, total-loss 85438849024.0000(163221872.0000, 73605677056.0000, 30830362.000000, 11639122944.0000, 3.2994), instances: 8, batch:(125|524, 8|12, 8|8)
iter 29685: image-id:0357095, time:0.688(sec), regular_loss: 10989575769948160.000000, total-loss 2035863.0000(0.0033, 0.1700, 0.000137, 2035862.8750, 0.0118), instances: 2, batch:(32|152, 2|32, 2|2)
iter 29686: image-id:0094952, time:0.764(sec), regular_loss: nan, total-loss 5372209.0000(0.0358, 0.2918, 0.000548, 5372208.5000, 0.0244), instances: 9, batch:(312|1256, 9|46, 9|9)

NaN happens...

@opikalo @Nikasa1889 My cuda version is 8.0 and cudnn is 5.1 already.

@mrlooi
Copy link

mrlooi commented May 5, 2017

I've gotten the overflow error quite a few times, all without changing anything. It seems that the overflow errors occur randomly, possibly caused by poor convergence in the weights. Unfortunately the trick for now is to simply restart the training and pray it doesn't overflow again - that's working for me, so far. @Kongsea seems like you got pretty lucky reaching 29000+ iters before seeing overflow, my first overflow was < 1000 iters

@mrlooi
Copy link

mrlooi commented May 5, 2017

I would suggest changing the checkpoint value in train/train.py from 10000 to a smaller value e.g. 3000, so that you have more checkpoints in case of overflow error

@kevinkit
Copy link

kevinkit commented May 5, 2017

Can you "reproduce" the random occurance with a certain seed for the random initialization?

@tianzq
Copy link

tianzq commented May 10, 2017

I had similar error. My CUDA is 8.0 and Cudnn is 5.1.
I found that I didn't add the path to the env.

export PATH=/usr/local/cuda-8.0/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64:$LD_LIBRARY_PATH
After adding the path, the problem was solved.

@meetps
Copy link

meetps commented Feb 19, 2018

Check my comment here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests