Overflow encountered in exp after 10 iters, and Segmentation fault(core dumped) after 40 iters. #20

maxenceliu · 2017-04-14T02:53:56Z

Has anyone encountered these two problems and fixed them?
Overflow in exp after 10 iters, and Segmentation fault after 40 iters.

lihungchieh · 2017-04-18T11:44:20Z

Hi @maxenceliu , How long it takes per iteration?

maxenceliu · 2017-04-18T15:12:37Z

2-3secs on GTX1080

Lafi · 2017-04-19T03:56:33Z

I've encountered the same problem, it seems that the dw and dh is too small to cause the overflow over exp function.

maxenceliu · 2017-04-20T03:26:59Z

after the newest commit, total_loss explode after 350 iterations due to the rpn_cls_loss exploded.

maxenceliu · 2017-04-20T07:40:19Z

result is not stable, this time, regular_loss become Nan ater 500 iters...

KeyKy · 2017-04-24T10:19:19Z

I also encounter the total_loss explosion when try the newest commit. I implement a caffe version of mask-rcnn and also encounter the same problem.

here is the newest commit loss:
iter 583: image-id:0272412, time:0.525(sec), regular_loss: 0.167962, total-loss 503.3103(0.0436, 488.1062, 0.000484, 14.5497, 0.6103), instances: 22, batch:(250|1016, 21|86, 21|21)

iter 584: image-id:0262213, time:0.359(sec), regular_loss: 0.177546, total-loss 739.0580(47.7280, 112.5653, 1.444757, 577.0145, 0.3054), instances: 1, batch:(1|33, 1|19, 1|1)

iter 585: image-id:0534559, time:0.429(sec), regular_loss: 0.355617, total-loss nan(nan, 1685118073183372762735414607872.0000, nan, 713696030880020606040835379691520.0000, 4810318291543261184.0000), instances: 16, batch:(128|528, 14|18, 14|14)

Nikasa1889 · 2017-04-27T21:55:00Z

I have got similar issue at iter 493, this seem to be caused by rpn loss explosion. We might need to double check the box matching strategy.

The error:

iter 493: image-id:0254301, time:1.600(sec), regular_loss: 0.215555, total-loss 0.9129(0.0529, 0.8338, 0.000000, 0.0263, 0.0000), instances: 29, batch:(89|372, 0|64, 0|0) train/../libs/boxes/bbox_transform.py:62: RuntimeWarning: overflow encountered in exp pred_h = np.exp(dh) * heights[:, np.newaxis] train/../libs/boxes/bbox_transform.py:62: RuntimeWarning: overflow encountered in multiply pred_h = np.exp(dh) * heights[:, np.newaxis] iter 494: image-id:0115028, time:0.463(sec), regular_loss: 0.215730, total-loss 265.2788(9.3407, 255.5391, 0.000000, 0.3989, 0.0000), instances: 5, batch:(14|69, 0|64, 0|0) train/../libs/boxes/bbox_transform.py:61: RuntimeWarning: overflow encountered in exp pred_w = np.exp(dw) * widths[:, np.newaxis] train/../libs/boxes/bbox_transform.py:61: RuntimeWarning: overflow encountered in multiply pred_w = np.exp(dw) * widths[:, np.newaxis] iter 495: image-id:0428026, time:0.363(sec), regular_loss: 0.221351, total-loss 773565.1250(48176.1797, 704032.8750, 0.000000, 21356.0664, 0.0000), instances: 1, batch:(17|92, 0|8, 0|0) train/../libs/layers/sample.py:144: RuntimeWarning: invalid value encountered in greater_equal keep = np.where((ws >= min_size) & (hs >= min_size))[0] iter 496: image-id:0165883, time:0.399(sec), regular_loss: 61892.210938, total-loss nan(nan, nan, 0.009564, 0.2174, 407103320661720213487616.0000), instances: 4, batch:(6|45, 12|76, 12|12) [[ 365.89337158 268.93334961 729.21331787 530.10668945 17. ] [ 447.27999878 25.89333344 759.37341309 428.58666992 1. ] [ 134.04000854 234.01333618 334.67999268 385.97335815 1. ] [ 353.80001831 273.25335693 852.85339355 632.81335449 60. ]] Traceback (most recent call last): File "train/train.py", line 195, in <module> train() File "train/train.py", line 178, in train raise TypeError: exceptions must be old-style classes or derived from BaseException, not NoneType

opikalo · 2017-04-28T22:08:45Z

I have experienced the same issue, then updated to cuda_8.0.61_375.26 and cudnn to 5.1 and it went away. Could be sporadic too.

Nikasa1889 · 2017-04-29T21:09:20Z

I confirm that upgrading to cuda 8.0 fix the problem. Thank you very much @opikalo

Kongsea · 2017-05-04T09:05:13Z

I also encounter this problem.

After 29683 iters, it gives warnings:

train/../libs/boxes/bbox_transform.py:62: RuntimeWarning: overflow encountered in exp
pred_h = np.exp(dh) * heights[:, np.newaxis]
train/../libs/boxes/bbox_transform.py:61: RuntimeWarning: overflow encountered in exp
pred_w = np.exp(dw) * widths[:, np.newaxis]

Then, in iter 29684, the loss becomes unusual:

iter 29684: image-id:0094949, time:0.605(sec), regular_loss: 0.179757, total-loss 85438849024.0000(163221872.0000, 73605677056.0000, 30830362.000000, 11639122944.0000, 3.2994), instances: 8, batch:(125|524, 8|12, 8|8)
iter 29685: image-id:0357095, time:0.688(sec), regular_loss: 10989575769948160.000000, total-loss 2035863.0000(0.0033, 0.1700, 0.000137, 2035862.8750, 0.0118), instances: 2, batch:(32|152, 2|32, 2|2)
iter 29686: image-id:0094952, time:0.764(sec), regular_loss: nan, total-loss 5372209.0000(0.0358, 0.2918, 0.000548, 5372208.5000, 0.0244), instances: 9, batch:(312|1256, 9|46, 9|9)

NaN happens...

@opikalo @Nikasa1889 My cuda version is 8.0 and cudnn is 5.1 already.

mrlooi · 2017-05-05T10:15:26Z

I've gotten the overflow error quite a few times, all without changing anything. It seems that the overflow errors occur randomly, possibly caused by poor convergence in the weights. Unfortunately the trick for now is to simply restart the training and pray it doesn't overflow again - that's working for me, so far. @Kongsea seems like you got pretty lucky reaching 29000+ iters before seeing overflow, my first overflow was < 1000 iters

mrlooi · 2017-05-05T10:18:45Z

I would suggest changing the checkpoint value in train/train.py from 10000 to a smaller value e.g. 3000, so that you have more checkpoints in case of overflow error

kevinkit · 2017-05-05T11:42:53Z

Can you "reproduce" the random occurance with a certain seed for the random initialization?

tianzq · 2017-05-10T14:04:10Z

I had similar error. My CUDA is 8.0 and Cudnn is 5.1.
I found that I didn't add the path to the env.

export PATH=/usr/local/cuda-8.0/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64:$LD_LIBRARY_PATH
After adding the path, the problem was solved.

meetps · 2018-02-19T08:03:05Z

Check my comment here

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overflow encountered in exp after 10 iters, and Segmentation fault(core dumped) after 40 iters. #20

Overflow encountered in exp after 10 iters, and Segmentation fault(core dumped) after 40 iters. #20

maxenceliu commented Apr 14, 2017

lihungchieh commented Apr 18, 2017

maxenceliu commented Apr 18, 2017

Lafi commented Apr 19, 2017

maxenceliu commented Apr 20, 2017

maxenceliu commented Apr 20, 2017

KeyKy commented Apr 24, 2017 •

edited

Loading

Nikasa1889 commented Apr 27, 2017 •

edited

Loading

opikalo commented Apr 28, 2017 •

edited

Loading

Nikasa1889 commented Apr 29, 2017

Kongsea commented May 4, 2017

mrlooi commented May 5, 2017

mrlooi commented May 5, 2017

kevinkit commented May 5, 2017

tianzq commented May 10, 2017

meetps commented Feb 19, 2018

Overflow encountered in exp after 10 iters, and Segmentation fault(core dumped) after 40 iters. #20

Overflow encountered in exp after 10 iters, and Segmentation fault(core dumped) after 40 iters. #20

Comments

maxenceliu commented Apr 14, 2017

lihungchieh commented Apr 18, 2017

maxenceliu commented Apr 18, 2017

Lafi commented Apr 19, 2017

maxenceliu commented Apr 20, 2017

maxenceliu commented Apr 20, 2017

KeyKy commented Apr 24, 2017 • edited Loading

Nikasa1889 commented Apr 27, 2017 • edited Loading

opikalo commented Apr 28, 2017 • edited Loading

Nikasa1889 commented Apr 29, 2017

Kongsea commented May 4, 2017

mrlooi commented May 5, 2017

mrlooi commented May 5, 2017

kevinkit commented May 5, 2017

tianzq commented May 10, 2017

meetps commented Feb 19, 2018

KeyKy commented Apr 24, 2017 •

edited

Loading

Nikasa1889 commented Apr 27, 2017 •

edited

Loading

opikalo commented Apr 28, 2017 •

edited

Loading