Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss output increases and always stops at 87.3365 while learning in GPU-mode, however it decreases and I can get quite good accuracy in CPU-mode. #6130

Closed
xsmiledur opened this issue Dec 25, 2017 · 11 comments

Comments

@xsmiledur
Copy link

xsmiledur commented Dec 25, 2017

Issue summary

I'm using Softmax With Loss Layer when see cnn loss output.
When I use CPU-mode, I could get so good value in accuracy.
Nevertheless, when using GPU-mode, Train and Test net Loss output always get high value( stops at 87.3365). It is very strange. I'm so troubled.

I investigated many times about this problem and saw many people troubled almost the same matter, but the answer for them is not fundamental. All of them says "Make learning-rate lower" or "Adjust hyperparameter". I tried all, but nothing got well.
Most answerers says about learning-rate, but when I set lower learning-rate (ex. 10^(-11)), the loss didn't neither increase nor decrease at all and accuracy didn't change(ex. 0.1 etc.).

First I thought the cause is under environment, but nothing is obtained from Caffe's issues. But I think that isn't.
I referenced I can't run Ur code correctly. [id_loss = 87.3365 (* 1 = 87.3365 loss)] #28 and tried, but the Caffe master doesn't include openmpi and nothing works well.

Surprisingly, when I use mnist's example using GPU, it works. However, when I apply lenet_solver.prototxt and lenet_train_test.prototxt for my datas, it doesn't work. I think the cause is under the difference between mnist's data and my data(jpg→lmdb or leveldb), but I can't figure out anything.

I'm so troubled , please help me.


For example, in GPU-mode, I got 87.3365 loss over 8 Iterations. The loss trainsition is here ( learning-rate: 0.001):

I1225 12:24:44.565865 19262 solver.cpp:397] Test net output #0: accuracy = 0.511875
I1225 12:24:44.565958 19262 solver.cpp:397] Test net output #1: loss = 1.6797 (* 1 = 1.6797 loss)
I1225 12:24:44.566012 19262 solver.cpp:397] Test net output #2: per_class_accuracy = 1
I1225 12:24:44.566063 19262 solver.cpp:397] Test net output #3: per_class_accuracy = 0.543384
I1225 12:24:44.566123 19262 solver.cpp:397] Test net output #4: per_class_accuracy = 0.266863
I1225 12:24:44.566174 19262 solver.cpp:397] Test net output #5: per_class_accuracy = 0.179504
I1225 12:24:45.072780 19262 solver.cpp:218] Iteration 0 (0 iter/s, 2.23846s/1 iters), loss = 1.65501
I1225 12:24:45.072860 19262 solver.cpp:237] Train net output #0: accuracy = 0.558594
I1225 12:24:45.072885 19262 solver.cpp:237] Train net output #1: loss = 1.65501 (* 1 = 1.65501 loss)
I1225 12:24:45.072898 19262 solver.cpp:237] Train net output #2: per_class_accuracy = 1
I1225 12:24:45.072911 19262 solver.cpp:237] Train net output #3: per_class_accuracy = 0.657534
I1225 12:24:45.072923 19262 solver.cpp:237] Train net output #4: per_class_accuracy = 0.315789
I1225 12:24:45.072937 19262 solver.cpp:237] Train net output #5: per_class_accuracy = 0.196721
I1225 12:24:45.072973 19262 sgd_solver.cpp:105] Iteration 0, lr = 0.0001
I1225 12:24:45.551261 19262 solver.cpp:218] Iteration 1 (2.09054 iter/s, 0.478346s/1 iters), loss = 1.53685
I1225 12:24:45.551345 19262 solver.cpp:237] Train net output #0: accuracy = 0.457031
I1225 12:24:45.551369 19262 solver.cpp:237] Train net output #1: loss = 1.53685 (* 1 = 1.53685 loss)
I1225 12:24:45.551384 19262 solver.cpp:237] Train net output #2: per_class_accuracy = 0.507692
I1225 12:24:45.551399 19262 solver.cpp:237] Train net output #3: per_class_accuracy = 0.965517
I1225 12:24:45.551417 19262 solver.cpp:237] Train net output #4: per_class_accuracy = 0.05
I1225 12:24:45.551445 19262 solver.cpp:237] Train net output #5: per_class_accuracy = 0.342466
I1225 12:24:45.551462 19262 sgd_solver.cpp:105] Iteration 1, lr = 9.99925e-05
I1225 12:24:46.033396 19262 solver.cpp:218] Iteration 2 (2.07471 iter/s, 0.481996s/1 iters), loss = 2.32121
I1225 12:24:46.033449 19262 solver.cpp:237] Train net output #0: accuracy = 0.480469
I1225 12:24:46.033465 19262 solver.cpp:237] Train net output #1: loss = 2.32121 (* 1 = 2.32121 loss)
I1225 12:24:46.033474 19262 solver.cpp:237] Train net output #2: per_class_accuracy = 0
I1225 12:24:46.033483 19262 solver.cpp:237] Train net output #3: per_class_accuracy = 1
I1225 12:24:46.033489 19262 solver.cpp:237] Train net output #4: per_class_accuracy = 0.0384615
I1225 12:24:46.033496 19262 solver.cpp:237] Train net output #5: per_class_accuracy = 0.966667
I1225 12:24:46.033509 19262 sgd_solver.cpp:105] Iteration 2, lr = 9.9985e-05
I1225 12:24:46.033396 19262 solver.cpp:218] Iteration 2 (2.07471 iter/s, 0.481996s/1 iters), loss = 2.32121
I1225 12:24:46.033449 19262 solver.cpp:237] Train net output #0: accuracy = 0.480469
I1225 12:24:46.033465 19262 solver.cpp:237] Train net output #1: loss = 2.32121 (* 1 = 2.32121 loss)
I1225 12:24:46.033474 19262 solver.cpp:237] Train net output #2: per_class_accuracy = 0
I1225 12:24:46.033483 19262 solver.cpp:237] Train net output #3: per_class_accuracy = 1
I1225 12:24:46.033489 19262 solver.cpp:237] Train net output #4: per_class_accuracy = 0.0384615
I1225 12:24:46.033496 19262 solver.cpp:237] Train net output #5: per_class_accuracy = 0.966667
I1225 12:24:46.033509 19262 sgd_solver.cpp:105] Iteration 2, lr = 9.9985e-05
I1225 12:24:46.526967 19262 solver.cpp:218] Iteration 3 (2.02649 iter/s, 0.493465s/1 iters), loss = 7.23706
I1225 12:24:46.527034 19262 solver.cpp:237] Train net output #0: accuracy = 0.542969
I1225 12:24:46.527056 19262 solver.cpp:237] Train net output #1: loss = 7.23706 (* 1 = 7.23706 loss)
I1225 12:24:46.527070 19262 solver.cpp:237] Train net output #2: per_class_accuracy = 0
I1225 12:24:46.527084 19262 solver.cpp:237] Train net output #3: per_class_accuracy = 1
I1225 12:24:46.527099 19262 solver.cpp:237] Train net output #4: per_class_accuracy = 0
I1225 12:24:46.527114 19262 solver.cpp:237] Train net output #5: per_class_accuracy = 1
I1225 12:24:46.527132 19262 sgd_solver.cpp:105] Iteration 3, lr = 9.99775e-05
I1225 12:24:47.011711 19262 solver.cpp:218] Iteration 4 (2.06347 iter/s, 0.484621s/1 iters), loss = 41.4759
I1225 12:24:47.011860 19262 solver.cpp:237] Train net output #0: accuracy = 0.476562
I1225 12:24:47.011926 19262 solver.cpp:237] Train net output #1: loss = 41.4759 (* 1 = 41.4759 loss)
I1225 12:24:47.011979 19262 solver.cpp:237] Train net output #2: per_class_accuracy = 0
I1225 12:24:47.012030 19262 solver.cpp:237] Train net output #3: per_class_accuracy = 1
I1225 12:24:47.012080 19262 solver.cpp:237] Train net output #4: per_class_accuracy = 0
I1225 12:24:47.012123 19262 solver.cpp:237] Train net output #5: per_class_accuracy = 1
I1225 12:24:47.012166 19262 sgd_solver.cpp:105] Iteration 4, lr = 9.997e-05
I1225 12:24:47.487167 19262 solver.cpp:218] Iteration 5 (2.10414 iter/s, 0.475254s/1 iters), loss = 68.1057
I1225 12:24:47.487270 19262 solver.cpp:237] Train net output #0: accuracy = 0.4375
I1225 12:24:47.487294 19262 solver.cpp:237] Train net output #1: loss = 68.1057 (* 1 = 68.1057 loss)
I1225 12:24:47.487313 19262 solver.cpp:237] Train net output #2: per_class_accuracy = 0
I1225 12:24:47.487325 19262 solver.cpp:237] Train net output #3: per_class_accuracy = 1
I1225 12:24:47.487340 19262 solver.cpp:237] Train net output #4: per_class_accuracy = 1
I1225 12:24:47.487352 19262 solver.cpp:237] Train net output #5: per_class_accuracy = 0
I1225 12:24:47.487367 19262 sgd_solver.cpp:105] Iteration 5, lr = 9.99625e-05
I1225 12:24:47.984722 19262 solver.cpp:218] Iteration 6 (2.01046 iter/s, 0.4974s/1 iters), loss = 69.9375
I1225 12:24:47.984786 19262 solver.cpp:237] Train net output #0: accuracy = 0.453125
I1225 12:24:47.984808 19262 solver.cpp:237] Train net output #1: loss = 69.9375 (* 1 = 69.9375 loss)
I1225 12:24:47.984823 19262 solver.cpp:237] Train net output #2: per_class_accuracy = 0
I1225 12:24:47.984838 19262 solver.cpp:237] Train net output #3: per_class_accuracy = 1
I1225 12:24:47.984853 19262 solver.cpp:237] Train net output #4: per_class_accuracy = 1
I1225 12:24:47.984869 19262 solver.cpp:237] Train net output #5: per_class_accuracy = 0
I1225 12:24:47.984887 19262 sgd_solver.cpp:105] Iteration 6, lr = 9.9955e-05
I1225 12:24:48.469370 19262 solver.cpp:218] Iteration 7 (2.06388 iter/s, 0.484524s/1 iters), loss = 62.432
I1225 12:24:48.469527 19262 solver.cpp:237] Train net output #0: accuracy = 0.527344
I1225 12:24:48.469597 19262 solver.cpp:237] Train net output #1: loss = 62.432 (* 1 = 62.432 loss)
I1225 12:24:48.469650 19262 solver.cpp:237] Train net output #2: per_class_accuracy = 0
I1225 12:24:48.469733 19262 solver.cpp:237] Train net output #3: per_class_accuracy = 1
I1225 12:24:48.469785 19262 solver.cpp:237] Train net output #4: per_class_accuracy = 0
I1225 12:24:48.469828 19262 solver.cpp:237] Train net output #5: per_class_accuracy = 1
I1225 12:24:48.469871 19262 sgd_solver.cpp:105] Iteration 7, lr = 9.99475e-05
I1225 12:24:48.966949 19262 solver.cpp:218] Iteration 8 (2.0106 iter/s, 0.497365s/1 iters), loss = 87.3365
I1225 12:24:48.967027 19262 solver.cpp:237] Train net output #0: accuracy = 1
I1225 12:24:48.967051 19262 solver.cpp:237] Train net output #1: loss = 87.3365 (* 1 = 87.3365 loss)
I1225 12:24:48.967067 19262 solver.cpp:237] Train net output #2: per_class_accuracy = 1
I1225 12:24:48.967082 19262 solver.cpp:237] Train net output #3: per_class_accuracy = 1
I1225 12:24:48.967098 19262 solver.cpp:237] Train net output #4: per_class_accuracy = 1
I1225 12:24:48.967113 19262 solver.cpp:237] Train net output #5: per_class_accuracy = 1
I1225 12:24:48.967131 19262 sgd_solver.cpp:105] Iteration 8, lr = 9.994e-05
I1225 12:24:49.451747 19262 solver.cpp:218] Iteration 9 (2.06329 iter/s, 0.484663s/1 iters), loss = 87.3365
I1225 12:24:49.451824 19262 solver.cpp:237] Train net output #0: accuracy = 1
I1225 12:24:49.451848 19262 solver.cpp:237] Train net output #1: loss = 87.3365 (* 1 = 87.3365 loss)
I1225 12:24:49.451864 19262 solver.cpp:237] Train net output #2: per_class_accuracy = 1
I1225 12:24:49.451877 19262 solver.cpp:237] Train net output #3: per_class_accuracy = 1
I1225 12:24:49.451890 19262 solver.cpp:237] Train net output #4: per_class_accuracy = 1
I1225 12:24:49.451908 19262 solver.cpp:237] Train net output #5: per_class_accuracy = 1
I1225 12:24:49.451925 19262 sgd_solver.cpp:105] Iteration 9, lr = 9.99325e-05


I'll also show the output in CPU-mode. Of cource I changed nothing expect for 'solver_mode: CPU #GPU'). (CPU-mode is so slow that I couldn't show only the first output, but when I slept running this and got up the next morning, I could get 1.0 accuracy and almost near to 0 loss both in train and test phase.

I1225 13:11:23.387768 20715 solver.cpp:397] Test net output #0: accuracy = 0.595
I1225 13:11:23.387887 20715 solver.cpp:397] Test net output #1: loss = 1.545 (* 1 = 1.545 loss)
I1225 13:11:23.387904 20715 solver.cpp:397] Test net output #2: per_class_accuracy = 0.973855
I1225 13:11:23.387919 20715 solver.cpp:397] Test net output #3: per_class_accuracy = 0.940514
I1225 13:11:23.387933 20715 solver.cpp:397] Test net output #4: per_class_accuracy = 0.0162946
I1225 13:11:23.387945 20715 solver.cpp:397] Test net output #5: per_class_accuracy = 0.382658
I1225 13:11:37.222299 20715 solver.cpp:218] Iteration 0 (-1.58519e-33 iter/s, 45.862s/1 iters), loss = 1.5431
I1225 13:11:37.222355 20715 solver.cpp:237] Train net output #0: accuracy = 0.621094
I1225 13:11:37.222375 20715 solver.cpp:237] Train net output #1: loss = 1.5431 (* 1 = 1.5431 loss)
I1225 13:11:37.222389 20715 solver.cpp:237] Train net output #2: per_class_accuracy = 0.969231
I1225 13:11:37.222404 20715 solver.cpp:237] Train net output #3: per_class_accuracy = 0.931507
I1225 13:11:37.222416 20715 solver.cpp:237] Train net output #4: per_class_accuracy = 0
I1225 13:11:37.222429 20715 solver.cpp:237] Train net output #5: per_class_accuracy = 0.459016
I1225 13:11:37.222458 20715 sgd_solver.cpp:105] Iteration 0, lr = 0.0001
I1225 13:11:51.407929 20715 solver.cpp:218] Iteration 1 (0.070497 iter/s, 14.185s/1 iters), loss = 1.59465
I1225 13:11:51.407979 20715 solver.cpp:237] Train net output #0: accuracy = 0.5625
I1225 13:11:51.407997 20715 solver.cpp:237] Train net output #1: loss = 1.59465 (* 1 = 1.59465 loss)
I1225 13:11:51.408012 20715 solver.cpp:237] Train net output #2: per_class_accuracy = 0.984615
I1225 13:11:51.408025 20715 solver.cpp:237] Train net output #3: per_class_accuracy = 0.896552
I1225 13:11:51.408036 20715 solver.cpp:237] Train net output #4: per_class_accuracy = 0.0333333
I1225 13:11:51.408051 20715 solver.cpp:237] Train net output #5: per_class_accuracy = 0.356164
I1225 13:11:51.408080 20715 sgd_solver.cpp:105] Iteration 1, lr = 9.99925e-05
I1225 13:12:03.350549 20715 solver.cpp:218] Iteration 2 (0.0837381 iter/s, 11.942s/1 iters), loss = 1.4225
I1225 13:12:03.350735 20715 solver.cpp:237] Train net output #0: accuracy = 0.660156
I1225 13:12:03.350762 20715 solver.cpp:237] Train net output #1: loss = 1.4225 (* 1 = 1.4225 loss)
I1225 13:12:03.350777 20715 solver.cpp:237] Train net output #2: per_class_accuracy = 0.962963
I1225 13:12:03.350790 20715 solver.cpp:237] Train net output #3: per_class_accuracy = 0.936508
I1225 13:12:03.350803 20715 solver.cpp:237] Train net output #4: per_class_accuracy = 0.0576923
I1225 13:12:03.350817 20715 solver.cpp:237] Train net output #5: per_class_accuracy = 0.483333
I1225 13:12:03.350836 20715 sgd_solver.cpp:105] Iteration 2, lr = 9.9985e-05
I1225 13:12:17.025830 20715 solver.cpp:218] Iteration 3 (0.0731261 iter/s, 13.675s/1 iters), loss = 1.48873
I1225 13:12:17.025892 20715 solver.cpp:237] Train net output #0: accuracy = 0.605469
I1225 13:12:17.025912 20715 solver.cpp:237] Train net output #1: loss = 1.48873 (* 1 = 1.48873 loss)
I1225 13:12:17.025926 20715 solver.cpp:237] Train net output #2: per_class_accuracy = 0.932203
I1225 13:12:17.025939 20715 solver.cpp:237] Train net output #3: per_class_accuracy = 0.930556
I1225 13:12:17.025964 20715 solver.cpp:237] Train net output #4: per_class_accuracy = 0.0172414
I1225 13:12:17.025979 20715 solver.cpp:237] Train net output #5: per_class_accuracy = 0.477612
I1225 13:12:17.026003 20715 sgd_solver.cpp:105] Iteration 3, lr = 9.99775e-05
I1225 13:12:30.591960 20715 solver.cpp:218] Iteration 4 (0.0737137 iter/s, 13.566s/1 iters), loss = 1.43087
I1225 13:12:30.592020 20715 solver.cpp:237] Train net output #0: accuracy = 0.617188
I1225 13:12:30.592042 20715 solver.cpp:237] Train net output #1: loss = 1.43087 (* 1 = 1.43087 loss)
I1225 13:12:30.592057 20715 solver.cpp:237] Train net output #2: per_class_accuracy = 0.9
I1225 13:12:30.592070 20715 solver.cpp:237] Train net output #3: per_class_accuracy = 0.912281
I1225 13:12:30.592083 20715 solver.cpp:237] Train net output #4: per_class_accuracy = 0.046875
I1225 13:12:30.592097 20715 solver.cpp:237] Train net output #5: per_class_accuracy = 0.615385
I1225 13:12:30.592113 20715 sgd_solver.cpp:105] Iteration 4, lr = 9.997e-05
I1225 13:12:43.663581 20715 solver.cpp:218] Iteration 5 (0.0765052 iter/s, 13.071s/1 iters), loss = 1.38299
I1225 13:12:43.663705 20715 solver.cpp:237] Train net output #0: accuracy = 0.597656
I1225 13:12:43.663733 20715 solver.cpp:237] Train net output #1: loss = 1.38299 (* 1 = 1.38299 loss)
I1225 13:12:43.663749 20715 solver.cpp:237] Train net output #2: per_class_accuracy = 0.814286
I1225 13:12:43.663763 20715 solver.cpp:237] Train net output #3: per_class_accuracy = 0.767857
I1225 13:12:43.663777 20715 solver.cpp:237] Train net output #4: per_class_accuracy = 0.0892857
I1225 13:12:43.663791 20715 solver.cpp:237] Train net output #5: per_class_accuracy = 0.648649
I1225 13:12:43.663810 20715 sgd_solver.cpp:105] Iteration 5, lr = 9.99625e-05

...

I1225 13:42:05.647958 20715 solver.cpp:218] Iteration 139 (0.0801475 iter/s, 12.477s/1 iters), loss = 1.07221
I1225 13:42:05.648006 20715 solver.cpp:237] Train net output #0: accuracy = 0.800781
I1225 13:42:05.648020 20715 solver.cpp:237] Train net output #1: loss = 1.07221 (* 1 = 1.07221 loss)
I1225 13:42:05.648030 20715 solver.cpp:237] Train net output #2: per_class_accuracy = 0.810811
I1225 13:42:05.648039 20715 solver.cpp:237] Train net output #3: per_class_accuracy = 0.777778
I1225 13:42:05.648046 20715 solver.cpp:237] Train net output #4: per_class_accuracy = 0.79661
I1225 13:42:05.648054 20715 solver.cpp:237] Train net output #5: per_class_accuracy = 0.816667
I1225 13:42:05.648066 20715 sgd_solver.cpp:105] Iteration 139, lr = 9.897e-05
I1225 13:42:18.277225 20715 solver.cpp:218] Iteration 140 (0.0791828 iter/s, 12.629s/1 iters), loss = 1.05339
I1225 13:42:18.277272 20715 solver.cpp:237] Train net output #0: accuracy = 0.839844
I1225 13:42:18.277287 20715 solver.cpp:237] Train net output #1: loss = 1.05339 (* 1 = 1.05339 loss)
I1225 13:42:18.277295 20715 solver.cpp:237] Train net output #2: per_class_accuracy = 0.837838
I1225 13:42:18.277312 20715 solver.cpp:237] Train net output #3: per_class_accuracy = 0.847458
I1225 13:42:18.277321 20715 solver.cpp:237] Train net output #4: per_class_accuracy = 0.83871
I1225 13:42:18.277328 20715 solver.cpp:237] Train net output #5: per_class_accuracy = 0.836066
I1225 13:42:18.277339 20715 sgd_solver.cpp:105] Iteration 140, lr = 9.89627e-05
I1225 13:42:30.137760 20715 solver.cpp:218] Iteration 141 (0.084317 iter/s, 11.86s/1 iters), loss = 1.06517
I1225 13:42:30.137866 20715 solver.cpp:237] Train net output #0: accuracy = 0.828125
I1225 13:42:30.137893 20715 solver.cpp:237] Train net output #1: loss = 1.06517 (* 1 = 1.06517 loss)
I1225 13:42:30.137908 20715 solver.cpp:237] Train net output #2: per_class_accuracy = 0.8
I1225 13:42:30.137923 20715 solver.cpp:237] Train net output #3: per_class_accuracy = 0.871429
I1225 13:42:30.137936 20715 solver.cpp:237] Train net output #4: per_class_accuracy = 0.76
I1225 13:42:30.137962 20715 solver.cpp:237] Train net output #5: per_class_accuracy = 0.863636
I1225 13:42:30.137980 20715 sgd_solver.cpp:105] Iteration 141, lr = 9.89554e-05
I1225 13:42:42.738821 20715 solver.cpp:218] Iteration 142 (0.0793651 iter/s, 12.6s/1 iters), loss = 1.04358
I1225 13:42:42.738862 20715 solver.cpp:237] Train net output #0: accuracy = 0.84375
I1225 13:42:42.738875 20715 solver.cpp:237] Train net output #1: loss = 1.04358 (* 1 = 1.04358 loss)
I1225 13:42:42.738884 20715 solver.cpp:237] Train net output #2: per_class_accuracy = 0.806452
I1225 13:42:42.738893 20715 solver.cpp:237] Train net output #3: per_class_accuracy = 0.846154
I1225 13:42:42.738901 20715 solver.cpp:237] Train net output #4: per_class_accuracy = 0.822581
I1225 13:42:42.738909 20715 solver.cpp:237] Train net output #5: per_class_accuracy = 0.907407
I1225 13:42:42.738919 20715 sgd_solver.cpp:105] Iteration 142, lr = 9.89481e-05

Your system configuration

Operating system: Ubuntu 14.04
Compiler: g++
CUDA version (if applicable): 8.0
CUDNN version (if applicable): ?? but when turned off cudnn, the problem didn't change.
Titan X (computing capability is OK)

@duygusar
Copy link

duygusar commented Dec 26, 2017

@xsmiledur What is your batch size? Could it be too low? Say <5? Especially problematic if your batch size is only 1. I think it is a problem with NVIDIA, this would explain why you have this problem only when you use the GPU. Can you run with a higher batch size?

@xsmiledur
Copy link
Author

My batch size I've tried is 256 or 128. Of cource I checked #4602 but nothing works well.
If this cause is inside NVIDIA, what should I do?

@xsmiledur
Copy link
Author

When I've tried batch size 1024 or 2048, I've got the same problem.

@duygusar
Copy link

Oh okay, sorry I missed that you already tried the things written on references. What is your data like? How many samples do you have? Do you randomize them or let Caffe randomize them? - Though, it seems like an environment related problem that I doubt it has to do with data.

@xsmiledur
Copy link
Author

Number of samples is about 16,000. I tried classifying images for 4 classes. Of course I've randomized them on labeling text, and also made leveldb with shuffle option (-shuffle 1).

@cyrilhsu
Copy link

probably caused by the bug of accuracy layer for gpu implementation.
have you checked #5981 yet ?

@xsmiledur
Copy link
Author

xsmiledur commented Dec 28, 2017

@cyrilhsu Thank you for your advise, I could fix it!
I'm really surprised that training accuracy may cause a big bug on GPU training...

@duygusar
Copy link

duygusar commented Jan 11, 2018

@xsmiledur Did you follow #5987 to solve this problem? Now I have the same problem. My program was working well on my dataset but then I augmented it and basically ended up with a larger dataset and although everything else remains the same, the loss jumps to 87.3365 and stays there (even with the fix). Did you try something else?

@xsmiledur
Copy link
Author

@duygusar No, I have just fixed accuracy layer to output only on test phase. Sorry

@duygusar
Copy link

duygusar commented Jan 11, 2018

@xsmiledur Alright thank you. I actually use the Accuracy layer only for the test phase as well - more specifically, when I train I evaluate at every 1000 iteration intervals (the phase is set to TEST only). Do you mean that I should remove it altogether? I will try it and observe the loss. Thanks for the tip if that is what you meant.

@duygusar
Copy link

@xsmiledur Ok, seems like my base_lr was too high (though it used to work fine with the smaller dataset).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants