nan issue with CIFAR10 example when running on CPU only #393

tdomhan · 2014-05-06T16:48:41Z

Hi everyone,
I'm trying to run the CIFAR10 example(train_full.sh), which seems to work on the GPU (at least here I don't get the same issue, I'm not sure yet to what value it converges). In any case the problem I'm having is that after the first forward-backward pass somewhere in the pipeline there's a nan introduced. This is not an issue related to the learning rate being to big, as I can even set the lr to 0 and have the same problem.
I tried to figure out what the problem was, but without any success yet.

First of all, I wanted to ask if anyone else is having the same issues. It would be great if someone could confirm, that something is going wrong here.

What I found out so far is, that this might be related to the LRN layer, because after removal the issue disappears. However it might also be due to the interaction between some of the other layers, I'm not sure. If anyone else has some insight in this, it would be greatly appreciated.

jeffdonahue · 2014-05-06T18:38:01Z

Interesting -- how many iterations does it take to see the nan? I swapped out the original ACROSS_CHANNELS pooling for WITHIN_CHANNEL pooling* and I never ran it on CPU because it is painfully slow, so you're likely right about the bug being there.

*The two pooling region types get pretty much equal accuracy, but I wanted to use the exact architecture Alex Krizhevsky used in his cuda-convnet cifar-18pct example. But if the WITHIN_CHANNEL isn't working you could try swapping the CROSS_CHANNEL back into the train/val prototxts if that satisfies whatever you're doing.

jeffdonahue · 2014-05-06T19:02:10Z

I reproduced this; changing to device_id: 0 and adding random_seed: 1701 to the bottom of cifar10_full_solver.prototxt gives this result:

I0506 11:27:40.389626  6796 data_layer.cpp:117] Restarting data prefetching from start.
I0506 11:27:41.909649  6533 solver.cpp:160] Test score #0: 0.1163
I0506 11:27:41.909716  6533 solver.cpp:160] Test score #1: 2.3026
I0506 11:27:42.676936  6533 softmax_loss_layer.cpp:58] Accuracy: 0.12
I0506 11:31:17.406493  6533 solver.cpp:261] Iteration 1, lr = 0.001
I0506 11:31:17.422197  6533 solver.cpp:105] Iteration 1, loss = 2.30258
I0506 11:35:01.583555  6533 softmax_loss_layer.cpp:58] Accuracy: 0.08
I0506 11:42:08.900871  6533 solver.cpp:261] Iteration 2, lr = 0.001
I0506 11:42:08.955682  6533 solver.cpp:105] Iteration 2, loss = nan

Thanks for reporting; for now I've added a note to the bottom of cifar10_full_solver.prototxt (pushed to latest dev) that there seems to be a bug with CPU training and a suggestion to change pooling to ACROSS_CHANNELS for CPU training.

tdomhan · 2014-05-07T08:52:48Z

great, thanks for confirming. Any ideas about why it might fail on the CPU when using WITHIN_CHANNEL?
What's weird is that as far as I can tell the GPU code for WITHIN_CHANNEL is using the CPU code. (WithinChannelForward being called within Forward_gpu). So to me it doesn't make much sense that it's working on one architecture but not the other.

shelhamer · 2014-09-19T19:01:26Z

@jeffdonahue was this fixed with whatever numerical stability update you made related to this at some point? #1095 Seems to say there is still an issue lurking somewhere.

jeffdonahue · 2014-09-19T19:42:38Z

@shelhamer it should be fixed yeah; I just tested train_full for 1000 iterations on GPU and 200 iterations on CPU and didn't get a NaN. The fix isn't in master though -- not sure which one @YutingZhang is using.

(Did find out that the cifar scripts hadn't been updated to be run from Caffe root though, which I just pushed a fix to dev for -- probably would be good to put that in the RC)

shelhamer · 2014-09-21T22:03:39Z

@jeffdonahue thanks for the script fix -- if you uncover other issues in the RC you should push / PR them to master and not dev so that they can be included now then back-merged to dev. Remember the new PR workflow with fixes to master.

Closing since should be fixed in the latest release (and in dev for some time now).

jeffdonahue · 2014-09-21T22:10:35Z

Sorry, I forgot that was the new workflow. Is it worth me cherry-picking to master now (causing a duplicate commit in dev whenever the next back-merge happens)? It was actually a little worse than that the scripts hadn't been updated to run from root -- they'd been partially updated (I think just the net proto path in the solver proto, but not any of the paths in the scripts) such that it worked from neither the CIFAR directory nor from root.

shelhamer · 2014-09-21T22:14:12Z

A cherry-pick and resolving the next back-merge should be fine. I'm
currently feeling two-ways about back-merges so do the back-merge whenever
you feel like, whether after each individual change or letting a few fixes
batch up in master first. Seems less noisy to me to batch but it does leave
dev unchanged in the interim (in the usual case where the PR / push was to
master).

On Sun, Sep 21, 2014 at 3:10 PM, Jeff Donahue notifications@github.com
wrote:

Sorry, I forgot that was the new workflow. Is it worth me cherry-picking
to master now (causing a duplicate commit in dev whenever the next
back-merge happens)? It was actually a little worse than that the scripts
hadn't been updated to run from root -- they'd been partially updated (I
think just the net proto path in the solver proto, but not any of the paths
in the scripts) such that it worked from neither the CIFAR directory nor
from root.

—
Reply to this email directly or view it on GitHub
#393 (comment).

jeffdonahue · 2014-09-21T23:06:41Z

Ok, cherry-picked to master. I definitely don't think this case needs an immediate backmerge since the diff would literally be 0 bytes; in general I think we could just do it whenever the change in question seems substantial enough to backmerge (e.g. always on a bug fix, but never on docs changes), and then any minor ones in the interim get batched together with the substantial one.

shelhamer · 2014-09-21T23:12:08Z

Deal. Reasonable and low-effort policy.

On Sun, Sep 21, 2014 at 4:06 PM, Jeff Donahue notifications@github.com
wrote:

Ok, cherry-picked to master. I definitely don't think this case needs an
immediate backmerge since the diff would literally be 0 bytes; in general I
think we could just do it whenever the change in question seems substantial
enough to backmerge (e.g. always on a bug fix, but never on docs changes),
and then any minor ones in the interim get batched together with the
substantial one.

—
Reply to this email directly or view it on GitHub
#393 (comment).

Update README.md

shelhamer changed the title ~~nan issue when running on CPU only~~ nan issue with CIFAR10 example when running on CPU only May 6, 2014

shelhamer assigned jeffdonahue May 6, 2014

tdomhan mentioned this issue May 10, 2014

Iteration 80, loss = nan, Training on Imagenet #399

Closed

shelhamer closed this as completed Sep 21, 2014

beniz pushed a commit to jolibrain/caffe that referenced this issue Jan 25, 2017

Merge pull request BVLC#393 from raingo/patch-1

34753f3

Update README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nan issue with CIFAR10 example when running on CPU only #393

nan issue with CIFAR10 example when running on CPU only #393

tdomhan commented May 6, 2014

jeffdonahue commented May 6, 2014

jeffdonahue commented May 6, 2014

tdomhan commented May 7, 2014

shelhamer commented Sep 19, 2014

jeffdonahue commented Sep 19, 2014

shelhamer commented Sep 21, 2014

jeffdonahue commented Sep 21, 2014

shelhamer commented Sep 21, 2014

jeffdonahue commented Sep 21, 2014

shelhamer commented Sep 21, 2014

nan issue with CIFAR10 example when running on CPU only #393

nan issue with CIFAR10 example when running on CPU only #393

Comments

tdomhan commented May 6, 2014

jeffdonahue commented May 6, 2014

jeffdonahue commented May 6, 2014

tdomhan commented May 7, 2014

shelhamer commented Sep 19, 2014

jeffdonahue commented Sep 19, 2014

shelhamer commented Sep 21, 2014

jeffdonahue commented Sep 21, 2014

shelhamer commented Sep 21, 2014

jeffdonahue commented Sep 21, 2014

shelhamer commented Sep 21, 2014