Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nan issue with CIFAR10 example when running on CPU only #393

Closed
tdomhan opened this issue May 6, 2014 · 10 comments
Closed

nan issue with CIFAR10 example when running on CPU only #393

tdomhan opened this issue May 6, 2014 · 10 comments
Assignees

Comments

@tdomhan
Copy link
Contributor

tdomhan commented May 6, 2014

Hi everyone,
I'm trying to run the CIFAR10 example(train_full.sh), which seems to work on the GPU (at least here I don't get the same issue, I'm not sure yet to what value it converges). In any case the problem I'm having is that after the first forward-backward pass somewhere in the pipeline there's a nan introduced. This is not an issue related to the learning rate being to big, as I can even set the lr to 0 and have the same problem.
I tried to figure out what the problem was, but without any success yet.

First of all, I wanted to ask if anyone else is having the same issues. It would be great if someone could confirm, that something is going wrong here.

What I found out so far is, that this might be related to the LRN layer, because after removal the issue disappears. However it might also be due to the interaction between some of the other layers, I'm not sure. If anyone else has some insight in this, it would be greatly appreciated.

@shelhamer shelhamer changed the title nan issue when running on CPU only nan issue with CIFAR10 example when running on CPU only May 6, 2014
@jeffdonahue
Copy link
Contributor

Interesting -- how many iterations does it take to see the nan? I swapped out the original ACROSS_CHANNELS pooling for WITHIN_CHANNEL pooling* and I never ran it on CPU because it is painfully slow, so you're likely right about the bug being there.

*The two pooling region types get pretty much equal accuracy, but I wanted to use the exact architecture Alex Krizhevsky used in his cuda-convnet cifar-18pct example. But if the WITHIN_CHANNEL isn't working you could try swapping the CROSS_CHANNEL back into the train/val prototxts if that satisfies whatever you're doing.

@jeffdonahue
Copy link
Contributor

I reproduced this; changing to device_id: 0 and adding random_seed: 1701 to the bottom of cifar10_full_solver.prototxt gives this result:

I0506 11:27:40.389626  6796 data_layer.cpp:117] Restarting data prefetching from start.
I0506 11:27:41.909649  6533 solver.cpp:160] Test score #0: 0.1163
I0506 11:27:41.909716  6533 solver.cpp:160] Test score #1: 2.3026
I0506 11:27:42.676936  6533 softmax_loss_layer.cpp:58] Accuracy: 0.12
I0506 11:31:17.406493  6533 solver.cpp:261] Iteration 1, lr = 0.001
I0506 11:31:17.422197  6533 solver.cpp:105] Iteration 1, loss = 2.30258
I0506 11:35:01.583555  6533 softmax_loss_layer.cpp:58] Accuracy: 0.08
I0506 11:42:08.900871  6533 solver.cpp:261] Iteration 2, lr = 0.001
I0506 11:42:08.955682  6533 solver.cpp:105] Iteration 2, loss = nan

Thanks for reporting; for now I've added a note to the bottom of cifar10_full_solver.prototxt (pushed to latest dev) that there seems to be a bug with CPU training and a suggestion to change pooling to ACROSS_CHANNELS for CPU training.

@tdomhan
Copy link
Contributor Author

tdomhan commented May 7, 2014

great, thanks for confirming. Any ideas about why it might fail on the CPU when using WITHIN_CHANNEL?
What's weird is that as far as I can tell the GPU code for WITHIN_CHANNEL is using the CPU code. (WithinChannelForward being called within Forward_gpu). So to me it doesn't make much sense that it's working on one architecture but not the other.

@shelhamer
Copy link
Member

@jeffdonahue was this fixed with whatever numerical stability update you made related to this at some point? #1095 Seems to say there is still an issue lurking somewhere.

@jeffdonahue
Copy link
Contributor

@shelhamer it should be fixed yeah; I just tested train_full for 1000 iterations on GPU and 200 iterations on CPU and didn't get a NaN. The fix isn't in master though -- not sure which one @YutingZhang is using.

(Did find out that the cifar scripts hadn't been updated to be run from Caffe root though, which I just pushed a fix to dev for -- probably would be good to put that in the RC)

@shelhamer
Copy link
Member

@jeffdonahue thanks for the script fix -- if you uncover other issues in the RC you should push / PR them to master and not dev so that they can be included now then back-merged to dev. Remember the new PR workflow with fixes to master.

Closing since should be fixed in the latest release (and in dev for some time now).

@jeffdonahue
Copy link
Contributor

Sorry, I forgot that was the new workflow. Is it worth me cherry-picking to master now (causing a duplicate commit in dev whenever the next back-merge happens)? It was actually a little worse than that the scripts hadn't been updated to run from root -- they'd been partially updated (I think just the net proto path in the solver proto, but not any of the paths in the scripts) such that it worked from neither the CIFAR directory nor from root.

@shelhamer
Copy link
Member

A cherry-pick and resolving the next back-merge should be fine. I'm
currently feeling two-ways about back-merges so do the back-merge whenever
you feel like, whether after each individual change or letting a few fixes
batch up in master first. Seems less noisy to me to batch but it does leave
dev unchanged in the interim (in the usual case where the PR / push was to
master).

On Sun, Sep 21, 2014 at 3:10 PM, Jeff Donahue notifications@github.com
wrote:

Sorry, I forgot that was the new workflow. Is it worth me cherry-picking
to master now (causing a duplicate commit in dev whenever the next
back-merge happens)? It was actually a little worse than that the scripts
hadn't been updated to run from root -- they'd been partially updated (I
think just the net proto path in the solver proto, but not any of the paths
in the scripts) such that it worked from neither the CIFAR directory nor
from root.


Reply to this email directly or view it on GitHub
#393 (comment).

@jeffdonahue
Copy link
Contributor

Ok, cherry-picked to master. I definitely don't think this case needs an immediate backmerge since the diff would literally be 0 bytes; in general I think we could just do it whenever the change in question seems substantial enough to backmerge (e.g. always on a bug fix, but never on docs changes), and then any minor ones in the interim get batched together with the substantial one.

@shelhamer
Copy link
Member

Deal. Reasonable and low-effort policy.

On Sun, Sep 21, 2014 at 4:06 PM, Jeff Donahue notifications@github.com
wrote:

Ok, cherry-picked to master. I definitely don't think this case needs an
immediate backmerge since the diff would literally be 0 bytes; in general I
think we could just do it whenever the change in question seems substantial
enough to backmerge (e.g. always on a bug fix, but never on docs changes),
and then any minor ones in the interim get batched together with the
substantial one.


Reply to this email directly or view it on GitHub
#393 (comment).

beniz pushed a commit to jolibrain/caffe that referenced this issue Jan 25, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants