-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nan issue with CIFAR10 example when running on CPU only #393
Comments
Interesting -- how many iterations does it take to see the nan? I swapped out the original ACROSS_CHANNELS pooling for WITHIN_CHANNEL pooling* and I never ran it on CPU because it is painfully slow, so you're likely right about the bug being there. *The two pooling region types get pretty much equal accuracy, but I wanted to use the exact architecture Alex Krizhevsky used in his cuda-convnet cifar-18pct example. But if the WITHIN_CHANNEL isn't working you could try swapping the CROSS_CHANNEL back into the train/val prototxts if that satisfies whatever you're doing. |
I reproduced this; changing to
Thanks for reporting; for now I've added a note to the bottom of cifar10_full_solver.prototxt (pushed to latest dev) that there seems to be a bug with CPU training and a suggestion to change pooling to ACROSS_CHANNELS for CPU training. |
great, thanks for confirming. Any ideas about why it might fail on the CPU when using WITHIN_CHANNEL? |
@jeffdonahue was this fixed with whatever numerical stability update you made related to this at some point? #1095 Seems to say there is still an issue lurking somewhere. |
@shelhamer it should be fixed yeah; I just tested train_full for 1000 iterations on GPU and 200 iterations on CPU and didn't get a NaN. The fix isn't in master though -- not sure which one @YutingZhang is using. (Did find out that the cifar scripts hadn't been updated to be run from Caffe root though, which I just pushed a fix to dev for -- probably would be good to put that in the RC) |
@jeffdonahue thanks for the script fix -- if you uncover other issues in the RC you should push / PR them to master and not dev so that they can be included now then back-merged to dev. Remember the new PR workflow with fixes to master. Closing since should be fixed in the latest release (and in dev for some time now). |
Sorry, I forgot that was the new workflow. Is it worth me cherry-picking to master now (causing a duplicate commit in dev whenever the next back-merge happens)? It was actually a little worse than that the scripts hadn't been updated to run from root -- they'd been partially updated (I think just the net proto path in the solver proto, but not any of the paths in the scripts) such that it worked from neither the CIFAR directory nor from root. |
A cherry-pick and resolving the next back-merge should be fine. I'm On Sun, Sep 21, 2014 at 3:10 PM, Jeff Donahue notifications@github.com
|
Ok, cherry-picked to master. I definitely don't think this case needs an immediate backmerge since the diff would literally be 0 bytes; in general I think we could just do it whenever the change in question seems substantial enough to backmerge (e.g. always on a bug fix, but never on docs changes), and then any minor ones in the interim get batched together with the substantial one. |
Deal. Reasonable and low-effort policy. On Sun, Sep 21, 2014 at 4:06 PM, Jeff Donahue notifications@github.com
|
Hi everyone,
I'm trying to run the CIFAR10 example(train_full.sh), which seems to work on the GPU (at least here I don't get the same issue, I'm not sure yet to what value it converges). In any case the problem I'm having is that after the first forward-backward pass somewhere in the pipeline there's a nan introduced. This is not an issue related to the learning rate being to big, as I can even set the lr to 0 and have the same problem.
I tried to figure out what the problem was, but without any success yet.
First of all, I wanted to ask if anyone else is having the same issues. It would be great if someone could confirm, that something is going wrong here.
What I found out so far is, that this might be related to the LRN layer, because after removal the issue disappears. However it might also be due to the interaction between some of the other layers, I'm not sure. If anyone else has some insight in this, it would be greatly appreciated.
The text was updated successfully, but these errors were encountered: