Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

too high loss #416

Closed
dccho opened this issue Nov 15, 2015 · 5 comments
Closed

too high loss #416

dccho opened this issue Nov 15, 2015 · 5 comments
Labels

Comments

@dccho
Copy link

dccho commented Nov 15, 2015

image

i'm using DIGIT 3.0.0-rc1 + NVIDIA/caffe v0.13.2.
Loss2 for training set (green graph) suddenly went to high. Here is log when it was happened.

I1115 20:08:15.366227  6494 solver.cpp:217] Iteration 75776, loss = 10.4095
I1115 20:08:15.366282  6494 solver.cpp:234]     Train net output #0: loss = 5.77672 (* 1 = 5.77672 loss)
I1115 20:08:15.366291  6494 solver.cpp:234]     Train net output #1: loss1/loss = 3.67123 (* 0.2 = 0.734247 loss)
I1115 20:08:15.366297  6494 solver.cpp:234]     Train net output #2: loss2/loss = 3.37401 (* 0.3 = 1.0122 loss)
I1115 20:08:15.366300  6494 solver.cpp:234]     Train net output #3: loss3/loss = 5.77274 (* 0.5 = 2.88637 loss)
I1115 20:08:15.817288  6494 solver.cpp:511] Iteration 75776 (0.393751/s), lr = 0.05
I1115 20:09:36.631261  6494 solver.cpp:217] Iteration 75808, loss = 52.4953
I1115 20:09:36.631345  6494 solver.cpp:234]     Train net output #0: loss = 5.88328 (* 1 = 5.88328 loss)
I1115 20:09:36.631353  6494 solver.cpp:234]     Train net output #1: loss1/loss = 87.3365 (* 0.2 = 17.4673 loss)
I1115 20:09:36.631358  6494 solver.cpp:234]     Train net output #2: loss2/loss = 87.3365 (* 0.3 = 26.201 loss)
I1115 20:09:36.631363  6494 solver.cpp:234]     Train net output #3: loss3/loss = 5.88758 (* 0.5 = 2.94379 loss)
@lukeyeager lukeyeager added the bug label Nov 17, 2015
@lukeyeager
Copy link
Member

Thanks for the bug report @dccho. Unfortunately, I don't know what would cause this.

Has this happened to you more than once? Is the problem reproducible? Try it again. If you can provide some extra information, the the engineers working on NVIDIA/caffe should be able to help.

  • GPU type[s]
  • Driver version
  • CUDA version
  • cuDNN version

@dccho
Copy link
Author

dccho commented Dec 11, 2015

screenshot from 2015-12-11 11 21 07

It happen frequently now. I just ran googlenet in standard networks without any change.

GPU : 4 x Titan X
Driver : 352.63
CUDA : 7.5
cuDNN : v3

@dccho
Copy link
Author

dccho commented Dec 11, 2015

Probably it is problem with combination of several unstable libraries. Now I change DIGITS to 3.0-rc3 with cuDNN v4, NVCaffe-0.14, then it works fine.

@lukeyeager
Copy link
Member

Glad you got your problem worked out - sorry I wasn't more help debugging! I'll close this now, but anyone who can add to the discussion should still feel free to do so.

@gheinrich
Copy link
Contributor

It would be interesting to know if the high loss issue with NVCaffe-0.13 also occurred during single-GPU training?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants