TestPowerGradientShiftZero, TestPowerGradient fail with certain boost #1252

Closed
shelhamer opened this Issue Oct 10, 2014 · 14 comments

Projects

None yet

10 participants

@shelhamer
Member

The PowerLayer::Backward checks seem to fail with certain versions of boost on OS X / ubuntu.

boost 1.55 passes, but boost 1.56 and and 1.57 fail.

[ RUN      ] PowerLayerTest/0.TestPowerGradientShiftZero
./include/caffe/test/test_gradient_check_util.hpp:166: Failure
The difference between computed_gradient and estimated_gradient is 0.16171693801879883, which exceeds threshold_ * scale, where
computed_gradient evaluates to 6.6543664932250977,
estimated_gradient evaluates to 6.8160834312438965, and
threshold_ * scale evaluates to 0.068160831928253174.
debug: (top_id, top_data_id, blob_id, feat_id)=0,65,0,65; feat = 0.027440188452601433; objective+ = 0.55363553762435913; objective- = 0.41731387376785278
[  FAILED  ] PowerLayerTest/0.TestPowerGradientShiftZero, where TypeParam = caffe::FloatCPU (3 ms)

#######################################################################################################################
[ RUN      ] PowerLayerTest/1.TestPowerGradientShiftZero
./include/caffe/test/test_gradient_check_util.hpp:166: Failure
The difference between computed_gradient and estimated_gradient is 0.66645549482483268, which exceeds threshold_ * scale, where
computed_gradient evaluates to 9.0545713301684909,
estimated_gradient evaluates to 9.7210268249933236, and
threshold_ * scale evaluates to 0.097210268249933243.
debug: (top_id, top_data_id, blob_id, feat_id)=0,66,0,66; feat = 0.016829367669263143; objective+ = 0.48941214282974049; objective- = 0.29499160632987403
./include/caffe/test/test_gradient_check_util.hpp:166: Failure
The difference between computed_gradient and estimated_gradient is 0.48462038369962279, which exceeds threshold_ * scale, where
computed_gradient evaluates to 8.4754232528829139,
estimated_gradient evaluates to 8.9600436365825367, and
threshold_ * scale evaluates to 0.089600436365825362.
debug: (top_id, top_data_id, blob_id, feat_id)=0,71,0,71; feat = 0.01869104873835549; objective+ = 0.50171265479916738; objective- = 0.32251178206751663
./include/caffe/test/test_gradient_check_util.hpp:166: Failure
The difference between computed_gradient and estimated_gradient is 0.24489777273781588, which exceeds threshold_ * scale, where
computed_gradient evaluates to 7.3061654292184715,
estimated_gradient evaluates to 7.5510632019562873, and
threshold_ * scale evaluates to 0.075510632019562873.
debug: (top_id, top_data_id, blob_id, feat_id)=0,99,0,99; feat = 0.023657563288965969; objective+ = 0.53224239845290788; objective- = 0.38122113441378214
[  FAILED  ] PowerLayerTest/1.TestPowerGradientShiftZero, where TypeParam = caffe::DoubleCPU (4 ms)

#######################################################################################################################
[ RUN      ] PowerLayerTest/1.TestPowerGradient
./include/caffe/test/test_gradient_check_util.hpp:166: Failure
The difference between computed_gradient and estimated_gradient is 1.206900511134485, which exceeds threshold_ * scale, where
computed_gradient evaluates to 10.15816285551514,
estimated_gradient evaluates to 11.365063366649625, and
threshold_ * scale evaluates to 0.11365063366649626.
debug: (top_id, top_data_id, blob_id, feat_id)=0,57,0,57; feat = 2.9055876775560447; objective+ = 0.46979585546340097; objective- = 0.24249458813040844
[  FAILED  ] PowerLayerTest/1.TestPowerGradient, where TypeParam = caffe::DoubleCPU (3 ms)
@mprat
Contributor
mprat commented Nov 13, 2014

This failed for me with the native BLAS provided by OSX 10.9, so I tried with openBLAS and it gave me the same 3 errors. Does anyone have any suggestions for getting openBLAS to work?

@II-Matto

I built caffe with MKL and also encountered such failures. To be specific, there were actually six failed tests. The boost library used is of the newest version, i.e. 1.57.0, and the Anaconda Python 2.7.

[----------] Global test environment tear-down
[==========] 838 tests from 169 test cases ran. (1664414 ms total)
[ PASSED ] 832 tests.
[ FAILED ] 6 tests, listed below:
[ FAILED ] PowerLayerTest/0.TestPowerGradientShiftZero, where TypeParam = caffe::FloatCPU
[ FAILED ] PowerLayerTest/1.TestPowerGradientShiftZero, where TypeParam = caffe::DoubleCPU
[ FAILED ] PowerLayerTest/1.TestPowerGradient, where TypeParam = caffe::DoubleCPU
[ FAILED ] PowerLayerTest/2.TestPowerGradientShiftZero, where TypeParam = caffe::FloatGPU
[ FAILED ] PowerLayerTest/3.TestPowerGradientShiftZero, where TypeParam = caffe::DoubleGPU
[ FAILED ] PowerLayerTest/3.TestPowerGradient, where TypeParam = caffe::DoubleGPU

Do the failures indicate that caffe will not work appropriately? How should I deal with them?

@mprat
Contributor
mprat commented Nov 14, 2014

I also just tried compiling with MKL for all my libraries and I am still getting the errors in TestPowerGradient. @II-Matto , you have 6 errors and I have 3 because you are using GPU compilation and I am not.

@mprat
Contributor
mprat commented Nov 14, 2014

I got it to work with MKL and Boost 1.55.

@geekan
geekan commented Nov 16, 2014

I've faced same problem and I've solved it.
I tried to uninstall Boost 1.56 and install Boost 1.55, then reinstall caffe, all tests passed! (with openblas)

@relh
relh commented Jan 12, 2015

Still having the same errors with Boost 1.57, downgrading to 1.55 solved the problem.

@lou-k
lou-k commented Jan 14, 2015

I think the BLAS issue here is a red herring; the tests passed for me with Atlas and Boost 1.55.

Boost 1.56 failed with both OpenBLAS and Atlas.

@relh
relh commented Jan 14, 2015

Agreed, a boost problem then
On Jan 14, 2015 3:41 PM, "lou-k" notifications@github.com wrote:

I think the BLAS issue here is a red herring; the tests passed for me with
Atlas and Boost 1.55.

Boost 1.56 failed with both OpenBLAS and Atlas.


Reply to this email directly or view it on GitHub
#1252 (comment).

@svanschalkwyk

I'm getting it with boost1.54.0. Ubuntu 14.04, boost1.54.0, mkl from intel version 15 c++.
Any other ideas?

@shelhamer shelhamer changed the title from TestPowerGradientShiftZero, TestPowerGradient fail with vecLib on OS X to TestPowerGradientShiftZero, TestPowerGradient fail with vecLib with certain boost Jan 20, 2015
@shelhamer shelhamer changed the title from TestPowerGradientShiftZero, TestPowerGradient fail with vecLib with certain boost to TestPowerGradientShiftZero, TestPowerGradient fail with certain boost Jan 20, 2015
@lazywei
lazywei commented Jan 31, 2015

Confirm the same problem here. CentOS6, boost 1.57, mkl-203

[----------] Global test environment tear-down
[==========] 838 tests from 169 test cases ran. (98290 ms total)
[  PASSED  ] 832 tests.
[  FAILED  ] 6 tests, listed below:
[  FAILED  ] PowerLayerTest/0.TestPowerGradientShiftZero, where TypeParam = caffe::FloatCPU
[  FAILED  ] PowerLayerTest/1.TestPowerGradient, where TypeParam = caffe::DoubleCPU
[  FAILED  ] PowerLayerTest/1.TestPowerGradientShiftZero, where TypeParam = caffe::DoubleCPU
[  FAILED  ] PowerLayerTest/2.TestPowerGradientShiftZero, where TypeParam = caffe::FloatGPU
[  FAILED  ] PowerLayerTest/3.TestPowerGradient, where TypeParam = caffe::DoubleGPU
[  FAILED  ] PowerLayerTest/3.TestPowerGradientShiftZero, where TypeParam = caffe::DoubleGPU
@dgolden1
Contributor
dgolden1 commented Feb 3, 2015

@shelhamer, as users, should we be concerned about the test failures with Boost 1.57? Will Caffe give erroneous results? Or can we ignore the failures for now?

@blackyang

Same problem(6 failed test) at first, on OSX 10.9.5 with atlas, boost 1.57 and Anaconda Python 2.7. After downgrading boost to 1.55(others remain unchanged) and reinstall caffe, it works now

@shelhamer
Member

While I can't dismiss these numerical errors, the consolation is that these are isolated to PowerLayer, and quite rare at that with only 1-3 out of 120 elements out of tolerance, so only models that (1) define a POWER layer or make the rare choice of the WITHIN_CHANNEL mode of the LRN layer are at risk -- and even these might be ok. That said these errors are worth resolving.

@shelhamer shelhamer added a commit to shelhamer/caffe that referenced this issue Feb 6, 2015
@shelhamer shelhamer reduce step size in PowerLayer gradient checks: fix #1252
The gradient checker fails on certain elements of the PowerLayer checks,
but only 1-3 sometimes fail out of the 120 elements tested. This is not
due to any numerical issue in the PowerLayer, but the distribution of
the random inputs for the checks.

boost 1.56 switched the normal distribution RNG engine from Box-Muller
to Ziggurat.
8ec095a
@shelhamer
Member

I looked into this a little and @jeffdonahue was quick to note that boost RNG is used by all the fillers regardless of mode -- and I found this boost thread on RNG that notes the normal distribution RNG was rewritten for the 1.56 release. A little good old fashioned hand calculation confirmed this is nothing more than a precision error, so #1840 fixes this by reducing the step size for the finite-differencing.

There's no need to keep to boost 1.55.

(For those who like RNG the switch was from Box-Muller to Ziggurat.)

@shelhamer shelhamer closed this Feb 6, 2015
@pannous pannous pushed a commit to pannous/caffe that referenced this issue Feb 6, 2015
@shelhamer shelhamer + Pannous reduce step size in PowerLayer gradient checks: fix #1252
The gradient checker fails on certain elements of the PowerLayer checks,
but only 1-3 sometimes fail out of the 120 elements tested. This is not
due to any numerical issue in the PowerLayer, but the distribution of
the random inputs for the checks.

boost 1.56 switched the normal distribution RNG engine from Box-Muller
to Ziggurat.
0fb3f2a
@slayton58 slayton58 pushed a commit to slayton58/caffe that referenced this issue Mar 4, 2015
@shelhamer @NV-slayton shelhamer + NV-slayton reduce step size in PowerLayer gradient checks: fix #1252
The gradient checker fails on certain elements of the PowerLayer checks,
but only 1-3 sometimes fail out of the 120 elements tested. This is not
due to any numerical issue in the PowerLayer, but the distribution of
the random inputs for the checks.

boost 1.56 switched the normal distribution RNG engine from Box-Muller
to Ziggurat.
636e8aa
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment