Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace atlas/cblas routines with Eigen in the math functions #85

Closed
wants to merge 19 commits into from
Closed

Replace atlas/cblas routines with Eigen in the math functions #85

wants to merge 19 commits into from

Conversation

kloudkl
Copy link
Contributor

@kloudkl kloudkl commented Feb 8, 2014

This pull request meets the requirement of issue #84. Layerwise runtime analysis of imagenet.prototxt using @sguada's detailed net_speed_benchmark from #83 was a little disappointing in terms of Eigen's performance.

Table 1. Training time analysis of different BLAS library in the CPU mode and in the GPU mode. All time measured in seconds. Batch size in the CPU mode is 256. The data of the GPU mode is for illustrative purpose for issue #3 only.

Compute Setting Forward Pass Time Backward Pass Time Total Time
Eigen (4 threads) 29.71 58.64 88.35
OpenBLAS* 20.51 34.32 54.83
MKL* 19.78 34.6 54.38
GPU** 4.93 4.87 9.8
  1. * Unlike Eigen, OpenBLAS and MKL are insensitive to setting the number of threads bigger than the number of physical cores.
  2. ** The time for the GPU mode is not very accurate. Due to the memory limit of the available GPU, batch size can only be 1 and iterations are set accordingly to make the training sample size matches that of the CPU mode. This brings extra communication cost which offsets part of the computation advantage of GPU.

Even when running with the max number of physical cores (4 on my machine), Eigen is still much slower than OpenBLAS and MKL. It requires expert knowledge of how Eigen evaluates expressions to fully exploit its high performance. In contrast, OpenBLAS is a very low-hanging fruit. You only need to install the multi-threaded package (#80, #81) and link it with your application (#82). Then everything works like a charm. MKL also does a good job but at a high price.
I would appreciate any other independent benchmark very much.

Eigen (boost-eigen branch head + this PR)
OMP_NUM_THREADS=4 ../build/examples/net_speed_benchmark.bin local_imagenet.prototxt 1 CPU

*** Benchmark   begins  ***
data    forward:    0.02    seconds.
conv1   forward:    4.96    seconds.
relu1   forward:    0.22    seconds.
pool1   forward:    0.17    seconds.
norm1   forward:    1.21    seconds.
pad2    forward:    0.02    seconds.
conv2   forward:    8.05    seconds.
relu2   forward:    0.14    seconds.
pool2   forward:    0.11    seconds.
norm2   forward:    0.79    seconds.
pad3    forward:    0.01    seconds.
conv3   forward:    5.06    seconds.
relu3   forward:    0.06    seconds.
pad4    forward:    0.02    seconds.
conv4   forward:    4.57    seconds.
relu4   forward:    0.05    seconds.
pad5    forward:    0.02    seconds.
conv5   forward:    2.74    seconds.
relu5   forward:    0.03    seconds.
pool5   forward:    0.03    seconds.
fc6 forward:    0.85    seconds.
relu6   forward:    0.01    seconds.
drop6   forward:    0.01    seconds.
fc7 forward:    0.4 seconds.
relu7   forward:    0   seconds.
drop7   forward:    0.01    seconds.
fc8 forward:    0.14    seconds.
loss    forward:    0.01    seconds.
Forward pass:   29.71   seconds.
loss    backward:   0   seconds.
fc8 backward:   0.18    seconds.
drop7   backward:   0.01    seconds.
relu7   backward:   0   seconds.
fc7 backward:   0.75    seconds.
drop6   backward:   0.01    seconds.
relu6   backward:   0   seconds.
fc6 backward:   2.54    seconds.
pool5   backward:   0.07    seconds.
relu5   backward:   0.01    seconds.
conv5   backward:   7.05    seconds.
pad5    backward:   0.02    seconds.
relu4   backward:   0.02    seconds.
conv4   backward:   8.31    seconds.
pad4    backward:   0.02    seconds.
relu3   backward:   0.02    seconds.
conv3   backward:   10.41   seconds.
pad3    backward:   0.01    seconds.
norm2   backward:   0.7 seconds.
pool2   backward:   0.28    seconds.
relu2   backward:   0.05    seconds.
conv2   backward:   16.63   seconds.
pad2    backward:   0.02    seconds.
norm1   backward:   1.06    seconds.
pool1   backward:   0.42    seconds.
relu1   backward:   0.08    seconds.
conv1   backward:   9.97    seconds.
data    backward:   0   seconds.
Backward    pass:   58.64   seconds.
Total   Time:   88.35   seconds.
*** Benchmark   ends    ***

OpenBLAS (boost-eigen branch head & git cherry-pick 969d0ab)

*** Benchmark   begins  ***
data    forward:    0.08    seconds.
conv1   forward:    3.89    seconds.
relu1   forward:    0.44    seconds.
pool1   forward:    0.18    seconds.
norm1   forward:    1.22    seconds.
pad2    forward:    0.02    seconds.
conv2   forward:    4.94    seconds.
relu2   forward:    0.36    seconds.
pool2   forward:    0.11    seconds.
norm2   forward:    0.78    seconds.
pad3    forward:    0.01    seconds.
conv3   forward:    2.76    seconds.
relu3   forward:    0.19    seconds.
pad4    forward:    0.08    seconds.
conv4   forward:    2.3 seconds.
relu4   forward:    0.18    seconds.
pad5    forward:    0.08    seconds.
conv5   forward:    1.81    seconds.
relu5   forward:    0.12    seconds.
pool5   forward:    0.12    seconds.
fc6 forward:    0.47    seconds.
relu6   forward:    0.01    seconds.
drop6   forward:    0.04    seconds.
fc7 forward:    0.2 seconds.
relu7   forward:    0.01    seconds.
drop7   forward:    0.03    seconds.
fc8 forward:    0.07    seconds.
loss    forward:    0.01    seconds.
Forward pass:   20.51   seconds.
loss    backward:   0   seconds.
fc8 backward:   0.11    seconds.
drop7   backward:   0.02    seconds.
relu7   backward:   0   seconds.
fc7 backward:   0.42    seconds.
drop6   backward:   0.03    seconds.
relu6   backward:   0   seconds.
fc6 backward:   0.92    seconds.
pool5   backward:   0.3 seconds.
relu5   backward:   0.02    seconds.
conv5   backward:   3.47    seconds.
pad5    backward:   0.07    seconds.
relu4   backward:   0.06    seconds.
conv4   backward:   4.68    seconds.
pad4    backward:   0.05    seconds.
relu3   backward:   0.06    seconds.
conv3   backward:   5.18    seconds.
pad3    backward:   0.04    seconds.
norm2   backward:   0.87    seconds.
pool2   backward:   0.27    seconds.
relu2   backward:   0.04    seconds.
conv2   backward:   9.84    seconds.
pad2    backward:   0.06    seconds.
norm1   backward:   1.24    seconds.
pool1   backward:   0.4 seconds.
relu1   backward:   0.06    seconds.
conv1   backward:   6.11    seconds.
data    backward:   0   seconds.
Backward    pass:   34.32   seconds.
Total   Time:   54.83   seconds.
*** Benchmark   ends    ***

MKL (master head)

Initial loss:   7.63095 
*** Benchmark   begins  ***
data    forward:    0.07    seconds.
conv1   forward:    3.32    seconds.
relu1   forward:    0.81    seconds.
pool1   forward:    0.2 seconds.
norm1   forward:    0.13    seconds.
pad2    forward:    0.08    seconds.
conv2   forward:    5.18    seconds.
relu2   forward:    0.51    seconds.
pool2   forward:    0.43    seconds.
norm2   forward:    0.18    seconds.
pad3    forward:    0.04    seconds.
conv3   forward:    2.5 seconds.
relu3   forward:    0.18    seconds.
pad4    forward:    0.09    seconds.
conv4   forward:    2.3 seconds.
relu4   forward:    0.19    seconds.
pad5    forward:    0.08    seconds.
conv5   forward:    2.25    seconds.
relu5   forward:    0.12    seconds.
pool5   forward:    0.1 seconds.
fc6 forward:    0.68    seconds.
relu6   forward:    0.01    seconds.
drop6   forward:    0.02    seconds.
fc7 forward:    0.22    seconds.
relu7   forward:    0.01    seconds.
drop7   forward:    0.02    seconds.
fc8 forward:    0.06    seconds.
loss    forward:    0   seconds.
Forward pass:   19.78   seconds.
loss    backward:   0   seconds.
fc8 backward:   0.11    seconds.
drop7   backward:   0.01    seconds.
relu7   backward:   0.01    seconds.
fc7 backward:   0.43    seconds.
drop6   backward:   0.01    seconds.
relu6   backward:   0   seconds.
fc6 backward:   0.96    seconds.
pool5   backward:   0.25    seconds.
relu5   backward:   0.07    seconds.
conv5   backward:   3.31    seconds.
pad5    backward:   0.04    seconds.
relu4   backward:   0.07    seconds.
conv4   backward:   4.38    seconds.
pad4    backward:   0.05    seconds.
relu3   backward:   0.06    seconds.
conv3   backward:   4.87    seconds.
pad3    backward:   0.03    seconds.
norm2   backward:   0.23    seconds.
pool2   backward:   1.05    seconds.
relu2   backward:   0.04    seconds.
conv2   backward:   10.69   seconds.
pad2    backward:   0.06    seconds.
norm1   backward:   0.28    seconds.
pool1   backward:   1.59    seconds.
relu1   backward:   0.07    seconds.
conv1   backward:   5.92    seconds.
data    backward:   0   seconds.
Backward    pass:   34.6    seconds.
Total   Time:   54.38   seconds.
*** Benchmark   ends    ***

GPU mode
../build/examples/net_speed_benchmark.bin local_imagenet.prototxt 256 GPU

*** Benchmark   begins  ***
data    forward:    0.22    seconds.
conv1   forward:    0.01    seconds.
relu1   forward:    0   seconds.
pool1   forward:    0.04    seconds.
norm1   forward:    0.11    seconds.
pad2    forward:    0.02    seconds.
conv2   forward:    0.16    seconds.
relu2   forward:    0.05    seconds.
pool2   forward:    0.03    seconds.
norm2   forward:    0.07    seconds.
pad3    forward:    0.02    seconds.
conv3   forward:    0.03    seconds.
relu3   forward:    0   seconds.
pad4    forward:    0.14    seconds.
conv4   forward:    0.22    seconds.
relu4   forward:    0.03    seconds.
pad5    forward:    0.06    seconds.
conv5   forward:    0.17    seconds.
relu5   forward:    0.04    seconds.
pool5   forward:    0.04    seconds.
fc6 forward:    0.06    seconds.
relu6   forward:    0   seconds.
drop6   forward:    1.09    seconds.
fc7 forward:    1.05    seconds.
relu7   forward:    0   seconds.
drop7   forward:    0.5 seconds.
fc8 forward:    0.47    seconds.
loss    forward:    0.3 seconds.
Forward pass:   4.93    seconds.
loss    backward:   0.01    seconds.
fc8 backward:   0   seconds.
drop7   backward:   0   seconds.
relu7   backward:   0.06    seconds.
fc7 backward:   0.12    seconds.
drop6   backward:   0.01    seconds.
relu6   backward:   0.26    seconds.
fc6 backward:   0.5 seconds.
pool5   backward:   0.01    seconds.
relu5   backward:   0.55    seconds.
conv5   backward:   1.29    seconds.
pad5    backward:   0.01    seconds.
relu4   backward:   0.01    seconds.
conv4   backward:   0.28    seconds.
pad4    backward:   0.02    seconds.
relu3   backward:   0.02    seconds.
conv3   backward:   0.27    seconds.
pad3    backward:   0.04    seconds.
norm2   backward:   0.03    seconds.
pool2   backward:   0.03    seconds.
relu2   backward:   0.03    seconds.
conv2   backward:   0.72    seconds.
pad2    backward:   0.05    seconds.
norm1   backward:   0.03    seconds.
pool1   backward:   0.03    seconds.
relu1   backward:   0.03    seconds.
conv1   backward:   0.46    seconds.
data    backward:   0   seconds.
Backward    pass:   4.87    seconds.
Total   Time:   9.8 seconds.
*** Benchmark   ends    ***

rodrigob and others added 19 commits January 21, 2014 18:07
- examples, test and pycaffe compile without problem (matcaffe not tested)
- tests show some errors (on cpu gradient tests), to be investigated
- random generators need to be double checked
- mkl commented code needs to be removed
Replace MKL with Boost+Eigen3

* commit '70c4320e436f92d0963b2622d20c7435b2f07f30':
  Fix test_data_layer segfault by adding destructor to join pthread
  Fix math funcs, add tests, change Eigen Map to unaligned for lrn_layer
  Fix test stochastic pooling stepsize/threshold to be same as max pooling
  Fixed FlattenLayer Backward_cpu/gpu have no return value
  Fixed uniform distribution upper bound to be inclusive
  Add python scripts to install dependent development libs

* commit '9a7d022652d65f44bebc97576a3b4f1b5e559748':
  Fix test_data_layer segfault by adding destructor to join pthread
  Fix math funcs, add tests, change Eigen Map to unaligned for lrn_layer
  Fix test stochastic pooling stepsize/threshold to be same as max pooling
  Fixed FlattenLayer Backward_cpu/gpu have no return value
  Fixed uniform distribution upper bound to be inclusive

* commit '958f038e9e0b1b1c0c62b9119b323f4d62a3832a':
  Fix test_data_layer segfault by adding destructor to join pthread
  Fix math funcs, add tests, change Eigen Map to unaligned for lrn_layer
  Fix test stochastic pooling stepsize/threshold to be same as max pooling
  Fixed FlattenLayer Backward_cpu/gpu have no return value
  Fixed uniform distribution upper bound to be inclusive
previously filled in all NaNs for me, making many tests fail)
fix bernoulli random number generation
@kloudkl kloudkl mentioned this pull request Feb 10, 2014
@shelhamer
Copy link
Member

Consider the discussion for removing Eigen and relying on OpenBLAS alone #81 (comment)).

If this PR is still desired, please rebase for a clean merge. Note that boost-eigen has been rebased from master @ ad08dd1 to avoid drift.

@Yangqing
Copy link
Member

If you guys don't mind, please hold on changes regarding #84 and #85 and kindly join the discussion in #81 - let's collectively decide which way to go for removing MKL dependency.

My personal feeling is that having caffe simply depend on blas (with vsl functions written customly) instead of eigen, and then linking against multiple backend libraries (atlas, openblas, mkl) seems the right way based on @kloudkl 's analysis.

@kloudkl
Copy link
Contributor Author

kloudkl commented Feb 11, 2014

I agree that it is more beneficial to unify the code base.

After the benchmarks, I had a better understanding of why the standard BLAS was created. Issue #54 should be closed too.

@rodrigob
Copy link
Contributor

Great work @kloudkl !
Could you point me out to a short howto/guideline on how to reproduce these benchmarks ? (Let us try to have another data point on different machines...)

@shelhamer
Copy link
Member

Largely superseded by #97. Once the boost-eigen dust settles we'll note OpenBLAS in the installation documentation.

Thanks for your experiments with eigen, openmp, and openblas @kloudkl, and thanks @rodrigob for the initial PR and discussion.

@shelhamer shelhamer closed this Feb 18, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants