Replace atlas/cblas routines with Eigen in the math functions #85

kloudkl · 2014-02-08T17:25:12Z

This pull request meets the requirement of issue #84. Layerwise runtime analysis of imagenet.prototxt using @sguada's detailed net_speed_benchmark from #83 was a little disappointing in terms of Eigen's performance.

Table 1. Training time analysis of different BLAS library in the CPU mode and in the GPU mode. All time measured in seconds. Batch size in the CPU mode is 256. The data of the GPU mode is for illustrative purpose for issue #3 only.

Compute Setting	Forward Pass Time	Backward Pass Time	Total Time
Eigen (4 threads)	29.71	58.64	88.35
OpenBLAS*	20.51	34.32	54.83
MKL*	19.78	34.6	54.38
GPU**	4.93	4.87	9.8

* Unlike Eigen, OpenBLAS and MKL are insensitive to setting the number of threads bigger than the number of physical cores.
** The time for the GPU mode is not very accurate. Due to the memory limit of the available GPU, batch size can only be 1 and iterations are set accordingly to make the training sample size matches that of the CPU mode. This brings extra communication cost which offsets part of the computation advantage of GPU.

Even when running with the max number of physical cores (4 on my machine), Eigen is still much slower than OpenBLAS and MKL. It requires expert knowledge of how Eigen evaluates expressions to fully exploit its high performance. In contrast, OpenBLAS is a very low-hanging fruit. You only need to install the multi-threaded package (#80, #81) and link it with your application (#82). Then everything works like a charm. MKL also does a good job but at a high price.
I would appreciate any other independent benchmark very much.

Eigen (boost-eigen branch head + this PR)
OMP_NUM_THREADS=4 ../build/examples/net_speed_benchmark.bin local_imagenet.prototxt 1 CPU

*** Benchmark   begins  ***
data    forward:    0.02    seconds.
conv1   forward:    4.96    seconds.
relu1   forward:    0.22    seconds.
pool1   forward:    0.17    seconds.
norm1   forward:    1.21    seconds.
pad2    forward:    0.02    seconds.
conv2   forward:    8.05    seconds.
relu2   forward:    0.14    seconds.
pool2   forward:    0.11    seconds.
norm2   forward:    0.79    seconds.
pad3    forward:    0.01    seconds.
conv3   forward:    5.06    seconds.
relu3   forward:    0.06    seconds.
pad4    forward:    0.02    seconds.
conv4   forward:    4.57    seconds.
relu4   forward:    0.05    seconds.
pad5    forward:    0.02    seconds.
conv5   forward:    2.74    seconds.
relu5   forward:    0.03    seconds.
pool5   forward:    0.03    seconds.
fc6 forward:    0.85    seconds.
relu6   forward:    0.01    seconds.
drop6   forward:    0.01    seconds.
fc7 forward:    0.4 seconds.
relu7   forward:    0   seconds.
drop7   forward:    0.01    seconds.
fc8 forward:    0.14    seconds.
loss    forward:    0.01    seconds.
Forward pass:   29.71   seconds.
loss    backward:   0   seconds.
fc8 backward:   0.18    seconds.
drop7   backward:   0.01    seconds.
relu7   backward:   0   seconds.
fc7 backward:   0.75    seconds.
drop6   backward:   0.01    seconds.
relu6   backward:   0   seconds.
fc6 backward:   2.54    seconds.
pool5   backward:   0.07    seconds.
relu5   backward:   0.01    seconds.
conv5   backward:   7.05    seconds.
pad5    backward:   0.02    seconds.
relu4   backward:   0.02    seconds.
conv4   backward:   8.31    seconds.
pad4    backward:   0.02    seconds.
relu3   backward:   0.02    seconds.
conv3   backward:   10.41   seconds.
pad3    backward:   0.01    seconds.
norm2   backward:   0.7 seconds.
pool2   backward:   0.28    seconds.
relu2   backward:   0.05    seconds.
conv2   backward:   16.63   seconds.
pad2    backward:   0.02    seconds.
norm1   backward:   1.06    seconds.
pool1   backward:   0.42    seconds.
relu1   backward:   0.08    seconds.
conv1   backward:   9.97    seconds.
data    backward:   0   seconds.
Backward    pass:   58.64   seconds.
Total   Time:   88.35   seconds.
*** Benchmark   ends    ***

OpenBLAS (boost-eigen branch head & git cherry-pick 969d0ab)

*** Benchmark   begins  ***
data    forward:    0.08    seconds.
conv1   forward:    3.89    seconds.
relu1   forward:    0.44    seconds.
pool1   forward:    0.18    seconds.
norm1   forward:    1.22    seconds.
pad2    forward:    0.02    seconds.
conv2   forward:    4.94    seconds.
relu2   forward:    0.36    seconds.
pool2   forward:    0.11    seconds.
norm2   forward:    0.78    seconds.
pad3    forward:    0.01    seconds.
conv3   forward:    2.76    seconds.
relu3   forward:    0.19    seconds.
pad4    forward:    0.08    seconds.
conv4   forward:    2.3 seconds.
relu4   forward:    0.18    seconds.
pad5    forward:    0.08    seconds.
conv5   forward:    1.81    seconds.
relu5   forward:    0.12    seconds.
pool5   forward:    0.12    seconds.
fc6 forward:    0.47    seconds.
relu6   forward:    0.01    seconds.
drop6   forward:    0.04    seconds.
fc7 forward:    0.2 seconds.
relu7   forward:    0.01    seconds.
drop7   forward:    0.03    seconds.
fc8 forward:    0.07    seconds.
loss    forward:    0.01    seconds.
Forward pass:   20.51   seconds.
loss    backward:   0   seconds.
fc8 backward:   0.11    seconds.
drop7   backward:   0.02    seconds.
relu7   backward:   0   seconds.
fc7 backward:   0.42    seconds.
drop6   backward:   0.03    seconds.
relu6   backward:   0   seconds.
fc6 backward:   0.92    seconds.
pool5   backward:   0.3 seconds.
relu5   backward:   0.02    seconds.
conv5   backward:   3.47    seconds.
pad5    backward:   0.07    seconds.
relu4   backward:   0.06    seconds.
conv4   backward:   4.68    seconds.
pad4    backward:   0.05    seconds.
relu3   backward:   0.06    seconds.
conv3   backward:   5.18    seconds.
pad3    backward:   0.04    seconds.
norm2   backward:   0.87    seconds.
pool2   backward:   0.27    seconds.
relu2   backward:   0.04    seconds.
conv2   backward:   9.84    seconds.
pad2    backward:   0.06    seconds.
norm1   backward:   1.24    seconds.
pool1   backward:   0.4 seconds.
relu1   backward:   0.06    seconds.
conv1   backward:   6.11    seconds.
data    backward:   0   seconds.
Backward    pass:   34.32   seconds.
Total   Time:   54.83   seconds.
*** Benchmark   ends    ***

MKL (master head)

Initial loss:   7.63095 
*** Benchmark   begins  ***
data    forward:    0.07    seconds.
conv1   forward:    3.32    seconds.
relu1   forward:    0.81    seconds.
pool1   forward:    0.2 seconds.
norm1   forward:    0.13    seconds.
pad2    forward:    0.08    seconds.
conv2   forward:    5.18    seconds.
relu2   forward:    0.51    seconds.
pool2   forward:    0.43    seconds.
norm2   forward:    0.18    seconds.
pad3    forward:    0.04    seconds.
conv3   forward:    2.5 seconds.
relu3   forward:    0.18    seconds.
pad4    forward:    0.09    seconds.
conv4   forward:    2.3 seconds.
relu4   forward:    0.19    seconds.
pad5    forward:    0.08    seconds.
conv5   forward:    2.25    seconds.
relu5   forward:    0.12    seconds.
pool5   forward:    0.1 seconds.
fc6 forward:    0.68    seconds.
relu6   forward:    0.01    seconds.
drop6   forward:    0.02    seconds.
fc7 forward:    0.22    seconds.
relu7   forward:    0.01    seconds.
drop7   forward:    0.02    seconds.
fc8 forward:    0.06    seconds.
loss    forward:    0   seconds.
Forward pass:   19.78   seconds.
loss    backward:   0   seconds.
fc8 backward:   0.11    seconds.
drop7   backward:   0.01    seconds.
relu7   backward:   0.01    seconds.
fc7 backward:   0.43    seconds.
drop6   backward:   0.01    seconds.
relu6   backward:   0   seconds.
fc6 backward:   0.96    seconds.
pool5   backward:   0.25    seconds.
relu5   backward:   0.07    seconds.
conv5   backward:   3.31    seconds.
pad5    backward:   0.04    seconds.
relu4   backward:   0.07    seconds.
conv4   backward:   4.38    seconds.
pad4    backward:   0.05    seconds.
relu3   backward:   0.06    seconds.
conv3   backward:   4.87    seconds.
pad3    backward:   0.03    seconds.
norm2   backward:   0.23    seconds.
pool2   backward:   1.05    seconds.
relu2   backward:   0.04    seconds.
conv2   backward:   10.69   seconds.
pad2    backward:   0.06    seconds.
norm1   backward:   0.28    seconds.
pool1   backward:   1.59    seconds.
relu1   backward:   0.07    seconds.
conv1   backward:   5.92    seconds.
data    backward:   0   seconds.
Backward    pass:   34.6    seconds.
Total   Time:   54.38   seconds.
*** Benchmark   ends    ***

GPU mode
../build/examples/net_speed_benchmark.bin local_imagenet.prototxt 256 GPU

*** Benchmark   begins  ***
data    forward:    0.22    seconds.
conv1   forward:    0.01    seconds.
relu1   forward:    0   seconds.
pool1   forward:    0.04    seconds.
norm1   forward:    0.11    seconds.
pad2    forward:    0.02    seconds.
conv2   forward:    0.16    seconds.
relu2   forward:    0.05    seconds.
pool2   forward:    0.03    seconds.
norm2   forward:    0.07    seconds.
pad3    forward:    0.02    seconds.
conv3   forward:    0.03    seconds.
relu3   forward:    0   seconds.
pad4    forward:    0.14    seconds.
conv4   forward:    0.22    seconds.
relu4   forward:    0.03    seconds.
pad5    forward:    0.06    seconds.
conv5   forward:    0.17    seconds.
relu5   forward:    0.04    seconds.
pool5   forward:    0.04    seconds.
fc6 forward:    0.06    seconds.
relu6   forward:    0   seconds.
drop6   forward:    1.09    seconds.
fc7 forward:    1.05    seconds.
relu7   forward:    0   seconds.
drop7   forward:    0.5 seconds.
fc8 forward:    0.47    seconds.
loss    forward:    0.3 seconds.
Forward pass:   4.93    seconds.
loss    backward:   0.01    seconds.
fc8 backward:   0   seconds.
drop7   backward:   0   seconds.
relu7   backward:   0.06    seconds.
fc7 backward:   0.12    seconds.
drop6   backward:   0.01    seconds.
relu6   backward:   0.26    seconds.
fc6 backward:   0.5 seconds.
pool5   backward:   0.01    seconds.
relu5   backward:   0.55    seconds.
conv5   backward:   1.29    seconds.
pad5    backward:   0.01    seconds.
relu4   backward:   0.01    seconds.
conv4   backward:   0.28    seconds.
pad4    backward:   0.02    seconds.
relu3   backward:   0.02    seconds.
conv3   backward:   0.27    seconds.
pad3    backward:   0.04    seconds.
norm2   backward:   0.03    seconds.
pool2   backward:   0.03    seconds.
relu2   backward:   0.03    seconds.
conv2   backward:   0.72    seconds.
pad2    backward:   0.05    seconds.
norm1   backward:   0.03    seconds.
pool1   backward:   0.03    seconds.
relu1   backward:   0.03    seconds.
conv1   backward:   0.46    seconds.
data    backward:   0   seconds.
Backward    pass:   4.87    seconds.
Total   Time:   9.8 seconds.
*** Benchmark   ends    ***

- examples, test and pycaffe compile without problem (matcaffe not tested) - tests show some errors (on cpu gradient tests), to be investigated - random generators need to be double checked - mkl commented code needs to be removed

Replace MKL with Boost+Eigen3 * commit '70c4320e436f92d0963b2622d20c7435b2f07f30': Fix test_data_layer segfault by adding destructor to join pthread Fix math funcs, add tests, change Eigen Map to unaligned for lrn_layer Fix test stochastic pooling stepsize/threshold to be same as max pooling Fixed FlattenLayer Backward_cpu/gpu have no return value Fixed uniform distribution upper bound to be inclusive Add python scripts to install dependent development libs * commit '9a7d022652d65f44bebc97576a3b4f1b5e559748': Fix test_data_layer segfault by adding destructor to join pthread Fix math funcs, add tests, change Eigen Map to unaligned for lrn_layer Fix test stochastic pooling stepsize/threshold to be same as max pooling Fixed FlattenLayer Backward_cpu/gpu have no return value Fixed uniform distribution upper bound to be inclusive * commit '958f038e9e0b1b1c0c62b9119b323f4d62a3832a': Fix test_data_layer segfault by adding destructor to join pthread Fix math funcs, add tests, change Eigen Map to unaligned for lrn_layer Fix test stochastic pooling stepsize/threshold to be same as max pooling Fixed FlattenLayer Backward_cpu/gpu have no return value Fixed uniform distribution upper bound to be inclusive

Compile errors in boost-eigen branch

previously filled in all NaNs for me, making many tests fail)

make compatible with boost 1.46 and 1.55

1)

fix bernoulli random number generation

issue: #84

shelhamer · 2014-02-10T23:25:08Z

Consider the discussion for removing Eigen and relying on OpenBLAS alone #81 (comment)).

If this PR is still desired, please rebase for a clean merge. Note that boost-eigen has been rebased from master @ ad08dd1 to avoid drift.

Yangqing · 2014-02-10T23:31:50Z

If you guys don't mind, please hold on changes regarding #84 and #85 and kindly join the discussion in #81 - let's collectively decide which way to go for removing MKL dependency.

My personal feeling is that having caffe simply depend on blas (with vsl functions written customly) instead of eigen, and then linking against multiple backend libraries (atlas, openblas, mkl) seems the right way based on @kloudkl 's analysis.

kloudkl · 2014-02-11T06:22:47Z

I agree that it is more beneficial to unify the code base.

After the benchmarks, I had a better understanding of why the standard BLAS was created. Issue #54 should be closed too.

rodrigob · 2014-02-13T14:21:06Z

Great work @kloudkl !
Could you point me out to a short howto/guideline on how to reproduce these benchmarks ? (Let us try to have another data point on different machines...)

shelhamer · 2014-02-18T19:43:12Z

Largely superseded by #97. Once the boost-eigen dust settles we'll note OpenBLAS in the installation documentation.

Thanks for your experiments with eigen, openmp, and openblas @kloudkl, and thanks @rodrigob for the initial PR and discussion.

rodrigob and others added 19 commits January 21, 2014 18:07

Fixed uniform distribution upper bound to be inclusive

8a1ede9

Fixed FlattenLayer Backward_cpu/gpu have no return value

9293cc2

Fix test stochastic pooling stepsize/threshold to be same as max pooling

d8dd5d0

Fix math funcs, add tests, change Eigen Map to unaligned for lrn_layer

00b450b

Fix test_data_layer segfault by adding destructor to join pthread

958f038

relax precision of MultinomialLogisticLossLayer test

5385b74

nextafter templates off one type

d74c16d

mean_bound and sample_mean need referencing with this

7ac4a30

Merge pull request #47 from alito/compileerrorsboosteigenkloudkl

3122c8a

Compile errors in boost-eigen branch

make uniform distribution usage compatible with boost 1.46

3d2696e

use boost variate_generator to pass tests w/ boost 1.46 (Gaussian filler

f76b296

previously filled in all NaNs for me, making many tests fail)

change all Rng's to use variate_generator for consistency

fae6944

Merge pull request #49 from jeffdonahue/boosteigencompilewithboost146

a5f2cb1

make compatible with boost 1.46 and 1.55

add bernoulli rng test to demonstrate bug (generates all 0s unless p ==

6639f8f

1)

fix bernoulli generator bug

d1c9111

Merge pull request #63 from jeffdonahue/bernoullirngbugfix

ca1c462

fix bernoulli random number generation

Replace atlas/cblas routines with Eigen in the math functions

50f0491

issue: #84

kloudkl mentioned this pull request Feb 10, 2014

Create CPU only Version #3

Closed

sguada mentioned this pull request Feb 11, 2014

Make gemm fully dependent on eigen #84

Closed

Yangqing mentioned this pull request Feb 11, 2014

Implement Matrix class to abstract algorithms away from data storage details #54

Closed

kloudkl mentioned this pull request Feb 12, 2014

MKL/non-MKL merge #97

Closed

shelhamer added the hardware/portability label Feb 12, 2014

mavenlin mentioned this pull request Feb 12, 2014

implemented padding aware im2col and col2im functions #99

Closed

kloudkl mentioned this pull request Feb 13, 2014

Identify the critical parts of computation time in GPU mode #102

Closed

shelhamer closed this Feb 18, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace atlas/cblas routines with Eigen in the math functions #85

Replace atlas/cblas routines with Eigen in the math functions #85

kloudkl commented Feb 8, 2014

shelhamer commented Feb 10, 2014

Yangqing commented Feb 10, 2014

kloudkl commented Feb 11, 2014

rodrigob commented Feb 13, 2014

shelhamer commented Feb 18, 2014

Replace atlas/cblas routines with Eigen in the math functions #85

Replace atlas/cblas routines with Eigen in the math functions #85

Conversation

kloudkl commented Feb 8, 2014

shelhamer commented Feb 10, 2014

Yangqing commented Feb 10, 2014

kloudkl commented Feb 11, 2014

rodrigob commented Feb 13, 2014

shelhamer commented Feb 18, 2014