Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clcaffe #3355

Closed
wants to merge 22 commits into from
Closed

Clcaffe #3355

wants to merge 22 commits into from

Conversation

gongzg
Copy link

@gongzg gongzg commented Nov 19, 2015

This is an OpenCL backend for Caffe from Intel. Besides seamless integration of OpenCL support to original Caffe framework, this backend also provides the following major features:

  • Implementation of three types of convolution engines in the forward pass
  • GEMM (default matrix multiplication approach)
  • Spatial domain (best time performer)
  • Frequency domain
  • Runtime auto-tuner in spatial domain convolution to generate best kernels according to HW config
  • Switch among OpenCL, CPU and CUDA implementations (Now only support build time switching)
  • It is tested on Intel, AMD and NVidia, with support to OpenCL 1.1 and up

Dependencies

  • OpenCL BLAS. It has been tested with clBLAS, and another internal BLAS, which will be released soon
  • (optional) OpenCL FFT to run frequency domain convolution. We tested with clFFT.

Usage

Currently, this code base only supports the Makefile style build. Copy the Makefile.config.example to Makefile.config, then uncomment the following line:
# USE_OCL := 1
to
USE_OCL:=1
We can measure the default convolution engine (GEMM)'s performance by:
./build/tools/caffe test -model models/bvlc_alexnet/deploy.prototxt -gpu=0 --weights=1
To switch to SPATIAL engine, we can add "engine: SPATIAL" to the convolution_param in prototxt, and rerun the test to collect performance. Please be noted, the auto tuner takes more time on the very first run. The tuning results are cached for subsequent usages.

Performance reference

In our test on Intel BDW Xeon CPU E3-12xx v4@3.40GHz, GT3e GPU @1.15GHz with 128MB eDRAM (peak 0.88 TFlops), the following is the time performance using AlexNet as benchmark:

Classification
(ref) CPU+ATLAS: 8 img/sec
(ref) CPU+OpenBLAS: 10 img/sec
(ref) CPU+MKL: 65 img/sec
OpenCL backend GEMM: 89 img/sec
OpenCL backend spatial domain: 165 img/sec
OpenCL backend frequency domain: 60 img/sec

Training
(ref) CPU+ATLAS: 4 img/sec
(ref) CPU+OpenBLAS: 5 img/sec
(ref) CPU+MKL: 28 img/sec
OpenCL backend GEMM: 28 img/sec
OpenCL backend spatial domain: 48 img/sec
OpenCL backend frequency domain: 19 img/sec

Known issues

Due to last merge with the master, there may be some issues related to the batch_norm_layer, batch_reindex_layer, embed_layer, tile_layer or the thread local related functions. We are actively working on filling this gap. This branch is under active development now.

jeffintc and others added 11 commits November 19, 2015 17:48
(err != CL_SUCCESS && err != true) will always be false.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
@bhack bhack mentioned this pull request Nov 19, 2015
@ozabluda
Copy link

Counting only convolutional and fully-connected layers, models/bvlc_alexnet/deploy.prototxt has to perform 1449 MFLOP per image for classification (forward) and 4347 MFLOP per image for training (forward+backward). With 883.2 GFLOP/s (GT3e at 1150 MHz), maximum possible performance is:

Classification: 610 images/s
Training: 203 images/s

The huge difference between this and reality must be performance of clBLAS and clFFT (both open-source libraries from AMD) on Intel's iGPU.

Zhigang Gong added 4 commits November 20, 2015 09:02
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
OCL still doesn't support cmake build, but we should not
break the cmake build for other configurations.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
@bhack
Copy link
Contributor

bhack commented Nov 21, 2015

CPU-OPENBLAS vs CPU-MKL benchmarks results seems to me a little bit strange. @xianyi what do you think?

@hughperkins
Copy link

Per Eigen benchmarks, openblas is very comparable to Intel MKL http://eigen.tuxfamily.org/index.php?title=Benchmark (openblas is gotoblas in the eigen benchmarks). The processor cited here seems to have 4 cores, or 8 hypercores, https://en.wikipedia.org/wiki/Broadwell_%28microarchitecture%29 . I wonder if openblas is using only 1 hypercore here for some reason?

Edit: I remember something about core affinity being configured in openblas actually, so depending on how it's built, it only uses a single core. You can rebuild with Makefile.rule changed with NO_AFFINITY = 1 uncommented, and it will use all cores.

@bhack
Copy link
Contributor

bhack commented Nov 21, 2015

We are ignoring the thread/openmp configuration of this build.

@andravin
Copy link

@ozabluda These Intel OpenCL SGEMM implementations would likely perform much better in this context: https://software.intel.com/en-us/articles/sgemm-for-intel-processor-graphics

IIRC, they currently require matrix dimensions to be multiples of 8.

@hughperkins
Copy link

Reading the blurb at the top of this thread, seems there will be a new OpenCL BLAS released soon. That will be interesting, since that's the main speed bottleneck for im2col convolutions.

Edit: ah, per andravin's link, looks like the Intel OpenCL BLAS might be using a proprietary method cl_intel_subgroups, which will make it Intel-only. Opinion: I guess that hardware-specific OpenCL BLAS implementations are not a bad idea, as long as they are drop-in, pluggable, but I'm not sure that's quite the case currently?

Zhigang Gong added 2 commits November 24, 2015 16:16
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
@ptillet
Copy link

ptillet commented Nov 29, 2015

@hughperkins I am the author of the upcoming BLAS library. I spent the summer at Intel optimizing the kernels for the HD graphics, but it doesn't rely on cl_intel_subgroups.

I expect a release fairly soon (during the winter break of my school). I had planned to release it earlier on, but found a bunch of nasty corner-cases I have to fix...

@hughperkins
Copy link

@ptillet Ah, nice! :-)

@xianyi
Copy link

xianyi commented Nov 30, 2015

@ptillet nice work!~

@bhack , I have no idea about the performance gap. I didn't profile Caffe with OpenBLAS before.
I wonder there are some corner cases for multi-threading implementation.
e.g. this branch can improve some multi-threaded sgemm about 100%.
https://github.com/xianyi/OpenBLAS/tree/optimized_for_deeplearning

Xianyi

@gujunli
Copy link

gujunli commented Dec 27, 2015

@ptillet Is the BLAS clBLAS? what are your optimizations focusing on? Thanks!

@bhack
Copy link
Contributor

bhack commented Dec 28, 2015

@bhack
Copy link
Contributor

bhack commented Feb 1, 2016

@naibaf7 What is the policy now? I think that not make sense to have Opencl PR opened against master now that we have an official Opencl branch.

@gongzg
Copy link
Author

gongzg commented Feb 1, 2016

@bhack I talked with @naibaf7 last month, and we will collaborate to merge this PR to the official OpenCL branch. You are right that we should close this PR now. Thanks for the reminder.

@gongzg gongzg closed this Feb 1, 2016
@gongzg gongzg deleted the clcaffe branch January 6, 2017 11:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants