Clcaffe #3355

gongzg · 2015-11-19T11:31:11Z

This is an OpenCL backend for Caffe from Intel. Besides seamless integration of OpenCL support to original Caffe framework, this backend also provides the following major features:

Implementation of three types of convolution engines in the forward pass
GEMM (default matrix multiplication approach)
Spatial domain (best time performer)
Frequency domain
Runtime auto-tuner in spatial domain convolution to generate best kernels according to HW config
Switch among OpenCL, CPU and CUDA implementations (Now only support build time switching)
It is tested on Intel, AMD and NVidia, with support to OpenCL 1.1 and up

Dependencies

OpenCL BLAS. It has been tested with clBLAS, and another internal BLAS, which will be released soon
(optional) OpenCL FFT to run frequency domain convolution. We tested with clFFT.

Usage

Currently, this code base only supports the Makefile style build. Copy the Makefile.config.example to Makefile.config, then uncomment the following line:
# USE_OCL := 1
to
USE_OCL:=1
We can measure the default convolution engine (GEMM)'s performance by:
./build/tools/caffe test -model models/bvlc_alexnet/deploy.prototxt -gpu=0 --weights=1
To switch to SPATIAL engine, we can add "engine: SPATIAL" to the convolution_param in prototxt, and rerun the test to collect performance. Please be noted, the auto tuner takes more time on the very first run. The tuning results are cached for subsequent usages.

Performance reference

In our test on Intel BDW Xeon CPU E3-12xx v4@3.40GHz, GT3e GPU @1.15GHz with 128MB eDRAM (peak 0.88 TFlops), the following is the time performance using AlexNet as benchmark:

Classification
(ref) CPU+ATLAS: 8 img/sec
(ref) CPU+OpenBLAS: 10 img/sec
(ref) CPU+MKL: 65 img/sec
OpenCL backend GEMM: 89 img/sec
OpenCL backend spatial domain: 165 img/sec
OpenCL backend frequency domain: 60 img/sec

Training
(ref) CPU+ATLAS: 4 img/sec
(ref) CPU+OpenBLAS: 5 img/sec
(ref) CPU+MKL: 28 img/sec
OpenCL backend GEMM: 28 img/sec
OpenCL backend spatial domain: 48 img/sec
OpenCL backend frequency domain: 19 img/sec

Known issues

Due to last merge with the master, there may be some issues related to the batch_norm_layer, batch_reindex_layer, embed_layer, tile_layer or the thread local related functions. We are actively working on filling this gap. This branch is under active development now.

(err != CL_SUCCESS && err != true) will always be false. Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>

ozabluda · 2015-11-19T18:27:28Z

Counting only convolutional and fully-connected layers, models/bvlc_alexnet/deploy.prototxt has to perform 1449 MFLOP per image for classification (forward) and 4347 MFLOP per image for training (forward+backward). With 883.2 GFLOP/s (GT3e at 1150 MHz), maximum possible performance is:

Classification: 610 images/s
Training: 203 images/s

The huge difference between this and reality must be performance of clBLAS and clFFT (both open-source libraries from AMD) on Intel's iGPU.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>

OCL still doesn't support cmake build, but we should not break the cmake build for other configurations. Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>

bhack · 2015-11-21T09:58:33Z

CPU-OPENBLAS vs CPU-MKL benchmarks results seems to me a little bit strange. @xianyi what do you think?

hughperkins · 2015-11-21T11:11:36Z

Per Eigen benchmarks, openblas is very comparable to Intel MKL http://eigen.tuxfamily.org/index.php?title=Benchmark (openblas is gotoblas in the eigen benchmarks). The processor cited here seems to have 4 cores, or 8 hypercores, https://en.wikipedia.org/wiki/Broadwell_%28microarchitecture%29 . I wonder if openblas is using only 1 hypercore here for some reason?

Edit: I remember something about core affinity being configured in openblas actually, so depending on how it's built, it only uses a single core. You can rebuild with Makefile.rule changed with NO_AFFINITY = 1 uncommented, and it will use all cores.

bhack · 2015-11-21T11:20:44Z

We are ignoring the thread/openmp configuration of this build.

andravin · 2015-11-21T20:21:34Z

@ozabluda These Intel OpenCL SGEMM implementations would likely perform much better in this context: https://software.intel.com/en-us/articles/sgemm-for-intel-processor-graphics

IIRC, they currently require matrix dimensions to be multiples of 8.

hughperkins · 2015-11-22T03:22:58Z

Reading the blurb at the top of this thread, seems there will be a new OpenCL BLAS released soon. That will be interesting, since that's the main speed bottleneck for im2col convolutions.

Edit: ah, per andravin's link, looks like the Intel OpenCL BLAS might be using a proprietary method cl_intel_subgroups, which will make it Intel-only. Opinion: I guess that hardware-specific OpenCL BLAS implementations are not a bad idea, as long as they are drop-in, pluggable, but I'm not sure that's quite the case currently?

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>

ptillet · 2015-11-29T23:29:08Z

@hughperkins I am the author of the upcoming BLAS library. I spent the summer at Intel optimizing the kernels for the HD graphics, but it doesn't rely on cl_intel_subgroups.

I expect a release fairly soon (during the winter break of my school). I had planned to release it earlier on, but found a bunch of nasty corner-cases I have to fix...

hughperkins · 2015-11-30T00:24:38Z

@ptillet Ah, nice! :-)

xianyi · 2015-11-30T00:34:24Z

@ptillet nice work!~

@bhack , I have no idea about the performance gap. I didn't profile Caffe with OpenBLAS before.
I wonder there are some corner cases for multi-threading implementation.
e.g. this branch can improve some multi-threaded sgemm about 100%.
https://github.com/xianyi/OpenBLAS/tree/optimized_for_deeplearning

Xianyi

gujunli · 2015-12-27T23:48:53Z

@ptillet Is the BLAS clBLAS? what are your optimizations focusing on? Thanks!

bhack · 2015-12-28T08:26:20Z

@gujunli See https://github.com/ptillet/isaac

fixed invalid_context bug: set cl_state_ to be static variable

fixed bug of "Cl buffers not released!"

bhack · 2016-02-01T11:54:34Z

@naibaf7 What is the policy now? I think that not make sense to have Opencl PR opened against master now that we have an official Opencl branch.

gongzg · 2016-02-01T15:33:44Z

@bhack I talked with @naibaf7 last month, and we will collaborate to merge this PR to the official OpenCL branch. You are right that we should close this PR now. Thanks for the reminder.

jeffintc and others added 11 commits November 19, 2015 17:48

Integrated OCL changes

c2d2c7b

fix a minor bug.

6bb8267

(err != CL_SUCCESS && err != true) will always be false. Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>

Merge with caffe git master.

5f1a743

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>

Fix opencv related link problem.

980b09d

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>

Silent some compilation warnings.

38d2dc4

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>

Add BatchNormLayer for ocl backend.

33aa736

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>

Add initial BatchReindexLayer for ocl backend.

0291065

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>

Add TileLayer for OCL backend.

85ef881

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>

Add EmbedLayer for OCL backend.

850214b

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>

avoid possible runtime name conflict.

4f341fe

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>

Switch to disable USE_OCL by default.

0d25957

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>

bhack mentioned this pull request Nov 19, 2015

Caffe OpenCL support #2610

Closed

Zhigang Gong added 4 commits November 20, 2015 09:02

Fix build error when USE_OCL is not defined.

7f0eec4

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>

Fix a bug in convert_imageset.

7f61762

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>

Fix build problem when use cuda.

e7a8895

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>

Fix build problem with cmake.

a729be1

OCL still doesn't support cmake build, but we should not break the cmake build for other configurations. Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>

bhack mentioned this pull request Nov 21, 2015

[October 2015] Intel are CPU magicians. But there's no one weird trick.... soumith/convnet-benchmarks#59

Open

Zhigang Gong added 2 commits November 24, 2015 16:16

Remove useless definitions.

5206926

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>

Enable CMake build for OpenCL backend.

dc985cd

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>

Zhigang Gong and others added 5 commits December 29, 2015 17:46

Merge branch 'master' into clcaffe

8b137ba

fixed invalid_context bug: set cl_state_ to be static variable

75b201a

Merge pull request #2 from listenlink/clcaffe

5d28ea5

fixed invalid_context bug: set cl_state_ to be static variable

fixed bug of "Cl buffers not released!"

a9842ac

Merge pull request #3 from listenlink/clcaffe

8e0823c

fixed bug of "Cl buffers not released!"

shelhamer added the OpenCL label Jan 26, 2016

gongzg closed this Feb 1, 2016

gongzg deleted the clcaffe branch January 6, 2017 11:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clcaffe #3355

Clcaffe #3355

gongzg commented Nov 19, 2015

ozabluda commented Nov 19, 2015

bhack commented Nov 21, 2015

hughperkins commented Nov 21, 2015

bhack commented Nov 21, 2015

andravin commented Nov 21, 2015

hughperkins commented Nov 22, 2015

ptillet commented Nov 29, 2015

hughperkins commented Nov 30, 2015

xianyi commented Nov 30, 2015

gujunli commented Dec 27, 2015

bhack commented Dec 28, 2015

bhack commented Feb 1, 2016

gongzg commented Feb 1, 2016

Clcaffe #3355

Clcaffe #3355

Conversation

gongzg commented Nov 19, 2015

Dependencies

Usage

Performance reference

Known issues

ozabluda commented Nov 19, 2015

bhack commented Nov 21, 2015

hughperkins commented Nov 21, 2015

bhack commented Nov 21, 2015

andravin commented Nov 21, 2015

hughperkins commented Nov 22, 2015

ptillet commented Nov 29, 2015

hughperkins commented Nov 30, 2015

xianyi commented Nov 30, 2015

gujunli commented Dec 27, 2015

bhack commented Dec 28, 2015

bhack commented Feb 1, 2016

gongzg commented Feb 1, 2016