-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clcaffe #3355
Clcaffe #3355
Conversation
(err != CL_SUCCESS && err != true) will always be false. Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Counting only convolutional and fully-connected layers, models/bvlc_alexnet/deploy.prototxt has to perform 1449 MFLOP per image for classification (forward) and 4347 MFLOP per image for training (forward+backward). With 883.2 GFLOP/s (GT3e at 1150 MHz), maximum possible performance is: Classification: 610 images/s The huge difference between this and reality must be performance of clBLAS and clFFT (both open-source libraries from AMD) on Intel's iGPU. |
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
OCL still doesn't support cmake build, but we should not break the cmake build for other configurations. Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
CPU-OPENBLAS vs CPU-MKL benchmarks results seems to me a little bit strange. @xianyi what do you think? |
Per Eigen benchmarks, openblas is very comparable to Intel MKL http://eigen.tuxfamily.org/index.php?title=Benchmark (openblas is gotoblas in the eigen benchmarks). The processor cited here seems to have 4 cores, or 8 hypercores, https://en.wikipedia.org/wiki/Broadwell_%28microarchitecture%29 . I wonder if openblas is using only 1 hypercore here for some reason? Edit: I remember something about core affinity being configured in openblas actually, so depending on how it's built, it only uses a single core. You can rebuild with Makefile.rule changed with |
We are ignoring the thread/openmp configuration of this build. |
@ozabluda These Intel OpenCL SGEMM implementations would likely perform much better in this context: https://software.intel.com/en-us/articles/sgemm-for-intel-processor-graphics IIRC, they currently require matrix dimensions to be multiples of 8. |
Reading the blurb at the top of this thread, seems there will be a new OpenCL BLAS released soon. That will be interesting, since that's the main speed bottleneck for im2col convolutions. Edit: ah, per andravin's link, looks like the Intel OpenCL BLAS might be using a proprietary method |
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
@hughperkins I am the author of the upcoming BLAS library. I spent the summer at Intel optimizing the kernels for the HD graphics, but it doesn't rely on cl_intel_subgroups. I expect a release fairly soon (during the winter break of my school). I had planned to release it earlier on, but found a bunch of nasty corner-cases I have to fix... |
@ptillet Ah, nice! :-) |
@ptillet nice work!~ @bhack , I have no idea about the performance gap. I didn't profile Caffe with OpenBLAS before. Xianyi |
@ptillet Is the BLAS clBLAS? what are your optimizations focusing on? Thanks! |
fixed invalid_context bug: set cl_state_ to be static variable
fixed bug of "Cl buffers not released!"
@naibaf7 What is the policy now? I think that not make sense to have Opencl PR opened against master now that we have an official Opencl branch. |
This is an OpenCL backend for Caffe from Intel. Besides seamless integration of OpenCL support to original Caffe framework, this backend also provides the following major features:
Dependencies
Usage
Currently, this code base only supports the Makefile style build. Copy the Makefile.config.example to Makefile.config, then uncomment the following line:
# USE_OCL := 1
to
USE_OCL:=1
We can measure the default convolution engine (GEMM)'s performance by:
./build/tools/caffe test -model models/bvlc_alexnet/deploy.prototxt -gpu=0 --weights=1
To switch to SPATIAL engine, we can add "engine: SPATIAL" to the convolution_param in prototxt, and rerun the test to collect performance. Please be noted, the auto tuner takes more time on the very first run. The tuning results are cached for subsequent usages.
Performance reference
In our test on Intel BDW Xeon CPU E3-12xx v4@3.40GHz, GT3e GPU @1.15GHz with 128MB eDRAM (peak 0.88 TFlops), the following is the time performance using AlexNet as benchmark:
Classification
(ref) CPU+ATLAS: 8 img/sec
(ref) CPU+OpenBLAS: 10 img/sec
(ref) CPU+MKL: 65 img/sec
OpenCL backend GEMM: 89 img/sec
OpenCL backend spatial domain: 165 img/sec
OpenCL backend frequency domain: 60 img/sec
Training
(ref) CPU+ATLAS: 4 img/sec
(ref) CPU+OpenBLAS: 5 img/sec
(ref) CPU+MKL: 28 img/sec
OpenCL backend GEMM: 28 img/sec
OpenCL backend spatial domain: 48 img/sec
OpenCL backend frequency domain: 19 img/sec
Known issues
Due to last merge with the master, there may be some issues related to the batch_norm_layer, batch_reindex_layer, embed_layer, tile_layer or the thread local related functions. We are actively working on filling this gap. This branch is under active development now.