Opencl - half floating point support and introduce layer fusion for inference #5745

Open
wants to merge 33 commits into
from

Conversation

Projects
None yet
4 participants

gongzg commented Jul 6, 2017

This PR's major contributions are the following two features:

  1. FP16 support for OCL device and CPU device. As it depends on half clBLAS, it requires to use ISAAC to enable FP16 feature.
  2. Layer fusion support. For inference optimization, some layer could be fused into one, for example: CONV + RELU, CONV + BN + SCALE + RELU etc.

The FP16 version currently could get about 91% pass rate. Most of the failures are for the gradient check and solver test cases. The inference function works fine and I already verified it on my tree with both SSD and yolo2. The FP16 version get very good performance gain compare the FP32 for yolo2 model. I will fix those failures eventually.

gongzg and others added some commits Mar 31, 2017

@gongzg gongzg set MKL_USE_SINGLE_DYNAMIC_LIBRARY as disable.
Otherwise it may cause some issues.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
4556630
@gongzg gongzg Add optimized GEMM/GEMV into caffe greentea math library.
The image interface is much faster than buffer interface GEMM in ISAAC.
Before we add new image based interface in ISAAC, we may have to keep this
implementation.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
c5ebc9e
@gongzg gongzg Prepare to support layer fusions.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
3c9c415
@gongzg gongzg Implement layer fusion in spatial convolution engine.
Support the following fusion types:
ConvolutionParameter_FuseType_FUSED_CONV_RELU - CONV + Relu
ConvolutionParameter_FuseType_FUSED_CONV_MAX_POOLING_RELU - CONV + Max pooling(without padding) + Relu
ConvolutionParameter_FuseType_FUSED_CONV_ELTWISE_RELU - Conv + EltWise + Relu

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
bc013a5
@gongzg gongzg Add LRN fusion with Pooling layer.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
e270af6
@gongzg gongzg Enable image based GEMM interface for inner product layer.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
a5dd297
@gongzg gongzg Optimize BN layer for inference only.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
6abe23d
@gongzg Richman, Reuven softmax layer cpu fwd - no need to max values with themselves 052c332
@gongzg gongzg Refine zero copy support.
As all memory with OpenCL backend are allocated with qualified size and
alignment, we can ignore the size check.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
5a1d290
@gongzg gongzg Add new lt option to caffe tool.
By default, we will not measure per layer timing now. As the per layer
timing measurement brings siginificant overhead for many net models.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
0e4994a
@gongzg gongzg Use explicit constant value type rather than the default double type.
For those compilers don't support double type, use implicit double type constant
may bring some issues.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
89f6315
@gongzg gongzg Fix a bug in inner product layer.
This bug is caused by two hacky cases:
1. SharedWeights is called from RNN layer's forwarding only path which
may change some inner product layer's weights data with a hacky way after
the layer setup.
2. The LSTM's gradient test case set the phase to TEST, but it will call
into inner product's backward path.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
4eaf108
@gongzg gongzg Reduce the maximum block size for spatial convolution engine.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
742d803
@gongzg gongzg Simplify IDLF kernel's output logic.
As we are using dynamic image size now, no need to use the last block
width and height. This patch could fix some of the performance regression
caused by the dynamic image size change. But still have some performance
gap between the previous constant image size version.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
db3da6e
@gongzg gongzg Add one infernece optimized model file for AlexNet.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
d121f01
@gongzg gongzg Disable viennacl cache mechanism during spatial engine's tuning phase.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
ba5ac27
@gongzg gongzg Fix segfault when VIENNACL_CACHE_PATH is not set.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
dab271d
@gongzg gongzg Add fused activation function.
Forgot to add these macros in previous commit.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
44b345a
@listenlink @gongzg listenlink Enable model fuse script to generate merged-model and adding an examp…
…le to show the usage.
1af4a95
@listenlink @gongzg listenlink 1, Enable gemm_fast_image blocks computing logic;
2, Refined innerprod autotuneing logic

Change-Id: I20b74574845a2d0d0b33fb0de340c5346d763897
dd8555a
@gongzg gongzg Always allocate zero-copy capable memory.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
6ab9944
@gongzg wzw Fix "nan" value bug for matvec_mul.cl
If the output buffer didn't initialized, the kernel code:
"result[row_gid] = alpha * work[0] + beta * result[row_gid];".
will make "result[row_gid]" to be "nan" no matter whether "beta"
is zero.
9681b26
@gongzg gongzg Enable FP16 support for OpenCL backend.
This patch depends on FP16 version of clBLAS which only supported by
ISAAC, so the FP16 support will only be enabled with ISAAC used.
For Intel platform, FP16 could get about 1.3x to 1.7x performance of
FP32 format's according to different net models and different batch sizes.

The patch introduce FP16 support into the framework.
Currently, it could pass 90% of the half type test cases.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
5cb43d8
@wujunkai166 @gongzg wujunkai166 Optimize buffer based gemm_nt kernel with both float and fp16 versions. ca71f1b
@gongzg gongzg Add negative slope support for relu fusion.
To support yolo2, we need to add negative_slope support for relu fusion.

Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
683bd2f
@gongzg gongzg We still need EXAMPLES_SOURCE_DIR for some test cases.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
b7f36b3
@gongzg gongzg Fix two OCL kernel compilation warnings.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
3733407
@gongzg gongzg Fix inner product layer for non-intel platform.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
5072f06
@gongzg gongzg Fix kernel compilation issue for non-Intel Gen platforms.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
3031f4e
@gongzg gongzg Adjust test cases for half precision.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
1a99aea
@gongzg gongzg Fix lrn fusion for non-Intel Gen platform.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
12f62b3
@gongzg gongzg Lint fix.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
03cd924
@gongzg gongzg Move half.hpp's license to 3rdparty/half.
e38c619
Member

naibaf7 commented Jul 6, 2017 edited

Ok, I will review this in parallel while developing the improved Caffe for MPN+MOE as discussed. This may take a while.

I will merge this already into https://github.com/naibaf7/caffe for current development, but not yet into the OpenCL branch.

I can already tell this is great work! Thanks for the contribution :)

naibaf7 self-requested a review Jul 6, 2017

naibaf7 self-assigned this Jul 6, 2017

@naibaf7 naibaf7 requested review from shelhamer and jeffdonahue Jul 6, 2017

naibaf7 added this to the Future milestone Jul 6, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment