This PR's major contributions are the following two features:
FP16 support for OCL device and CPU device. As it depends on half clBLAS, it requires to use ISAAC to enable FP16 feature.
Layer fusion support. For inference optimization, some layer could be fused into one, for example: CONV + RELU, CONV + BN + SCALE + RELU etc.
The FP16 version currently could get about 91% pass rate. Most of the failures are for the gradient check and solver test cases. The inference function works fine and I already verified it on my tree with both SSD and yolo2. The FP16 version get very good performance gain compare the FP32 for yolo2 model. I will fix those failures eventually.
The image interface is much faster than buffer interface GEMM in ISAAC.
Before we add new image based interface in ISAAC, we may have to keep this
implementation.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
As all memory with OpenCL backend are allocated with qualified size and
alignment, we can ignore the size check.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
By default, we will not measure per layer timing now. As the per layer
timing measurement brings siginificant overhead for many net models.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
For those compilers don't support double type, use implicit double type constant
may bring some issues.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
This bug is caused by two hacky cases:
1. SharedWeights is called from RNN layer's forwarding only path which
may change some inner product layer's weights data with a hacky way after
the layer setup.
2. The LSTM's gradient test case set the phase to TEST, but it will call
into inner product's backward path.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
As we are using dynamic image size now, no need to use the last block
width and height. This patch could fix some of the performance regression
caused by the dynamic image size change. But still have some performance
gap between the previous constant image size version.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
If the output buffer didn't initialized, the kernel code:
"result[row_gid] = alpha * work[0] + beta * result[row_gid];".
will make "result[row_gid]" to be "nan" no matter whether "beta"
is zero.
This patch depends on FP16 version of clBLAS which only supported by
ISAAC, so the FP16 support will only be enabled with ISAAC used.
For Intel platform, FP16 could get about 1.3x to 1.7x performance of
FP32 format's according to different net models and different batch sizes.
The patch introduce FP16 support into the framework.
Currently, it could pass 90% of the half type test cases.
Signed-off-by: Zhigang Gong <zhigang.gong@intel.com>
gongzg commentedJul 6, 2017
This PR's major contributions are the following two features:
The FP16 version currently could get about 91% pass rate. Most of the failures are for the gradient check and solver test cases. The inference function works fine and I already verified it on my tree with both SSD and yolo2. The FP16 version get very good performance gain compare the FP32 for yolo2 model. I will fix those failures eventually.