Centralize the use of simd intrinsic and implement scalar kernels. #2299

Xreki · 2017-05-27T06:12:49Z

Major modifications are listed as follows:

Move the simd special implementation to an independent file, see hl_cpu_simd_sse.cuh and hl_cpu_simd_neon.cuh.
Add scalar implementation, while don't use any extend instruction, see hl_cpu_scalar.cuh.

As a result, we do not need hl_matrix_base_[sse/neon].cuh, hl_[sse/neon]_matrix_kernel.cuh any more, which are almost the same.

2. Centralize the use of sse and neon instrinsic. 3. Disable neon intrinsic when enable gpu.

hedaoyuan

另外，看一下能不能把hl_cpu_scalar.cuh, hl_cpu_simd_sse.cuh和hl_cpu_simd_neon.cuh里面的add,
mul，sub，div这几个基本操作实现到hl_tensor_ops.h里面去。

hedaoyuan · 2017-05-27T08:45:49Z

paddle/cuda/include/hl_matrix_type.cuh

+#elif defined(__SSE3__)
+#include "hl_cpu_simd_sse.cuh"
+#elif (defined(__ARM_NEON) || defined(__ARM_NEON__)) && !defined(__NVCC__)
+#include "hl_cpu_simd_neon.cuh"


这里注释说明一下加入__NVCC__宏的原因。
另外，ARM+GPU环境下，ARM部分用不了neon指令。应该加一个TODO，后续还是需要fix这个问题的。

Xreki · 2017-05-27T10:19:19Z

@hedaoyuan

看一下能不能把hl_cpu_scalar.cuh, hl_cpu_simd_sse.cuh和hl_cpu_simd_neon.cuh里面的add,
mul，sub，div这几个基本操作实现到hl_tensor_ops.h里面去。

hl_tensor_ops.cuh是目前纯标量的实现。你的意思是使用hl_tensor_ops.cuh里面的方式封装simd指令，
实现hl_tensor_ops_scalar.cuh、hl_tensor_ops_sse.cuh和hl_tensor_ops_neon.cuh这样三个版本？然后在hl_matrix_base_detail.cuh里面改成调用hl_tensor_ops_[scalar/sse/neon].cuh定义的操作？

另外，其实Paddle里面还有好多用avx实现的kernel，目前没有neon实现，我做这一层对基本操作封装，也是希望日后可以用封装的接口代替直接调用的_mm256_**接口，也可以去掉到处存在的#ifdef __AVX__判断，并且比较容易扩展到neon或者其他指令集。

hl_matrix_base.cuh所涉及的这些操作也都是可以用avx来实现的吧，只是不知当初为何没实现avx版本？

还有一个问题，当前这种实现方式没有考虑到运行时动态选择指令集的实现。是不是不要统一命名的好？

hedaoyuan · 2017-05-27T10:48:23Z

实现hl_tensor_ops_scalar.cuh、hl_tensor_ops_sse.cuh和hl_tensor_ops_neon.cuh这样三个版本？

不是三个版本，是实现一个版本。hl_tensor_ops.cuh里面定义的是模板类，可以实例化不同的参数类型。

只是不知当初为何没实现avx版本？

只是没有实现而已。

还有一个问题，当前这种实现方式没有考虑到运行时动态选择指令集的实现。是不是不要统一命名的好？

这个有什么关系吗？不统一命名，调用的时候用if else区分？

into support_scalar_kernels

Xreki · 2017-05-27T12:49:49Z

@hedaoyuan

不是三个版本，是实现一个版本。hl_tensor_ops.cuh里面定义的是模板类，可以实例化不同的参数类型。

我明白了。

这个有什么关系吗？不统一命名，调用的时候用if else区分？

我明白了。没有关系了，统一命名，可以以类型区分。

hedaoyuan

先这样吧。
后续vecType需要修改下面这样的模板类，这样可以去掉PADDLE_TYPE_DOUBLE宏，并且可以扩展到int等类型。

template<class T, int size>
Packet;

Jiangtao Hu and others added 5 commits May 25, 2017 01:23

Support scalar computing.

32a8508

Centralize the use of sse and neon instrinsic.

688305f

Disable neon when enable gpu.

4ed5f7d

1. Support scalar computing.

987f065

2. Centralize the use of sse and neon instrinsic. 3. Disable neon intrinsic when enable gpu.

Merge branch 'develop' into support_scalar_kernels

1b48416

This was referenced May 27, 2017

Support native build on NVIDIA DRIVE PX2 (arm64 + GPU). #2302

Closed

Support native build on NVIDIA DRIVE PX2 (arm64 + GPU). #2303

Merged

Xreki requested review from gangliao, luotao1 and hedaoyuan May 27, 2017 07:29

Xreki added this to In Progress in Embedded and Mobile Deployment May 27, 2017

hedaoyuan requested changes May 27, 2017

View reviewed changes

Merge branch 'support_scalar_kernels' of https://github.com/Xreki/Paddle

adc5210

into support_scalar_kernels

Xreki added 4 commits May 31, 2017 02:20

Add annotations.

8f5d22b

Move the definition of hl_vec_add/sub/mul/div/max/min to hl_tensor_ops.h

430adf4

Merge branch 'develop' into support_scalar_kernels

c370560

Add the missing semicolon.

7fb0684

hedaoyuan approved these changes Jun 1, 2017

View reviewed changes

Xreki merged commit e36e24d into PaddlePaddle:develop Jun 5, 2017

Xreki moved this from In Progress to Build System in Embedded and Mobile Deployment Jun 6, 2017

hedaoyuan moved this from Build System & Build Optimize to Model Compression in Embedded and Mobile Deployment Oct 10, 2017

hedaoyuan moved this from Model Compression to Neon Optimize in Embedded and Mobile Deployment Oct 10, 2017

Xreki deleted the support_scalar_kernels branch October 18, 2017 06:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Centralize the use of simd intrinsic and implement scalar kernels. #2299

Centralize the use of simd intrinsic and implement scalar kernels. #2299

Xreki commented May 27, 2017 •

edited

Loading

hedaoyuan left a comment

hedaoyuan May 27, 2017

Xreki May 31, 2017

Xreki commented May 27, 2017 •

edited

Loading

hedaoyuan commented May 27, 2017

Xreki commented May 27, 2017

hedaoyuan left a comment •

edited by Xreki

Loading

Centralize the use of simd intrinsic and implement scalar kernels. #2299

Centralize the use of simd intrinsic and implement scalar kernels. #2299

Conversation

Xreki commented May 27, 2017 • edited Loading

hedaoyuan left a comment

Choose a reason for hiding this comment

hedaoyuan May 27, 2017

Choose a reason for hiding this comment

Xreki May 31, 2017

Choose a reason for hiding this comment

Xreki commented May 27, 2017 • edited Loading

hedaoyuan commented May 27, 2017

Xreki commented May 27, 2017

hedaoyuan left a comment • edited by Xreki Loading

Choose a reason for hiding this comment

Xreki commented May 27, 2017 •

edited

Loading

Xreki commented May 27, 2017 •

edited

Loading

hedaoyuan left a comment •

edited by Xreki

Loading