Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Centralize the use of simd intrinsic and implement scalar kernels. #2299

Merged
merged 10 commits into from
Jun 5, 2017

Conversation

Xreki
Copy link
Contributor

@Xreki Xreki commented May 27, 2017

Major modifications are listed as follows:

  1. Move the simd special implementation to an independent file, see hl_cpu_simd_sse.cuh and hl_cpu_simd_neon.cuh.
  2. Add scalar implementation, while don't use any extend instruction, see hl_cpu_scalar.cuh.

As a result, we do not need hl_matrix_base_[sse/neon].cuh, hl_[sse/neon]_matrix_kernel.cuh any more, which are almost the same.

Copy link
Contributor

@hedaoyuan hedaoyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

另外,看一下能不能把hl_cpu_scalar.cuh, hl_cpu_simd_sse.cuh和hl_cpu_simd_neon.cuh里面的add,
mul,sub,div这几个基本操作实现到hl_tensor_ops.h里面去。

#elif defined(__SSE3__)
#include "hl_cpu_simd_sse.cuh"
#elif (defined(__ARM_NEON) || defined(__ARM_NEON__)) && !defined(__NVCC__)
#include "hl_cpu_simd_neon.cuh"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里注释说明一下加入__NVCC__宏的原因。
另外,ARM+GPU环境下,ARM部分用不了neon指令。应该加一个TODO,后续还是需要fix这个问题的。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@Xreki
Copy link
Contributor Author

Xreki commented May 27, 2017

@hedaoyuan

看一下能不能把hl_cpu_scalar.cuh, hl_cpu_simd_sse.cuh和hl_cpu_simd_neon.cuh里面的add,
mul,sub,div这几个基本操作实现到hl_tensor_ops.h里面去。

hl_tensor_ops.cuh是目前纯标量的实现。你的意思是使用hl_tensor_ops.cuh里面的方式封装simd指令,
实现hl_tensor_ops_scalar.cuhhl_tensor_ops_sse.cuhhl_tensor_ops_neon.cuh这样三个版本?然后在hl_matrix_base_detail.cuh里面改成调用hl_tensor_ops_[scalar/sse/neon].cuh定义的操作?

另外,其实Paddle里面还有好多用avx实现的kernel,目前没有neon实现,我做这一层对基本操作封装,也是希望日后可以用封装的接口代替直接调用的_mm256_**接口,也可以去掉到处存在的#ifdef __AVX__判断,并且比较容易扩展到neon或者其他指令集。

hl_matrix_base.cuh所涉及的这些操作也都是可以用avx来实现的吧,只是不知当初为何没实现avx版本?

还有一个问题,当前这种实现方式没有考虑到运行时动态选择指令集的实现。是不是不要统一命名的好?

@hedaoyuan
Copy link
Contributor

实现hl_tensor_ops_scalar.cuh、hl_tensor_ops_sse.cuh和hl_tensor_ops_neon.cuh这样三个版本?

不是三个版本,是实现一个版本。hl_tensor_ops.cuh里面定义的是模板类,可以实例化不同的参数类型。

只是不知当初为何没实现avx版本?

只是没有实现而已。

还有一个问题,当前这种实现方式没有考虑到运行时动态选择指令集的实现。是不是不要统一命名的好?

这个有什么关系吗?不统一命名,调用的时候用if else区分?

@Xreki
Copy link
Contributor Author

Xreki commented May 27, 2017

@hedaoyuan

不是三个版本,是实现一个版本。hl_tensor_ops.cuh里面定义的是模板类,可以实例化不同的参数类型。

我明白了。

这个有什么关系吗?不统一命名,调用的时候用if else区分?

我明白了。没有关系了,统一命名,可以以类型区分。

Copy link
Contributor

@hedaoyuan hedaoyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

先这样吧。
后续vecType需要修改下面这样的模板类,这样可以去掉PADDLE_TYPE_DOUBLE宏,并且可以扩展到int等类型。

template<class T, int size>
Packet;

@Xreki Xreki merged commit e36e24d into PaddlePaddle:develop Jun 5, 2017
@Xreki Xreki moved this from In Progress to Build System in Embedded and Mobile Deployment Jun 6, 2017
@hedaoyuan hedaoyuan moved this from Build System & Build Optimize to Model Compression in Embedded and Mobile Deployment Oct 10, 2017
@hedaoyuan hedaoyuan moved this from Model Compression to Neon Optimize in Embedded and Mobile Deployment Oct 10, 2017
@Xreki Xreki deleted the support_scalar_kernels branch October 18, 2017 06:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Embedded and Mobile Deployment
Neon Optimize & Low Precision
Development

Successfully merging this pull request may close these issues.

None yet

2 participants