Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support reduce_sum_op float16 #32966

Merged
merged 8 commits into from
Jun 15, 2021

Conversation

thisjiang
Copy link
Contributor

@thisjiang thisjiang commented May 18, 2021

PR types

New features

PR changes

OPs

Describe

起因

混合精度训练pure_fp16训练不支持grad clip的优化器,其中一个问题在于reduce_sum_op遇到paddle::platform::float16类型编译会报错。

问题排查

从编译日志分析,编译出错在cub::Reduce处,也就是说cub::Reduce不支持paddle::platform::float16类型。

再进一步分析,编译报错原因是paddle::platform::float16 to float转换报错,仅实验float(float16_num)static_cast<float>(float16_num)都不会报错,但float num = float16_num会报错,且报错信息一致,因此可以认为cub::Reduce报错的原因也是因为内部实现有将float16类型数据直接赋值=给了float数据。

解决办法

方案1

既然是类型转换出错,那么一定是输入输出类型不同,即输入为paddle::platform::float16输出为float。分析代码,在TensorReduceFunctor::apply处,输入类型为TxTransformOp读入数据并处理后返回也是Tx,但输出类型为Ty,因此当Ty不等于Tx时就存在类型转换。一个简单的办法就是给TransformOp套层转换,使得输入Tx,输出Ty

template <typename Tx, typename Ty, typename TransformOp>
struct ConversionFunctor {
  const TransformOp& transformer;
  HOSTDEVICE explicit inline ConversionFunctor(const TransformOp& transformer)
        : transformer(transformer) {} 
  HOSTDEVICE inline Ty operator()(const Tx& x) const {
    return static_cast<Ty>(transformer(x));
  }
};

方案2

方案1当然可行,且不用修改其它代码,十分方便。但问题是输入输出都是float16时存在较大的精度误差,因此最好的办法是当输入为float16时使用精度较高的float类型进行累加。反应在代码里,即增加MPType用于中间计算类型,当输入为float16时设为float,否则不变。

当然,由于cub::Reduce内部的实现我们无法修改,且cub::Reduce也并没有提供类似MPType这种参数,因此我们只能自己实现一个ReduceKernel1DReduceKernel1D内部调用cub::BlockReduce进行计算,但计算输入为MPType。同时为保证一致性,给所有其它自写ReduceKernelReduceKernel2D添加MPType以确保float16下的计算精度。

修改为commit 5596a88

存在的问题

  1. 自写kernel需要从定义上就修改TransformOp,即允许operator()接受其它类型的输入,由于许多op都用到了TensorReduce函数,因此需要修改所有这些opTensorReduce对应的TransformOp::operator(),改动范围非常大。
  2. 自写kernel性能很难比成熟的cub::Reduce还要快,因此相比cub::Reduce可能会有性能损失。
  3. 除了自写的ReduceKernel1D外还需要改动已有的ReduceKernel2DReduceKernel,这部分改动量也比较大。

有没有什么办法可以避免以上三个问题呢?比如TensorReduceFunctor::apply处特例化输入输出都是float16时先强制TensorReduce输出为float类型,然后cast为float16类型?

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@@ -301,7 +301,10 @@ template <typename T>
struct IdentityFunctor {
HOSTDEVICE explicit inline IdentityFunctor() {}

HOSTDEVICE inline T operator()(const T& x) const { return x; }
template <typename T2>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为什么使用T2?可以考虑其他字母?T还有什么用呢?

@@ -38,7 +38,10 @@ template <typename T>
struct IdentityFunctor {
HOSTDEVICE explicit inline IdentityFunctor() {}

HOSTDEVICE inline T operator()(const T& x) const { return x; }
template <typename T2>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上

wzzju
wzzju previously approved these changes Jun 9, 2021
Copy link
Contributor

@wzzju wzzju left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Xreki
Xreki previously approved these changes Jun 9, 2021
Copy link
Contributor

@Xreki Xreki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for op benchmark ci

@@ -241,7 +241,10 @@ template <typename T>
struct IdentityFunctor {
HOSTDEVICE explicit inline IdentityFunctor() {}

HOSTDEVICE inline T operator()(const T& x) const { return x; }
template <typename U>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

有个疑问:类定义里面的模板类型T是不是没用了?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

对的,加上这个U是因为如果没有就只能接受float16的参数,在使用float来累加时编译就会报错。之所以还留着这个T是为了兼容性考虑,若去掉则需要改动所有调用IdentityFunctor的地方。

@thisjiang thisjiang dismissed stale reviews from Xreki and wzzju via 89e4127 June 10, 2021 11:52
Copy link
Contributor

@Xreki Xreki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for op benchmark ci

@wzzju wzzju merged commit 606939d into PaddlePaddle:develop Jun 15, 2021
@thisjiang thisjiang deleted the optimize-reduce_sum_op_fp16 branch July 9, 2021 03:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants