-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve layer_norm speed #9355
Improve layer_norm speed #9355
Conversation
transfomer on a single device step time reduces from 0.157 to 0.125
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is great!!
The functions in Eigen are too extensive and are very slow in many places.
|
||
#ifdef PADDLE_WITH_CUDA | ||
template <typename T> | ||
class RowwiseMean2D<platform::CUDADeviceContext, T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be better to write this function in math_function.
template <typename T> | ||
class ColwiseSum2D<platform::CUDADeviceContext, T> { | ||
public: | ||
ColwiseSum2D(int left, int right, const platform::DeviceContext& dev_ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ColwiseSum
is used in lstm_op, gru_op, sequence_expand_op and lstmp_op, maybe those ops' performance can be improved too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR can be merged first and fixing the comments in next PR.
layer_norm forward and backward overall speed up 3x ~ 4x
transfomer on a single device step time
reduces from 0.157 to 0.125
the precommit also automatically formatted some codes.