optimize the realization of cuda dropout #19136
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
dropout use thrust api to generate the random is too slow, use the curand can speed up
PaddlePaddle/benchmark#148
测试transformer-big(enable_ce)中dropout OP 的平均耗时:(利用PaddlePaddle的profiler工具):
Ave Time : 1.16155 ----> 0.344537
transformer-big模型加速效果,性能提升约:10%
transformer-base模型加速效果,性能提升约:12%
CUDA实现修改之后无多卡随机性问题