-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Autotune the workspace_size_limit in conv. #40338
Autotune the workspace_size_limit in conv. #40338
Conversation
Thanks for your contribution! |
… add_kernel_metrics_tools
8c6b95b
to
46ca0ac
Compare
642a27e
to
7adb1b1
Compare
paddle/fluid/operators/conv_op.cc
Outdated
@@ -583,7 +583,7 @@ void Conv3DOpMaker::Make() { | |||
"(bool, default false) cuDNN has many algorithm to calculation " | |||
"convolution, whether enable exhaustive search " | |||
"for cuDNN convolution or not, default is False.") | |||
.SetDefault(false) | |||
.SetDefault(true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个默认值不能改吧?设置为true,现在默认就开启穷举搜索了。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里的修改是方便在Benchmark机器上跑全量模型测试,省去手动修改Benchmark中的各个脚本了。目前全量测试已经结束,会提交脚本修改回默认 false的状态。
… add_kernel_metrics_tools
…m-sy/Paddle into add_kernel_metrics_tools
…m-sy/Paddle into add_kernel_metrics_tools
d5978ff
to
d68e487
Compare
|
||
// As the container of searchAlgorithm::Find() result. | ||
template <typename Algo_t> | ||
struct AlgoResult { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Algo_t
-> AlgoT
AlgoResult
-> SearchResult
可以实现一个构造函数方便初始化。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
根据建议修改
using ScalingParamType = typename platform::CudnnDataType<T>::ScalingParamType; | ||
|
||
// As the basic for SearchAlgorithm struct. | ||
template <typename conv_t> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
原来的PerfT
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
原来确实是PerfT
,这部分笔误了,会修改回PerfT
}; | ||
|
||
// As the container of conv relevant descriptors. | ||
template <typename Handle_t, typename Data_t> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typename T, typename Handle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
根据建议修改
platform::FilterDescriptor wdesc; | ||
platform::ConvolutionDescriptor cdesc; | ||
const framework::Tensor *x, *w, *o; | ||
Data_t cudnn_dtype; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
platform::CudnnDataType<T>::type cudnn_dtype;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
原来是DECLARE_TYPE_FOR_GPU(dnnDataType_t, cudnnDataType_t, miopenDataType_t);
CudnnDataType
和miopenDataType_t
一致,根据建议修改。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这部分修改之后,代码将隐式的模板传递,
paddle::operators::ConvArgs<T> args{&transformed_input,
&transformed_filter_channel,
&transformed_output,
strides,
padding_common,
dilations,
dtype};
转变为显示传递
paddle::operators::ConvArgs<cudnnHandle_t, T> args{&transformed_input,
&transformed_filter_channel,
&transformed_output,
strides,
padding_common,
dilations,
dtype};
感觉这样修改代码间接度下降了
} | ||
|
||
// template <typename algo_t> | ||
// struct SearchAlgorithm {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
删除注释的代码L97 - L98。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
根据建议修改
*algo = result.algo; | ||
algo_result->algo = result.algo; | ||
algo_result->time = result.time; | ||
algo_result->time = result.memory; | ||
VLOG(3) << " algo: " << result.algo << ", time: " << result.time | ||
<< " ms, wksp = " << result.memory | ||
<< ", status = " << result.status; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
打印一下workspace_bytes limit
的值。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已加入
workspace_size_limit, &algo); | ||
ChooseAlgoByWorkspace<perf_t, algo_result_t>( | ||
perf_results.get(), kNUM_CUDNN_FWD_ALGS, workspace_size_limit, | ||
&result); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
L133 - L141的逻辑可以优化下。
platform::dynload:: | ||
cudnnGetConvolutionBackwardFilterAlgorithmMaxCount(args.handle, | ||
&max_algos)); | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可以用kNUM_CUDNN_BWD_FILTER_ALGS
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
struct KernelMetricsTool { | ||
public: | ||
KernelMetricsTool() {} | ||
size_t GpuMemoryQuery() const { return GpuAvailableMemToAlloc(); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
该文件还会添加别的功能吗?若只是这个一个函数调用,感觉没有必要另外封装。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
嗯,核心就是调用GpuAvailableMemToAlloc()
这套规则,封装中之前考虑加入GpuTimer
组件,不过既然已经实现了,这里可以删除了
workspace_size_d = | ||
std::max(workspace_size_d, search1::GetWorkspaceSize(args1, data_algo)); | ||
bwd_result = search1::Find<T>(args1, exhaustive_search, deterministic, ctx); | ||
VLOG(3) << "bwd data algo: " << bwd_result.algo << ", time " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
在Find
内部有打印这些信息吧,这里没有必要加打印。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
根据建议删除
b2e3c44
to
ea9d4b4
Compare
de6e3e0
to
af169b6
Compare
5ffbdf9
to
451a3bf
Compare
451a3bf
to
ec3a2ab
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* Using the maximum workspace_size of all alogirhms to limit the workspace size in exhaustive search mode. * Use the system cudaMalloc and cudaFree to allocate workspace during searching. * Enable switch of two kind of workspace setting methods. Co-authored-by: Liu Yiqun <liuyiqun01@baidu.com>
* Using the maximum workspace_size of all alogirhms to limit the workspace size in exhaustive search mode. * Use the system cudaMalloc and cudaFree to allocate workspace during searching. * Enable switch of two kind of workspace setting methods. Co-authored-by: Liu Yiqun <liuyiqun01@baidu.com>
PR types
Performance optimization
PR changes
OPs
Describe
CudnnConvKernel
的修改主要涵盖以下内容:通过调用
platform::GpuAvailableMemToAlloc
和memory::StatGetCurrentValue
,或许当前空闲系统内存、显存池内存,并此作为workspace_size_limit,动态地调整可用workspace_size的上限。FLAGS_conv_workspace_size_limit < 0
时开启,FLAGS_conv_workspace_size_limit >= 0
时仍使用固定的workspace_size_limit
,后续将提PR改进方案并默认开启。workspace_size_limit
,在搜索时会创建很小大小不同的、一次性的workspace,导致bs较小时模型训练期间显存显著增加。解决方案:搜索期间,使用cudaMalloc和cudaFree来创建workspace,避免这些workspace被显存池cache住。修改
SearchAlgorithm::Find
方法返回值:定义类型SearchResult
,从单独的algorithm
返回值,变成返回{algorithm, time_cost, workspace_size}
一组数据,同时能够在ConvKernel
中获取cudnnFindXxx
选择出的算法类型和时间开销,以便于提取后续的algo_profiler_result
。新增
conv_base_helper.h
文件:整合conv_cudnn_helper.h
和conv_miopen_helper.h
中的可以复用的组件,缩减无效代码规模。OP性能方面:(A100-40G机器性能)
调整前:
调整后:
SearchAlgorithm::Find
返回值方面:经过测试采用
SearchResult
类型的返回值后,可以在ConvKernel
内得到SearchAlgorithm::Find
方法获得的最优算法时间开销time_cost
.模型层面
a. 在下图MLPerf ResNet50模型的第一个卷积计算中,无需设置
FLAGS_conv_workspace_size_limit
,即可在FLAGS_cudnn_exhaustive_search=1
时获得最佳性能。b. 通过了Benchmark平台全量模型可运行测试;
c. 通过以下常用Conv2d的PaddleClas模型中,显存打满设置的可运行测试。
该PR需要联合#41313 一起测试和验证性能,故先合进去再优化。