Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autotune the workspace_size_limit in conv. #40338

Merged
merged 48 commits into from
Apr 9, 2022

Conversation

JamesLim-sy
Copy link
Contributor

@JamesLim-sy JamesLim-sy commented Mar 8, 2022

PR types

Performance optimization

PR changes

OPs

Describe

  1. CudnnConvKernel 的修改主要涵盖以下内容:
  • 通过调用platform::GpuAvailableMemToAllocmemory::StatGetCurrentValue,或许当前空闲系统内存、显存池内存,并此作为workspace_size_limit,动态地调整可用workspace_size的上限。

    • 开启条件:因CycleGAN等部分模型存在性能下降情况,固当前设置FLAGS_conv_workspace_size_limit < 0时开启,FLAGS_conv_workspace_size_limit >= 0时仍使用固定的workspace_size_limit,后续将提PR改进方案并默认开启。
    • 特别说明:
      • 使用不固定的workspace_size_limit,在搜索时会创建很小大小不同的、一次性的workspace,导致bs较小时模型训练期间显存显著增加。解决方案:搜索期间,使用cudaMalloc和cudaFree来创建workspace,避免这些workspace被显存池cache住。
  • 修改SearchAlgorithm::Find方法返回值:定义类型SearchResult,从单独的algorithm返回值,变成返回{algorithm, time_cost, workspace_size} 一组数据,同时能够在ConvKernel中获取cudnnFindXxx选择出的算法类型和时间开销,以便于提取后续的algo_profiler_result

  • 新增conv_base_helper.h文件:整合conv_cudnn_helper.hconv_miopen_helper.h中的可以复用的组件,缩减无效代码规模。

  1. 修改后的效果展示:
  • OP性能方面:(A100-40G机器性能)
    调整前:image
    调整后:image

  • SearchAlgorithm::Find返回值方面:
    经过测试采用SearchResult类型的返回值后,可以在ConvKernel内得到SearchAlgorithm::Find方法获得的最优算法时间开销time_cost.
    image

  • 模型层面
    a. 在下图MLPerf ResNet50模型的第一个卷积计算中,无需设置FLAGS_conv_workspace_size_limit,即可在FLAGS_cudnn_exhaustive_search=1时获得最佳性能。
    image
    b. 通过了Benchmark平台全量模型可运行测试;
    c. 通过以下常用Conv2d的PaddleClas模型中,显存打满设置的可运行测试。

模型 bs fp32 A100-40G V100-32G
HRNet_W48_C 128 fp32
MobileNetV1 768 fp32
MobileNetV2 560 fp32
ShuffleNetV2_x1_0 1536 fp32
MobileNetV3_large_x1_0 640 fp32
alt_gvt_base 152 fp32
SwinTransformer 104 fp32

该PR需要联合#41313 一起测试和验证性能,故先合进去再优化。

@paddle-bot-old
Copy link

paddle-bot-old bot commented Mar 8, 2022

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@@ -583,7 +583,7 @@ void Conv3DOpMaker::Make() {
"(bool, default false) cuDNN has many algorithm to calculation "
"convolution, whether enable exhaustive search "
"for cuDNN convolution or not, default is False.")
.SetDefault(false)
.SetDefault(true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个默认值不能改吧?设置为true,现在默认就开启穷举搜索了。

Copy link
Contributor Author

@JamesLim-sy JamesLim-sy Mar 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的修改是方便在Benchmark机器上跑全量模型测试,省去手动修改Benchmark中的各个脚本了。目前全量测试已经结束,会提交脚本修改回默认 false的状态。

@JamesLim-sy JamesLim-sy changed the title Add basic kernel metrics tools Add basic kernel metrics tool Mar 27, 2022

// As the container of searchAlgorithm::Find() result.
template <typename Algo_t>
struct AlgoResult {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Algo_t -> AlgoT
AlgoResult -> SearchResult
可以实现一个构造函数方便初始化。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

根据建议修改

using ScalingParamType = typename platform::CudnnDataType<T>::ScalingParamType;

// As the basic for SearchAlgorithm struct.
template <typename conv_t>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

原来的PerfT

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

原来确实是PerfT,这部分笔误了,会修改回PerfT

};

// As the container of conv relevant descriptors.
template <typename Handle_t, typename Data_t>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typename T, typename Handle

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

根据建议修改

platform::FilterDescriptor wdesc;
platform::ConvolutionDescriptor cdesc;
const framework::Tensor *x, *w, *o;
Data_t cudnn_dtype;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

platform::CudnnDataType<T>::type cudnn_dtype;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

原来是DECLARE_TYPE_FOR_GPU(dnnDataType_t, cudnnDataType_t, miopenDataType_t);
CudnnDataTypemiopenDataType_t一致,根据建议修改。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这部分修改之后,代码将隐式的模板传递,

  paddle::operators::ConvArgs<T> args{&transformed_input,
                                   &transformed_filter_channel,
                                   &transformed_output,
                                   strides,
                                   padding_common,
                                   dilations,
                                   dtype};

转变为显示传递

  paddle::operators::ConvArgs<cudnnHandle_t, T> args{&transformed_input,
                                   &transformed_filter_channel,
                                   &transformed_output,
                                   strides,
                                   padding_common,
                                   dilations,
                                   dtype};

感觉这样修改代码间接度下降了

}

// template <typename algo_t>
// struct SearchAlgorithm {};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

删除注释的代码L97 - L98。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

根据建议修改

*algo = result.algo;
algo_result->algo = result.algo;
algo_result->time = result.time;
algo_result->time = result.memory;
VLOG(3) << " algo: " << result.algo << ", time: " << result.time
<< " ms, wksp = " << result.memory
<< ", status = " << result.status;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

打印一下workspace_bytes limit的值。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已加入

workspace_size_limit, &algo);
ChooseAlgoByWorkspace<perf_t, algo_result_t>(
perf_results.get(), kNUM_CUDNN_FWD_ALGS, workspace_size_limit,
&result);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L133 - L141的逻辑可以优化下。

platform::dynload::
cudnnGetConvolutionBackwardFilterAlgorithmMaxCount(args.handle,
&max_algos));
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以用kNUM_CUDNN_BWD_FILTER_ALGS

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

它原来是这样的,不太清楚#if CUDNN_VERSION_MIN(7, 0, 1)之前是否有什么设置,就沿用了int max_algos = 0;
image

struct KernelMetricsTool {
public:
KernelMetricsTool() {}
size_t GpuMemoryQuery() const { return GpuAvailableMemToAlloc(); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

该文件还会添加别的功能吗?若只是这个一个函数调用,感觉没有必要另外封装。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

嗯,核心就是调用GpuAvailableMemToAlloc()这套规则,封装中之前考虑加入GpuTimer 组件,不过既然已经实现了,这里可以删除了

workspace_size_d =
std::max(workspace_size_d, search1::GetWorkspaceSize(args1, data_algo));
bwd_result = search1::Find<T>(args1, exhaustive_search, deterministic, ctx);
VLOG(3) << "bwd data algo: " << bwd_result.algo << ", time "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Find内部有打印这些信息吧,这里没有必要加打印。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

根据建议删除

@Xreki Xreki changed the title Add basic kernel metrics tool Autotune the workspace_size_limit in conv. Apr 1, 2022
@Xreki Xreki force-pushed the add_kernel_metrics_tools branch 2 times, most recently from 5ffbdf9 to 451a3bf Compare April 7, 2022 08:58
zhangting2020
zhangting2020 previously approved these changes Apr 8, 2022
Copy link
Contributor

@zhangting2020 zhangting2020 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Xreki Xreki merged commit b937cdc into PaddlePaddle:develop Apr 9, 2022
douch pushed a commit to douch/Paddle that referenced this pull request Apr 10, 2022
* Using the maximum workspace_size of all alogirhms to limit the workspace size in exhaustive search mode.

* Use the system cudaMalloc and cudaFree to allocate workspace during searching.

* Enable switch of two kind of workspace setting methods.

Co-authored-by: Liu Yiqun <liuyiqun01@baidu.com>
Xreki added a commit to Xreki/Paddle that referenced this pull request Apr 15, 2022
* Using the maximum workspace_size of all alogirhms to limit the workspace size in exhaustive search mode.

* Use the system cudaMalloc and cudaFree to allocate workspace during searching.

* Enable switch of two kind of workspace setting methods.

Co-authored-by: Liu Yiqun <liuyiqun01@baidu.com>
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants