Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add fix op run order pass #34427

Merged
merged 7 commits into from Jul 29, 2021

Conversation

sneaxiy
Copy link
Collaborator

@sneaxiy sneaxiy commented Jul 27, 2021

PR types

Performance optimization

PR changes

APIs

Describe

Add fix_op_run_order pass.

MLPerf ResNet50模型优化。

使用nsys profile发现Paddle单机8卡AllReduce时间明显比MxNet慢。原因有2个:

  • MxNet一个batch只有2次AllReduce,一个FP32,一个FP16。而Paddle一个batch有8个AllReduce,4个FP32和4个FP16。本PR将FLAGS_fuse_parameter_memory_size=0的情况进行处理,将其处理成2个AllReduce,即跟MxNet一样,一个FP32,一个FP16。
  • MxNet多卡间的AllReduce步调较为一致(起AllReduce的时刻基本相同)。而Paddle多卡间的AllReduce步调不一致,导致AllReduce被拖慢。本PR加了一个fix_op_run_order_pass,固定多进程间的OP执行次序。

本PR经过上述两个改动后,MLPerf ResNet50模型单机8卡加速比从7.69提升至7.83。MxNet目前单机8卡加速比是7.92。

框架 单卡速度 单机8卡速度 加速比
MxNet 3349 26518 7.92
Paddle(优化前) 3009 23141 7.69
Paddle(优化后) 3009 23559 7.83

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

}
}

VLOG(10) << "Found unchanged OpDesc " << op_to_idx.size() << ", new OpDesc "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是node_to_idx.size()吧,op_to_idx是原始program的,可以再打印一个Origin OpDesc数量

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

auto &pending_ops = graph_view.PendingOps(cur_op);
tmp_ops.clear();
for (auto *pending_op : pending_ops) {
if (visited_ops.count(pending_op) > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

理论上是不是不会出现被访问执行多次的情况啊,感觉加个断言保险点

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个visited_ops其实可以去掉,已去掉。

auto *prev_op = sorted_ops[i - 1];
auto *cur_op = sorted_ops[i];
auto *dep_var = new details::DummyVarHandle(graph->CreateControlDepVar());
graph->Get<details::GraphDepVars>(details::kGraphDepVars).emplace(dep_var);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

提个小小的建议,可以判断一下prev_op和cur_op是否已经有var依赖了,如果有可以不加dep_var

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个其实比较起来还比较复杂。因为有var依赖其实可以是间接好几层的依赖。正确的做法是把reference_count_pass.cc里的ShrinkDepsFunctor的代码挪过来来看。但考虑到改动比较大,可以考虑下一个PR改。

// sort next ready ops by node index
std::sort(tmp_ops.begin(), tmp_ops.end(), comp);
for (auto *op : tmp_ops) {
q.push(op);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fast_threaded_ssa_graph_executor调度里面,原来有些op是高优先级的,这部分需要加进来吗

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

有道理,已添加。

// add one more thread for generate op_deps
prepare_pool_(1) {
if (ir::IsTopologySortOperationsUnique(*graph_)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个调度器感觉也可以简化一下,直接搞成顺溜执行的0.0

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个已经是有了的,里面有traced_ops_来做这个事情。

Copy link
Contributor

@wangxicoding wangxicoding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sneaxiy sneaxiy merged commit 79e758c into PaddlePaddle:develop Jul 29, 2021
@sneaxiy sneaxiy deleted the add_fix_op_run_order_pass branch July 29, 2021 13:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants