New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add fix op run order pass #34427
add fix op run order pass #34427
Conversation
Thanks for your contribution! |
} | ||
} | ||
|
||
VLOG(10) << "Found unchanged OpDesc " << op_to_idx.size() << ", new OpDesc " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是node_to_idx.size()吧,op_to_idx是原始program的,可以再打印一个Origin OpDesc数量
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
auto &pending_ops = graph_view.PendingOps(cur_op); | ||
tmp_ops.clear(); | ||
for (auto *pending_op : pending_ops) { | ||
if (visited_ops.count(pending_op) > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
理论上是不是不会出现被访问执行多次的情况啊,感觉加个断言保险点
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个visited_ops
其实可以去掉,已去掉。
auto *prev_op = sorted_ops[i - 1]; | ||
auto *cur_op = sorted_ops[i]; | ||
auto *dep_var = new details::DummyVarHandle(graph->CreateControlDepVar()); | ||
graph->Get<details::GraphDepVars>(details::kGraphDepVars).emplace(dep_var); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
提个小小的建议,可以判断一下prev_op和cur_op是否已经有var依赖了,如果有可以不加dep_var
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个其实比较起来还比较复杂。因为有var依赖其实可以是间接好几层的依赖。正确的做法是把reference_count_pass.cc里的ShrinkDepsFunctor的代码挪过来来看。但考虑到改动比较大,可以考虑下一个PR改。
// sort next ready ops by node index | ||
std::sort(tmp_ops.begin(), tmp_ops.end(), comp); | ||
for (auto *op : tmp_ops) { | ||
q.push(op); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fast_threaded_ssa_graph_executor调度里面,原来有些op是高优先级的,这部分需要加进来吗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
有道理,已添加。
// add one more thread for generate op_deps | ||
prepare_pool_(1) { | ||
if (ir::IsTopologySortOperationsUnique(*graph_)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个调度器感觉也可以简化一下,直接搞成顺溜执行的0.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个已经是有了的,里面有traced_ops_来做这个事情。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
Performance optimization
PR changes
APIs
Describe
Add fix_op_run_order pass.
MLPerf ResNet50模型优化。
使用nsys profile发现Paddle单机8卡AllReduce时间明显比MxNet慢。原因有2个:
FLAGS_fuse_parameter_memory_size=0
的情况进行处理,将其处理成2个AllReduce,即跟MxNet一样,一个FP32,一个FP16。本PR经过上述两个改动后,MLPerf ResNet50模型单机8卡加速比从7.69提升至7.83。MxNet目前单机8卡加速比是7.92。