-
Notifications
You must be signed in to change notification settings - Fork 756
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TaskNode::order_in_chain #10102
TaskNode::order_in_chain #10102
Conversation
file_stream << "i : " << std::to_string(i) << " , actor id : " << std::to_string(task_id) | ||
<< " thrd : " << std::to_string(thrd_id) << " name : " << task_id2name.at(task_id) | ||
<< "\n chain_id : " << std::to_string(task->chain_id()) | ||
<< " order_in_chain : " << std::to_string(task->order_in_chain()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
增加了 chain id 的信息,格式修改:
before:
order : 39 , actor id : 8796126576640 name : reduce_sum-12 thrd : 4194320 device_type : kCPU stream_index : 16 {
consume : in : <- [ reshape-11/__out_0 ] ( actor_id: 8796124479488, regst: {regust_num: 1, device: cpu, time_shape: (1,1,4), shape: (16,), dtype: kFloat} )
produce : tmp regst: {regust_num: 1, device: cpu, time_shape: (1,1,4), shape: (64,), dtype: kChar} {
}
produce : __output_tensor_0 regst: {regust_num: 1, device: cpu, time_shape: (1,1,4), shape: (), dtype: kFloat} {
-> [ pack-21 ] ( actor_id: 8796147548160 )
-> [ ones_like-13 ] ( actor_id: 8796128673792 )
}
}
order : 40 , actor id : 8796147548160 name : pack-21 thrd : 4194330 device_type : kCPU stream_index : 26 {
consume : in : <- [ reduce_sum-12/__output_tensor_0 ] ( actor_id: 8796126576640, regst: {regust_num: 1, device: cpu, time_shape: (1,1,4), shape: (), dtype: kFloat} )
produce : out regst: {regust_num: 1, device: cpu, time_shape: (1,1), shape: (4,), dtype: kFloat} {
-> [ _LinearTrainGraph_0_output.0.0.1_4 ] ( actor_id: 8796149645312 )
}
}
after:
i : 37 , actor id : 17592186044430 thrd : 8388608 name : add_n-10
chain_id : 0 order_in_chain : 4 device_type : kCUDA stream_index : 0 {
consume : in : <- [ broadcast_add-5/__z_0 ] ( actor_id: 17592186044426, regst: {regust_num: 1, device: cuda, time_shape: (1,1,4), shape: (2,8), dtype: kFloat} )
consume : in : <- [ constant-8/__out_0 ] ( actor_id: 17592186044429, regst: {regust_num: 1, device: cuda, time_shape: (1,1,4), shape: (2,8), dtype: kFloat} )
produce : __out_0 regst: {regust_num: 1, device: cuda, time_shape: (1,1,4), shape: (2,8), dtype: kFloat} {
-> [ reshape-11 ] ( actor_id: 17592186044431 )
}
}
i : 38 , actor id : 17592186044431 thrd : 8388608 name : reshape-11
chain_id : 0 order_in_chain : 5 device_type : kCUDA stream_index : 0 {
consume : in : <- [ add_n-10/__out_0 ] ( actor_id: 17592186044430, regst: {regust_num: 1, device: cuda, time_shape: (1,1,4), shape: (2,8), dtype: kFloat} )
produce : __out_0 regst: {regust_num: 1, device: cuda, time_shape: (1,1,4), shape: (16,), dtype: kFloat} {
-> [ pack-20 ] ( actor_id: 17592186044440 )
-> [ broadcast_like-14 ] ( actor_id: 17592186044434 )
-> [ reduce_sum-12 ] ( actor_id: 17592186044432 )
}
}
i : 39 , actor id : 17592186044432 thrd : 8388608 name : reduce_sum-12
chain_id : 0 order_in_chain : 7 device_type : kCUDA stream_index : 0 {
consume : in_ctrl : <- [ pack-20/out_ctrl_103 ] ( actor_id: 17592186044440, regst: {regust_num: 1, device: cuda, ctrl} )
consume : in : <- [ reshape-11/__out_0 ] ( actor_id: 17592186044431, regst: {regust_num: 1, device: cuda, time_shape: (1,1,4), shape: (16,), dtype: kFloat} )
produce : __output_tensor_0 regst: {regust_num: 1, device: cuda, time_shape: (1,1,4), shape: (), dtype: kFloat} {
-> [ pack-21 ] ( actor_id: 17592186044442 )
-> [ ones_like-13 ] ( actor_id: 17592186044433 )
}
produce : tmp regst: {regust_num: 1, device: cuda, time_shape: (1,1,4), shape: (512,), dtype: kChar} {
}
}
@@ -606,11 +603,7 @@ void StraightenNodes(TaskGraph* task_graph, std::vector<TaskNode*>* ordered_task | |||
|
|||
std::vector<int32_t> remain_task_nums(num_classifier, 0); | |||
|
|||
auto SetOrderInGraph = [&](TaskNode* task_node) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
对物理图上的拉直算法做了一点 refine,移除了 order in graph 概念。 仅提供 ordered task nodes 。 @Yipeng1994
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
嗯嗯,之前在分离编译的大pr上看到了这个改动
Speed stats:
|
View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/10102/ |
oneflow/core/job/task.proto
Outdated
map<string, RegstDescProto> produced_regst_desc = 8; | ||
map<string, RegstDescIdSet> consumed_regst_desc_id = 9; | ||
optional bool all_register_num_eq_one_hint = 10 [default = false]; | ||
required int64 chain_id = 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这些 id 按说还是可以复用的,比如这里还用 6 ?
proto 部分一直在更新,一直也没有保证兼容
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这些 id 按说还是可以复用的
可以用 6,但用了也不兼容,因为之前是:task_set_info
这里是希望如果后续再插入字段,比如 xx_id,可以不影响 chain id 之后的字段。 参考 op_conf 中对不同类型的 type 的字段分割。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
proto 部分一直在更新,一直也没有保证兼容
是的,后续还会大改,因为 plan/job 里有很多冗余字段
只能保证大版本内的可用性。 如果跨版本,job/plan 重新走一次编译就好了
cache plan 可以做一些检查,如果发现存储的 plan 是老版本的,就自动重新编译覆盖掉
Co-authored-by: Xiaoyu Xu <xiaoyulink@gmail.com>
…dev_cc_order_in_chain
Speed stats:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
简单测试过了拉直,功能正常
View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/10102/ |
CI failed when running job: cpu-misc. PR label automerge has been removed |
CI failed when running job: cuda-module. PR label automerge has been removed |
View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/10102/ |
拆分离编译下的 :
依赖:
移除 order_in_graph,使用 order_in_chain,在 LogicalChainPass 打开的情况下(分离编译强制 LogicalChain),logical chain 将 order_in_logical_chain 写入各个 op,从逻辑图上读取 order,跳过物理图的拓扑信息。
refine LightPlan 的输出信息