Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto FixPipelineStageIdPass #6204

Merged
merged 17 commits into from
Sep 10, 2021
Merged

Auto FixPipelineStageIdPass #6204

merged 17 commits into from
Sep 10, 2021

Conversation

chengtbf
Copy link
Contributor

@chengtbf chengtbf commented Sep 8, 2021

用户配置 Pipeline stage id 并不一定可以严格的保证 stage 连续(尤其是 module 衔接处的 op,如 to_consistent),因此我们需要有个算法来自动修复出错的 stage id。(本身 用户配置的实际上也是 pipeline stage id hint,oneflow 内部有权修改为更合适的值)

算法:

  1. 统计所有 placement 对应的 op name 和 stage id,对于同一个 Placement 下的所有 op,根据其 stage id 分组;
  2. 如果一个 Placement 下有多组 stage id,那么将取 group 内 stage id 最大的为真正的 stage id,将剩余其他的 stage id 的 op 都 merge 到这个真正的 stage id 上。

@chengtbf chengtbf added the WIP work in progress label Sep 8, 2021
@chengtbf chengtbf added automerge bottleneck blocking another feature/PR feature interface system and removed WIP work in progress labels Sep 9, 2021
@chengtbf chengtbf marked this pull request as ready for review September 9, 2021 09:09
@chengtbf chengtbf mentioned this pull request Sep 9, 2021
3 tasks
CHECK_GE_OR_RETURN(max_stage_id, 0);
for (const OpNode* this_node : pair.second) {
int64_t this_stage_id = GetStageIdHint(this_node);
if (this_stage_id != max_stage_id) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

重构了算法,现在主体逻辑大幅简化:

每个 Placement Group 寻找最大的 stage id,将其余的 op,如果 stage id 小于这个 最大值,就 merge 到最大值

@leaves-zwx @strint

自己本地测试是没问题的

@chengtbf chengtbf requested review from oneflow-ci-bot and removed request for oneflow-ci-bot September 10, 2021 09:09
@github-actions
Copy link
Contributor

CI failed, removing label automerge

@github-actions
Copy link
Contributor

Speed stats:
GPU Name: GeForce GTX 1080 

OneFlow resnet50 time: 127.5ms (= 6376.5ms / 50, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 139.2ms (= 6958.4ms / 50, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.09 (= 139.2ms / 127.5ms)

OneFlow resnet50 time: 74.2ms (= 3710.8ms / 50, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 83.8ms (= 4191.8ms / 50, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.13 (= 83.8ms / 74.2ms)

OneFlow resnet50 time: 53.9ms (= 2697.0ms / 50, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 59.0ms (= 2948.6ms / 50, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.09 (= 59.0ms / 53.9ms)

OneFlow resnet50 time: 44.6ms (= 2229.8ms / 50, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 56.2ms (= 2811.7ms / 50, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.26 (= 56.2ms / 44.6ms)

OneFlow resnet50 time: 48.1ms (= 2406.7ms / 50, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 47.3ms (= 2365.4ms / 50, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 0.98 (= 47.3ms / 48.1ms)

OneFlow resnet50 time: 155.5ms (= 7774.0ms / 50, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 161.8ms (= 8089.5ms / 50, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.04 (= 161.8ms / 155.5ms)

OneFlow resnet50 time: 99.5ms (= 4974.3ms / 50, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 106.3ms (= 5313.2ms / 50, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.07 (= 106.3ms / 99.5ms)

OneFlow resnet50 time: 79.6ms (= 3978.5ms / 50, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 79.6ms (= 3981.0ms / 50, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.00 (= 79.6ms / 79.6ms)

OneFlow resnet50 time: 72.8ms (= 3641.9ms / 50, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 70.9ms (= 3546.1ms / 50, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 0.97 (= 70.9ms / 72.8ms)

OneFlow resnet50 time: 71.0ms (= 3550.9ms / 50, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 65.7ms (= 3286.4ms / 50, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 0.93 (= 65.7ms / 71.0ms)

@github-actions
Copy link
Contributor

CI failed, removing label automerge

@oneflow-ci-bot oneflow-ci-bot removed their request for review September 10, 2021 10:24
@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot September 10, 2021 11:38
@github-actions
Copy link
Contributor

Speed stats:
GPU Name: GeForce GTX 1080 

OneFlow resnet50 time: 127.6ms (= 6379.7ms / 50, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 141.9ms (= 7093.5ms / 50, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.11 (= 141.9ms / 127.6ms)

OneFlow resnet50 time: 74.4ms (= 3721.0ms / 50, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 82.8ms (= 4138.6ms / 50, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.11 (= 82.8ms / 74.4ms)

OneFlow resnet50 time: 48.0ms (= 2400.7ms / 50, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 56.7ms (= 2836.2ms / 50, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.18 (= 56.7ms / 48.0ms)

OneFlow resnet50 time: 43.1ms (= 2154.6ms / 50, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 51.3ms (= 2567.0ms / 50, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.19 (= 51.3ms / 43.1ms)

OneFlow resnet50 time: 42.4ms (= 2121.7ms / 50, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 39.1ms (= 1955.1ms / 50, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 0.92 (= 39.1ms / 42.4ms)

OneFlow resnet50 time: 153.0ms (= 7650.6ms / 50, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 154.4ms (= 7717.9ms / 50, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.01 (= 154.4ms / 153.0ms)

OneFlow resnet50 time: 99.3ms (= 4965.6ms / 50, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 102.4ms (= 5117.8ms / 50, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.03 (= 102.4ms / 99.3ms)

OneFlow resnet50 time: 78.3ms (= 3916.8ms / 50, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 74.7ms (= 3736.6ms / 50, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 0.95 (= 74.7ms / 78.3ms)

OneFlow resnet50 time: 73.1ms (= 3657.0ms / 50, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 65.6ms (= 3278.7ms / 50, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 0.90 (= 65.6ms / 73.1ms)

OneFlow resnet50 time: 72.3ms (= 3614.9ms / 50, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 64.6ms (= 3232.4ms / 50, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 0.89 (= 64.6ms / 72.3ms)

@oneflow-ci-bot oneflow-ci-bot merged commit bff3c2d into master Sep 10, 2021
@oneflow-ci-bot oneflow-ci-bot deleted the dev_cc_fix_pipeline_stage branch September 10, 2021 12:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants