Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug of Multi-Client src tick output order #6221

Merged
merged 10 commits into from Sep 11, 2021

Conversation

chengtbf
Copy link
Contributor

@chengtbf chengtbf commented Sep 9, 2021

修复一个非常重要的 nn.Graph 多卡运行的 BUG:

每个 rank 都会有自己的 WaitAndSendIds 和 CallbackNotify Op/TaskNode 作为该 rank 的 唯一 src 和 唯一 sink;同时每个 rank 的 VM 都只会往本 rank 才有的 input/output 、WaitAndSendIds 和 CallbackNotify 发送 JobInstance 作为 RunLazyJob instruction 的实际执行结果, 参考:

buffer_mgr->Get(GetSourceTickBufferName(job_name))->Send(job_instance);

但是由于我们 Single-Client 迁移到 Multi-Client,整个计算图变成了多 source 的(每个 rank 都自己独立的 src 和 sink),以至于我们在逻辑图上即使连上了正确的控制边,物理图上每个 rank 的 input output 并不完全由自己 rank 的 src 触发,以至于在多卡运行时(尤其是 流水并行非对称 ,且输入输出也不对称),计算图中的本 rank 的 output 已经执行了,本 rank 的虚拟机还没有给 src 发 JobInstance ,此时就会报错 OutputKernel :buffer 为空 的 BUG。

CHECK_NE(buffer_status, kBufferStatusEmpty);

本 PR 通过在 TaskGraph 上,给每个 rank 的 src tick 和 input/output 连控制边 的方式强制使 output kernel 至少需要依赖本 rank 的 src tick 才能执行。

同时修复了 input/ouput callback 比 callback notify 晚执行的 BUG

@leaves-zwx
Copy link
Contributor

这是一种实现思路,在 task graph 层级处理 src op 的控制连边问题。但感觉按照 xinqi 之前的思路,在 op graph 层级 autotick 机制解决这个问题有一定的好处,如果能把相似的功能都用同一种形式表达是能增加维护性和鲁棒性的。负责控制子图的驱动的一直是 tick 机制,这次 multi-client 引入的新问题如果在 tick 的范畴下解决是最好的,另外在 task graph 去控制 tick 连边其实属于特例,会增加以后的维护成本。

目前 tick 按照功能层级逐级相连,一共4种 tick: SourceTickOp -> TickOp -> SrcSubsetTickOp -> DeviceTickOp,sink 部分则是 DeviceTickOp -> DstSubsetTickOp -> TickOp -> SinkTickOp,还有一个 AccTickOp 负责控制不同频的驱动

  • SourceTickOp/SinkTickOp: 标识 critical section 的起始结束
  • TickOp: 插在 SourceTickOp 和 SrcSubsetTickOp 之间,作用我不太清楚
  • SrcSubsetTickOp: 节点(machine)上的起始结束
  • DeviceTickOp: 同一个 placement group 的起始

multi-client 的问题感觉是引入了新的 src op,之前 src op 的判别都是通过 REGISTER_AUTO_TICK 注册的,属于静态判断,而在 multi-client 下的 src op 应该是动态的,无法静态判断,感觉应该可以通过搜索找到同一个 placement group 下的所有 src op,然后把它们与 DeviceTickOp 连上控制边,来达到控制不同 rank 上的 op 驱动的目的。

@chengtbf
Copy link
Contributor Author

chengtbf commented Sep 10, 2021

multi-client 的问题感觉是引入了新的 src op,之前 src op 的判别都是通过 REGISTER_AUTO_TICK 注册的,属于静态判断,而在 multi-client 下的 src op 应该是动态的,无法静态判断,感觉应该可以通过搜索找到同一个 placement group 下的所有 src op,然后把它们与 DeviceTickOp 连上控制边,来达到控制不同 rank 上的 op 驱动的目的。

我推演了一下,好像不能在 OpGraph 上连控制边 😂 ,因为可能会没有效果,见:

void ForEachOpGraphNecessaryCtrlEdge(

我们在 OpGraph 上连的控制边 (src -> dst),如果在 OpGraph 上可达的话(表示有先后的数据依赖),是不会插入控制边的?而 output 和 input,其实是被 src tick 可达的吧,只是不是被本 rank 上的 src tick 可达(逻辑图可达,物理图其实不可达)

@strint strint mentioned this pull request Sep 10, 2021
3 tasks
@leaves-zwx
Copy link
Contributor

我们在 OpGraph 上连的控制边 (src -> dst),如果在 OpGraph 上可达的话(表示有先后的数据依赖),是不会插入控制边的?而 output 和 input,其实是被 src tick 可达的吧,只是不是被本 rank 上的 src tick 可达(逻辑图可达,物理图其实不可达)

Got 了,看来暂时没法在 op graph 层面操作这件事情。我仔细想了下,你这种方法应该还是比较安全的,后期不会产生什么隐患。至于机制的统一性,我们可以在实现分布式编译的功能的时候再去设计下这个机制。

@chengtbf
Copy link
Contributor Author

增加了测试用例,这个分支 Ready 了

Copy link
Contributor

@strint strint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

看了下plan,和实现匹配,我那个pr的4卡测试合并这个执行也正常

@chengtbf chengtbf requested review from oneflow-ci-bot and removed request for oneflow-ci-bot September 11, 2021 02:41
loss = pp_out.mean()
loss.backward()
y = x + y
free_out = y.to_consistent(P1, B)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个没有上下文的估计看不出在做什么,这里加注释说明下应该会更清楚

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的

@chengtbf chengtbf removed the request for review from oneflow-ci-bot September 11, 2021 02:58
@chengtbf chengtbf requested review from oneflow-ci-bot and removed request for oneflow-ci-bot September 11, 2021 03:13
@github-actions
Copy link
Contributor

Speed stats:
GPU Name: GeForce GTX 1080 

OneFlow resnet50 time: 127.4ms (= 6370.3ms / 50, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 138.9ms (= 6945.9ms / 50, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.09 (= 138.9ms / 127.4ms)

OneFlow resnet50 time: 74.2ms (= 3708.6ms / 50, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 84.0ms (= 4199.9ms / 50, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.13 (= 84.0ms / 74.2ms)

OneFlow resnet50 time: 49.7ms (= 2485.3ms / 50, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 57.4ms (= 2868.5ms / 50, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.15 (= 57.4ms / 49.7ms)

OneFlow resnet50 time: 39.8ms (= 1991.0ms / 50, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 51.7ms (= 2585.2ms / 50, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.30 (= 51.7ms / 39.8ms)

OneFlow resnet50 time: 42.1ms (= 2103.6ms / 50, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 43.7ms (= 2185.8ms / 50, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.04 (= 43.7ms / 42.1ms)

OneFlow resnet50 time: 150.5ms (= 7527.2ms / 50, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 158.9ms (= 7945.8ms / 50, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.06 (= 158.9ms / 150.5ms)

OneFlow resnet50 time: 100.8ms (= 5041.4ms / 50, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 101.5ms (= 5074.7ms / 50, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.01 (= 101.5ms / 100.8ms)

OneFlow resnet50 time: 81.5ms (= 4077.3ms / 50, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 87.0ms (= 4351.8ms / 50, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.07 (= 87.0ms / 81.5ms)

OneFlow resnet50 time: 76.2ms (= 3808.9ms / 50, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 87.5ms (= 4376.7ms / 50, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.15 (= 87.5ms / 76.2ms)

OneFlow resnet50 time: 67.4ms (= 3369.9ms / 50, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 69.8ms (= 3489.0ms / 50, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.04 (= 69.8ms / 67.4ms)

@oneflow-ci-bot oneflow-ci-bot removed their request for review September 11, 2021 04:12
@chengtbf chengtbf added automerge and removed bottleneck blocking another feature/PR automerge labels Sep 11, 2021
@github-actions
Copy link
Contributor

Speed stats:
GPU Name: GeForce GTX 1080 

OneFlow resnet50 time: 128.4ms (= 6418.3ms / 50, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 138.8ms (= 6938.3ms / 50, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.08 (= 138.8ms / 128.4ms)

OneFlow resnet50 time: 74.7ms (= 3735.6ms / 50, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 85.3ms (= 4264.1ms / 50, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.14 (= 85.3ms / 74.7ms)

OneFlow resnet50 time: 48.0ms (= 2399.2ms / 50, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 61.1ms (= 3056.3ms / 50, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.27 (= 61.1ms / 48.0ms)

OneFlow resnet50 time: 39.5ms (= 1977.4ms / 50, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 44.8ms (= 2240.6ms / 50, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.13 (= 44.8ms / 39.5ms)

OneFlow resnet50 time: 40.3ms (= 2016.0ms / 50, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 40.5ms (= 2022.5ms / 50, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.00 (= 40.5ms / 40.3ms)

OneFlow resnet50 time: 152.8ms (= 7638.9ms / 50, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 160.9ms (= 8045.4ms / 50, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.05 (= 160.9ms / 152.8ms)

OneFlow resnet50 time: 97.8ms (= 4887.7ms / 50, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 105.8ms (= 5287.8ms / 50, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.08 (= 105.8ms / 97.8ms)

OneFlow resnet50 time: 76.2ms (= 3810.3ms / 50, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 73.4ms (= 3671.4ms / 50, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 0.96 (= 73.4ms / 76.2ms)

OneFlow resnet50 time: 69.4ms (= 3467.9ms / 50, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 63.8ms (= 3189.2ms / 50, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 0.92 (= 63.8ms / 69.4ms)

OneFlow resnet50 time: 67.4ms (= 3370.3ms / 50, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 62.2ms (= 3108.9ms / 50, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 0.92 (= 62.2ms / 67.4ms)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants