Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

restore instr_local_dep_object_pool_size for nccl #6160

Merged
merged 3 commits into from
Sep 6, 2021
Merged

Conversation

daquexian
Copy link
Contributor

@daquexian daquexian commented Sep 5, 2021

经过实验,#6071 合并后 ddp 会出问题的原因是它把 local_dep_object_pool_size 从 kDoubleBufferPoolSize 改成了 GetInstructionHighWaterMark()。

wenxiao 做了实验,在未合并 #6071 的 debug_resnet_zwx_2 分支上做了 kDoubleBufferPoolSize -> GetInstructionHighWaterMark() 的改动之后,ddp 就会出问题;在合并了 #6071 的 debug_resnet_zwx_1 分支上把 GetInstructionHighWaterMark() 改回 kDoubleBufferPoolSize 之后问题就会消除。
luyang 基于 restore_pool_size_based_on_zwx_1 分支在类脑上跑 ddp 的完整训练,目前跑了 35 个 epoch,acc 曲线正常。

但不知道问题的根本原因是什么。我为了快速复现问题,尝试了在二卡上用一致的初始权重和输入数据训练 resnet50,期望结果是两张卡的 loss 完全一致,但实际现象是即使用 debug_resnet_zwx_2 分支,也会出现十几个 iterations 之后 loss 开始不一致的情况。如果把 resnet50 的残差连接去掉,或者把 allreduce 放在虚拟机的 sync cuda stream 上,或者在 allreduce 之后手动 sync,两张卡的 loss 就会一直一致。所以猜测还有一个未知原因导致 nccl 和计算之间的同步出了问题,之前在可容忍的范围内,但 #6071 合并之后这个问题被 pool size 的改动放大了。还需要进一步定位。

Signed-off-by: daquexian <daquexian566@gmail.com>
@yuanms2
Copy link
Contributor

yuanms2 commented Sep 5, 2021

提交了这个修复之后,“猜测还有一个未知原因导致 nccl 和计算之间的同步出了问题” 这个问题还在吗?

如果还在,是不行的。

昨天我在
https://github.com/Oneflow-Inc/OneTeam/issues/560
提了一些想法。

根本上,我们需要保证eager和graph的交互机制在可理解的范围内,感觉现在是一个混沌的状态。

@daquexian
Copy link
Contributor Author

提交了这个修复之后,“猜测还有一个未知原因导致 nccl 和计算之间的同步出了问题” 这个问题还在吗?

还在

如果还在,是不行的。

是的,还需要继续找出根本原因

@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot September 6, 2021 02:12
@github-actions
Copy link
Contributor

github-actions bot commented Sep 6, 2021

Speed stats:
GPU Name: GeForce GTX 1080 

OneFlow resnet50 time: 128.1ms (= 6406.6ms / 50, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 139.2ms (= 6957.7ms / 50, input_shape=[16, 3, 224, 224])
Relative speed: 1.09 (= 139.2ms / 128.1ms)

OneFlow resnet50 time: 74.4ms (= 3719.6ms / 50, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 80.9ms (= 4042.7ms / 50, input_shape=[8, 3, 224, 224])
Relative speed: 1.09 (= 80.9ms / 74.4ms)

OneFlow resnet50 time: 49.4ms (= 2469.4ms / 50, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 61.0ms (= 3049.0ms / 50, input_shape=[4, 3, 224, 224])
Relative speed: 1.23 (= 61.0ms / 49.4ms)

OneFlow resnet50 time: 46.5ms (= 2325.1ms / 50, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 44.8ms (= 2238.3ms / 50, input_shape=[2, 3, 224, 224])
Relative speed: 0.96 (= 44.8ms / 46.5ms)

OneFlow resnet50 time: 35.7ms (= 1784.1ms / 50, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 43.0ms (= 2151.2ms / 50, input_shape=[1, 3, 224, 224])
Relative speed: 1.21 (= 43.0ms / 35.7ms)

OneFlow resnet50 time: 153.8ms (= 7690.2ms / 50, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 161.5ms (= 8074.1ms / 50, input_shape=[16, 3, 224, 224], ddp, world size=2)
Relative speed: 1.05 (= 161.5ms / 153.8ms)

OneFlow resnet50 time: 102.7ms (= 5133.4ms / 50, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 104.9ms (= 5245.1ms / 50, input_shape=[8, 3, 224, 224], ddp, world size=2)
Relative speed: 1.02 (= 104.9ms / 102.7ms)

OneFlow resnet50 time: 80.3ms (= 4013.7ms / 50, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 82.8ms (= 4137.6ms / 50, input_shape=[4, 3, 224, 224], ddp, world size=2)
Relative speed: 1.03 (= 82.8ms / 80.3ms)

OneFlow resnet50 time: 66.0ms (= 3302.1ms / 50, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 63.9ms (= 3197.2ms / 50, input_shape=[2, 3, 224, 224], ddp, world size=2)
Relative speed: 0.97 (= 63.9ms / 66.0ms)

OneFlow resnet50 time: 68.8ms (= 3439.9ms / 50, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 60.9ms (= 3046.2ms / 50, input_shape=[1, 3, 224, 224], ddp, world size=2)
Relative speed: 0.89 (= 60.9ms / 68.8ms)

@oneflow-ci-bot oneflow-ci-bot removed their request for review September 6, 2021 03:18
@daquexian daquexian merged commit ac99bca into master Sep 6, 2021
@daquexian daquexian deleted the fix_ddp branch September 6, 2021 03:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants