-
Notifications
You must be signed in to change notification settings - Fork 662
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
restore instr_local_dep_object_pool_size for nccl #6160
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: daquexian <daquexian566@gmail.com>
提交了这个修复之后,“猜测还有一个未知原因导致 nccl 和计算之间的同步出了问题” 这个问题还在吗? 如果还在,是不行的。 昨天我在 根本上,我们需要保证eager和graph的交互机制在可理解的范围内,感觉现在是一个混沌的状态。 |
还在
是的,还需要继续找出根本原因 |
lixinqi
approved these changes
Sep 6, 2021
oneflow-ci-bot
requested review from
oneflow-ci-bot
and removed request for
oneflow-ci-bot
September 6, 2021 02:12
Speed stats:
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
经过实验,#6071 合并后 ddp 会出问题的原因是它把 local_dep_object_pool_size 从 kDoubleBufferPoolSize 改成了 GetInstructionHighWaterMark()。
wenxiao 做了实验,在未合并 #6071 的 debug_resnet_zwx_2 分支上做了 kDoubleBufferPoolSize -> GetInstructionHighWaterMark() 的改动之后,ddp 就会出问题;在合并了 #6071 的 debug_resnet_zwx_1 分支上把 GetInstructionHighWaterMark() 改回 kDoubleBufferPoolSize 之后问题就会消除。
luyang 基于 restore_pool_size_based_on_zwx_1 分支在类脑上跑 ddp 的完整训练,目前跑了 35 个 epoch,acc 曲线正常。
但不知道问题的根本原因是什么。我为了快速复现问题,尝试了在二卡上用一致的初始权重和输入数据训练 resnet50,期望结果是两张卡的 loss 完全一致,但实际现象是即使用 debug_resnet_zwx_2 分支,也会出现十几个 iterations 之后 loss 开始不一致的情况。如果把 resnet50 的残差连接去掉,或者把 allreduce 放在虚拟机的 sync cuda stream 上,或者在 allreduce 之后手动 sync,两张卡的 loss 就会一直一致。所以猜测还有一个未知原因导致 nccl 和计算之间的同步出了问题,之前在可容忍的范围内,但 #6071 合并之后这个问题被 pool size 的改动放大了。还需要进一步定位。