New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix deadlock in instruction done #8897
Merged
Merged
Changes from all commits
Commits
Show all changes
77 commits
Select commit
Hold shift + click to select a range
6e8e9c9
ThreadLocalGuard
lixinqi 08e9178
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi f59d17d
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi 3eb809a
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi 55c163c
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi 8aa2e8f
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi 7612597
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi de5f971
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi 8e86949
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi 2ca0707
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi 8537b7e
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi 55c5160
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi e643eb1
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi eccdfe6
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi 043accc
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi 97b0eef
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi 1591853
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi ba6f2d7
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi 5e1a86a
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi 1ee004c
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi e853c71
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi c5afe82
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi 14226d6
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi 754d6a7
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi acb7c98
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi 5916848
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi 913f6f5
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi fa3867e
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi 61bee99
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi 7eb2d72
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi 5862a95
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi 29ad00c
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi 7297192
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi 0a54078
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi cec8a1d
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi b50e236
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi a6c5d07
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi b6b73a2
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi 43197bb
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi 4453c58
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi 582e11f
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi 4001637
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi 7fdc675
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi 1555f70
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi cea5d58
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi ccbddef
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi c914f2f
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi 6b7885f
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi 09489b2
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi ee14204
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi 4720413
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi 97b697d
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi 2cccecb
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi 755199c
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi 31a5022
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi d690538
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi a3a6056
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi dcaacc6
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi 700c39a
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi 1c6f65f
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi 1d3c62f
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi 50bc3ed
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi 3aee226
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi 4de4d3b
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi 8164635
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi 31296f5
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi 126011b
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi 8aea7b1
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi 610b471
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi b0b2756
merge origin
lixinqi 6ae63ef
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi 42aaf06
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi cf923bc
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi 4d1abe8
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi c95faf0
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi 0203d26
fix deadlock in instruction done
lixinqi fc83dd6
fix done-query in ReleaseFinishedInstructions
lixinqi File filter
Filter by extension
Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个判断过于强烈。会导致多卡情况下在 https://github.com/Oneflow-Inc/oneflow/blob/master/oneflow/core/vm/virtual_machine_engine.cpp#L303 处死锁。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
最好能描述清楚,原来的做法为什么会死锁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OOMHandler那里的代码基本上是:
在stream_wait的https://github.com/Oneflow-Inc/oneflow/pull/8571/files#diff-ade37933f507aeea6c610837a09657ef820c915ec7d6d2bd0134c4750b900295R53 这里,Done行为被重构为考虑输入边为空才算完成。这在正常的流程里是没有问题的。但OOMHandler里却会在多卡的情况下有问题。我们观察到的是layer_norm op在上述的
wait until instruction->Done()
过程中死锁。因为layer_norm的weight 输入来自于nccl stream,而layer_norm op本身在compute stream上,如果上述for instruction in stream.running_instruction_list
先遍历到compute stream,就会出现这个BUG。