Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix deadlock in instruction done #8897

Merged
merged 77 commits into from Aug 10, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
77 commits
Select commit Hold shift + click to select a range
6e8e9c9
ThreadLocalGuard
lixinqi May 12, 2022
08e9178
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi May 14, 2022
f59d17d
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi May 18, 2022
3eb809a
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi May 18, 2022
55c163c
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi May 20, 2022
8aa2e8f
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Jun 1, 2022
7612597
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Jun 6, 2022
de5f971
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Jun 8, 2022
8e86949
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Jun 9, 2022
2ca0707
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Jun 16, 2022
8537b7e
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Jun 16, 2022
55c5160
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Jun 17, 2022
e643eb1
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Jun 18, 2022
eccdfe6
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Jun 20, 2022
043accc
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Jun 20, 2022
97b0eef
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Jun 20, 2022
1591853
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Jun 23, 2022
ba6f2d7
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Jun 23, 2022
5e1a86a
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Jun 23, 2022
1ee004c
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Jun 24, 2022
e853c71
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Jun 24, 2022
c5afe82
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Jun 24, 2022
14226d6
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Jun 26, 2022
754d6a7
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Jun 27, 2022
acb7c98
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Jun 28, 2022
5916848
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Jun 28, 2022
913f6f5
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Jul 1, 2022
fa3867e
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Jul 2, 2022
61bee99
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Jul 3, 2022
7eb2d72
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Jul 4, 2022
5862a95
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Jul 6, 2022
29ad00c
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Jul 6, 2022
7297192
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Jul 6, 2022
0a54078
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Jul 8, 2022
cec8a1d
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Jul 11, 2022
b50e236
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Jul 11, 2022
a6c5d07
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Jul 15, 2022
b6b73a2
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Jul 16, 2022
43197bb
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Jul 16, 2022
4453c58
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Jul 17, 2022
582e11f
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Jul 18, 2022
4001637
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Jul 19, 2022
7fdc675
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Jul 19, 2022
1555f70
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Jul 20, 2022
cea5d58
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Jul 22, 2022
ccbddef
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Jul 22, 2022
c914f2f
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Jul 25, 2022
6b7885f
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Jul 25, 2022
09489b2
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Jul 26, 2022
ee14204
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Jul 27, 2022
4720413
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Jul 28, 2022
97b697d
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Jul 28, 2022
2cccecb
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Jul 28, 2022
755199c
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Jul 29, 2022
31a5022
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Jul 29, 2022
d690538
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Jul 29, 2022
a3a6056
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Jul 29, 2022
dcaacc6
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Jul 30, 2022
700c39a
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Jul 31, 2022
1c6f65f
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Aug 1, 2022
1d3c62f
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Aug 2, 2022
50bc3ed
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Aug 2, 2022
3aee226
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Aug 2, 2022
4de4d3b
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Aug 3, 2022
8164635
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Aug 3, 2022
31296f5
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Aug 3, 2022
126011b
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Aug 4, 2022
8aea7b1
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Aug 5, 2022
610b471
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Aug 5, 2022
b0b2756
merge origin
lixinqi Aug 6, 2022
6ae63ef
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Aug 7, 2022
42aaf06
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Aug 8, 2022
cf923bc
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Aug 9, 2022
4d1abe8
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Aug 9, 2022
c95faf0
Merge branch 'master' of github.com:Oneflow-Inc/oneflow
lixinqi Aug 10, 2022
0203d26
fix deadlock in instruction done
lixinqi Aug 10, 2022
fc83dd6
fix done-query in ReleaseFinishedInstructions
lixinqi Aug 10, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
17 changes: 11 additions & 6 deletions oneflow/core/vm/instruction.cpp
Expand Up @@ -41,16 +41,21 @@ void Instruction::InitStatus() { instruction_policy_->InitInstructionStatusIf(th
Maybe<void> Instruction::Prepare() { return instruction_policy_->PrepareIf(this); }
void Instruction::Compute() { return instruction_policy_->ComputeIf(this); }

void Instruction::DeleteStatusAndClearEdges() {
OF_PROFILER_RANGE_GUARD("Instruction::DeleteStatusAndClearEdges");
void Instruction::DeleteStatusAndCheckEdges() {
OF_PROFILER_RANGE_GUARD("Instruction::DeleteStatusAndCheckEdges");
instruction_policy_->DeleteInstructionStatusIf(this);
mut_in_edges()->Clear();
mut_out_edges()->Clear();
INTRUSIVE_FOR_EACH_PTR(edge, mut_in_edges()) {
Instruction* in_instruction = edge->mut_src_instruction();
LOG(FATAL) << "unerased edge: " << in_instruction->DebugName() << " -> " << this->DebugName();
}
INTRUSIVE_FOR_EACH_PTR(edge, mut_out_edges()) {
Instruction* out_instruction = edge->mut_dst_instruction();
LOG(FATAL) << "unerased edge: " << this->DebugName() << " -> " << out_instruction->DebugName();
}
}

bool Instruction::Done() const {
return stream_policy().QueryInstructionStatusDone(stream(), status_buffer())
&& in_edges().empty();
Copy link
Contributor Author

@lixinqi lixinqi Aug 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个判断过于强烈。会导致多卡情况下在 https://github.com/Oneflow-Inc/oneflow/blob/master/oneflow/core/vm/virtual_machine_engine.cpp#L303 处死锁。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

最好能描述清楚,原来的做法为什么会死锁

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OOMHandler那里的代码基本上是:

for instruction in stream.running_instruction_list:
    wait until instruction->Done()

在stream_wait的https://github.com/Oneflow-Inc/oneflow/pull/8571/files#diff-ade37933f507aeea6c610837a09657ef820c915ec7d6d2bd0134c4750b900295R53 这里,Done行为被重构为考虑输入边为空才算完成。这在正常的流程里是没有问题的。但OOMHandler里却会在多卡的情况下有问题。我们观察到的是layer_norm op在上述的 wait until instruction->Done()过程中死锁。因为layer_norm的weight 输入来自于nccl stream,而layer_norm op本身在compute stream上,如果上述for instruction in stream.running_instruction_list先遍历到compute stream,就会出现这个BUG。

return stream_policy().QueryInstructionStatusDone(stream(), status_buffer());
}

StreamPolicy* Instruction::mut_stream_policy() { return mut_stream()->mut_stream_policy(); }
Expand Down
2 changes: 1 addition & 1 deletion oneflow/core/vm/instruction.h
Expand Up @@ -138,7 +138,7 @@ class Instruction final : public intrusive::Base {

// methods
void InitStatus();
void DeleteStatusAndClearEdges();
void DeleteStatusAndCheckEdges();
bool Done() const;
StreamPolicy* mut_stream_policy();
const StreamPolicy& stream_policy() const;
Expand Down
5 changes: 3 additions & 2 deletions oneflow/core/vm/virtual_machine_engine.cpp
Expand Up @@ -177,13 +177,14 @@ void VirtualMachineEngine::ReleaseFinishedInstructions(const ScheduleCtx& schedu
INTRUSIVE_FOR_EACH_PTR(stream, mut_active_stream_list()) {
while (true) {
auto* instruction_ptr = stream->mut_running_instruction_list()->Begin();
if (instruction_ptr == nullptr || !instruction_ptr->Done()) { break; }
if (instruction_ptr == nullptr) { break; }
if (!(instruction_ptr->in_edges().empty() && instruction_ptr->Done())) { break; }
ReleaseInstruction(instruction_ptr);
// Prevent destructing instruction_ptr.
intrusive::shared_ptr<Instruction> instruction =
stream->mut_running_instruction_list()->Erase(instruction_ptr);
LivelyInstructionListErase(instruction_ptr);
instruction_ptr->DeleteStatusAndClearEdges();
instruction_ptr->DeleteStatusAndCheckEdges();
}
if (stream->running_instruction_list().empty()) { mut_active_stream_list()->Erase(stream); }
}
Expand Down