New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix deadlock in instruction done #8897
Conversation
oneflow/core/vm/instruction.cpp
Outdated
INTRUSIVE_FOR_EACH_PTR(edge, mut_in_edges()) { | ||
Instruction* in_instruction = edge->mut_src_instruction(); | ||
CHECK(in_instruction->Done()); | ||
in_instruction->mut_out_edges()->Erase(edge); | ||
mut_in_edges()->Erase(edge); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StreamWait由于超前调度的作用,会出现这种情况。这是正常的,因为其上游指令一定也已经完成了。
} | ||
|
||
bool Instruction::Done() const { | ||
return stream_policy().QueryInstructionStatusDone(stream(), status_buffer()) | ||
&& in_edges().empty(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
最好能描述清楚,原来的做法为什么会死锁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OOMHandler那里的代码基本上是:
for instruction in stream.running_instruction_list:
wait until instruction->Done()
在stream_wait的https://github.com/Oneflow-Inc/oneflow/pull/8571/files#diff-ade37933f507aeea6c610837a09657ef820c915ec7d6d2bd0134c4750b900295R53 这里,Done行为被重构为考虑输入边为空才算完成。这在正常的流程里是没有问题的。但OOMHandler里却会在多卡的情况下有问题。我们观察到的是layer_norm op在上述的 wait until instruction->Done()
过程中死锁。因为layer_norm的weight 输入来自于nccl stream,而layer_norm op本身在compute stream上,如果上述for instruction in stream.running_instruction_list
先遍历到compute stream,就会出现这个BUG。
Speed stats:
|
CI failed when running job: cuda-speed-test. PR label automerge has been removed |
Speed stats:
|
解决近期Instruction->Done的错误修改导致的死锁。