Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix deadlock in instruction done #8897

Merged
merged 77 commits into from Aug 10, 2022
Merged

Conversation

lixinqi
Copy link
Contributor

@lixinqi lixinqi commented Aug 10, 2022

解决近期Instruction->Done的错误修改导致的死锁。

lixinqi added 30 commits May 12, 2022 21:11
@lixinqi lixinqi requested a review from daquexian August 10, 2022 08:33
Comment on lines 47 to 52
INTRUSIVE_FOR_EACH_PTR(edge, mut_in_edges()) {
Instruction* in_instruction = edge->mut_src_instruction();
CHECK(in_instruction->Done());
in_instruction->mut_out_edges()->Erase(edge);
mut_in_edges()->Erase(edge);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StreamWait由于超前调度的作用,会出现这种情况。这是正常的,因为其上游指令一定也已经完成了。

}

bool Instruction::Done() const {
return stream_policy().QueryInstructionStatusDone(stream(), status_buffer())
&& in_edges().empty();
Copy link
Contributor Author

@lixinqi lixinqi Aug 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个判断过于强烈。会导致多卡情况下在 https://github.com/Oneflow-Inc/oneflow/blob/master/oneflow/core/vm/virtual_machine_engine.cpp#L303 处死锁。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

最好能描述清楚,原来的做法为什么会死锁

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OOMHandler那里的代码基本上是:

for instruction in stream.running_instruction_list:
    wait until instruction->Done()

在stream_wait的https://github.com/Oneflow-Inc/oneflow/pull/8571/files#diff-ade37933f507aeea6c610837a09657ef820c915ec7d6d2bd0134c4750b900295R53 这里,Done行为被重构为考虑输入边为空才算完成。这在正常的流程里是没有问题的。但OOMHandler里却会在多卡的情况下有问题。我们观察到的是layer_norm op在上述的 wait until instruction->Done()过程中死锁。因为layer_norm的weight 输入来自于nccl stream,而layer_norm op本身在compute stream上,如果上述for instruction in stream.running_instruction_list先遍历到compute stream,就会出现这个BUG。

@lixinqi lixinqi requested a review from ouyangyu August 10, 2022 08:36
@lixinqi lixinqi added the need-highest-priority Only add this when you really need it!!! Will block all other PRs. label Aug 10, 2022
@github-actions
Copy link
Contributor

Speed stats:

@github-actions
Copy link
Contributor

CI failed when running job: cuda-speed-test. PR label automerge has been removed

@github-actions
Copy link
Contributor

Speed stats:
GPU Name: GeForce GTX 1080 

✔️ OneFlow resnet50 time: 128.2ms (= 12820.8ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 142.8ms (= 14279.4ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.11 (= 142.8ms / 128.2ms)

OneFlow resnet50 time: 75.4ms (= 7544.5ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 83.3ms (= 8331.8ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.10 (= 83.3ms / 75.4ms)

OneFlow resnet50 time: 48.3ms (= 9659.2ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 56.3ms (= 11253.2ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.17 (= 56.3ms / 48.3ms)

OneFlow resnet50 time: 35.7ms (= 7149.9ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 44.3ms (= 8862.5ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.24 (= 44.3ms / 35.7ms)

OneFlow resnet50 time: 28.1ms (= 5617.7ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 39.6ms (= 7927.5ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.41 (= 39.6ms / 28.1ms)

OneFlow swin dataloader time: 0.271s (= 54.223s / 200, num_workers=1)
PyTorch swin dataloader time: 0.150s (= 30.070s / 200, num_workers=1)
Relative speed: 0.555 (= 0.150s / 0.271s)

OneFlow swin dataloader time: 0.072s (= 14.307s / 200, num_workers=4)
PyTorch swin dataloader time: 0.042s (= 8.384s / 200, num_workers=4)
Relative speed: 0.586 (= 0.042s / 0.072s)

OneFlow swin dataloader time: 0.041s (= 8.214s / 200, num_workers=8)
PyTorch swin dataloader time: 0.023s (= 4.515s / 200, num_workers=8)
Relative speed: 0.550 (= 0.023s / 0.041s)

❌ OneFlow resnet50 time: 136.5ms (= 13647.2ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 161.6ms (= 16160.1ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.18 (= 161.6ms / 136.5ms)

OneFlow resnet50 time: 84.5ms (= 8445.2ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 101.5ms (= 10153.8ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.20 (= 101.5ms / 84.5ms)

OneFlow resnet50 time: 58.0ms (= 11593.9ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 78.9ms (= 15772.9ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.36 (= 78.9ms / 58.0ms)

OneFlow resnet50 time: 45.4ms (= 9080.3ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 70.3ms (= 14064.7ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.55 (= 70.3ms / 45.4ms)

OneFlow resnet50 time: 39.0ms (= 7797.8ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 68.9ms (= 13775.5ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.77 (= 68.9ms / 39.0ms)

@hjchen2 hjchen2 merged commit 9a1fc46 into master Aug 10, 2022
@hjchen2 hjchen2 deleted the fix_deadlock_in_instruction_done branch August 10, 2022 12:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug need-highest-priority Only add this when you really need it!!! Will block all other PRs. system
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants