Fix deadlock in instruction done #8897

lixinqi · 2022-08-10T08:33:15Z

解决近期Instruction->Done的错误修改导致的死锁。

lixinqi · 2022-08-10T08:35:46Z

oneflow/core/vm/instruction.cpp

+  INTRUSIVE_FOR_EACH_PTR(edge, mut_in_edges()) {
+    Instruction* in_instruction = edge->mut_src_instruction();
+    CHECK(in_instruction->Done());
+    in_instruction->mut_out_edges()->Erase(edge);
+    mut_in_edges()->Erase(edge);
+  }


StreamWait由于超前调度的作用，会出现这种情况。这是正常的，因为其上游指令一定也已经完成了。

lixinqi · 2022-08-10T08:35:50Z

oneflow/core/vm/instruction.cpp

 }

 bool Instruction::Done() const {
-  return stream_policy().QueryInstructionStatusDone(stream(), status_buffer())
-         && in_edges().empty();


这个判断过于强烈。会导致多卡情况下在 https://github.com/Oneflow-Inc/oneflow/blob/master/oneflow/core/vm/virtual_machine_engine.cpp#L303 处死锁。

最好能描述清楚，原来的做法为什么会死锁

OOMHandler那里的代码基本上是：

for instruction in stream.running_instruction_list: wait until instruction->Done()

在stream_wait的https://github.com/Oneflow-Inc/oneflow/pull/8571/files#diff-ade37933f507aeea6c610837a09657ef820c915ec7d6d2bd0134c4750b900295R53 这里，Done行为被重构为考虑输入边为空才算完成。这在正常的流程里是没有问题的。但OOMHandler里却会在多卡的情况下有问题。我们观察到的是layer_norm op在上述的 wait until instruction->Done()过程中死锁。因为layer_norm的weight 输入来自于nccl stream，而layer_norm op本身在compute stream上，如果上述for instruction in stream.running_instruction_list先遍历到compute stream，就会出现这个BUG。

oneflow/core/vm/instruction.cpp

github-actions · 2022-08-10T09:41:19Z

Speed stats:

github-actions · 2022-08-10T09:41:21Z

CI failed when running job: cuda-speed-test. PR label automerge has been removed

github-actions · 2022-08-10T10:59:42Z

Speed stats:

GPU Name: GeForce GTX 1080 

✔️ OneFlow resnet50 time: 128.2ms (= 12820.8ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 142.8ms (= 14279.4ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.11 (= 142.8ms / 128.2ms)

OneFlow resnet50 time: 75.4ms (= 7544.5ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 83.3ms (= 8331.8ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.10 (= 83.3ms / 75.4ms)

OneFlow resnet50 time: 48.3ms (= 9659.2ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 56.3ms (= 11253.2ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.17 (= 56.3ms / 48.3ms)

OneFlow resnet50 time: 35.7ms (= 7149.9ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 44.3ms (= 8862.5ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.24 (= 44.3ms / 35.7ms)

OneFlow resnet50 time: 28.1ms (= 5617.7ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 39.6ms (= 7927.5ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.41 (= 39.6ms / 28.1ms)

OneFlow swin dataloader time: 0.271s (= 54.223s / 200, num_workers=1)
PyTorch swin dataloader time: 0.150s (= 30.070s / 200, num_workers=1)
Relative speed: 0.555 (= 0.150s / 0.271s)

OneFlow swin dataloader time: 0.072s (= 14.307s / 200, num_workers=4)
PyTorch swin dataloader time: 0.042s (= 8.384s / 200, num_workers=4)
Relative speed: 0.586 (= 0.042s / 0.072s)

OneFlow swin dataloader time: 0.041s (= 8.214s / 200, num_workers=8)
PyTorch swin dataloader time: 0.023s (= 4.515s / 200, num_workers=8)
Relative speed: 0.550 (= 0.023s / 0.041s)

❌ OneFlow resnet50 time: 136.5ms (= 13647.2ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 161.6ms (= 16160.1ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.18 (= 161.6ms / 136.5ms)

OneFlow resnet50 time: 84.5ms (= 8445.2ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 101.5ms (= 10153.8ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.20 (= 101.5ms / 84.5ms)

OneFlow resnet50 time: 58.0ms (= 11593.9ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 78.9ms (= 15772.9ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.36 (= 78.9ms / 58.0ms)

OneFlow resnet50 time: 45.4ms (= 9080.3ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 70.3ms (= 14064.7ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.55 (= 70.3ms / 45.4ms)

OneFlow resnet50 time: 39.0ms (= 7797.8ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 68.9ms (= 13775.5ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.77 (= 68.9ms / 39.0ms)

lixinqi added 30 commits May 12, 2022 21:11

ThreadLocalGuard

6e8e9c9

Merge branch 'master' of github.com:Oneflow-Inc/oneflow

08e9178

Merge branch 'master' of github.com:Oneflow-Inc/oneflow

f59d17d

Merge branch 'master' of github.com:Oneflow-Inc/oneflow

3eb809a

Merge branch 'master' of github.com:Oneflow-Inc/oneflow

55c163c

Merge branch 'master' of github.com:Oneflow-Inc/oneflow

8aa2e8f

Merge branch 'master' of github.com:Oneflow-Inc/oneflow

7612597

Merge branch 'master' of github.com:Oneflow-Inc/oneflow

de5f971

Merge branch 'master' of github.com:Oneflow-Inc/oneflow

8e86949

Merge branch 'master' of github.com:Oneflow-Inc/oneflow

2ca0707

Merge branch 'master' of github.com:Oneflow-Inc/oneflow

8537b7e

Merge branch 'master' of github.com:Oneflow-Inc/oneflow

55c5160

Merge branch 'master' of github.com:Oneflow-Inc/oneflow

e643eb1

Merge branch 'master' of github.com:Oneflow-Inc/oneflow

eccdfe6

Merge branch 'master' of github.com:Oneflow-Inc/oneflow

043accc

Merge branch 'master' of github.com:Oneflow-Inc/oneflow

97b0eef

Merge branch 'master' of github.com:Oneflow-Inc/oneflow

1591853

Merge branch 'master' of github.com:Oneflow-Inc/oneflow

ba6f2d7

Merge branch 'master' of github.com:Oneflow-Inc/oneflow

5e1a86a

Merge branch 'master' of github.com:Oneflow-Inc/oneflow

1ee004c

Merge branch 'master' of github.com:Oneflow-Inc/oneflow

e853c71

Merge branch 'master' of github.com:Oneflow-Inc/oneflow

c5afe82

Merge branch 'master' of github.com:Oneflow-Inc/oneflow

14226d6

Merge branch 'master' of github.com:Oneflow-Inc/oneflow

754d6a7

Merge branch 'master' of github.com:Oneflow-Inc/oneflow

acb7c98

Merge branch 'master' of github.com:Oneflow-Inc/oneflow

5916848

Merge branch 'master' of github.com:Oneflow-Inc/oneflow

913f6f5

Merge branch 'master' of github.com:Oneflow-Inc/oneflow

fa3867e

Merge branch 'master' of github.com:Oneflow-Inc/oneflow

61bee99

Merge branch 'master' of github.com:Oneflow-Inc/oneflow

7eb2d72

lixinqi added 12 commits August 3, 2022 23:29

Merge branch 'master' of github.com:Oneflow-Inc/oneflow

8164635

Merge branch 'master' of github.com:Oneflow-Inc/oneflow

31296f5

Merge branch 'master' of github.com:Oneflow-Inc/oneflow

126011b

Merge branch 'master' of github.com:Oneflow-Inc/oneflow

8aea7b1

Merge branch 'master' of github.com:Oneflow-Inc/oneflow

610b471

merge origin

b0b2756

Merge branch 'master' of github.com:Oneflow-Inc/oneflow

6ae63ef

Merge branch 'master' of github.com:Oneflow-Inc/oneflow

42aaf06

Merge branch 'master' of github.com:Oneflow-Inc/oneflow

cf923bc

Merge branch 'master' of github.com:Oneflow-Inc/oneflow

4d1abe8

Merge branch 'master' of github.com:Oneflow-Inc/oneflow

c95faf0

fix deadlock in instruction done

0203d26

lixinqi requested a review from daquexian August 10, 2022 08:33

lixinqi commented Aug 10, 2022

View reviewed changes

lixinqi requested a review from ouyangyu August 10, 2022 08:36

daquexian reviewed Aug 10, 2022

View reviewed changes

oneflow/core/vm/instruction.cpp Outdated Show resolved Hide resolved

daquexian approved these changes Aug 10, 2022

View reviewed changes

lixinqi added automerge bug system labels Aug 10, 2022

lixinqi requested a review from oneflow-ci-bot August 10, 2022 08:50

lixinqi added the need-highest-priority Only add this when you really need it!!! Will block all other PRs. label Aug 10, 2022

Flowingsun007 approved these changes Aug 10, 2022

View reviewed changes

github-actions bot removed the automerge label Aug 10, 2022

fix done-query in ReleaseFinishedInstructions

fc83dd6

hjchen2 merged commit 9a1fc46 into master Aug 10, 2022

hjchen2 deleted the fix_deadlock_in_instruction_done branch August 10, 2022 12:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix deadlock in instruction done #8897

Fix deadlock in instruction done #8897

lixinqi commented Aug 10, 2022

lixinqi Aug 10, 2022

lixinqi Aug 10, 2022 •

edited

yuanms2 Aug 10, 2022

lixinqi Aug 10, 2022

github-actions bot commented Aug 10, 2022

github-actions bot commented Aug 10, 2022

github-actions bot commented Aug 10, 2022

Fix deadlock in instruction done #8897

Fix deadlock in instruction done #8897

Conversation

lixinqi commented Aug 10, 2022

lixinqi Aug 10, 2022

Choose a reason for hiding this comment

lixinqi Aug 10, 2022 • edited

Choose a reason for hiding this comment

yuanms2 Aug 10, 2022

Choose a reason for hiding this comment

lixinqi Aug 10, 2022

Choose a reason for hiding this comment

github-actions bot commented Aug 10, 2022

github-actions bot commented Aug 10, 2022

github-actions bot commented Aug 10, 2022

lixinqi Aug 10, 2022 •

edited