New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Fix sync nccl and async nccl deadlock #6071

Merged

oneflow-ci-bot merged 45 commits into master from fix_sync_nccl_and_async_nccl_deadlock

Aug 31, 2021

Contributor

liufengwei0103 commented Aug 26, 2021

Fix sync nccl and async nccl deadlock

liufengwei0103 and others added 9 commits

August 25, 2021 16:31


          add loop p2b test

bceccd8


          refine

cd2f1cf


          refine

25dc209


          use more data

c21a5e4


          use more data

d664d5b


          refactor test_sync_and_async_allreduce.py

68590e1


          Merge branch 'fix_sync_nccl_and_async_nccl_deadlock' of github.com:On…

c7652d9

…eflow-Inc/oneflow


          sequential all comm_net ops

54cfcab


          Merge branch 'master' into fix_sync_nccl_and_async_nccl_deadlock

60ba2a3

lixinqi requested a review from oneflow-ci-bot

August 27, 2021 13:18

lixinqi added automerge bug enhancement system labels

lixinqi requested a review from daquexian

August 27, 2021 13:19

lixinqi added 2 commits

August 27, 2021 21:20


          remove unused a bash test file


          Merge branch 'fix_sync_nccl_and_async_nccl_deadlock' of github.com:On…

fdb4508

…eflow-Inc/oneflow into fix_sync_nccl_and_async_nccl_deadlock

lixinqi reviewed

View reviewed changes

oneflow/core/eager/local_call_opkernel_phy_instr_operand.cpp

-                auto* device_dep_object = opkernel().device()->mut_compute_local_dep_object();
-                if (opkernel().device()->type() == "nccl") {
+                const auto& device = opkernel().device();
+                const auto& opt_transport_dep_object = device->mut_transport_local_dep_object();

Contributor

lixinqi Aug 27, 2021

这里让所有的传输类的指令全部顺序化。

oneflow/core/eager/run_lazy_job_phy_instr_operand.cpp Outdated

               void RunLazyJobPhyInstrOperand::ForEachMutMirroredObject(
                   const std::function<void(vm::MirroredObject* infer, vm::MirroredObject* compute)>& DoEach)
                   const {
+                DoEach(nullptr, CHECK_JUST(GetCommNetLocalDepObject())->mut_mirrored_object());

Contributor

lixinqi Aug 27, 2021

这里让所有lazy的执行和eager传输类执行顺序化。

oneflow/core/framework/device.cpp Outdated

Comment on lines 106 to 108

+                    {"comm_net", Optional<std::string>("comm_net")},
+                    {"sync_launched_nccl", Optional<std::string>("comm_net")},
+                    {"async_launched_nccl", Optional<std::string>("comm_net")},

Contributor

lixinqi Aug 27, 2021

让comm_net, sync_launched_nccl, async_launched_nccl设备上的指令全部顺序化。

Contributor

daquexian Aug 27, 2021

comm_net 和 async_launched_nccl 之间是不是没有必要顺序化，这里有可能做得更精确吗

python/oneflow/test/modules/test_sync_and_async_allreduce.py

Comment on lines +44 to +45

		synced_y = sync_allreduce(sync_x)
		asynced_y = async_allreduce(async_x)

Contributor

lixinqi Aug 27, 2021

同步异步交替执行，如果顺序错误，一定会abort。

Contributor

yuanms2 Aug 27, 2021

async_allreduce 会由actor 来执行？

Contributor

lixinqi Aug 27, 2021

不会。是由async_cuda线程来执行，就是优化传统数据并行后向的allreduce，能让计算和传输重叠。

lixinqi approved these changes

View reviewed changes

yuanms2 reviewed

View reviewed changes

oneflow/core/framework/device.cpp Outdated Show resolved Hide resolved

lixinqi and others added 2 commits

August 27, 2021 22:05


          reset pool_size of async_launced_nccl to high water mark

c3b155f


          Merge branch 'master' into fix_sync_nccl_and_async_nccl_deadlock

e3dbe8e

oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot

August 27, 2021 14:36

Contributor

github-actions bot commented Aug 27, 2021

CI failed, removing label automerge

github-actions bot removed the automerge label

Contributor

github-actions bot commented Aug 27, 2021

CI failed, removing label automerge

1 similar comment

Contributor

github-actions bot commented Aug 27, 2021

CI failed, removing label automerge

oneflow-ci-bot removed their request for review

August 27, 2021 16:35

Contributor

yuanms2 commented Aug 28, 2021

新奇思考一下，我们原来加入vm 是为了更多的流水和乱序执行，现在发现了一些必须规定顺序的场景，所以很多地方加入dep obj，相当于又放弃了一些乱序执行，这个影响有多大，以及我们之前那些乱序执行的设计还有多少作用

oneflow-ci-bot self-requested a review

August 30, 2021 13:59

Contributor

github-actions bot commented Aug 30, 2021

CI failed, removing label automerge

github-actions bot removed the automerge label

oneflow-ci-bot removed their request for review

August 30, 2021 16:24


          default op_device

23ef591

lixinqi requested a review from oneflow-ci-bot

August 30, 2021 16:58

lixinqi added the automerge label


          Merge branch 'master' into fix_sync_nccl_and_async_nccl_deadlock

1ad703a

oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot

August 30, 2021 17:08

Contributor

github-actions bot commented Aug 30, 2021

CI failed, removing label automerge

github-actions bot removed the automerge label

oneflow-ci-bot removed their request for review

August 30, 2021 18:21


          refactor flow.F.xxx to flow._C.xxx

lixinqi requested a review from oneflow-ci-bot

August 30, 2021 23:58

lixinqi added the automerge label

oneflow-ci-bot removed their request for review

August 31, 2021 00:57

chengtbf requested a review from oneflow-ci-bot

August 31, 2021 02:29

oneflow-ci-bot removed their request for review

August 31, 2021 03:39


          Merge branch 'master' into fix_sync_nccl_and_async_nccl_deadlock

9e02516

oneflow-ci-bot self-requested a review

August 31, 2021 03:39


          Merge branch 'master' into fix_sync_nccl_and_async_nccl_deadlock

1c22146

oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot

August 31, 2021 05:11

Contributor

github-actions bot commented Aug 31, 2021

Speed stats:

GPU Name: GeForce GTX 1080 

OneFlow resnet50 time: 127.5ms (= 6373.8ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 142.5ms (= 7123.3ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.12 (= 142.5ms / 127.5ms)

OneFlow resnet50 time: 74.2ms (= 3708.4ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 83.3ms (= 4167.0ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.12 (= 83.3ms / 74.2ms)

OneFlow resnet50 time: 47.5ms (= 2374.0ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 57.9ms (= 2892.5ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.22 (= 57.9ms / 47.5ms)

OneFlow resnet50 time: 40.5ms (= 2027.4ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 47.1ms (= 2356.0ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.16 (= 47.1ms / 40.5ms)

OneFlow resnet50 time: 37.0ms (= 1850.5ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 43.7ms (= 2184.7ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.18 (= 43.7ms / 37.0ms)

OneFlow resnet50 time: 141.0ms (= 7052.1ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 152.5ms (= 7626.3ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.08 (= 152.5ms / 141.0ms)

OneFlow resnet50 time: 86.5ms (= 4327.3ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 94.1ms (= 4705.0ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.09 (= 94.1ms / 86.5ms)

OneFlow resnet50 time: 58.7ms (= 2936.2ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 67.8ms (= 3389.5ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.15 (= 67.8ms / 58.7ms)

OneFlow resnet50 time: 58.2ms (= 2911.7ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 53.1ms (= 2653.5ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 0.91 (= 53.1ms / 58.2ms)

OneFlow resnet50 time: 48.9ms (= 2442.8ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 48.0ms (= 2398.9ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 0.98 (= 48.0ms / 48.9ms)

oneflow-ci-bot merged commit d51d893 into master

oneflow-ci-bot deleted the fix_sync_nccl_and_async_nccl_deadlock branch

August 31, 2021 07:35

daquexian added a commit that referenced this pull request


          implement #6071 in another way

a0bf6ea

Signed-off-by: daquexian <daquexian566@gmail.com>

daquexian added a commit that referenced this pull request


          reimplement dep object part of #6071

9b6e496

Signed-off-by: daquexian <daquexian566@gmail.com>

daquexian mentioned this pull request

restore instr_local_dep_object_pool_size for nccl #6160

Merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

automerge bug enhancement system