Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix sync nccl and async nccl deadlock #6071

Merged
merged 45 commits into from
Aug 31, 2021

Conversation

liufengwei0103
Copy link
Contributor

Fix sync nccl and async nccl deadlock

auto* device_dep_object = opkernel().device()->mut_compute_local_dep_object();
if (opkernel().device()->type() == "nccl") {
const auto& device = opkernel().device();
const auto& opt_transport_dep_object = device->mut_transport_local_dep_object();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里让所有的传输类的指令全部顺序化。

void RunLazyJobPhyInstrOperand::ForEachMutMirroredObject(
const std::function<void(vm::MirroredObject* infer, vm::MirroredObject* compute)>& DoEach)
const {
DoEach(nullptr, CHECK_JUST(GetCommNetLocalDepObject())->mut_mirrored_object());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里让所有lazy的执行和eager传输类执行顺序化。

Comment on lines 106 to 108
{"comm_net", Optional<std::string>("comm_net")},
{"sync_launched_nccl", Optional<std::string>("comm_net")},
{"async_launched_nccl", Optional<std::string>("comm_net")},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

让comm_net, sync_launched_nccl, async_launched_nccl设备上的指令全部顺序化。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comm_net 和 async_launched_nccl 之间是不是没有必要顺序化,这里有可能做得更精确吗

Comment on lines +44 to +45
synced_y = sync_allreduce(sync_x)
asynced_y = async_allreduce(async_x)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同步异步交替执行,如果顺序错误,一定会abort。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

async_allreduce 会由actor 来执行?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不会。是由async_cuda线程来执行,就是优化传统数据并行后向的allreduce,能让计算和传输重叠。

@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot August 27, 2021 14:36
@github-actions
Copy link
Contributor

CI failed, removing label automerge

@github-actions
Copy link
Contributor

CI failed, removing label automerge

1 similar comment
@github-actions
Copy link
Contributor

CI failed, removing label automerge

@oneflow-ci-bot oneflow-ci-bot removed their request for review August 27, 2021 16:35
@yuanms2
Copy link
Contributor

yuanms2 commented Aug 28, 2021

新奇思考一下,我们原来加入vm 是为了更多的流水和乱序执行,现在发现了一些必须规定顺序的场景,所以很多地方加入dep obj,相当于又放弃了一些乱序执行,这个影响有多大,以及我们之前那些乱序执行的设计还有多少作用

@oneflow-ci-bot oneflow-ci-bot self-requested a review August 30, 2021 13:59
@github-actions
Copy link
Contributor

CI failed, removing label automerge

@oneflow-ci-bot oneflow-ci-bot removed their request for review August 30, 2021 16:24
@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot August 30, 2021 17:08
@github-actions
Copy link
Contributor

CI failed, removing label automerge

@oneflow-ci-bot oneflow-ci-bot removed their request for review August 30, 2021 18:21
@oneflow-ci-bot oneflow-ci-bot removed their request for review August 31, 2021 00:57
@oneflow-ci-bot oneflow-ci-bot removed their request for review August 31, 2021 03:39
@oneflow-ci-bot oneflow-ci-bot self-requested a review August 31, 2021 03:39
@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot August 31, 2021 05:11
@github-actions
Copy link
Contributor

Speed stats:
GPU Name: GeForce GTX 1080 

OneFlow resnet50 time: 127.5ms (= 6373.8ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 142.5ms (= 7123.3ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.12 (= 142.5ms / 127.5ms)

OneFlow resnet50 time: 74.2ms (= 3708.4ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 83.3ms (= 4167.0ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.12 (= 83.3ms / 74.2ms)

OneFlow resnet50 time: 47.5ms (= 2374.0ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 57.9ms (= 2892.5ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.22 (= 57.9ms / 47.5ms)

OneFlow resnet50 time: 40.5ms (= 2027.4ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 47.1ms (= 2356.0ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.16 (= 47.1ms / 40.5ms)

OneFlow resnet50 time: 37.0ms (= 1850.5ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 43.7ms (= 2184.7ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.18 (= 43.7ms / 37.0ms)

OneFlow resnet50 time: 141.0ms (= 7052.1ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 152.5ms (= 7626.3ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.08 (= 152.5ms / 141.0ms)

OneFlow resnet50 time: 86.5ms (= 4327.3ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 94.1ms (= 4705.0ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.09 (= 94.1ms / 86.5ms)

OneFlow resnet50 time: 58.7ms (= 2936.2ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 67.8ms (= 3389.5ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 1.15 (= 67.8ms / 58.7ms)

OneFlow resnet50 time: 58.2ms (= 2911.7ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 53.1ms (= 2653.5ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 0.91 (= 53.1ms / 58.2ms)

OneFlow resnet50 time: 48.9ms (= 2442.8ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
OneFlow GPU used (rank 0): 0 MiB
PyTorch resnet50 time: 48.0ms (= 2398.9ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
PyTorch GPU used (rank 0, estimated): 0 MiB
Relative speed: 0.98 (= 48.0ms / 48.9ms)

@oneflow-ci-bot oneflow-ci-bot merged commit d51d893 into master Aug 31, 2021
@oneflow-ci-bot oneflow-ci-bot deleted the fix_sync_nccl_and_async_nccl_deadlock branch August 31, 2021 07:35
daquexian added a commit that referenced this pull request Sep 4, 2021
Signed-off-by: daquexian <daquexian566@gmail.com>
daquexian added a commit that referenced this pull request Sep 4, 2021
Signed-off-by: daquexian <daquexian566@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants