-
Notifications
You must be signed in to change notification settings - Fork 662
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix sync nccl and async nccl deadlock #6071
Fix sync nccl and async nccl deadlock #6071
Conversation
…eflow-Inc/oneflow
…eflow-Inc/oneflow into fix_sync_nccl_and_async_nccl_deadlock
auto* device_dep_object = opkernel().device()->mut_compute_local_dep_object(); | ||
if (opkernel().device()->type() == "nccl") { | ||
const auto& device = opkernel().device(); | ||
const auto& opt_transport_dep_object = device->mut_transport_local_dep_object(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里让所有的传输类的指令全部顺序化。
void RunLazyJobPhyInstrOperand::ForEachMutMirroredObject( | ||
const std::function<void(vm::MirroredObject* infer, vm::MirroredObject* compute)>& DoEach) | ||
const { | ||
DoEach(nullptr, CHECK_JUST(GetCommNetLocalDepObject())->mut_mirrored_object()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里让所有lazy的执行和eager传输类执行顺序化。
oneflow/core/framework/device.cpp
Outdated
{"comm_net", Optional<std::string>("comm_net")}, | ||
{"sync_launched_nccl", Optional<std::string>("comm_net")}, | ||
{"async_launched_nccl", Optional<std::string>("comm_net")}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
让comm_net, sync_launched_nccl, async_launched_nccl设备上的指令全部顺序化。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comm_net 和 async_launched_nccl 之间是不是没有必要顺序化,这里有可能做得更精确吗
synced_y = sync_allreduce(sync_x) | ||
asynced_y = async_allreduce(async_x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同步异步交替执行,如果顺序错误,一定会abort。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
async_allreduce 会由actor 来执行?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不会。是由async_cuda线程来执行,就是优化传统数据并行后向的allreduce,能让计算和传输重叠。
CI failed, removing label automerge |
CI failed, removing label automerge |
1 similar comment
CI failed, removing label automerge |
新奇思考一下,我们原来加入vm 是为了更多的流水和乱序执行,现在发现了一些必须规定顺序的场景,所以很多地方加入dep obj,相当于又放弃了一些乱序执行,这个影响有多大,以及我们之前那些乱序执行的设计还有多少作用 |
CI failed, removing label automerge |
CI failed, removing label automerge |
Speed stats:
|
Signed-off-by: daquexian <daquexian566@gmail.com>
Signed-off-by: daquexian <daquexian566@gmail.com>
Fix sync nccl and async nccl deadlock