Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

force sync batch norm grad sequential #52268

Merged

Conversation

wanghuancoder
Copy link
Contributor

@wanghuancoder wanghuancoder commented Mar 29, 2023

PR types

Bug fixes

PR changes

Others

Describe

本PR用于修复SyncBatchNorm的前向执行顺序和反向的执行顺序错乱,导致的hang或者精度不对的问题。
由于控制流原因,多卡时,每个卡执行的逻辑可能是不一致的,算子数量肯定也是不一致的。
而前向算子的执行顺序,包括SyncBN的执行顺序都是由Python逻辑来控制的。
而反向图算子的执行顺序,是DAG的拓扑序,具体逻辑是框架通过“广度优先搜索(在算子的input全部准备好的情况下)”来确定顺序的。那么每个SyncBNGrad的执行顺序,由他举例Loss的“距离”决定。
而由于前向控制流原因,导致每张卡SyncBNGrad的顺序可能是不一致的,这可能导致hang或者精度不对。

@paddle-bot
Copy link

paddle-bot bot commented Mar 29, 2023

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@paddle-bot
Copy link

paddle-bot bot commented Mar 29, 2023

❌ The PR is not created using PR's template. You can refer to this Demo.
Please use PR's template, it helps save our maintainers' time so that more developers get helped.

Copy link
Contributor

@JiabinYang JiabinYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some comments

// Node Construction
auto grad_node =
std::shared_ptr<SyncBatchNormGradNode>(new SyncBatchNormGradNode(6, 5));
egr::Controller::Instance().PushBackForceSequentialNodes(grad_node.get());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AddForceSequentialNodes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我觉得还是PushBack好,代表了有序的~

force_sequential_nodes_.pop();
}
}
void PushBackForceSequentialNodes(GradNodeBase* node) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AddForceSequentialNodes

@@ -111,6 +111,22 @@ std::vector<paddle::Tensor> RunBackward(
const std::vector<paddle::Tensor>& no_grad_vars = {}) {
VLOG(3) << "Start Backward";

std::queue<GradNodeBase*> force_sequential_nodes_forward_queue =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seal this into one function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

下个PR来处理吧~

}
};

if (force_sequential_nodes_set.count(next_node)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about give a quick path for empty force_sequential_nodes_set

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

下个PR来处理吧~

} else {
queue.push_back(std::move(next_node));
ready_force_sequential_nodes.insert(next_node);
continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

下个PR来处理吧~

Copy link
Contributor

@JiabinYang JiabinYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM and fix problem in next pr ASAP

Copy link
Contributor

@XiaoguangHu01 XiaoguangHu01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wanghuancoder wanghuancoder merged commit 336160c into PaddlePaddle:develop Mar 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants