New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NNGraph input/output valid by register tensors #6240
Conversation
for (const auto& op_name : cur_nn_graph->outputs_op_names()) { | ||
buffer_mgr->Get(GetOutputBufferName(job_name, op_name))->Send(job_instance); | ||
for (int i = 0; i < cur_nn_graph->outputs_op_names().size(); ++i) { | ||
if (cur_nn_graph->outputs_valid().at(i)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
核心逻辑,根据 NNGraphIf 提供的 inputs/outputs_valid 接口,决定是否跳过 Send Push/Pull CB
@@ -36,6 +36,23 @@ limitations under the License. | |||
|
|||
namespace oneflow { | |||
|
|||
namespace { | |||
|
|||
Maybe<bool> GetTensorValidInCurRank(const std::shared_ptr<one::Tensor>& tensor) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
判断是否在本 rank 上有分量,表示是否 valid
这个分支可以 Review 了,解决了 issue: 发现的执行超过 128 iter 会死锁的 BUG |
Maybe<bool> GetTensorValidInCurRank(const std::shared_ptr<one::Tensor>& tensor) { | ||
if (tensor->is_consistent()) { | ||
const auto& parallel_id = JUST(GetParallelId4CurrentProcessCtx(JUST(tensor->parallel_desc()))); | ||
if (parallel_id->has_value()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parallel_id看起来是一个placement内的rank id编号,如果发现当前的全局rank不属于该tensor的placement,那么就会查到一个空的placement内的rank id编号,表示该tensor在该全局rank没有分量。
这个判断是否有本rank分量的接口后面貌似可以考虑包装下会更直接?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是的
CI failed, removing label automerge |
Speed stats:
|
用于同时解决:
- [ ] 每次 RunLazyNNGraph 检查 input tensor meta 信息合法NNGraph RunLazyJob check static input tensor meta #6243