Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NNGraph input/output valid by register tensors #6240

Merged
merged 7 commits into from Sep 12, 2021

Conversation

chengtbf
Copy link
Contributor

@chengtbf chengtbf commented Sep 11, 2021

用于同时解决:

  1. nn.Graph 非对称 Consistent Input/Output Tensor 在每个 rank 上有无分量 决定 是否 send Push/Pull CB 的 问题。
  2. nn.Graph 检查每个 step 输入 tensor 的 meta 信息与编译时的 tensor meta 一致
  • nn.Graph register_input/output_op_names -> register_input/output_op_names_and_tensors
  • NNGraphIf 提供 input/output_valid 接口,用于表示每个 input/output tensor 有无在本 rank 上的分量
  • RunLazyJobInstructionType::Compute 给 BufferMgr Send JobInstance 的 Push/Pull CB 时,根据 io valid 过滤跳过无分量的 op
    - [ ] 每次 RunLazyNNGraph 检查 input tensor meta 信息合法 NNGraph RunLazyJob check static input tensor meta #6243
  • 非对称 超过 buffer size 的 nn.Graph 单测通过

@chengtbf chengtbf added feature bottleneck blocking another feature/PR bug WIP work in progress system interface labels Sep 11, 2021
@chengtbf chengtbf marked this pull request as ready for review September 12, 2021 06:20
for (const auto& op_name : cur_nn_graph->outputs_op_names()) {
buffer_mgr->Get(GetOutputBufferName(job_name, op_name))->Send(job_instance);
for (int i = 0; i < cur_nn_graph->outputs_op_names().size(); ++i) {
if (cur_nn_graph->outputs_valid().at(i)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

核心逻辑,根据 NNGraphIf 提供的 inputs/outputs_valid 接口,决定是否跳过 Send Push/Pull CB

@@ -36,6 +36,23 @@ limitations under the License.

namespace oneflow {

namespace {

Maybe<bool> GetTensorValidInCurRank(const std::shared_ptr<one::Tensor>& tensor) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

判断是否在本 rank 上有分量,表示是否 valid

@chengtbf
Copy link
Contributor Author

chengtbf commented Sep 12, 2021

这个分支可以 Review 了,解决了 issue:

发现的执行超过 128 iter 会死锁的 BUG

@leaves-zwx @strint @lixinqi @doombeaker

Maybe<bool> GetTensorValidInCurRank(const std::shared_ptr<one::Tensor>& tensor) {
if (tensor->is_consistent()) {
const auto& parallel_id = JUST(GetParallelId4CurrentProcessCtx(JUST(tensor->parallel_desc())));
if (parallel_id->has_value()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parallel_id看起来是一个placement内的rank id编号,如果发现当前的全局rank不属于该tensor的placement,那么就会查到一个空的placement内的rank id编号,表示该tensor在该全局rank没有分量。

这个判断是否有本rank分量的接口后面貌似可以考虑包装下会更直接?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是的

@chengtbf chengtbf removed the WIP work in progress label Sep 12, 2021
@chengtbf chengtbf requested review from oneflow-ci-bot and removed request for oneflow-ci-bot September 12, 2021 14:24
@oneflow-ci-bot oneflow-ci-bot removed their request for review September 12, 2021 14:31
@github-actions
Copy link
Contributor

CI failed, removing label automerge

@oneflow-ci-bot oneflow-ci-bot removed their request for review September 12, 2021 14:55
@github-actions
Copy link
Contributor

Speed stats:
GPU Name: GeForce GTX 1080 

OneFlow resnet50 time: 128.0ms (= 6398.8ms / 50, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 141.4ms (= 7069.4ms / 50, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.10 (= 141.4ms / 128.0ms)

OneFlow resnet50 time: 74.2ms (= 3711.3ms / 50, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 83.0ms (= 4147.6ms / 50, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.12 (= 83.0ms / 74.2ms)

OneFlow resnet50 time: 47.9ms (= 2397.2ms / 50, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 58.8ms (= 2942.0ms / 50, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.23 (= 58.8ms / 47.9ms)

OneFlow resnet50 time: 41.5ms (= 2077.0ms / 50, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 48.8ms (= 2442.1ms / 50, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.18 (= 48.8ms / 41.5ms)

OneFlow resnet50 time: 34.6ms (= 1729.3ms / 50, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 38.9ms (= 1946.9ms / 50, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.13 (= 38.9ms / 34.6ms)

OneFlow resnet50 time: 160.4ms (= 8020.4ms / 50, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 166.0ms (= 8297.7ms / 50, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.03 (= 166.0ms / 160.4ms)

OneFlow resnet50 time: 102.9ms (= 5144.7ms / 50, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 111.8ms (= 5589.3ms / 50, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.09 (= 111.8ms / 102.9ms)

OneFlow resnet50 time: 75.3ms (= 3762.8ms / 50, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 80.9ms (= 4045.7ms / 50, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.08 (= 80.9ms / 75.3ms)

OneFlow resnet50 time: 70.1ms (= 3506.1ms / 50, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 67.2ms (= 3361.5ms / 50, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 0.96 (= 67.2ms / 70.1ms)

OneFlow resnet50 time: 72.6ms (= 3629.6ms / 50, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 65.6ms (= 3280.7ms / 50, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 0.90 (= 65.6ms / 72.6ms)

@oneflow-ci-bot oneflow-ci-bot merged commit 8f67c6b into master Sep 12, 2021
@oneflow-ci-bot oneflow-ci-bot deleted the dev_cc_nngraph_io_valid branch September 12, 2021 17:05
@chengtbf chengtbf restored the dev_cc_nngraph_io_valid branch September 12, 2021 17:31
@chengtbf chengtbf deleted the dev_cc_nngraph_io_valid branch September 13, 2021 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

流水并行出错记录: header_size == rhs->blob_desc().ByteSizeOfBlobHeader() 报错
3 participants