Task infer blob desc support choosing method #10124

strint · 2023-04-13T04:41:34Z

No description provided.

chengtbf · 2023-04-13T09:49:44Z

oneflow/core/graph/task_node.h

@@ -71,6 +71,9 @@ class TaskNode : public Node<TaskNode, TaskEdge> {
  DeviceType device_type() const;
  virtual const ParallelContext* parallel_ctx() const { return nullptr; }

+  // Different types of ExecNode choose different output BlobDesc inference methods


ExecNode 没有 type，这里说的是 TaskNode Type 吧

ExecNode 没有 type，这里说的是 TaskNode Type 吧

我改下

chengtbf · 2023-04-13T09:50:59Z

oneflow/core/graph/exec_graph.h

@@ -72,7 +72,8 @@ class ExecNode final : public Node<ExecNode, ExecEdge> {
  std::string VisualStr() const override { return op_->op_name(); }
  void ToProto(const ParallelContext*, ExecNodeProto*) const;

-  void InferBlobDescs(const ParallelContext* parallel_ctx);
+  typedef void (ExecNode::*InferBlobDescsMethod)(const ParallelContext*);
+  void InferBlobDescsByInputs(const ParallelContext* parallel_ctx);


这里只提供了一种 method 吗

这里只提供了一种 method 吗

对，相当于这个分支不改变执行逻辑，只提供了接口

chengtbf · 2023-04-13T09:52:38Z

oneflow/core/graph/compute_task_node.h

@@ -43,6 +43,10 @@ class CompTaskNode : public TaskNode {
  // op
  std::shared_ptr<const Operator> op() const { return op_node_->shared_op(); }

+  ExecNode::InferBlobDescsMethod GetInferBlobDescsMethod() const override {
+    return &ExecNode::InferBlobDescsByInputs;


这里应该支持 from sbp 和 logical shape ？

这里应该支持 from sbp 和 logical shape ？

from sbp 的推理方法和编译模式关联了，所以就没加到这个分支

其实为了加速，master 编译这里也可以用 from sbp 吧，这样是不是会更快？

其实为了加速，master 编译这里也可以用 from sbp 吧，这样是不是会更快？

当前估计使用 from sbp 不会变快：

from sbp 本身的实现开销和之前做 infer physical blobdesc 估计差不多

只适用于 user op，改成通用的影响的地方比较多，有个后续 pr 在做这个

master infer 的过程如果不使用并行，加速不明显

有的op不支持from sbp吧，比如涉及到求平均，求和或者求最大值的。（我记得当前有一个op是这样的，好像是叫bn？）

还有个问题就是如果sbp变动了会怎么样？当前是要重新推导一遍
比如自动并行会大规模修改sbp。这时候如果有个logical desc储存着应该好一点

有的op不支持from sbp吧，比如涉及到求平均，求和或者求最大值的。（我记得当前有一个op是这样的，好像是叫bn？）

user op 都是符合的，这里使用的场景，之前用 physical infer 推理后，也会再用 sbp 做下 check，所以也是符合的。

还有个问题就是如果sbp变动了会怎么样？当前是要重新推导一遍比如自动并行会大规模修改sbp。这时候如果有个logical desc储存着应该好一点

因为这个推理发生在 plan 生成阶段，是在自动不行之后，按说 sbp 已经稳定了。

oneflow/core/graph/exec_graph.cpp

Co-authored-by: Yipeng Li <jamesonli1313@gmail.com>

github-actions · 2023-04-17T10:26:15Z

Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.

Yipeng1994

LGTM

github-actions · 2023-04-20T06:49:15Z

Speed stats:

GPU Name: GeForce GTX 1080 

❌ OneFlow resnet50 time: 141.3ms (= 14131.6ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 147.3ms (= 14725.1ms / 100, input_shape=[16, 3, 224, 224])
❌ Relative speed: 1.04 (= 147.3ms / 141.3ms)

OneFlow resnet50 time: 82.8ms (= 8283.8ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 93.5ms (= 9346.2ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.13 (= 93.5ms / 82.8ms)

OneFlow resnet50 time: 51.6ms (= 10315.5ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 70.5ms (= 14095.9ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.37 (= 70.5ms / 51.6ms)

OneFlow resnet50 time: 34.0ms (= 6805.8ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 64.7ms (= 12940.5ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.90 (= 64.7ms / 34.0ms)

OneFlow resnet50 time: 25.7ms (= 5147.8ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 61.7ms (= 12343.9ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 2.40 (= 61.7ms / 25.7ms)

OneFlow swin dataloader time: 0.242s (= 48.435s / 200, num_workers=1)
PyTorch swin dataloader time: 0.150s (= 29.976s / 200, num_workers=1)
Relative speed: 0.619 (= 0.150s / 0.242s)

OneFlow swin dataloader time: 0.072s (= 14.333s / 200, num_workers=4)
PyTorch swin dataloader time: 0.042s (= 8.374s / 200, num_workers=4)
Relative speed: 0.584 (= 0.042s / 0.072s)

OneFlow swin dataloader time: 0.046s (= 9.208s / 200, num_workers=8)
PyTorch swin dataloader time: 0.022s (= 4.387s / 200, num_workers=8)
Relative speed: 0.476 (= 0.022s / 0.046s)

❌ OneFlow resnet50 time: 154.4ms (= 15436.7ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 165.0ms (= 16499.8ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
❌ Relative speed: 1.07 (= 165.0ms / 154.4ms)

OneFlow resnet50 time: 94.5ms (= 9453.6ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 104.3ms (= 10429.0ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.10 (= 104.3ms / 94.5ms)

OneFlow resnet50 time: 62.0ms (= 12395.1ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 88.7ms (= 17742.1ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.43 (= 88.7ms / 62.0ms)

OneFlow resnet50 time: 44.2ms (= 8834.9ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 72.2ms (= 14430.5ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.63 (= 72.2ms / 44.2ms)

OneFlow resnet50 time: 37.4ms (= 7487.9ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 70.8ms (= 14151.0ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.89 (= 70.8ms / 37.4ms)

github-actions · 2023-04-20T06:52:38Z

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/10124/

strint added 7 commits April 12, 2023 14:22

add task to/from proto

62dee17

Update boxing_task_graph.proto

db84632

Update task_edge.proto

49c8d18

Update task_graph_rebuild_ctx.cpp

041a4b3

Update task_graph_rebuild_ctx.h

3e08944

Update transport_task_node.cpp

e0cf92b

support infer desc choose method

008239e

strint requested a review from chengtbf as a code owner April 13, 2023 04:41

chengtbf reviewed Apr 13, 2023

View reviewed changes

refine comment

be2987d

chengtbf approved these changes Apr 14, 2023

View reviewed changes

strint added 6 commits April 14, 2023 09:46

rm useless

9acbc79

add task factory to create new task node

17203d0

Merge branch 'sep0_task_proto' into sep2_custom_blobdesc_infer

4377368

add infer from ndsbp

88f4297

rm useless

266c388

Merge branch 'sep0_task_proto' into sep2_custom_blobdesc_infer

9f952bd

strint changed the title ~~support infer desc choose method~~ Task infer blob desc support choosing method Apr 16, 2023

Yipeng1994 reviewed Apr 17, 2023

View reviewed changes

oneflow/core/graph/exec_graph.cpp Outdated Show resolved Hide resolved

strint and others added 2 commits April 17, 2023 18:25

Update oneflow/core/graph/exec_graph.cpp

9b12419

Co-authored-by: Yipeng Li <jamesonli1313@gmail.com>

auto format by CI

5050fff

Yipeng1994 approved these changes Apr 17, 2023

View reviewed changes

Base automatically changed from sep0_task_proto to master April 19, 2023 21:58

Merge branch 'master' into sep2_custom_blobdesc_infer

4ef089b

strint requested a review from oneflow-ci-bot April 20, 2023 01:02

strint added automerge graph graph mode feature labels Apr 20, 2023

Merge branch 'master' into sep2_custom_blobdesc_infer

2cd7998

mergify bot merged commit 2d54365 into master Apr 20, 2023

mergify bot deleted the sep2_custom_blobdesc_infer branch April 20, 2023 07:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Task infer blob desc support choosing method #10124

Task infer blob desc support choosing method #10124

strint commented Apr 13, 2023

chengtbf Apr 13, 2023

strint Apr 13, 2023

strint Apr 13, 2023

chengtbf Apr 13, 2023

strint Apr 13, 2023

chengtbf Apr 13, 2023

strint Apr 13, 2023

chengtbf Apr 14, 2023

strint Apr 14, 2023

Yipeng1994 Apr 17, 2023

strint Apr 18, 2023

github-actions bot commented Apr 17, 2023

Yipeng1994 left a comment

github-actions bot commented Apr 20, 2023

github-actions bot commented Apr 20, 2023

Task infer blob desc support choosing method #10124

Task infer blob desc support choosing method #10124

Conversation

strint commented Apr 13, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Apr 17, 2023

Yipeng1994 left a comment

Choose a reason for hiding this comment

github-actions bot commented Apr 20, 2023

github-actions bot commented Apr 20, 2023