Plan rank compiler #10141

strint · 2023-04-14T17:38:42Z

No description provided.

github-actions · 2023-05-11T10:47:03Z

Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.

strint · 2023-05-11T10:47:12Z

oneflow/core/graph/op_graph.cpp

@@ -622,4 +649,17 @@ void OpGraph::PrintSBPGraphDebugInfo() const {
  }
 }

+OpGraphSingletonGuard::OpGraphSingletonGuard(const Job& job) {


提供一个 RAII 风格的 OpGraph

strint · 2023-05-11T10:48:06Z

oneflow/core/graph/task_graph.cpp

-      comp_task_node->set_thrd_id(EncodeStreamIdToInt64(StreamId{device_id, stream_index}));
-      comp_task_node->set_op_node(op_node);
-      sorted_comp_tasks->emplace_back(comp_task_node);
+      sorted_comp_tasks->emplace_back(GenCompTaskNode(op_node, parallel_idx++, &GetStreamId));


这里只是把上面的逻辑拆分成了个一个 GenCompTaskNode 函数。

strint · 2023-05-11T10:48:44Z

oneflow/core/graph/task_graph.cpp

@@ -491,47 +533,6 @@ Maybe<void> RegisterCreateSubTskGphBuilderFn(DeviceType device_type,
  return Maybe<void>::Ok();
 }

-TaskGraph::TaskGraph() {


task graph 的构造函数放到了子类中

strint · 2023-05-11T10:49:44Z

oneflow/core/graph/task_graph.cpp

-    InplaceObasInfo safe_inplace_obas_info;
-    GetSafeInplaceOpBlobArgList(&safe_inplace_obas_info, dev_nodes, IsOpNameDataOrCtrlReachable);
-    SetTaskRegstInplaceInfo(safe_inplace_obas_info, dev_nodes);
+    EnableInplaceMemSharing(dev_nodes, IsOpNameDataOrCtrlReachable);


只是拆分了函数

strint · 2023-05-11T10:50:52Z

oneflow/core/graph/task_graph.cpp

-      || (straighten_algorithm_tag == StraightenAlgorithmTag::kOverlap4Transfer
-          && GlobalProcessCtx::WorldSize() == 1)) {
-    InitOrderedTaskNodes();
+Maybe<void> GlobalTaskGraph::Init() {


原 task graph，完整的 task graph

strint · 2023-05-11T10:52:56Z

oneflow/core/job/plan_util.cpp

@@ -237,7 +237,7 @@ void GenChunkForMultiNNGraphMemoryReuseInMultiClient(

 }  // namespace

-void PlanUtil::MergeMemBlockIdByLogicalChainId(Plan* plan, const Job& job) {
+void PlanUtil::MergeMemBlockIdByLogicalChainId(Plan* plan, const Job& job, int64_t limited_rank) {


分离编译时，需要过滤下合法 rank

strint · 2023-05-11T10:53:05Z

oneflow/core/job/plan_util.cpp

@@ -801,10 +816,13 @@ std::function<RegstDescProto*(int64_t)> PlanUtil::MakeMutRegstDesc4Id(Plan* plan
  };
 }

-void PlanUtil::SetForceInplaceMemBlock(Plan* plan) {
+void PlanUtil::SetForceInplaceMemBlock(Plan* plan, int64_t limited_rank) {


分离编译时，需要过滤下合法 rank

strint · 2023-05-11T10:54:24Z

oneflow/core/job/rank_compiler.cpp

+#ifdef WITH_CUDA
+  // Use the right device when some plan compilation needs cuda to avoid creating unnecessary cuda
+  // context on cuda:0.
+  CudaCurrentDeviceGuard guard(GetCudaDeviceIndex());


解决分离编译时多创建 cuda context 问题

strint · 2023-05-11T10:55:06Z

oneflow/core/job/rank_compiler.cpp

+
+}  // namespace
+
+Maybe<void> RankCompiler::Compile(const HashSet<std::string>& var_op_names, Job* job,


相对于原来的全局 compiler 的分 rank compiler

strint · 2023-05-11T10:55:48Z

oneflow/core/job/rank_compiler.cpp

+  // context on cuda:0.
+  CudaCurrentDeviceGuard guard(GetCudaDeviceIndex());
+#endif  // WITH_CUDA
+  auto task_gph = JUST(RankTaskGraph::New(boxing_task_graph_proto_, var_op_names, rank_));


从 boxing task graph 开始继续编译

…/oneflow into sep4_rank_task_graph

github-actions · 2023-05-20T17:11:13Z

Speed stats:

strint · 2023-05-21T08:12:09Z

oneflow/core/job/rank_compiler.cpp

+  IntraJobMemSharingUtil::InferMemBlockId4MemReusedRegst(plan);
+  PlanUtil::MergeMemBlockIdByLogicalChainId(plan, *job, rank_);
+  PlanUtil::SetUniqueMemBlockId4UnreusedMemRegst(plan);
+  PlanUtil::SetForceInplaceMemBlock(plan, rank_);


这里的其它流程和之前的 plan compiler 是相似的，只是在特定的地方需要考虑下单 rank 的处理。

strint · 2023-05-21T08:19:00Z

oneflow/core/job/compiler.cpp

@@ -27,26 +27,6 @@ limitations under the License.

 namespace oneflow {

-void CreateOpAttributeRef(Plan* plan, int64_t job_id, TaskProto* task_proto) {


改到了 PlanUtil 中作为公共函数

github-actions · 2023-05-21T08:39:54Z

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/10141/

github-actions · 2023-05-21T08:46:45Z

Speed stats:

GPU Name: NVIDIA GeForce RTX 3090 

❌ OneFlow resnet50 time: 42.5ms (= 4254.8ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 57.5ms (= 5750.7ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.35 (= 57.5ms / 42.5ms)

OneFlow resnet50 time: 26.1ms (= 2607.9ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 37.3ms (= 3733.9ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.43 (= 37.3ms / 26.1ms)

OneFlow resnet50 time: 18.6ms (= 3717.5ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 35.5ms (= 7104.8ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.91 (= 35.5ms / 18.6ms)

OneFlow resnet50 time: 18.0ms (= 3609.8ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 32.3ms (= 6451.4ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.79 (= 32.3ms / 18.0ms)

OneFlow resnet50 time: 17.6ms (= 3528.1ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 29.5ms (= 5900.4ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.67 (= 29.5ms / 17.6ms)

OneFlow swin dataloader time: 0.202s (= 40.463s / 200, num_workers=1)
PyTorch swin dataloader time: 0.129s (= 25.763s / 200, num_workers=1)
Relative speed: 0.637 (= 0.129s / 0.202s)

OneFlow swin dataloader time: 0.053s (= 10.566s / 200, num_workers=4)
PyTorch swin dataloader time: 0.033s (= 6.566s / 200, num_workers=4)
Relative speed: 0.621 (= 0.033s / 0.053s)

OneFlow swin dataloader time: 0.030s (= 6.088s / 200, num_workers=8)
PyTorch swin dataloader time: 0.017s (= 3.330s / 200, num_workers=8)
Relative speed: 0.547 (= 0.017s / 0.030s)

❌ OneFlow resnet50 time: 48.6ms (= 4861.4ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 65.8ms (= 6580.9ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.35 (= 65.8ms / 48.6ms)

OneFlow resnet50 time: 36.8ms (= 3683.6ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 47.2ms (= 4716.0ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.28 (= 47.2ms / 36.8ms)

OneFlow resnet50 time: 28.5ms (= 5703.9ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 39.5ms (= 7905.4ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.39 (= 39.5ms / 28.5ms)

OneFlow resnet50 time: 25.4ms (= 5087.1ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 39.5ms (= 7902.1ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.55 (= 39.5ms / 25.4ms)

OneFlow resnet50 time: 23.7ms (= 4740.1ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 36.0ms (= 7193.3ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.52 (= 36.0ms / 23.7ms)

chengtbf · 2023-05-20T07:00:24Z

oneflow/core/operator/op_conf.proto

@@ -399,6 +399,7 @@ message OperatorConf {
  optional string loc = 11 [default = ""];
  optional int64 logical_chain_id = 12 [default = -1];
  optional int64 order_in_logical_chain = 13 [default = -1];
+  optional string calculation_pass_name = 14 [default = "forward_pass"];


额。。。好像：

optional string pass_tag = 10;

就是这个 calculation pass name，你查一下这个关键字

看了下 pass_tag 的取值范围如下：

static const std::string kNoPassTag = ""; static const std::string kMainOp = "main_op";

而 calculation_pass_name 取值范围如下

const std::string kForwardPass = "forward_pass"; const std::string kBackwardPass = "backward_pass"; const std::string kOptimizerPass = "optimizer_pass";

看起来不是一个东西。

😂 sorry 我记错了

chengtbf · 2023-05-23T13:47:07Z

oneflow/core/graph/graph.h

+Maybe<void> Graph<NodeType, EdgeType>::MaybeForEachEdge(
+    std::function<Maybe<void>(EdgeType*)> EdgeHandler) const {
+  for (auto& x : edges_) {
+    if (x->src_node() == nullptr && x->dst_node() == nullptr) { continue; }


这个在什么情况下会是 nullptr 呢

这个在什么情况下会是 nullptr 呢

在之前的没有 Maybe 的基础上改的，估计是为了严谨。因为默认情况下，edge 初始化时的 src_node 和 dst_node 都是 nullptr

template<typename NodeType, typename EdgeType> void Graph<NodeType, EdgeType>::ForEachEdge(std::function<void(EdgeType*)> EdgeHandler) const { for (auto& x : edges_) { if (x->src_node() == nullptr && x->dst_node() == nullptr) { continue; } EdgeHandler(x.get()); } }

chengtbf · 2023-05-23T13:50:52Z

oneflow/core/graph/op_graph.cpp

+  for (const auto& lbi : *lbis_) {
+    const auto& obn = CHECK_JUST(MapAt(*lbi2obn_, lbi));
+    for (const auto& ibn : CHECK_JUST(MapAt(*lbi2ibns_, lbi))) {
+      if (src_node()->NdSbp4BnInOp(obn) != dst_node()->NdSbp4BnInOp(ibn)) { return true; }


这个判断标准不严谨。在某些 2d case 下会失效，参考：

oneflow/oneflow/core/job_rewriter/logical_chain_pass.cpp

Line 133 in c7d431c

// NOTE(chengcheng): nd_sbp need to be reduction like from [P, P] to [P]

这里需要做 reduce 判断。

这个判断标准不严谨。在某些 2d case 下会失效，参考：
这里需要做 reduce 判断。

done in bded88f

chengtbf · 2023-05-23T13:52:38Z

oneflow/core/graph/op_graph.cpp

+}
+
+void OpGraph::UpdateCachedPredicatorIsReachable() {
+  cached_predicator_is_reachable_ = MakePredicatorIsReachable();


这里为什么会有 update 的需求？

这里为什么会有 update 的需求？

之前提供了单进程多线程编译模式，主线程生成可达关系的 cache，多个编译子 plan 的线程可以复用。以达到缩减开销的目的。

但是我们去掉了单进程多线程编译这个 debug 模式。默认采用的是多进程模式，这样每个进程里面必然还是会生成一个可达关系的lambda，且没有复用。所以这个优化失去价值了。我删掉这部分。

oneflow/core/graph/op_graph.cpp

chengtbf · 2023-05-23T14:23:44Z

oneflow/core/graph/task_graph.cpp

+}  // namespace
+
+/*static*/ bool BoxingTaskGraph::SelectTaskNodeByRank(TaskNode* task_node, int64_t rank) {
+  return TaskNodeVisitor<bool>(


这个函数写的好晦涩

这个函数写的好晦涩

因为有几个地方用了同样的处理逻辑，所以提供了这样一个模版函数。

task_node，输入的 task node

HandleTansportTaskNodeT，如果task node是 transport task node，就调用这个处理函数（visit）

HandleComputeTaskNodeT，如果task node是 compute task node，就调用这个处理函数

RetT，函数返回类型

我在函数声明处补充下注释吧。

template<typename RetT, typename HandleTansportTaskNodeT, typename HandleComputeTaskNodeT> RetT TaskNodeVisitor(TaskNode* task_node, const HandleTansportTaskNodeT& HandleTansportTaskNode, const HandleComputeTaskNodeT& HandleComputeTaskNode)

chengtbf · 2023-05-23T14:27:32Z

oneflow/core/graph/task_graph.cpp

+      for (auto* out_edge : task_node->out_edges()) { TryInsertEdge(out_edge); }
+    }
+    return rank_task_edges;
+  }();


这个括号可以写在 {} 后面吗？

lambda = [&] { ... } ();

这个括号可以写在 {} 后面吗？

lambda = [&] { ... } ();

这里相当于是 lambda = [&] { ... } ；然后 const auto rank_task_edges = lambda()

哦哦哦懂了，其实是省略了：

lambda = [&] () { ... }; lambda = [&] { ... };

oneflow/core/graph/task_graph.cpp

chengtbf · 2023-05-23T14:33:11Z

oneflow/core/graph/task_graph.cpp

+                                                                             int64_t parallel_id) {
+  auto* comp_task_node = JUST(TryGetBoxingRelatedComTaskNode(op_node, parallel_id));
+  if (comp_task_node != nullptr) { return comp_task_node; }
+  auto** comp_task_node_ptr = &op_node2comp_task_node_[op_node];


这里的写法也很奇怪，为什么不是 find in map return， or create ，应该可以不用二级指针。

这里的写法也很奇怪，为什么不是 find in map return， or create ，应该可以不用二级指针。

done，已经修改

chengtbf · 2023-05-23T14:35:21Z

oneflow/core/graph/task_graph.cpp

+      << "parallel_id not found.";
+  auto* comp_task_node = JUST(TryGetBoxingRelatedComTaskNode(op_node, parallel_id));
+  if (comp_task_node != nullptr) { return comp_task_node; }
+  return op_node2comp_task_node_[op_node];


这有有可能返回 nullptr 吗

这有有可能返回 nullptr 吗

compute task node 已经生成完成了，按说不会返回 nullptr。

我增加下查找不到时的报错提示。

…/oneflow into sep4_rank_task_graph

github-actions · 2023-05-24T04:09:48Z

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/10141/

github-actions · 2023-05-24T04:17:08Z

Speed stats:

GPU Name: NVIDIA GeForce RTX 3080 Ti 

❌ OneFlow resnet50 time: 43.8ms (= 4379.6ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 66.8ms (= 6678.0ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.52 (= 66.8ms / 43.8ms)

OneFlow resnet50 time: 26.9ms (= 2686.5ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 47.4ms (= 4735.9ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.76 (= 47.4ms / 26.9ms)

OneFlow resnet50 time: 19.5ms (= 3902.2ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 39.9ms (= 7984.2ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 2.05 (= 39.9ms / 19.5ms)

OneFlow resnet50 time: 17.3ms (= 3454.0ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 35.4ms (= 7073.7ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 2.05 (= 35.4ms / 17.3ms)

OneFlow resnet50 time: 16.4ms (= 3287.0ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 32.9ms (= 6575.7ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 2.00 (= 32.9ms / 16.4ms)

OneFlow swin dataloader time: 0.204s (= 40.780s / 200, num_workers=1)
PyTorch swin dataloader time: 0.133s (= 26.566s / 200, num_workers=1)
Relative speed: 0.651 (= 0.133s / 0.204s)

OneFlow swin dataloader time: 0.064s (= 12.740s / 200, num_workers=4)
PyTorch swin dataloader time: 0.033s (= 6.581s / 200, num_workers=4)
Relative speed: 0.517 (= 0.033s / 0.064s)

OneFlow swin dataloader time: 0.034s (= 6.787s / 200, num_workers=8)
PyTorch swin dataloader time: 0.017s (= 3.311s / 200, num_workers=8)
Relative speed: 0.488 (= 0.017s / 0.034s)

❌ OneFlow resnet50 time: 47.7ms (= 4771.4ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 64.0ms (= 6398.2ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.34 (= 64.0ms / 47.7ms)

OneFlow resnet50 time: 32.6ms (= 3263.3ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 44.6ms (= 4458.3ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.37 (= 44.6ms / 32.6ms)

OneFlow resnet50 time: 24.1ms (= 4814.0ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 41.5ms (= 8298.4ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.72 (= 41.5ms / 24.1ms)

OneFlow resnet50 time: 22.9ms (= 4577.3ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 38.1ms (= 7627.8ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.67 (= 38.1ms / 22.9ms)

OneFlow resnet50 time: 21.2ms (= 4242.2ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 34.5ms (= 6909.3ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.63 (= 34.5ms / 21.2ms)

github-actions · 2023-05-27T16:24:08Z

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/10141/

github-actions · 2023-05-27T16:30:41Z

Speed stats:

GPU Name: NVIDIA GeForce RTX 3090 

❌ OneFlow resnet50 time: 42.7ms (= 4270.5ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 57.7ms (= 5772.8ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.35 (= 57.7ms / 42.7ms)

OneFlow resnet50 time: 25.7ms (= 2566.6ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 37.5ms (= 3751.5ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.46 (= 37.5ms / 25.7ms)

OneFlow resnet50 time: 19.3ms (= 3869.2ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 35.6ms (= 7125.2ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.84 (= 35.6ms / 19.3ms)

OneFlow resnet50 time: 19.8ms (= 3964.6ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 32.2ms (= 6437.4ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.62 (= 32.2ms / 19.8ms)

OneFlow resnet50 time: 17.7ms (= 3548.1ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 25.4ms (= 5082.2ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.43 (= 25.4ms / 17.7ms)

OneFlow swin dataloader time: 0.201s (= 40.222s / 200, num_workers=1)
PyTorch swin dataloader time: 0.130s (= 25.980s / 200, num_workers=1)
Relative speed: 0.646 (= 0.130s / 0.201s)

OneFlow swin dataloader time: 0.054s (= 10.876s / 200, num_workers=4)
PyTorch swin dataloader time: 0.033s (= 6.549s / 200, num_workers=4)
Relative speed: 0.602 (= 0.033s / 0.054s)

OneFlow swin dataloader time: 0.031s (= 6.161s / 200, num_workers=8)
PyTorch swin dataloader time: 0.017s (= 3.331s / 200, num_workers=8)
Relative speed: 0.541 (= 0.017s / 0.031s)

❌ OneFlow resnet50 time: 48.4ms (= 4842.9ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 64.5ms (= 6452.2ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.33 (= 64.5ms / 48.4ms)

OneFlow resnet50 time: 35.7ms (= 3572.9ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 47.3ms (= 4726.5ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.32 (= 47.3ms / 35.7ms)

OneFlow resnet50 time: 28.4ms (= 5686.3ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 39.3ms (= 7858.7ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.38 (= 39.3ms / 28.4ms)

OneFlow resnet50 time: 25.7ms (= 5133.6ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 40.1ms (= 8020.2ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.56 (= 40.1ms / 25.7ms)

OneFlow resnet50 time: 24.1ms (= 4825.1ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 35.9ms (= 7188.5ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.49 (= 35.9ms / 24.1ms)

chengtbf · 2023-06-03T12:40:31Z

oneflow/core/graph/task_graph.cpp

-    TaskEdge* edge = NewEdge();
-    Connect<TaskNode>(src_task_nodes.at(i), edge, dst_task_nodes.at(i));
-    src_task_nodes.at(i)->BindEdgeWithProducedRegst(edge, regst_desc_name);
+    ConnectCtrlEdge(src_task_nodes.at(i), dst_task_nodes.at(i));


那其实没有这个开销 😂

chengtbf · 2023-06-03T12:41:25Z

oneflow/core/graph/task_graph.cpp

+      for (auto* out_edge : task_node->out_edges()) { TryInsertEdge(out_edge); }
+    }
+    return rank_task_edges;
+  }();


哦哦哦懂了，其实是省略了：

lambda = [&] () { ... }; lambda = [&] { ... };

github-actions · 2023-06-03T15:02:41Z

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/10141/

github-actions · 2023-06-03T15:08:51Z

Speed stats:

GPU Name: NVIDIA GeForce RTX 3090 

❌ OneFlow resnet50 time: 42.9ms (= 4289.1ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 57.6ms (= 5764.9ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.34 (= 57.6ms / 42.9ms)

OneFlow resnet50 time: 25.7ms (= 2567.9ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 37.4ms (= 3736.3ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.46 (= 37.4ms / 25.7ms)

OneFlow resnet50 time: 18.6ms (= 3724.4ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 37.0ms (= 7397.2ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.99 (= 37.0ms / 18.6ms)

OneFlow resnet50 time: 20.0ms (= 4000.7ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 32.9ms (= 6570.7ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.64 (= 32.9ms / 20.0ms)

OneFlow resnet50 time: 17.5ms (= 3490.3ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 30.4ms (= 6076.3ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.74 (= 30.4ms / 17.5ms)

OneFlow swin dataloader time: 0.201s (= 40.134s / 200, num_workers=1)
PyTorch swin dataloader time: 0.129s (= 25.739s / 200, num_workers=1)
Relative speed: 0.641 (= 0.129s / 0.201s)

OneFlow swin dataloader time: 0.056s (= 11.284s / 200, num_workers=4)
PyTorch swin dataloader time: 0.033s (= 6.502s / 200, num_workers=4)
Relative speed: 0.576 (= 0.033s / 0.056s)

OneFlow swin dataloader time: 0.030s (= 5.958s / 200, num_workers=8)
PyTorch swin dataloader time: 0.017s (= 3.310s / 200, num_workers=8)
Relative speed: 0.556 (= 0.017s / 0.030s)

❌ OneFlow resnet50 time: 48.6ms (= 4860.7ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 68.0ms (= 6797.9ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.40 (= 68.0ms / 48.6ms)

OneFlow resnet50 time: 36.0ms (= 3600.5ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 45.9ms (= 4594.8ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.28 (= 45.9ms / 36.0ms)

OneFlow resnet50 time: 28.5ms (= 5704.0ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 39.6ms (= 7927.5ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.39 (= 39.6ms / 28.5ms)

OneFlow resnet50 time: 26.0ms (= 5199.9ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 39.1ms (= 7816.2ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.50 (= 39.1ms / 26.0ms)

OneFlow resnet50 time: 24.5ms (= 4904.3ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 36.2ms (= 7249.3ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.48 (= 36.2ms / 24.5ms)

github-actions · 2023-06-03T15:53:34Z

CI failed when running job: cuda-module. PR label automerge has been removed

github-actions · 2023-06-03T15:53:44Z

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/10141/

github-actions · 2023-06-03T15:54:10Z

Speed stats:

github-actions · 2023-06-04T01:26:52Z

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/10141/

github-actions · 2023-06-04T01:33:45Z

Speed stats:

GPU Name: NVIDIA GeForce RTX 3090 

❌ OneFlow resnet50 time: 43.0ms (= 4298.6ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 56.7ms (= 5674.4ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.32 (= 56.7ms / 43.0ms)

OneFlow resnet50 time: 26.1ms (= 2613.7ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 37.6ms (= 3764.9ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.44 (= 37.6ms / 26.1ms)

OneFlow resnet50 time: 18.5ms (= 3708.6ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 35.5ms (= 7104.2ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.92 (= 35.5ms / 18.5ms)

OneFlow resnet50 time: 17.9ms (= 3572.3ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 33.6ms (= 6715.2ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.88 (= 33.6ms / 17.9ms)

OneFlow resnet50 time: 17.3ms (= 3459.6ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 29.0ms (= 5799.6ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.68 (= 29.0ms / 17.3ms)

OneFlow swin dataloader time: 0.200s (= 39.977s / 200, num_workers=1)
PyTorch swin dataloader time: 0.129s (= 25.798s / 200, num_workers=1)
Relative speed: 0.645 (= 0.129s / 0.200s)

OneFlow swin dataloader time: 0.055s (= 10.968s / 200, num_workers=4)
PyTorch swin dataloader time: 0.033s (= 6.511s / 200, num_workers=4)
Relative speed: 0.594 (= 0.033s / 0.055s)

OneFlow swin dataloader time: 0.031s (= 6.175s / 200, num_workers=8)
PyTorch swin dataloader time: 0.017s (= 3.314s / 200, num_workers=8)
Relative speed: 0.537 (= 0.017s / 0.031s)

❌ OneFlow resnet50 time: 48.4ms (= 4840.4ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 63.9ms (= 6390.2ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.32 (= 63.9ms / 48.4ms)

OneFlow resnet50 time: 36.4ms (= 3637.8ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 46.0ms (= 4595.1ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.26 (= 46.0ms / 36.4ms)

OneFlow resnet50 time: 29.1ms (= 5819.1ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 39.7ms (= 7949.1ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.37 (= 39.7ms / 29.1ms)

OneFlow resnet50 time: 25.6ms (= 5110.7ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 38.9ms (= 7778.1ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.52 (= 38.9ms / 25.6ms)

OneFlow resnet50 time: 24.5ms (= 4902.9ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 35.9ms (= 7183.2ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.47 (= 35.9ms / 24.5ms)

strint added 19 commits April 12, 2023 14:22

add task to/from proto

62dee17

Update boxing_task_graph.proto

db84632

Update task_edge.proto

49c8d18

Update task_graph_rebuild_ctx.cpp

041a4b3

Update task_graph_rebuild_ctx.h

3e08944

Update transport_task_node.cpp

e0cf92b

support infer desc choose method

008239e

refine comment

be2987d

rm useless

9acbc79

add comsume fake regst

34b3133

fix typo

acff92c

add task factory to create new task node

17203d0

Merge branch 'sep0_task_proto' into sep2_custom_blobdesc_infer

4377368

add infer from ndsbp

88f4297

Merge branch 'sep2_custom_blobdesc_infer' into sep3_fake_regst

0e7b8ed

rm useless

266c388

Merge branch 'sep0_task_proto' into sep2_custom_blobdesc_infer

9f952bd

Merge branch 'sep2_custom_blobdesc_infer' into sep3_fake_regst

a9ad100

add rank compiler

d7b7594

strint requested a review from chengtbf as a code owner April 14, 2023 17:38

add rank compiler

09154d0

strint changed the title ~~add plan rank compiler~~ Plan rank compiler Apr 16, 2023

Base automatically changed from sep3_fake_regst to master May 6, 2023 03:40

strint and others added 2 commits May 11, 2023 10:45

merge master

410c71f

auto format by CI

e9a20de

strint commented May 11, 2023

View reviewed changes

strint added 2 commits May 12, 2023 09:25

fix merge

57499b7

Merge branch 'sep4_rank_task_graph' of https://github.com/Oneflow-Inc…

d169bdd

…/oneflow into sep4_rank_task_graph

strint added the graph graph mode label May 12, 2023

Merge branch 'master' into sep4_rank_task_graph

45c1ca4

strint commented May 21, 2023

View reviewed changes

chengtbf reviewed May 23, 2023

View reviewed changes

strint added 2 commits May 24, 2023 03:29

fix NeedBoxing for NDSBP

bded88f

Merge branch 'sep4_rank_task_graph' of https://github.com/Oneflow-Inc…

af8303e

…/oneflow into sep4_rank_task_graph

address review

adb1b08

chengtbf approved these changes Jun 3, 2023

View reviewed changes

Merge branch 'master' into sep4_rank_task_graph

429aa14

strint added the automerge label Jun 3, 2023

fix static check

a916407

github-actions bot removed the automerge label Jun 3, 2023

strint requested a review from oneflow-ci-bot June 4, 2023 01:17

strint added the automerge label Jun 4, 2023

mergify bot merged commit a9a339b into master Jun 4, 2023

mergify bot deleted the sep4_rank_task_graph branch June 4, 2023 01:55


		} // namespace

		Maybe<void> RankCompiler::Compile(const HashSet<std::string>& var_op_names, Job* job,

		@@ -27,26 +27,6 @@ limitations under the License.

		namespace oneflow {

		void CreateOpAttributeRef(Plan* plan, int64_t job_id, TaskProto* task_proto) {

Plan rank compiler #10141

Plan rank compiler #10141

Conversation

strint commented Apr 14, 2023

github-actions bot commented May 11, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented May 20, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented May 21, 2023

github-actions bot commented May 21, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented May 24, 2023

github-actions bot commented May 24, 2023

github-actions bot commented May 27, 2023

github-actions bot commented May 27, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Jun 3, 2023

github-actions bot commented Jun 3, 2023

github-actions bot commented Jun 3, 2023

github-actions bot commented Jun 3, 2023

github-actions bot commented Jun 3, 2023

github-actions bot commented Jun 4, 2023

github-actions bot commented Jun 4, 2023