Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plan rank compiler #10141

Merged
merged 34 commits into from
Jun 4, 2023
Merged

Plan rank compiler #10141

merged 34 commits into from
Jun 4, 2023

Conversation

strint
Copy link
Contributor

@strint strint commented Apr 14, 2023

No description provided.

@strint strint requested a review from chengtbf as a code owner April 14, 2023 17:38
@strint strint changed the title add plan rank compiler Plan rank compiler Apr 16, 2023
Base automatically changed from sep3_fake_regst to master May 6, 2023 03:40
@github-actions
Copy link
Contributor

Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.

@@ -622,4 +649,17 @@ void OpGraph::PrintSBPGraphDebugInfo() const {
}
}

OpGraphSingletonGuard::OpGraphSingletonGuard(const Job& job) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

提供一个 RAII 风格的 OpGraph

comp_task_node->set_thrd_id(EncodeStreamIdToInt64(StreamId{device_id, stream_index}));
comp_task_node->set_op_node(op_node);
sorted_comp_tasks->emplace_back(comp_task_node);
sorted_comp_tasks->emplace_back(GenCompTaskNode(op_node, parallel_idx++, &GetStreamId));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里只是把上面的逻辑拆分成了个一个 GenCompTaskNode 函数。

@@ -491,47 +533,6 @@ Maybe<void> RegisterCreateSubTskGphBuilderFn(DeviceType device_type,
return Maybe<void>::Ok();
}

TaskGraph::TaskGraph() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

task graph 的构造函数放到了子类中

InplaceObasInfo safe_inplace_obas_info;
GetSafeInplaceOpBlobArgList(&safe_inplace_obas_info, dev_nodes, IsOpNameDataOrCtrlReachable);
SetTaskRegstInplaceInfo(safe_inplace_obas_info, dev_nodes);
EnableInplaceMemSharing(dev_nodes, IsOpNameDataOrCtrlReachable);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

只是拆分了函数

|| (straighten_algorithm_tag == StraightenAlgorithmTag::kOverlap4Transfer
&& GlobalProcessCtx::WorldSize() == 1)) {
InitOrderedTaskNodes();
Maybe<void> GlobalTaskGraph::Init() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

原 task graph,完整的 task graph

@@ -237,7 +237,7 @@ void GenChunkForMultiNNGraphMemoryReuseInMultiClient(

} // namespace

void PlanUtil::MergeMemBlockIdByLogicalChainId(Plan* plan, const Job& job) {
void PlanUtil::MergeMemBlockIdByLogicalChainId(Plan* plan, const Job& job, int64_t limited_rank) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

分离编译时,需要过滤下合法 rank

@@ -801,10 +816,13 @@ std::function<RegstDescProto*(int64_t)> PlanUtil::MakeMutRegstDesc4Id(Plan* plan
};
}

void PlanUtil::SetForceInplaceMemBlock(Plan* plan) {
void PlanUtil::SetForceInplaceMemBlock(Plan* plan, int64_t limited_rank) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

分离编译时,需要过滤下合法 rank

#ifdef WITH_CUDA
// Use the right device when some plan compilation needs cuda to avoid creating unnecessary cuda
// context on cuda:0.
CudaCurrentDeviceGuard guard(GetCudaDeviceIndex());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

解决分离编译时多创建 cuda context 问题


} // namespace

Maybe<void> RankCompiler::Compile(const HashSet<std::string>& var_op_names, Job* job,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

相对于原来的全局 compiler 的分 rank compiler

// context on cuda:0.
CudaCurrentDeviceGuard guard(GetCudaDeviceIndex());
#endif // WITH_CUDA
auto task_gph = JUST(RankTaskGraph::New(boxing_task_graph_proto_, var_op_names, rank_));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

从 boxing task graph 开始继续编译

@strint strint added the graph graph mode label May 12, 2023
@github-actions
Copy link
Contributor

Speed stats:

IntraJobMemSharingUtil::InferMemBlockId4MemReusedRegst(plan);
PlanUtil::MergeMemBlockIdByLogicalChainId(plan, *job, rank_);
PlanUtil::SetUniqueMemBlockId4UnreusedMemRegst(plan);
PlanUtil::SetForceInplaceMemBlock(plan, rank_);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的其它流程和之前的 plan compiler 是相似的,只是在特定的地方需要考虑下单 rank 的处理。

@@ -27,26 +27,6 @@ limitations under the License.

namespace oneflow {

void CreateOpAttributeRef(Plan* plan, int64_t job_id, TaskProto* task_proto) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

改到了 PlanUtil 中作为公共函数

@github-actions
Copy link
Contributor

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/10141/

@github-actions
Copy link
Contributor

Speed stats:
GPU Name: NVIDIA GeForce RTX 3090 

❌ OneFlow resnet50 time: 42.5ms (= 4254.8ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 57.5ms (= 5750.7ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.35 (= 57.5ms / 42.5ms)

OneFlow resnet50 time: 26.1ms (= 2607.9ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 37.3ms (= 3733.9ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.43 (= 37.3ms / 26.1ms)

OneFlow resnet50 time: 18.6ms (= 3717.5ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 35.5ms (= 7104.8ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.91 (= 35.5ms / 18.6ms)

OneFlow resnet50 time: 18.0ms (= 3609.8ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 32.3ms (= 6451.4ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.79 (= 32.3ms / 18.0ms)

OneFlow resnet50 time: 17.6ms (= 3528.1ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 29.5ms (= 5900.4ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.67 (= 29.5ms / 17.6ms)

OneFlow swin dataloader time: 0.202s (= 40.463s / 200, num_workers=1)
PyTorch swin dataloader time: 0.129s (= 25.763s / 200, num_workers=1)
Relative speed: 0.637 (= 0.129s / 0.202s)

OneFlow swin dataloader time: 0.053s (= 10.566s / 200, num_workers=4)
PyTorch swin dataloader time: 0.033s (= 6.566s / 200, num_workers=4)
Relative speed: 0.621 (= 0.033s / 0.053s)

OneFlow swin dataloader time: 0.030s (= 6.088s / 200, num_workers=8)
PyTorch swin dataloader time: 0.017s (= 3.330s / 200, num_workers=8)
Relative speed: 0.547 (= 0.017s / 0.030s)

❌ OneFlow resnet50 time: 48.6ms (= 4861.4ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 65.8ms (= 6580.9ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.35 (= 65.8ms / 48.6ms)

OneFlow resnet50 time: 36.8ms (= 3683.6ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 47.2ms (= 4716.0ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.28 (= 47.2ms / 36.8ms)

OneFlow resnet50 time: 28.5ms (= 5703.9ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 39.5ms (= 7905.4ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.39 (= 39.5ms / 28.5ms)

OneFlow resnet50 time: 25.4ms (= 5087.1ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 39.5ms (= 7902.1ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.55 (= 39.5ms / 25.4ms)

OneFlow resnet50 time: 23.7ms (= 4740.1ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 36.0ms (= 7193.3ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.52 (= 36.0ms / 23.7ms)

@@ -399,6 +399,7 @@ message OperatorConf {
optional string loc = 11 [default = ""];
optional int64 logical_chain_id = 12 [default = -1];
optional int64 order_in_logical_chain = 13 [default = -1];
optional string calculation_pass_name = 14 [default = "forward_pass"];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

额。。。好像:

optional string pass_tag = 10;

就是这个 calculation pass name,你查一下这个关键字

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

看了下 pass_tag 的取值范围如下:

static const std::string kNoPassTag = "";
static const std::string kMainOp = "main_op";

而 calculation_pass_name 取值范围如下

const std::string kForwardPass = "forward_pass";
const std::string kBackwardPass = "backward_pass";
const std::string kOptimizerPass = "optimizer_pass";

看起来不是一个东西。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😂 sorry 我记错了

Maybe<void> Graph<NodeType, EdgeType>::MaybeForEachEdge(
std::function<Maybe<void>(EdgeType*)> EdgeHandler) const {
for (auto& x : edges_) {
if (x->src_node() == nullptr && x->dst_node() == nullptr) { continue; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个在什么情况下会是 nullptr 呢

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个在什么情况下会是 nullptr 呢

在之前的没有 Maybe 的基础上改的,估计是为了严谨。因为默认情况下,edge 初始化时的 src_node 和 dst_node 都是 nullptr

template<typename NodeType, typename EdgeType>
void Graph<NodeType, EdgeType>::ForEachEdge(std::function<void(EdgeType*)> EdgeHandler) const {
  for (auto& x : edges_) {
    if (x->src_node() == nullptr && x->dst_node() == nullptr) { continue; }
    EdgeHandler(x.get());
  }
}

for (const auto& lbi : *lbis_) {
const auto& obn = CHECK_JUST(MapAt(*lbi2obn_, lbi));
for (const auto& ibn : CHECK_JUST(MapAt(*lbi2ibns_, lbi))) {
if (src_node()->NdSbp4BnInOp(obn) != dst_node()->NdSbp4BnInOp(ibn)) { return true; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个判断标准不严谨。在某些 2d case 下会失效,参考:

// NOTE(chengcheng): nd_sbp need to be reduction like from [P, P] to [P]

这里需要做 reduce 判断。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个判断标准不严谨。在某些 2d case 下会失效,参考:
这里需要做 reduce 判断。

done in bded88f

}

void OpGraph::UpdateCachedPredicatorIsReachable() {
cached_predicator_is_reachable_ = MakePredicatorIsReachable();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里为什么会有 update 的需求?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里为什么会有 update 的需求?

之前提供了单进程多线程编译模式,主线程生成可达关系的 cache,多个编译子 plan 的线程可以复用。以达到缩减开销的目的。

但是我们去掉了单进程多线程编译这个 debug 模式。默认采用的是多进程模式,这样每个进程里面必然还是会生成一个可达关系的lambda,且没有复用。所以这个优化失去价值了。我删掉这部分。

oneflow/core/graph/op_graph.cpp Outdated Show resolved Hide resolved
} // namespace

/*static*/ bool BoxingTaskGraph::SelectTaskNodeByRank(TaskNode* task_node, int64_t rank) {
return TaskNodeVisitor<bool>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个函数写的好晦涩

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个函数写的好晦涩

因为有几个地方用了同样的处理逻辑,所以提供了这样一个模版函数。

  • task_node,输入的 task node
  • HandleTansportTaskNodeT,如果task node是 transport task node,就调用这个处理函数(visit)
  • HandleComputeTaskNodeT,如果task node是 compute task node,就调用这个处理函数
  • RetT,函数返回类型

我在函数声明处补充下注释吧。

template<typename RetT, typename HandleTansportTaskNodeT, typename HandleComputeTaskNodeT>
RetT TaskNodeVisitor(TaskNode* task_node, const HandleTansportTaskNodeT& HandleTansportTaskNode,
                     const HandleComputeTaskNodeT& HandleComputeTaskNode) 

for (auto* out_edge : task_node->out_edges()) { TryInsertEdge(out_edge); }
}
return rank_task_edges;
}();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个括号可以写在 {} 后面吗?

lambda = [&] { ... } ();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个括号可以写在 {} 后面吗?

lambda = [&] { ... } ();

这里相当于是 lambda = [&] { ... } ;然后 const auto rank_task_edges = lambda()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

哦哦哦懂了,其实是省略了:

lambda = [&] () { ... };
lambda = [&] { ... };

oneflow/core/graph/task_graph.cpp Show resolved Hide resolved
int64_t parallel_id) {
auto* comp_task_node = JUST(TryGetBoxingRelatedComTaskNode(op_node, parallel_id));
if (comp_task_node != nullptr) { return comp_task_node; }
auto** comp_task_node_ptr = &op_node2comp_task_node_[op_node];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的写法也很奇怪,为什么不是 find in map return, or create ,应该可以不用二级指针。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的写法也很奇怪,为什么不是 find in map return, or create ,应该可以不用二级指针。

done,已经修改

<< "parallel_id not found.";
auto* comp_task_node = JUST(TryGetBoxingRelatedComTaskNode(op_node, parallel_id));
if (comp_task_node != nullptr) { return comp_task_node; }
return op_node2comp_task_node_[op_node];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这有有可能返回 nullptr 吗

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这有有可能返回 nullptr 吗

compute task node 已经生成完成了,按说不会返回 nullptr。

我增加下查找不到时的报错提示。

@github-actions
Copy link
Contributor

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/10141/

@github-actions
Copy link
Contributor

Speed stats:
GPU Name: NVIDIA GeForce RTX 3080 Ti 

❌ OneFlow resnet50 time: 43.8ms (= 4379.6ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 66.8ms (= 6678.0ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.52 (= 66.8ms / 43.8ms)

OneFlow resnet50 time: 26.9ms (= 2686.5ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 47.4ms (= 4735.9ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.76 (= 47.4ms / 26.9ms)

OneFlow resnet50 time: 19.5ms (= 3902.2ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 39.9ms (= 7984.2ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 2.05 (= 39.9ms / 19.5ms)

OneFlow resnet50 time: 17.3ms (= 3454.0ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 35.4ms (= 7073.7ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 2.05 (= 35.4ms / 17.3ms)

OneFlow resnet50 time: 16.4ms (= 3287.0ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 32.9ms (= 6575.7ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 2.00 (= 32.9ms / 16.4ms)

OneFlow swin dataloader time: 0.204s (= 40.780s / 200, num_workers=1)
PyTorch swin dataloader time: 0.133s (= 26.566s / 200, num_workers=1)
Relative speed: 0.651 (= 0.133s / 0.204s)

OneFlow swin dataloader time: 0.064s (= 12.740s / 200, num_workers=4)
PyTorch swin dataloader time: 0.033s (= 6.581s / 200, num_workers=4)
Relative speed: 0.517 (= 0.033s / 0.064s)

OneFlow swin dataloader time: 0.034s (= 6.787s / 200, num_workers=8)
PyTorch swin dataloader time: 0.017s (= 3.311s / 200, num_workers=8)
Relative speed: 0.488 (= 0.017s / 0.034s)

❌ OneFlow resnet50 time: 47.7ms (= 4771.4ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 64.0ms (= 6398.2ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.34 (= 64.0ms / 47.7ms)

OneFlow resnet50 time: 32.6ms (= 3263.3ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 44.6ms (= 4458.3ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.37 (= 44.6ms / 32.6ms)

OneFlow resnet50 time: 24.1ms (= 4814.0ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 41.5ms (= 8298.4ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.72 (= 41.5ms / 24.1ms)

OneFlow resnet50 time: 22.9ms (= 4577.3ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 38.1ms (= 7627.8ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.67 (= 38.1ms / 22.9ms)

OneFlow resnet50 time: 21.2ms (= 4242.2ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 34.5ms (= 6909.3ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.63 (= 34.5ms / 21.2ms)

@github-actions
Copy link
Contributor

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/10141/

@github-actions
Copy link
Contributor

Speed stats:
GPU Name: NVIDIA GeForce RTX 3090 

❌ OneFlow resnet50 time: 42.7ms (= 4270.5ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 57.7ms (= 5772.8ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.35 (= 57.7ms / 42.7ms)

OneFlow resnet50 time: 25.7ms (= 2566.6ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 37.5ms (= 3751.5ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.46 (= 37.5ms / 25.7ms)

OneFlow resnet50 time: 19.3ms (= 3869.2ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 35.6ms (= 7125.2ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.84 (= 35.6ms / 19.3ms)

OneFlow resnet50 time: 19.8ms (= 3964.6ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 32.2ms (= 6437.4ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.62 (= 32.2ms / 19.8ms)

OneFlow resnet50 time: 17.7ms (= 3548.1ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 25.4ms (= 5082.2ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.43 (= 25.4ms / 17.7ms)

OneFlow swin dataloader time: 0.201s (= 40.222s / 200, num_workers=1)
PyTorch swin dataloader time: 0.130s (= 25.980s / 200, num_workers=1)
Relative speed: 0.646 (= 0.130s / 0.201s)

OneFlow swin dataloader time: 0.054s (= 10.876s / 200, num_workers=4)
PyTorch swin dataloader time: 0.033s (= 6.549s / 200, num_workers=4)
Relative speed: 0.602 (= 0.033s / 0.054s)

OneFlow swin dataloader time: 0.031s (= 6.161s / 200, num_workers=8)
PyTorch swin dataloader time: 0.017s (= 3.331s / 200, num_workers=8)
Relative speed: 0.541 (= 0.017s / 0.031s)

❌ OneFlow resnet50 time: 48.4ms (= 4842.9ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 64.5ms (= 6452.2ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.33 (= 64.5ms / 48.4ms)

OneFlow resnet50 time: 35.7ms (= 3572.9ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 47.3ms (= 4726.5ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.32 (= 47.3ms / 35.7ms)

OneFlow resnet50 time: 28.4ms (= 5686.3ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 39.3ms (= 7858.7ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.38 (= 39.3ms / 28.4ms)

OneFlow resnet50 time: 25.7ms (= 5133.6ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 40.1ms (= 8020.2ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.56 (= 40.1ms / 25.7ms)

OneFlow resnet50 time: 24.1ms (= 4825.1ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 35.9ms (= 7188.5ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.49 (= 35.9ms / 24.1ms)

TaskEdge* edge = NewEdge();
Connect<TaskNode>(src_task_nodes.at(i), edge, dst_task_nodes.at(i));
src_task_nodes.at(i)->BindEdgeWithProducedRegst(edge, regst_desc_name);
ConnectCtrlEdge(src_task_nodes.at(i), dst_task_nodes.at(i));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

那其实没有这个开销 😂

for (auto* out_edge : task_node->out_edges()) { TryInsertEdge(out_edge); }
}
return rank_task_edges;
}();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

哦哦哦懂了,其实是省略了:

lambda = [&] () { ... };
lambda = [&] { ... };

@github-actions
Copy link
Contributor

github-actions bot commented Jun 3, 2023

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/10141/

@github-actions
Copy link
Contributor

github-actions bot commented Jun 3, 2023

Speed stats:
GPU Name: NVIDIA GeForce RTX 3090 

❌ OneFlow resnet50 time: 42.9ms (= 4289.1ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 57.6ms (= 5764.9ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.34 (= 57.6ms / 42.9ms)

OneFlow resnet50 time: 25.7ms (= 2567.9ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 37.4ms (= 3736.3ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.46 (= 37.4ms / 25.7ms)

OneFlow resnet50 time: 18.6ms (= 3724.4ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 37.0ms (= 7397.2ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.99 (= 37.0ms / 18.6ms)

OneFlow resnet50 time: 20.0ms (= 4000.7ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 32.9ms (= 6570.7ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.64 (= 32.9ms / 20.0ms)

OneFlow resnet50 time: 17.5ms (= 3490.3ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 30.4ms (= 6076.3ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.74 (= 30.4ms / 17.5ms)

OneFlow swin dataloader time: 0.201s (= 40.134s / 200, num_workers=1)
PyTorch swin dataloader time: 0.129s (= 25.739s / 200, num_workers=1)
Relative speed: 0.641 (= 0.129s / 0.201s)

OneFlow swin dataloader time: 0.056s (= 11.284s / 200, num_workers=4)
PyTorch swin dataloader time: 0.033s (= 6.502s / 200, num_workers=4)
Relative speed: 0.576 (= 0.033s / 0.056s)

OneFlow swin dataloader time: 0.030s (= 5.958s / 200, num_workers=8)
PyTorch swin dataloader time: 0.017s (= 3.310s / 200, num_workers=8)
Relative speed: 0.556 (= 0.017s / 0.030s)

❌ OneFlow resnet50 time: 48.6ms (= 4860.7ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 68.0ms (= 6797.9ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.40 (= 68.0ms / 48.6ms)

OneFlow resnet50 time: 36.0ms (= 3600.5ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 45.9ms (= 4594.8ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.28 (= 45.9ms / 36.0ms)

OneFlow resnet50 time: 28.5ms (= 5704.0ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 39.6ms (= 7927.5ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.39 (= 39.6ms / 28.5ms)

OneFlow resnet50 time: 26.0ms (= 5199.9ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 39.1ms (= 7816.2ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.50 (= 39.1ms / 26.0ms)

OneFlow resnet50 time: 24.5ms (= 4904.3ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 36.2ms (= 7249.3ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.48 (= 36.2ms / 24.5ms)

@github-actions
Copy link
Contributor

github-actions bot commented Jun 3, 2023

CI failed when running job: cuda-module. PR label automerge has been removed

@github-actions github-actions bot removed the automerge label Jun 3, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Jun 3, 2023

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/10141/

@github-actions
Copy link
Contributor

github-actions bot commented Jun 3, 2023

Speed stats:

@strint strint requested a review from oneflow-ci-bot June 4, 2023 01:17
@github-actions
Copy link
Contributor

github-actions bot commented Jun 4, 2023

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/10141/

@github-actions
Copy link
Contributor

github-actions bot commented Jun 4, 2023

Speed stats:
GPU Name: NVIDIA GeForce RTX 3090 

❌ OneFlow resnet50 time: 43.0ms (= 4298.6ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 56.7ms (= 5674.4ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.32 (= 56.7ms / 43.0ms)

OneFlow resnet50 time: 26.1ms (= 2613.7ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 37.6ms (= 3764.9ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.44 (= 37.6ms / 26.1ms)

OneFlow resnet50 time: 18.5ms (= 3708.6ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 35.5ms (= 7104.2ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.92 (= 35.5ms / 18.5ms)

OneFlow resnet50 time: 17.9ms (= 3572.3ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 33.6ms (= 6715.2ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.88 (= 33.6ms / 17.9ms)

OneFlow resnet50 time: 17.3ms (= 3459.6ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 29.0ms (= 5799.6ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.68 (= 29.0ms / 17.3ms)

OneFlow swin dataloader time: 0.200s (= 39.977s / 200, num_workers=1)
PyTorch swin dataloader time: 0.129s (= 25.798s / 200, num_workers=1)
Relative speed: 0.645 (= 0.129s / 0.200s)

OneFlow swin dataloader time: 0.055s (= 10.968s / 200, num_workers=4)
PyTorch swin dataloader time: 0.033s (= 6.511s / 200, num_workers=4)
Relative speed: 0.594 (= 0.033s / 0.055s)

OneFlow swin dataloader time: 0.031s (= 6.175s / 200, num_workers=8)
PyTorch swin dataloader time: 0.017s (= 3.314s / 200, num_workers=8)
Relative speed: 0.537 (= 0.017s / 0.031s)

❌ OneFlow resnet50 time: 48.4ms (= 4840.4ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 63.9ms (= 6390.2ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.32 (= 63.9ms / 48.4ms)

OneFlow resnet50 time: 36.4ms (= 3637.8ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 46.0ms (= 4595.1ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.26 (= 46.0ms / 36.4ms)

OneFlow resnet50 time: 29.1ms (= 5819.1ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 39.7ms (= 7949.1ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.37 (= 39.7ms / 29.1ms)

OneFlow resnet50 time: 25.6ms (= 5110.7ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 38.9ms (= 7778.1ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.52 (= 38.9ms / 25.6ms)

OneFlow resnet50 time: 24.5ms (= 4902.9ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 35.9ms (= 7183.2ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.47 (= 35.9ms / 24.5ms)

@mergify mergify bot merged commit a9a339b into master Jun 4, 2023
@mergify mergify bot deleted the sep4_rank_task_graph branch June 4, 2023 01:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants