Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable the detection of subgraph composed of grad ops #21223

Merged
merged 23 commits into from
Feb 7, 2020

Conversation

Xreki
Copy link
Contributor

@Xreki Xreki commented Nov 18, 2019

前景介绍

#19884 中实现的子图匹配方法是:

  • 从一个var节点出发,如果var是指定的elementwise算子的输入X,则将var加入子图,然后将var的输出elementwise类型的op节点加入子图。
  • 对于搜索过程的中的中间节点,若是elementwise类型的op节点,则将该op节点和它的输入、输出var节点加入子图,再依次判断与其输入输出互联的其他op节点。

总的原理是:从一个var节点出发,将与该节点互联的elementwise类型的op节点以及它的输入、输出节点都加入子图。该子图匹配方法存在一个问题,如下图所示:

图片

左图中ABC均为elementwise类型的op节点,并且互联,原子图匹配方法会将ABC匹配到一个子图中,从而变成右图所示,会在图中引入环

子图匹配方法在Paddle中早有实现,TensorRT、ngraph集成都是采用子图匹配的方式,一个通用的子图匹配方法原先实现在inference/analysis/subgraph_detector.h/.cc中,#22094subgraph_detector源码移动到了framework/ir目录下。

这个PR的工作:

  • 本PR子图匹配改用通用的subgraph_detector,可成功匹配图中的前向、反向子图。
  • BuildStrategy中加入接口enable_auto_fusion,来控制是否打开fusion_group功能。
  • 加入单测,单测中使用PaddingRNN static test模型配置,打开配置build_strategy.enable_auto_fusion = True
  • 使用与实际模型训练任务中
    • language_model static small模型训练,可匹配到119个子图,训练性能60 steps/s → 84 steps/s。
    • ResNet50 float类型训练,可成功匹配到32个子图。该模型匹配到的都是elementwise_add+relu前向、反向的子图,由于时间占比很小,因此总体训练性能没有提升。匹配到的子图结构如下:
I0107 09:49:09.681634 24788 fusion_group_pass.cc:58] subgraph: {
    Node(bn5a_branch2c.output.1.tmp_2{-1x2048x7x7}), inputs:{batch_norm}, outputs:{elementwise_add, elementwise_add_grad}
    Node(bn5a_branch1.output.1.tmp_2{-1x2048x7x7}), inputs:{batch_norm}, outputs:{elementwise_add, elementwise_add_grad}
  Node(Op(elementwise_add), inputs:{X[bn5a_branch1.output.1.tmp_2], Y[bn5a_branch2c.output.1.tmp_2]}, outputs:{Out[res5a.add.output.5]}), inputs:{bn5a_branch1.output.1.tmp_2, bn5a_branch2c.output.1.tmp_2}, outputs:{res5a.add.output.5}.
    Node(res5a.add.output.5{-1x2048x7x7}), inputs:{elementwise_add}, outputs:{relu}
  Node(Op(relu), inputs:{X[res5a.add.output.5]}, outputs:{Out[res5a.add.output.5.tmp_0]}), inputs:{res5a.add.output.5}, outputs:{res5a.add.output.5.tmp_0}.
    Node(res5a.add.output.5.tmp_0{-1x2048x7x7}), inputs:{relu}, outputs:{conv2d, elementwise_add, elementwise_add_grad, conv2d_grad, relu_grad}
}
...
I0107 09:49:11.991044 24788 fusion_group_pass.cc:58] subgraph: {
    Node(res4a.add.output.5.tmp_0{-1x1024x14x14}), inputs:{fusion_group}, outputs:{conv2d, elementwise_add, elementwise_add_grad, conv2d_grad, relu_grad}
    Node(bn4b_branch2c.output.1.tmp_2{-1x1024x14x14}), inputs:{batch_norm}, outputs:{elementwise_add, elementwise_add_grad}
    Node(res4b.add.output.5.tmp_0{-1x1024x14x14}), inputs:{relu}, outputs:{conv2d, elementwise_add_grad, conv2d_grad, relu_grad, fusion_group}
    Node(res4b.add.output.5.tmp_0@GRAD{-1x1024x14x14}), inputs:{sum}, outputs:{relu_grad}
  Node(Op(relu_grad), inputs:{Out[res4b.add.output.5.tmp_0], Out@GRAD[res4b.add.output.5.tmp_0@GRAD]}, outputs:{X@GRAD[res4b.add.output.5@GRAD]}), inputs:{res4b.add.output.5.tmp_0, res4b.add.output.5.tmp_0@GRAD}, outputs:{res4b.add.output.5@GRAD}.
    Node(res4b.add.output.5@GRAD{-1x1024x14x14}), inputs:{relu_grad}, outputs:{elementwise_add_grad}
  Node(Op(elementwise_add_grad), inputs:{Out@GRAD[res4b.add.output.5@GRAD], X[res4a.add.output.5.tmp_0], Y[bn4b_branch2c.output.1.tmp_2]}, outputs:{X@GRAD[res4a.add.output.5.tmp_0@GRAD@RENAME@block0@0], Y@GRAD[bn4b_branch2c.output.1.tmp_2@GRAD]}), inputs:{res4b.add.output.5@GRAD, res4a.add.output.5.tmp_0, bn4b_branch2c.output.1.tmp_2}, outputs:{res4a.add.output.5.tmp_0@GRAD@RENAME@block0@0, bn4b_branch2c.output.1.tmp_2@GRAD}.
    Node(res4a.add.output.5.tmp_0@GRAD@RENAME@block0@0{-1x1024x14x14}), inputs:{elementwise_add_grad}, outputs:{sum}
    Node(bn4b_branch2c.output.1.tmp_2@GRAD{-1x1024x14x14}), inputs:{elementwise_add_grad}, outputs:{batch_norm_grad}
}
...

* Add the dynamic load of nvrtc, and support runtime compiling of CUDA kernel using nvrtc.
test=develop

* Call CUDA driver api to launch the kernel compiled by nvrtc.
test=develop

* Disable for mac and windows.
test=develop

* Refine the codes to support manually specified num_threads and workload_per_thread.
test=develop

* Refine the CUDA kernel to support large dims.
test=develop

* Add DeviceCodePool to manage all device codes.

* Add the first implementation fusion_group op.

* Add unit-test for fusion_group op.

* Add the check of result.

* Add the check of nvrtc in unit-test.
test=develop

* Add comment to explain the inputs, outputs and features of fusion_group op.
test=develop

* Disable fusion_group op for mac and windows.
test=develop

* Make the compiling of device code return status instead of hanging up.
test=develop

* Add the check of whether there is CUDA driver library, and do not core dump when failing to call the CUDA driver API.

* Unify fusion_group_op's input and output names.
test=develop

* Add the check of CUDA driver library in unittest.
test=develop
* Enable generating code for a given subgraph.

* Support sorting the subgraph.

* Remove the rearange of expressions because we use the sorted subgraph directly.

* Enable generating code for a subgraph which is composed of grad ops.

* Use expression information to check the accuracy in unittest.

* Separate load and store from computation expressions.
test=develop

* Improve the loading statements in generated codes.
test=develop

* Remove unused arguments from formal list.
test=develop
@Xreki Xreki force-pushed the fusion_group_final branch 4 times, most recently from 0b8ebe0 to dc8263b Compare January 9, 2020 08:51
Copy link
Member

@zhhsplendid zhhsplendid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good code, I put some tiny readability issues. Other things LGTM.

@@ -370,6 +373,12 @@ ir::Graph *BuildStrategy::Apply(ir::Graph *graph,
"GPU, skipped.";
continue;
}
} else if (pass->Type() == "fusion_group_pass") {
pass->Set("use_gpu", new bool(use_cuda));
Copy link
Member

@zhhsplendid zhhsplendid Jan 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny issue:

pass->Set("use_gpu", new bool(use_cuda))

I saw other "passes" wrote Set<bool> instead of Set. Do you have to do that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, I'd like to use Set<bool>, I will make the code more clear.

y_shape = y->Var()->GetShape();
}
if (x_shape.size() == 0U || x_shape.size() != y_shape.size()) {
static bool IsEqual(const std::vector<int64_t>& l,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Are you implementing "equality" for std::vector<int64_t>? I think "l == r" may work because std::vector has "=="

  2. If l.size() == 0 or r.size() == 0, also returns "false". So if you need this, I suggest to rename function as:

EqualAndNotEmtpy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you implementing "equality" for std::vector<int64_t>? I think "l == r" may work because std::vector has "=="

Thanks a lot to this. It will help me a lot.

If l.size() == 0 or r.size() == 0, also returns "false". So if you need this, I suggest to rename function as: EqualAndNotEmtpy

Yes. I need them not empty. I changed the name to IsEqualAndNotEmpty.


class FusionGroupPaddingRNNTest(PaddingRNNTestBase):
def set_customed_config(self):
# Enable fusion_group_pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that we don't need this comment because the following code contains same information

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.



class PaddingRNNTestBase(unittest.TestCase):
def setUp(self):
self.reader = Reader()
self.device_count = 1

def prepare_program(self, config, parallel=True):
# Default exec_strategy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we delete the comment "Default exec_strategy"?

Because you initialize a default exec_strategy but set some values. Then it is not default, I think. Same as "Default build_strategy"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean this is the default build_strategy and exec_strategy used for this program. I will refine the annotate to make it not ambiguous.

Copy link
Member

@zhhsplendid zhhsplendid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Xreki Xreki merged commit dcfb603 into PaddlePaddle:develop Feb 7, 2020
@Xreki Xreki deleted the fusion_group_final branch February 7, 2020 06:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants