Enable the detection of subgraph composed of grad ops #21223

Xreki · 2019-11-18T09:22:52Z

前景介绍

#19884 中实现的子图匹配方法是：

从一个var节点出发，如果var是指定的elementwise算子的输入X，则将var加入子图，然后将var的输出elementwise类型的op节点加入子图。
对于搜索过程的中的中间节点，若是elementwise类型的op节点，则将该op节点和它的输入、输出var节点加入子图，再依次判断与其输入输出互联的其他op节点。

总的原理是：从一个var节点出发，将与该节点互联的elementwise类型的op节点以及它的输入、输出节点都加入子图。该子图匹配方法存在一个问题，如下图所示：

左图中ABC均为elementwise类型的op节点，并且互联，原子图匹配方法会将ABC匹配到一个子图中，从而变成右图所示，会在图中引入环。

子图匹配方法在Paddle中早有实现，TensorRT、ngraph集成都是采用子图匹配的方式，一个通用的子图匹配方法原先实现在inference/analysis/subgraph_detector.h/.cc中，#22094 将subgraph_detector源码移动到了framework/ir目录下。

这个PR的工作：

本PR子图匹配改用通用的subgraph_detector，可成功匹配图中的前向、反向子图。
在BuildStrategy中加入接口enable_auto_fusion，来控制是否打开fusion_group功能。
加入单测，单测中使用PaddingRNN static test模型配置，打开配置build_strategy.enable_auto_fusion = True
使用与实际模型训练任务中
- language_model static small模型训练，可匹配到119个子图，训练性能60 steps/s → 84 steps/s。
- ResNet50 float类型训练，可成功匹配到32个子图。该模型匹配到的都是elementwise_add+relu前向、反向的子图，由于时间占比很小，因此总体训练性能没有提升。匹配到的子图结构如下：

I0107 09:49:09.681634 24788 fusion_group_pass.cc:58] subgraph: {
    Node(bn5a_branch2c.output.1.tmp_2{-1x2048x7x7}), inputs:{batch_norm}, outputs:{elementwise_add, elementwise_add_grad}
    Node(bn5a_branch1.output.1.tmp_2{-1x2048x7x7}), inputs:{batch_norm}, outputs:{elementwise_add, elementwise_add_grad}
  Node(Op(elementwise_add), inputs:{X[bn5a_branch1.output.1.tmp_2], Y[bn5a_branch2c.output.1.tmp_2]}, outputs:{Out[res5a.add.output.5]}), inputs:{bn5a_branch1.output.1.tmp_2, bn5a_branch2c.output.1.tmp_2}, outputs:{res5a.add.output.5}.
    Node(res5a.add.output.5{-1x2048x7x7}), inputs:{elementwise_add}, outputs:{relu}
  Node(Op(relu), inputs:{X[res5a.add.output.5]}, outputs:{Out[res5a.add.output.5.tmp_0]}), inputs:{res5a.add.output.5}, outputs:{res5a.add.output.5.tmp_0}.
    Node(res5a.add.output.5.tmp_0{-1x2048x7x7}), inputs:{relu}, outputs:{conv2d, elementwise_add, elementwise_add_grad, conv2d_grad, relu_grad}
}
...
I0107 09:49:11.991044 24788 fusion_group_pass.cc:58] subgraph: {
    Node(res4a.add.output.5.tmp_0{-1x1024x14x14}), inputs:{fusion_group}, outputs:{conv2d, elementwise_add, elementwise_add_grad, conv2d_grad, relu_grad}
    Node(bn4b_branch2c.output.1.tmp_2{-1x1024x14x14}), inputs:{batch_norm}, outputs:{elementwise_add, elementwise_add_grad}
    Node(res4b.add.output.5.tmp_0{-1x1024x14x14}), inputs:{relu}, outputs:{conv2d, elementwise_add_grad, conv2d_grad, relu_grad, fusion_group}
    Node(res4b.add.output.5.tmp_0@GRAD{-1x1024x14x14}), inputs:{sum}, outputs:{relu_grad}
  Node(Op(relu_grad), inputs:{Out[res4b.add.output.5.tmp_0], Out@GRAD[res4b.add.output.5.tmp_0@GRAD]}, outputs:{X@GRAD[res4b.add.output.5@GRAD]}), inputs:{res4b.add.output.5.tmp_0, res4b.add.output.5.tmp_0@GRAD}, outputs:{res4b.add.output.5@GRAD}.
    Node(res4b.add.output.5@GRAD{-1x1024x14x14}), inputs:{relu_grad}, outputs:{elementwise_add_grad}
  Node(Op(elementwise_add_grad), inputs:{Out@GRAD[res4b.add.output.5@GRAD], X[res4a.add.output.5.tmp_0], Y[bn4b_branch2c.output.1.tmp_2]}, outputs:{X@GRAD[res4a.add.output.5.tmp_0@GRAD@RENAME@block0@0], Y@GRAD[bn4b_branch2c.output.1.tmp_2@GRAD]}), inputs:{res4b.add.output.5@GRAD, res4a.add.output.5.tmp_0, bn4b_branch2c.output.1.tmp_2}, outputs:{res4a.add.output.5.tmp_0@GRAD@RENAME@block0@0, bn4b_branch2c.output.1.tmp_2@GRAD}.
    Node(res4a.add.output.5.tmp_0@GRAD@RENAME@block0@0{-1x1024x14x14}), inputs:{elementwise_add_grad}, outputs:{sum}
    Node(bn4b_branch2c.output.1.tmp_2@GRAD{-1x1024x14x14}), inputs:{elementwise_add_grad}, outputs:{batch_norm_grad}
}
...

* Add the dynamic load of nvrtc, and support runtime compiling of CUDA kernel using nvrtc. test=develop * Call CUDA driver api to launch the kernel compiled by nvrtc. test=develop * Disable for mac and windows. test=develop * Refine the codes to support manually specified num_threads and workload_per_thread. test=develop * Refine the CUDA kernel to support large dims. test=develop * Add DeviceCodePool to manage all device codes. * Add the first implementation fusion_group op. * Add unit-test for fusion_group op. * Add the check of result. * Add the check of nvrtc in unit-test. test=develop * Add comment to explain the inputs, outputs and features of fusion_group op. test=develop * Disable fusion_group op for mac and windows. test=develop * Make the compiling of device code return status instead of hanging up. test=develop * Add the check of whether there is CUDA driver library, and do not core dump when failing to call the CUDA driver API. * Unify fusion_group_op's input and output names. test=develop * Add the check of CUDA driver library in unittest. test=develop

* Enable generating code for a given subgraph. * Support sorting the subgraph. * Remove the rearange of expressions because we use the sorted subgraph directly. * Enable generating code for a subgraph which is composed of grad ops. * Use expression information to check the accuracy in unittest. * Separate load and store from computation expressions. test=develop * Improve the loading statements in generated codes. test=develop * Remove unused arguments from formal list. test=develop

…ittest. test=develop

…ork/ir directory. (#5) test=develop

test=develop

…_role for fused op. test=develop

test=develop

zhhsplendid

Good code, I put some tiny readability issues. Other things LGTM.

zhhsplendid · 2020-01-15T11:55:30Z

paddle/fluid/framework/details/build_strategy.cc

@@ -370,6 +373,12 @@ ir::Graph *BuildStrategy::Apply(ir::Graph *graph,
                        "GPU, skipped.";
        continue;
      }
+    } else if (pass->Type() == "fusion_group_pass") {
+      pass->Set("use_gpu", new bool(use_cuda));


Tiny issue:

pass->Set("use_gpu", new bool(use_cuda))

I saw other "passes" wrote Set<bool> instead of Set. Do you have to do that?

In fact, I'd like to use Set<bool>, I will make the code more clear.

zhhsplendid · 2020-01-15T12:03:33Z

paddle/fluid/framework/ir/fusion_group/elementwise_group_detector.cc

-      y_shape = y->Var()->GetShape();
-    }
-    if (x_shape.size() == 0U || x_shape.size() != y_shape.size()) {
+static bool IsEqual(const std::vector<int64_t>& l,


Are you implementing "equality" for std::vector<int64_t>? I think "l == r" may work because std::vector has "=="

If l.size() == 0 or r.size() == 0, also returns "false". So if you need this, I suggest to rename function as:

EqualAndNotEmtpy

Are you implementing "equality" for std::vector<int64_t>? I think "l == r" may work because std::vector has "=="

Thanks a lot to this. It will help me a lot.

If l.size() == 0 or r.size() == 0, also returns "false". So if you need this, I suggest to rename function as: EqualAndNotEmtpy

Yes. I need them not empty. I changed the name to IsEqualAndNotEmpty.

zhhsplendid · 2020-01-15T12:10:54Z

python/paddle/fluid/tests/unittests/test_build_strategy_fusion_group_pass.py

+
+class FusionGroupPaddingRNNTest(PaddingRNNTestBase):
+    def set_customed_config(self):
+        # Enable fusion_group_pass


I feel that we don't need this comment because the following code contains same information

zhhsplendid · 2020-01-15T12:12:34Z

python/paddle/fluid/tests/unittests/test_eager_deletion_padding_rnn.py



 class PaddingRNNTestBase(unittest.TestCase):
    def setUp(self):
        self.reader = Reader()
        self.device_count = 1

-    def prepare_program(self, config, parallel=True):
+        # Default exec_strategy


Can we delete the comment "Default exec_strategy"?

Because you initialize a default exec_strategy but set some values. Then it is not default, I think. Same as "Default build_strategy"

I mean this is the default build_strategy and exec_strategy used for this program. I will refine the annotate to make it not ambiguous.

test=develop

zhhsplendid

LGTM

Xreki added 5 commits November 18, 2019 16:57

Enable the detection of subgraph of grad ops.

7061d85

Merge branch 'develop' into fusion_group_final

7ec4836

Generate code for detected subgraph in fusion_group_pass.

6d1159e

Xreki force-pushed the fusion_group_final branch from 8b4e7d0 to ead917c Compare November 21, 2019 02:49

Add an option in BuildStrategy to enable fusion_group_pass and add un…

e4f20ca

…ittest. test=develop

Xreki force-pushed the fusion_group_final branch from ead917c to e4f20ca Compare November 21, 2019 08:19

Xreki added 6 commits November 22, 2019 14:36

Merge branch 'develop' into fusion_group_final

310b5b2

Fix a bug when checking whether the shape of all inputs are the same.

7aebc6a

Merge branch 'develop' into fusion_group_final

241c332

Merge branch 'develop' into fusion_group_final

582d0f6

Add debug information.

c59530a

Merge branch 'develop' into fusion_group_final

32ac208

Xreki force-pushed the fusion_group_final branch from d0c45d4 to 32ac208 Compare January 3, 2020 07:40

Xreki added 4 commits January 6, 2020 11:23

Merge branch 'develop' into fusion_group_final

d6de64e

Remove subgraph_detector from inference/analysis to the common framew…

ce73cd3

…ork/ir directory. (#5) test=develop

Call subgraph_detector in fusion_group pass.

fc1ab12

test=develop

Disable fusion_group when WITH_GPU is OFF.

0127093

test=develop

Xreki force-pushed the fusion_group_final branch from 2328611 to 0127093 Compare January 7, 2020 01:17

Refine all PADDLE_ENFORCE message.

dd40dcb

test=develop

Xreki force-pushed the fusion_group_final branch from c28a79d to d887700 Compare January 7, 2020 09:05

Fix the case that some inputs are not defined in grad ops, and set op…

7fec60f

…_role for fused op. test=develop

Xreki force-pushed the fusion_group_final branch 4 times, most recently from 0b8ebe0 to dc8263b Compare January 9, 2020 08:51

Merge branch 'develop' into fusion_group_final

1a5ebe9

test=develop

Xreki force-pushed the fusion_group_final branch from dc8263b to 1a5ebe9 Compare January 9, 2020 09:18

Merge branch 'develop' into fusion_group_final

97b690b

Merge branch 'develop' into fusion_group_final

0e96b09

test=develop

Xreki force-pushed the fusion_group_final branch from 1874258 to 0e96b09 Compare January 13, 2020 03:15

Xreki requested review from luotao1, zhaoyuchen2018 and zhhsplendid January 14, 2020 03:12

zhhsplendid reviewed Jan 15, 2020

View reviewed changes

Xreki added 2 commits February 6, 2020 10:44

Merge branch 'develop' into fusion_group_final

11368b3

Follow review comments.

160a3e5

test=develop

zhhsplendid approved these changes Feb 7, 2020

View reviewed changes

Xreki merged commit dcfb603 into PaddlePaddle:develop Feb 7, 2020

Xreki deleted the fusion_group_final branch February 7, 2020 06:59

Xreki mentioned this pull request Feb 12, 2020

Disable fusion_group for windows and mac in build_strategy. #22549

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable the detection of subgraph composed of grad ops #21223

Enable the detection of subgraph composed of grad ops #21223

Xreki commented Nov 18, 2019 •

edited

Loading

zhhsplendid left a comment

zhhsplendid Jan 15, 2020 •

edited by Xreki

Loading

Xreki Feb 6, 2020

zhhsplendid Jan 15, 2020

Xreki Feb 6, 2020

zhhsplendid Jan 15, 2020

Xreki Feb 6, 2020

zhhsplendid Jan 15, 2020

Xreki Feb 6, 2020

zhhsplendid left a comment

Enable the detection of subgraph composed of grad ops #21223

Enable the detection of subgraph composed of grad ops #21223

Conversation

Xreki commented Nov 18, 2019 • edited Loading

前景介绍

这个PR的工作：

zhhsplendid left a comment

Choose a reason for hiding this comment

zhhsplendid Jan 15, 2020 • edited by Xreki Loading

Choose a reason for hiding this comment

Xreki Feb 6, 2020

Choose a reason for hiding this comment

zhhsplendid Jan 15, 2020

Choose a reason for hiding this comment

Xreki Feb 6, 2020

Choose a reason for hiding this comment

zhhsplendid Jan 15, 2020

Choose a reason for hiding this comment

Xreki Feb 6, 2020

Choose a reason for hiding this comment

zhhsplendid Jan 15, 2020

Choose a reason for hiding this comment

Xreki Feb 6, 2020

Choose a reason for hiding this comment

zhhsplendid left a comment

Choose a reason for hiding this comment

Xreki commented Nov 18, 2019 •

edited

Loading

zhhsplendid Jan 15, 2020 •

edited by Xreki

Loading