Remove reshape and transpose operators from attention module #16342

yihuaxu · 2019-03-21T05:02:24Z

According to the performance status of Bert/Transformer model, fused matmul/reshape/transpose operators to reduce memory's copy.

Platform: Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
Model Path: third_party/inference_demo/bert_emb128/model
Batch Size: 1
Command: ./paddle/fluid/inference/tests/api/test_analyzer_bert --infer_model=third_party/inference_demo/bert_emb128/model/ --infer_data=third_party/inference_demo/bert_emb128/data.txt --gtest_filter=Analyzer_bert.profile --paddle_num_threads=1 --repeat=10 --batch_size=1 --test_all_data
Data Source: third_party/inference_demo/bert_emb128/data.txt.

The following is the comparison with the different scenarios.

Model Comparison:
(a).Before Optimization:

(b).After Optimization:

Reference:
Can we avoid head split_merge in Transformer.pdf

test=develop

…bgraph. test=develop

test=develop

yihuaxu · 2019-03-24T00:45:24Z

start a review

jianhang-liu · 2019-03-25T02:46:29Z

@tensor-tang Please help to review this PR. This is one critical patch for BERT (and apply to Transformer) also. Thanks!

test=develop

…matmul_fuse_pass

test=develop

yihuaxu · 2019-04-24T05:17:18Z

@tensor-tang Please help us review this PR and give some suggestion. Thanks a lot!

test=develop

luotao1 · 2019-04-25T08:12:33Z

The following is the comparison with the different scenarios.

Do you have the model level comparison before and after this PR?

yihuaxu · 2019-04-27T06:20:46Z

The following is the comparison with the different scenarios.

Do you have the model level comparison before and after this PR?

Just updated the description included the comparison of this model

test=develop

luotao1 · 2019-05-10T02:26:13Z

paddle/fluid/inference/api/paddle_pass_builder.cc

@@ -137,7 +137,8 @@ CpuPassStrategy::CpuPassStrategy() : PassStrategy({}) {
                  // following two passes should be located in the last, since
                  // they will work on all fused ops.
                  "expected_kernel_cache_pass",  //
-                  "runtime_context_cache_pass"});
+                  "runtime_context_cache_pass",  //
+                  "fuse_reshape_transpose_scale_matmul_pass"});


see line137, put fuse_reshape_transpose_scale_matmul_pass before expected_kernel_cache_pass.

luotao1 · 2019-05-10T07:48:04Z

paddle/fluid/operators/math/blas_impl.h

+    this->template GEMM<T>(transA == CblasTrans, transB == CblasTrans, M, N, K,
+                           alpha, Ak, lda, Bk, ldb, beta, Ck, ldc);
+  }
+}


What's the difference between old BatchedGEMM and new BatchedGEMM in your PR?

I see the difference is the input format from const T to std::vector<const T *> *a_array.

Could you reuse the old one or unify them? Or why do you create a new one?

Same for the Matmul.
There is already

void Blas<DeviceContext>::MatMul(const framework::Tensor &mat_a, const MatDescriptor &dim_a, const framework::Tensor &mat_b, const MatDescriptor &dim_b, T alpha, framework::Tensor *mat_out, T beta)

What's the difference between old BatchedGEMM and new BatchedGEMM in your PR?
Transfer the arrays of input and output into BatchedGEMM directly.

I see the difference is the input format from const T to std::vector<const T *> *a_array.
According to the transpose dimensions's difference and the stride's requirement, the array calculation of MKL BatchedGEMM need be get though the special calculation. So it tends to move the calculation into the internal of matmul operator.

Could you reuse the old one or unify them? Or why do you create a new one?
The initial idea is that it can avoid the completion into the common blas's implementation. If we need implement this or others array's calculation into blas_impl.h, it can not keep the code's clean.

Same for the Matmul.
There is already

void Blas<DeviceContext>::MatMul(const framework::Tensor &mat_a, const MatDescriptor &dim_a, const framework::Tensor &mat_b, const MatDescriptor &dim_b, T alpha, framework::Tensor *mat_out, T beta)

luotao1 · 2019-05-10T08:02:40Z

paddle/fluid/operators/matmul_op.cc

+        .SetDefault(std::vector<int>{-1, -1, -1});
+    AddAttr<bool>("is_test",
+                  "(bool, default false) Set to true for inference only, false "
+                  "for training. Some layers may run faster when this is true.")


Why matmul need is_test attribute?
Why add last_dim attribute?

Why matmul need is_test attribute?
Add the "is_test" attribute for inference mode and don't influence other requirement.
Why add last_dim attribute?
To decrease the count of matmul operator's attributes, but it will result in that it is only for the special dimensions of reshape and transpose.

Add the "is_test" attribute for inference mode and don't influence other requirement.

Matmul is a common and base op, and it should not have the difference between train and inference.

@luotao1 Score (inference) and Forward part of Training sometimes have difference. "is_test" attribute is used to distinguish between them and it's widely used in many OPs. For example:

Batch Norm: Use fixed mean/variance instead of computing on batch

Softmax: skip epson for performance improvement

sequence_pool: don't create index buffer for performance improvement

The optimization here for attention (i.e. remove transpose/reshape via enhanced MatMul) only need apply to Inference only at this time. So we add "is_test" to contain our code.

In case this optimization need be applied to training also, we can add backward part and remove this "is_test" in fwd.

test=develop

bingyanghuang · 2019-07-03T07:52:26Z

@jianhang-liu This PR should be moved to Release 1.6

tensor-tang · 2019-08-24T13:53:34Z

paddle/fluid/operators/math/blas.h

  template <typename T>
  void MatMul(const framework::Tensor& mat_a, const MatDescriptor& dim_a,
              const framework::Tensor& mat_b, const MatDescriptor& dim_b,
              T alpha, framework::Tensor* mat_out, T beta) const;

+  template <typename T>
+  void MatMul(std::vector<const T*>* a_array, const MatDescriptor& dim_a,
+              const int ld_a, std::vector<const T*>* b_array,


why matmul input should be a vector?

Now Bingyang are ready to re-implement the pass in future. This PR will be aborted.

tensor-tang · 2019-08-24T14:00:39Z

paddle/fluid/operators/math/blas.h

@@ -176,11 +177,24 @@ class Blas {
                   int K, T alpha, const T* A, const T* B, T beta, T* C,
                   int batchCount, int64_t strideA, int64_t strideB) const;

+  template <typename T>
+  void BatchedGEMM(CBLAS_TRANSPOSE transA, CBLAS_TRANSPOSE transB, int M, int N,
+                   int K, T alpha, std::vector<const T*>* a_array,


const std:vector<const T*>* ?

Now Bingyang are ready to re-implement the pass in future. This PR will be aborted.

paddle/fluid/operators/matmul_op.cc

tensor-tang · 2019-08-24T14:18:37Z

Such a good point and thanks to the foresight GEMM!

tensor-tang · 2019-08-26T07:11:29Z

paddle/fluid/inference/api/paddle_pass_builder.cc

@@ -136,7 +136,8 @@ CpuPassStrategy::CpuPassStrategy() : PassStrategy({}) {
                  "is_test_pass",                  //
                  // following two passes should be located in the last, since
                  // they will work on all fused ops.
-                  "expected_kernel_cache_pass",  //
+                  "expected_kernel_cache_pass",                //
+                  "fuse_reshape_transpose_scale_matmul_pass",  //


why add at last?

this is a very big fuse, maybe should be earlier.

Now Bingyang are ready to re-implement the pass in future. This PR will be aborted.

GaoWei8 · 2019-08-26T14:14:40Z

This PR is tested on Ernie in CPU and the num of threads is set as 20.
The original time after fuse is 72.5432ms (without this PR) and the time decreases to 69.9925ms with merged this PR.
Is this situation correct？

yihuaxu and others added 5 commits March 21, 2019 10:48

Add matmul fuse pass

b14544e

test=develop

Improve MatMul with leading dimensions and bias

3081c56

test=develop

Fuse matmul+reshape+transpose as matmul operator

8665c47

test=develop

Remove eltwise_add operator's checking.

ad52756

test=develop

Add the fused operator and reverse detecting

fff4b75

yihuaxu force-pushed the develop_7fbf52daa_matmul_fuse_pass branch from 89136c8 to ef3cd38 Compare March 21, 2019 06:36

luotao1 added the Intel label Mar 21, 2019

luotao1 requested a review from tensor-tang March 21, 2019 06:55

yihuaxu force-pushed the develop_7fbf52daa_matmul_fuse_pass branch from ef3cd38 to 0eea941 Compare March 21, 2019 11:15

Add more comments and the unit tests for fused matmul operator and su…

943f99f

…bgraph. test=develop

yihuaxu force-pushed the develop_7fbf52daa_matmul_fuse_pass branch 2 times, most recently from 62a9e02 to dc40dfb Compare March 21, 2019 13:20

Fixe the hang issue when the count of subgraphs is zero.

b4359f9

test=develop

yihuaxu force-pushed the develop_7fbf52daa_matmul_fuse_pass branch from dc40dfb to b4359f9 Compare March 21, 2019 22:48

yihuaxu added 2 commits March 22, 2019 16:15

Fix the exception while replacing the fused matmul operator.

426669e

test=develop

Modify for windows CI testing.

c0f4911

test=develop

yihuaxu and others added 3 commits April 9, 2019 15:54

Merge branch 'develop' into develop_7fbf52daa_matmul_fuse_pass

ccf06eb

Fix the compile issue when merge the latest code.

a8265b8

test=develop

Fix the compile issue when merge the latest code.

c425265

test=develop

yihuaxu changed the title ~~Fuse matmul/reshape/transpose operators to reduce memory's copy~~ Remove reshape and transpose operators from attention module Apr 16, 2019

yihuaxu added 5 commits April 17, 2019 15:48

Merge remote-tracking branch 'paddle/develop' into develop_7fbf52daa_…

fa51db2

…matmul_fuse_pass

Migrated the fused parts into matmul operator.

5d2e89a

test=develop

Fix the compile issue for CUDA

ad0a09c

test=develop

Fix the compile issue for CUDA.

a188d3d

test=develop

Fix the crash issue of unit test when enabled GPU flag.

0b0d981

test=develop

yihuaxu and others added 2 commits April 24, 2019 19:43

Merge branch 'develop' into develop_7fbf52daa_matmul_fuse_pass

dc8492d

Update the format for CI checking

5682631

test=develop

Reduce the parameters's count for matmul operator.

f99b5d5

test=develop

luotao1 reviewed May 10, 2019

View reviewed changes

yihuaxu added 2 commits May 10, 2019 16:42

Change the pass's order and other conflict.

23cd166

test=develop

Merge branch 'develop' into develop_7fbf52daa_matmul_fuse_pass

48e4998

jianhang-liu added this to the v1.5 for Intel milestone May 13, 2019

czhu15 mentioned this pull request Jul 10, 2019

Extend Matmul to support matrix multiplication with multiple heads #18570

Merged

Xreki mentioned this pull request Aug 14, 2019

Optimize inference performance of ERNIE on P40 GPU PaddlePaddle/benchmark#165

Open

tensor-tang reviewed Aug 24, 2019

View reviewed changes

tensor-tang reviewed Aug 26, 2019

View reviewed changes

bingyanghuang mentioned this pull request Sep 2, 2019

Remove reshape transpose in attention module for bert optimization #19585

Closed

yihuaxu closed this Nov 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove reshape and transpose operators from attention module #16342

Remove reshape and transpose operators from attention module #16342

yihuaxu commented Mar 21, 2019 •

edited by luotao1

yihuaxu commented Mar 24, 2019

jianhang-liu commented Mar 25, 2019

yihuaxu commented Apr 24, 2019

luotao1 commented Apr 25, 2019 •

edited

yihuaxu commented Apr 27, 2019

luotao1 May 10, 2019

yihuaxu May 10, 2019

luotao1 May 10, 2019

yihuaxu May 10, 2019 •

edited

luotao1 May 10, 2019

yihuaxu May 10, 2019

luotao1 May 10, 2019

jianhang-liu May 13, 2019

bingyanghuang commented Jul 3, 2019

tensor-tang Aug 24, 2019

yihuaxu Aug 27, 2019

tensor-tang Aug 24, 2019

yihuaxu Aug 27, 2019

tensor-tang commented Aug 24, 2019

tensor-tang Aug 26, 2019 •

edited

yihuaxu Aug 27, 2019

GaoWei8 commented Aug 26, 2019

Remove reshape and transpose operators from attention module #16342

Remove reshape and transpose operators from attention module #16342

Conversation

yihuaxu commented Mar 21, 2019 • edited by luotao1

yihuaxu commented Mar 24, 2019

jianhang-liu commented Mar 25, 2019

yihuaxu commented Apr 24, 2019

luotao1 commented Apr 25, 2019 • edited

yihuaxu commented Apr 27, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yihuaxu May 10, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bingyanghuang commented Jul 3, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tensor-tang commented Aug 24, 2019

tensor-tang Aug 26, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GaoWei8 commented Aug 26, 2019

yihuaxu commented Mar 21, 2019 •

edited by luotao1

luotao1 commented Apr 25, 2019 •

edited

yihuaxu May 10, 2019 •

edited

tensor-tang Aug 26, 2019 •

edited