-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove reshape and transpose operators from attention module #16342
Remove reshape and transpose operators from attention module #16342
Conversation
test=develop
test=develop
test=develop
test=develop
89136c8
to
ef3cd38
Compare
ef3cd38
to
0eea941
Compare
…bgraph. test=develop
62a9e02
to
dc40dfb
Compare
dc40dfb
to
b4359f9
Compare
test=develop
start a review |
@tensor-tang Please help to review this PR. This is one critical patch for BERT (and apply to Transformer) also. Thanks! |
test=develop
test=develop
test=develop
@tensor-tang Please help us review this PR and give some suggestion. Thanks a lot! |
Do you have the model level comparison before and after this PR? |
Just updated the description included the comparison of this model |
@@ -137,7 +137,8 @@ CpuPassStrategy::CpuPassStrategy() : PassStrategy({}) { | |||
// following two passes should be located in the last, since | |||
// they will work on all fused ops. | |||
"expected_kernel_cache_pass", // | |||
"runtime_context_cache_pass"}); | |||
"runtime_context_cache_pass", // | |||
"fuse_reshape_transpose_scale_matmul_pass"}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see line137, put fuse_reshape_transpose_scale_matmul_pass
before expected_kernel_cache_pass
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
this->template GEMM<T>(transA == CblasTrans, transB == CblasTrans, M, N, K, | ||
alpha, Ak, lda, Bk, ldb, beta, Ck, ldc); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- What's the difference between old
BatchedGEMM
and newBatchedGEMM
in your PR? - I see the difference is the input format from
const T
tostd::vector<const T *> *a_array
. - Could you reuse the old one or unify them? Or why do you create a new one?
Same for the Matmul
.
There is already
void Blas<DeviceContext>::MatMul(const framework::Tensor &mat_a,
const MatDescriptor &dim_a,
const framework::Tensor &mat_b,
const MatDescriptor &dim_b, T alpha,
framework::Tensor *mat_out, T beta)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- What's the difference between old
BatchedGEMM
and newBatchedGEMM
in your PR?
Transfer the arrays of input and output into BatchedGEMM directly.- I see the difference is the input format from
const T
tostd::vector<const T *> *a_array
.
According to the transpose dimensions's difference and the stride's requirement, the array calculation of MKL BatchedGEMM need be get though the special calculation. So it tends to move the calculation into the internal of matmul operator.- Could you reuse the old one or unify them? Or why do you create a new one?
The initial idea is that it can avoid the completion into the common blas's implementation. If we need implement this or others array's calculation into blas_impl.h, it can not keep the code's clean.Same for the
Matmul
.
There is alreadyvoid Blas<DeviceContext>::MatMul(const framework::Tensor &mat_a, const MatDescriptor &dim_a, const framework::Tensor &mat_b, const MatDescriptor &dim_b, T alpha, framework::Tensor *mat_out, T beta)
.SetDefault(std::vector<int>{-1, -1, -1}); | ||
AddAttr<bool>("is_test", | ||
"(bool, default false) Set to true for inference only, false " | ||
"for training. Some layers may run faster when this is true.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why matmul need is_test
attribute?
Why add last_dim
attribute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why matmul need
is_test
attribute?
Add the "is_test" attribute for inference mode and don't influence other requirement.
Why addlast_dim
attribute?
To decrease the count of matmul operator's attributes, but it will result in that it is only for the special dimensions of reshape and transpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add the "is_test" attribute for inference mode and don't influence other requirement.
Matmul is a common and base op, and it should not have the difference between train and inference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@luotao1 Score (inference) and Forward part of Training sometimes have difference. "is_test" attribute is used to distinguish between them and it's widely used in many OPs. For example:
- Batch Norm: Use fixed mean/variance instead of computing on batch
- Softmax: skip epson for performance improvement
- sequence_pool: don't create index buffer for performance improvement
The optimization here for attention (i.e. remove transpose/reshape via enhanced MatMul) only need apply to Inference only at this time. So we add "is_test" to contain our code.
In case this optimization need be applied to training also, we can add backward part and remove this "is_test" in fwd.
@jianhang-liu This PR should be moved to Release 1.6 |
template <typename T> | ||
void MatMul(const framework::Tensor& mat_a, const MatDescriptor& dim_a, | ||
const framework::Tensor& mat_b, const MatDescriptor& dim_b, | ||
T alpha, framework::Tensor* mat_out, T beta) const; | ||
|
||
template <typename T> | ||
void MatMul(std::vector<const T*>* a_array, const MatDescriptor& dim_a, | ||
const int ld_a, std::vector<const T*>* b_array, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why matmul input should be a vector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now Bingyang are ready to re-implement the pass in future. This PR will be aborted.
@@ -176,11 +177,24 @@ class Blas { | |||
int K, T alpha, const T* A, const T* B, T beta, T* C, | |||
int batchCount, int64_t strideA, int64_t strideB) const; | |||
|
|||
template <typename T> | |||
void BatchedGEMM(CBLAS_TRANSPOSE transA, CBLAS_TRANSPOSE transB, int M, int N, | |||
int K, T alpha, std::vector<const T*>* a_array, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const std:vector<const T*>*
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now Bingyang are ready to re-implement the pass in future. This PR will be aborted.
Such a good point and thanks to the foresight |
@@ -136,7 +136,8 @@ CpuPassStrategy::CpuPassStrategy() : PassStrategy({}) { | |||
"is_test_pass", // | |||
// following two passes should be located in the last, since | |||
// they will work on all fused ops. | |||
"expected_kernel_cache_pass", // | |||
"expected_kernel_cache_pass", // | |||
"fuse_reshape_transpose_scale_matmul_pass", // |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why add at last?
this is a very big fuse, maybe should be earlier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now Bingyang are ready to re-implement the pass in future. This PR will be aborted.
This PR is tested on Ernie in CPU and the num of threads is set as 20. |
According to the performance status of Bert/Transformer model, fused matmul/reshape/transpose operators to reduce memory's copy.
Platform: Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
Model Path: third_party/inference_demo/bert_emb128/model
Batch Size: 1
Command: ./paddle/fluid/inference/tests/api/test_analyzer_bert --infer_model=third_party/inference_demo/bert_emb128/model/ --infer_data=third_party/inference_demo/bert_emb128/data.txt --gtest_filter=Analyzer_bert.profile --paddle_num_threads=1 --repeat=10 --batch_size=1 --test_all_data
Data Source: third_party/inference_demo/bert_emb128/data.txt.
The following is the comparison with the different scenarios.
Model Comparison:
![image](https://user-images.githubusercontent.com/4131969/56845682-7cb5a280-68f7-11e9-82b2-ab459aaf3c9e.png)
(a).Before Optimization:
(b).After Optimization:
![image](https://user-images.githubusercontent.com/4131969/56845673-50018b00-68f7-11e9-9d11-670d67994cd5.png)
Reference:
![image](https://user-images.githubusercontent.com/6836917/56850552-bd80dc00-6936-11e9-9784-7fcfc989d52f.png)
Can we avoid head split_merge in Transformer.pdf