-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Softmax OP (fwd+bwd) need be further optimized #17268
Comments
|
@jczaja Does all above 4 improvement for softmax_mkldnn_op? |
Yes. Do you suggest looking at Nonmkl-DNN as well ? @tensor-tang implemented JIT softmax in paddlepaddlle , but only for inference. Extending his code to work for forward-training (if only instruction set used is matching your requirements)could be useful when softmax without mkl-dnn is used |
Since |
@jczaja Could you create a PR when you finish the first one improvement |
@luotao1 Sure, Those points it just things I will try. Not that all of them will be implemented in one PR. Ok, We will look at softmax_mkldnn first. |
@luotao1 Just talked with Jacek. The current "extreamly slow" performance of Softmax FWD looks quite strange to us. There should be something wrong. Jacek will prepare a branch to add some debug code and ask for your help to run model and collect log. That should happen within today (PL time). |
@luotao1 I would like you to execute following experiments (use MKL-DNN execution) so we gather more data on poor softmax fwd execution.
Please send us gathered output. |
@jczaja Does step2 depends on step1?
|
@jczaja Does it need enable MKLDNN_VERBOSE when running step2? I told Luo Tao it's not necessary since your code directly use cout to print log. |
I think it may help less. |
@luotao1 Regarding step 1. Please provide full log not only MKLDNN_VERBOSE=1 part. I asked for MKLDNN_VERBOSE just to make sure MKLDNN is in use. But I need other part of logs, as I just put printf into MKL-DNN . To be more precise I need lines like this: |
I don't see any log like this. |
@luotao1 When running configuration eg. cmake ... , There is PaddlePaddle version/commit mentioned used. could you please paste/confirm that line eg. This is a branch that should be used for step 1. Please confirm |
for step1, the
|
@luotao1 Could you please build step 1 branch for unit tests and run: ctest -R test_softmax_mkldnn -VV |
@jczaja @jianhang-liu Sorry for the mistaken test. The new log is: |
@luotao1 Softmax forward DENSE is faster kernel unless MKL-DNN is build without MKL. PaddlePaddle is build with MKL so this should be faster kernel. Thanks very much for sending logs. So next two things I want to test is:
|
It is using axis param with the default value -1 in this model. |
@luotao1 I have seen reference code for softmax as used in your internal framework. SoftmaxLayer::ff is performing normalization on all input data eg. It is assumed that input Tensor is one dimensional, but ContentDNN seems to use 2D/3D data for softmax. questions:
|
|
|
No, it calls 1 time for
No, there is 4560 calls to SeqSoftmaxLayer.
We look the TotalValue. |
|
|
|
|
The input of SeqSoftmaxLayer has been flattened already, that's why Paddle uses additional transpose2 op to do this. |
Thanks for answers and logs. They are very helpful. Meantime we inspected slowness of softmax MKL-DNN as from previous findings we know that 99% of time of softmax fwd op is spent in MKL-DNN . Having MKLDNN_VERBOSE output from ContentDNN we can see : We modified unit test to execute softmax of same dims on MKL-DNN and results are: mkldnn_verbose,exec,softmax,ref:any,forward_inference,fdata:nc fdiff:undef,,mb1600ic100,21.1709 From comparison of times we can see that results We gathered from UT are much lower (apart from first one) comparing to those taken from ContentDNN training. So I would like you get us some more data:
Local machine I used for presented data: |
I test on three different machines, but we have the same problem:
|
@luotao1 Just a small comment. Logs presented above are suggesting that input data for softmax |
The framework overhead is quite small on this model.
softmax_compute means the |
step1_ut.log |
Thanks very much for logs, they suggest that machines are fine. I'm sorry I did not write it very clear. If PaddlePaddle is doing some reads and writes in between of transpose and corressponding softmax (more than internal framework) ops then output from transpose may no longer be in a cache and then softmax has to take input from RAM rather than from cache memory so then even if JIT is used it may be working slower due to waiting time to get data from RAM memory. Anyway Hopefully all is fine with JIT softmax performance for ContentDNN. |
@jacek, so what's the next step in your side to further check MKLDNN Softmax FWD? |
@jianhang-liu Next step should be to enable softmax JIT of PaddlePaddle for this model . If this works very fast then MKL-DNN softmax implementation is poor if JIT softmax is not performing well then |
@jczaja @tensor-tang @luotao1 Let's recap where we are after a few days investigation.
What's ongoing now:
|
|
@luotao1 Could you please,
In both scenarios functional problem will arise eg. convergence problems as those are not production quality branches. We are checking performance. |
@jczaja @jianhang-liu @tensor-tang I directly use inference jit softmax fwd, but it is slow as before. |
The performance is 8x slower than before.
see #17268 (comment) |
@luotao1 Could you please provide log (MKLDNN_VERBOSE + paddle profiling) for step 1 ? |
@luotao1 Regarding step 1 , 8 x slower with memcpy instead of vsexp is very surprising. I was able to look into detailed log of contentDNN (VLOG output) training . I looked what happens in-between of transpose2 and corresponding softmax . For slowest one eg. 16x100x100 , we have
On the other hand I heard from @jianhang-liu that @tensor-tang put internal framework softmax into PaddlePaddle and performance was largly improved. If you could share your findings that would be great |
Yes, no other VM/dockers running.
I have sent email to you.
For step1, it will cause err after iteration 600. Thus, I paste the log (MKLDNN_VERBOSE + paddle profiling) for iteration 400.
|
@luotao1 actually jit helps a little from @zhupengyang 's test The reason why not helps much is that the situation on content dnn have changing size and small ones. jit creating takes time. The better solution for this case is proposed at #17522 |
This issue is closed by PR #17522 and #17534 from @tensor-tang . Meanwhile, Intel MKLDNN team is also working on a JIT version Softmax and will be in v1.0 (and possiblly backport to v0.1x also). |
Need optimize Softmax (fwd+bwd) for CPU training
The text was updated successfully, but these errors were encountered: