-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
exception occurs when do BERT training on CPU with transpose mkldnn op enabled #17611
Comments
We meet the same problem on content_dnn model. We use parallel_executor for training.
Thus, we guess is there something wrong in transpose mkldnn op?
@LeoZhao-Intel Could you give the |
A temporary fix, just for test.
|
@luotao1 BTW, can you share the method to enable/disable mkldnn for specific op in python ? thx |
please see #17341 |
softmax mkldnn kernel also has such issue. |
content_dnn don't have such issue on softmax mkldnn. |
increase device_count , e.g. change CPU_NUM to big, it may happen |
@LeoZhao-Intel @jczaja I test the temporary fix in #17611 (comment) on content_dnn 16 threads training, it runs OK. |
@luotao1 For Executor, is there similar interface with ParallelExecutor which can be used for mkldnn op list setting? thx |
No, there is no similar interface for Executor. |
so that means no way from python level to set mkldnn op list? |
For executor, there is no such interface now, since it doesn't support pass now. |
got it, thx |
@jczaja for how to setup env, please refer https://docs.google.com/document/d/1luZg1v2_leUaV7nQhXqJvgzLIEsU-OAVNiIyDuo88dM, run script is shared here.
you can modify mkldnn op list to enable/disable op in script. |
@jczaja will take time to continue the investigation and give formal fix. @LeoZhao-Intel Please help Jacek to reproduce this issue. Thanks. |
Some update: I have reproduced problem eg. crash in Transpose and Softmax of BERT training but not fixed it yet. I'm investigating problem and have couple of questions regarding threads managements in paddle. Device context is holding collection of maps . This collection (Map as well) is using cur_thread_id
questions:
Potential fix could be calling setMkldnnThreadID/set_cur_thread_id with unqiue value for each "task" Anyway, As I'm not familiar with PaddlePaddle ParallelExecutor and PoolThread then I will further look to that, but If you could answer my questions or explain architecture of ParallelExecutor in terms of threads management that would be very helpful. |
https://github.com/PaddlePaddle/FluidDoc/blob/develop/doc/fluid/design/concepts/parallel_executor.md |
@jczaja Because during the running of an operator, it firstly gets the DeviceContext from the |
@chengduoZH Thanks for conceptual explanation. Status: MKL-DNN Reusing implementation is not thread safe, currently implementing prototype of making it thread safe eg. locks, critical sections etc. Once this work for BERT traninging via parallel executor then we will think how to make it working for all mkl-dnn ops. |
@chengduoZH , on behalf of this issue, I ask few questions about ParallelExecutor.
|
@luotao1 Is it possible from within paddle(In operator level) to figure out which executor is used eg. NaiveExecutor , Executor or ParallelExecutor? |
@LeoZhao-Intel Which one is fast? use_fast_executor=true of false.
ParallelExecutor can be used for training and testing now. And Parallel Executor will continue to be optimized in the future. |
@chengduoZH Thanks for your reply, use_fast_executor=true has almost 2 times of performance improvement, we are surprised to see this. It is good for us to understand your strategy and roadmap, then do optimization on Xeon accordingly. |
@jczaja It is impossible in op level to figure out which executor is used. And an op doesn't need to know which executor will call it. |
@luotao1 Could you please run training of ContentDNN on following branch: -If there is a failure please provide log (this is still work in progress , halpf of mkl-dnn ops were secured) |
Status: critical sections were reduced (changes on branch URL as prviously). Currently trying to improve code (avoid copying locks all over). If all is fine Then first PR should be ready this week. |
@jczaja Could you update the branch
|
@jczaja one comment for prv-mkldnn-reuse-mt branch.
Then what will happen ? either primitive X's data handle for instance A is changed to B's, or crash ? Given this primitive object shared in one instance case is serially executed, but for multiple instances it is parallel execution, how to make sure execution order in mkldnn pipeline, it may be a problem, or use mutex to pretect execution? looks not efficient. Come to initial problem, how to identify op from different execution instances with same graph? previously it has thread id, but in ParallelExecutor, there is no same way to set/get this thread id. So if we can identify instance and create primitive shared obj for each instance, that will be fine. |
@LeoZhao-Intel That is very good point. Currently it would be easier for me to make critical section around setting pointer to memory primitive and mkldnn execution. If there is noticable performance drop then The only option would be finding a way to cache instances' primitives/memory independently. I haven't found an parallel's member executor identification so I guess we would need to add one which means some changes into core functionality of paddle , so once solution with critical sections is not fast enough We will have to do it. Thanks for pointing out this serious issue. |
@chengduoZH Is ParallelExecutor to use for data parallelizm of model or is it also model parallelizm ? |
Extending work from branch: prv-mkldnn-reuse-mt to handle case pointed by @LeoZhao-Intel is difficult and will introduce overhead. Decided to try diffrent approach:
|
@luotao1 Could you please run training of ContentDNN on following branch: -If there is a failure please provide log (this is still work in progress ,Relu(activations) were not updated) |
@jczaja @jianhang-liu Could you create a PR for the fix of our face deployment service at first? |
My understanding is that #17965 solved some issues , but introduced memory-leaks due to over-caching mkl-dnn data. I'm working on modification of this behaviour so that Parallel executor's use case is still cached per std::thread::id , but AnalysisPredictor (to be used in your service?) is cached
|
|
fixed by PR #17965 |
I met a issue related with mkldnn when I tried to enable transpose mkldnn op in ParallelExecutor for BERT training.
throw exception in TransposeMKLDNNHandler:: AcquireTranspose()
For current implementation in Compute(), a hash key is used for create TransposeMKLDNNHandler, and this key is generated by
This method is ok for common case in Executor or NativeExecutor since the only instance for one op, and even for multi-instances inference case, SetMkldnnThreadID() is designed for identifying op.
But for ParallelExecutor case, it is specific, it duplicates program execution by user defined instance number, it looks like you have multiple GPU cards in one machine. Let's come to the root cause for my case, for ParallelExecutor on CPU, no place to SetMkldnnThreadID() to identify different mkldnn op instances on same program, simply speaking, the key generation method is not enough anymore.
For my current temporary solution, I add kernel instance pointer 'this' into hash key, so it can be simplified with just 'this', and then it works.
I think it is not a issue just on Transpose, it may be a generic issue for all mkldnn ops on ParallelExecutor.
The text was updated successfully, but these errors were encountered: