Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_analyzer_* random crashs at compare_mkldnn #15032

Closed
luotao1 opened this issue Dec 25, 2018 · 35 comments · Fixed by #15860 or #18578
Closed

test_analyzer_* random crashs at compare_mkldnn #15032

luotao1 opened this issue Dec 25, 2018 · 35 comments · Fixed by #15860 or #18578
Assignees
Labels

Comments

@luotao1 luotao1 added the Intel label Dec 25, 2018
@luotao1
Copy link
Contributor Author

luotao1 commented Dec 25, 2018

Is the same problem like #14174?

@luotao1 luotao1 changed the title test_analyzer_small_dam random crashs at compare_mkldnn test_analyzer_* random crashs at compare_mkldnn Jan 3, 2019
@jianhang-liu
Copy link
Contributor

@jczaja Could you help to have a check at this? Thanks!

@luotao1
Copy link
Contributor Author

luotao1 commented Jan 3, 2019

It seems that mkldnn hangs from the log, is a openmp problem?

@jczaja
Copy link
Contributor

jczaja commented Jan 3, 2019

@luotao1 We will look to that and get back to You

@jczaja
Copy link
Contributor

jczaja commented Jan 3, 2019

@luotao1 I did not get to reproduce problem.

  1. Could you make a changes so that : compare_mkldnn works without mkl_dnn eg. compare(false) and then run experiments. So we know if MKL-DNN is actual problem ?
  2. How did you observe that mkldnn hangs on logging? Did you attach debugger and observe where execution is looping over?
  3. Do you use parallel mode of execution of tests of ctest? What is ctest commandline to run tests?

@luotao1
Copy link
Contributor Author

luotao1 commented Jan 4, 2019

Could you make a changes so that : compare_mkldnn works without mkl_dnn eg. compare(false) and then run experiments. So we know if MKL-DNN is actual problem ?

We have compare UT which works without mkldnn, and it all runs successfully. You can see these UTs in url.

[22:59:02][Step 1/1] [       OK ] Analyzer_dam.compare (553 ms)

How did you observe that mkldnn hangs on logging? Did you attach debugger and observe where execution is looping over?

  1. All the logs like as follows. Since there is no error log, I guess mkldnn hangs. But I didn't attach debugger and observe where execution is looping over.
[22:59:02][Step 1/1] I1223 22:59:01.768762 59580 analysis_predictor.cc:345] == optimize end ==
[22:59:02][Step 1/1] I1223 22:59:01.771080 59580 tester_helper.h:181] Warm up run...
[22:59:02][Step 1/1] W1223 22:59:01.771121 59580 naive_executor.cc:43] The NaiveExecutor can not work properly if the cmake flag ON_INFER is not set.
[22:59:02][Step 1/1] W1223 22:59:01.771126 59580 naive_executor.cc:45] Unlike the training phase, all the scopes and variables will be reused to save the allocation overhead.
[22:59:02][Step 1/1] W1223 22:59:01.771132 59580 naive_executor.cc:48] Please re-compile the inference library by setting the cmake flag ON_INFER=ON if you are running Paddle Inference
[22:59:02][Step 1/1] 
[23:54:36][Step 1/1] 	138 - test_analyzer_dam (Timeout)
  1. Since some hangs in our business job due to openmp library (you can ask @jianhang-liu for more details), I guess that maybe openmp hangs in our CI as well.

Do you use parallel mode of execution of tests of ctest? What is ctest commandline to run tests?

Yes, we use parallel mode of execution of tests of ctest. We use

if [ ${TESTING_DEBUG_MODE:-OFF} == "ON" ] ; then
ctest -V
else
ctest --output-on-failure

The parallel level is
image

@luotao1
Copy link
Contributor Author

luotao1 commented Jan 4, 2019

@jczaja http://ci.paddlepaddle.org/viewLog.html?tab=buildLog&buildTypeId=Paddle_PrCiNight&buildId=45100&_focus=23067#_state=23743 test_analyzer_small_dam crashs again. Why it crashs more frequently than test_analyzer_mobilenet_depthwise_conv? And test_analyzer_resnet50 also has compare_mkldnn, but it works normally.

@jczaja
Copy link
Contributor

jczaja commented Jan 4, 2019

@luotao1

  1. 5117 seems to be virtual machine of 16 sockets and 1 thread per socket (1 core in 1 socket). So 16 logical threads. Could you please tell me how many physical cores/sockets where dedicated to that virtual machine? For example if 5117 is virtual machine with 16 logical threads, which underneath is using 4 physical threads then we would have threads oversubscription when using MKL-DNN four tasks at once. PaddlePaddle is having timeout set to 600s , when there is oversubscription of threads, it may happen that tasks running in parallel will have much longer execution time(as they are fighting for resources) Please check settings of this virtual machine and let me know how many physical threads is in use for 5117.
  2. What is target num threads value that is to be used for each test? How to you set omp num threads
    so it take into consideration number of threads in 5117 divided by number of ctest parallel tasks running?
  3. Could you please run on this machine same build but with ctest executing test in sequential order (no parallel mode execution) ?
  4. If execution of previous points gives us no information then, could you please disable ctest timeout and run tests so we see if this is just very log execution or hang

@jianhang-liu
Copy link
Contributor

@luotao1 From the log you indicated, I found below:
19:19:29] : [Step 1/1] -- Do not have AVX2 intrinsics and disabled MKL-DNN
Is this a correct build option?

@jianhang-liu
Copy link
Contributor

@luotao1 Please ignore my above comment. Just found there are two builds in this log. The second one should be valid one with all features including MKL turn on.

@jianhang-liu
Copy link
Contributor

@luotao1 @jczaja One interesting point is that "dam" and "small" dam share same test app so they has exactly same test cases, i.e. "dam" also has compare_mkldnn test case. But according to log, no issue to run all test case of "dam" (including "compare_mkldnn" of course).
Line 22514: [19:50:36] : [Step 1/1] 154/514 Test #138: test_analyzer_dam ............................... Passed 71.89 sec

@jianhang-liu
Copy link
Contributor

Another hint is that "small dam" is tested immediately after "dam". This means those two models will be tested concurrently due to parallel mode of ctest. This may make over-subscription of OMP core even worse.
[19:49:24] : [Step 1/1] Start 138: test_analyzer_dam
...
[19:49:26] : [Step 1/1] Start 139: test_analyzer_small_dam

@luotao1
Copy link
Contributor Author

luotao1 commented Jan 7, 2019

But according to log, no issue to run all test case of "dam" (including "compare_mkldnn" of course).

@jianhang-liu test_analyzer_dam also have compare_mkldnn, see http://ci.paddlepaddle.org/viewLog.html?buildId=44600&tab=buildLog&buildTypeId=Paddle_PrCiNight&logTab=tree&filter=all&_focus=22935

If test_analyzer_dam runs successfully, there is no detailed log (containing each part of UT) in the CI log.

@luotao1
Copy link
Contributor Author

luotao1 commented Jan 7, 2019

@jczaja

  1. 5117 is a virtual machine of 16 sockets and 1 thread per socket (1 core in 1 socket). The result of cat /proc/cpuinfo is :
    image
    Is the number of physical threads 16?

  2. The default openmp num threads is 1.

    // number of cpu math library (such as MKL, OpenBlas) threads for each
    // instance.
    int cpu_math_library_num_threads_{1};

  3. I could try run test_analyzer_small_dam, test_analyzer_dam, test_analyzer_resnet50 and test_analyzer_mobilenet_depthwise_conv in sequential order. Does change to sequential order will solve the issue?

@jianhang-liu
Copy link
Contributor

jianhang-liu commented Jan 7, 2019

@luotao1 "I could try run test_analyzer_small_dam, test_analyzer_dam, test_analyzer_resnet50 and test_analyzer_mobilenet_depthwise_conv in sequential order". It somehow prove our suspect: the hang in "small dam" is possibly due to "oversubscription" of OMP cores since it's running concurrently with "dam". I think change the order of those tests could be a acceptable workaround for now. @jczaja What's your idea?

@jczaja
Copy link
Contributor

jczaja commented Jan 7, 2019

@luotao1 Yes, if problem is with threads over subscription then sequential execution of ctest should help.
I experimented with MKL-DNN unit tests running with ctest -j4 (parallel mode execution) and there is threads oversubscription and total time of execution of all unit tests is 3 times longer and specific times of tests execution are much longer. So this situation may be similar to what we observe at 5117.
So there are two experiments that I'm interested to check:

  1. Make ctest running in sequential mode
  2. Remove timeout(temporary) for parallel execution of ctest . just to confirm that is due to slow execution and not due to actual hang.

Why I wrote that sequential execution should help? Because I'm suspecting that it could be that virtual machine that of 5117 is causing threads over-subscription . I'm not expert on VM , but
as we can see from logs (and @luotao1 confirmed) there is 16 threads on VM. The thing is that
when setting up Virtual machine it is should be possible to dedicate some resources (cores, sockets and memory) for VM. So it is posible for example to dedicate 4 HW (physical) threads to execute VM with 16 threads. By such a setting we would have threads oversubscription. It is unlikely , as overall performance of sucha VM would be very poor, but possible. Hence:

  1. Please tell us how do you set up VM (what kind of VM software is used?) and check if possible how many physicall resources where dedicated to create VM of 16 sockets of 1 CPU with 1 thread .
  2. How many VMs are running on 5117 . The other reasons of threads over-subscription could be that
    more than one VMs are sharing 5117 , and there is not enough threads to be shared among them?

@luotao1
Copy link
Contributor Author

luotao1 commented Jan 7, 2019

@jczaja Thanks very much for so detailed explanation!

Make ctest running in sequential mode

#15196 make related ctest running in sequential mode.

Remove timeout(temporary) for parallel execution of ctest . just to confirm that is due to slow execution and not due to actual hang.

It's difficult to do this experiment since nightly stress testing runs on the develop branch.

Please tell us how do you set up VM (what kind of VM software is used?)

I'm not expert on VM as well. We will observe nightly stress testing after #15196 merged several days at first.

@luotao1
Copy link
Contributor Author

luotao1 commented Jan 11, 2019

@jczaja @jianhang-liu @yihuaxu
test_analyzer_small_dam timeouts at compare_mkldnn again. Does the mkldnn kernel of conv3d have problem?
http://ci.paddlepaddle.org/viewLog.html?buildId=48242&tab=buildLog&buildTypeId=Paddle_PrCiNight&logTab=tree&filter=all&_focus=23274

@luotao1
Copy link
Contributor Author

luotao1 commented Jan 11, 2019

@jczaja The detail machine configuration: http://ci.paddlepaddle.org/viewLog.html?buildId=48242&tab=buildLog&buildTypeId=Paddle_PrCiNight&logTab=tree&filter=all&state=65&_focus=75#_state=65,34

[22:56:29][Step 1/1] 架构:              x86_64
[22:56:29][Step 1/1] CPU op-mode(s):        32-bit, 64-bit
[22:56:29][Step 1/1] 字节序:           Little Endian
[22:56:29][Step 1/1] CPU(s):                16
[22:56:29][Step 1/1] On-line CPU(s) list:   0-15
[22:56:29][Step 1/1] Thread(s) per core:    1
[22:56:29][Step 1/1] Core(s) per socket:    1
[22:56:29][Step 1/1] 座:                 16
[22:56:29][Step 1/1] NUMA 节点:         1
[22:56:29][Step 1/1] 厂商 ID:           GenuineIntel
[22:56:29][Step 1/1] CPU 系列:          6
[22:56:29][Step 1/1] 型号:              85
[22:56:29][Step 1/1] 型号名称:        Intel(R) Xeon(R) Gold 5117 CPU @ 2.00GHz
[22:56:29][Step 1/1] 步进:              4
[22:56:29][Step 1/1] CPU MHz:             2000.031
[22:56:29][Step 1/1] BogoMIPS:            4016.87
[22:56:29][Step 1/1] L1d 缓存:          32K
[22:56:29][Step 1/1] L1i 缓存:          32K
[22:56:29][Step 1/1] L2 缓存:           4096K
[22:56:29][Step 1/1] NUMA 节点0 CPU:    0-15

@jczaja
Copy link
Contributor

jczaja commented Jan 11, 2019

@luotao1 When We look at specification of Xeon Gold 5117:
https://ark.intel.com/products/122460/Intel-Xeon-Gold-5117-Processor-19-25M-Cache-2-00-GHz-

There is 14 cores within single socket. And specification you sent/from logs is 16 sockets with 1 core in each socket. So Please tell me:

  1. Why those specifications are different?
  2. How virtualization was done? (what virtualization software used)
  3. How many virtual machines are running on 5117 ?
  4. how many docker containers in running simultaneously on 5117 ?

@luotao1
Copy link
Contributor Author

luotao1 commented Jan 15, 2019

@jczaja @jianhang-liu
Discussed with @tianshuo78520a

  1. How virtualization was done? (what virtualization software used)

We use OpenStack.

  1. How many virtual machines are running on 5117 ?

One or Two virtual machines run on 5117.

  1. how many docker containers in running simultaneously on 5117 ?

Only One docker container runs at the same time.

@jczaja
Copy link
Contributor

jczaja commented Jan 15, 2019

@luotao1
5117 is providing AVX-512. Some other platforms that I have seen only provided AVX2. My question is do you have in CI other platform that has AVX-512 ? I understand that problem(timeout) is manifesting itself only on 5117 , is there any other platform in CI that CI works fine and this platform is having AVX-512 ?

@luotao1
Copy link
Contributor Author

luotao1 commented Jan 15, 2019

@jczaja #15335 disable conv3d mkldnn in dam to check whether conv3d mkldnn is the real reason.

@luotao1
Copy link
Contributor Author

luotao1 commented Jan 15, 2019

I understand that problem(timeout) is manifesting itself only on 5117 , is there any other platform in CI that CI works fine and this platform is having AVX-512

I check that all the fail log is on 5117 only. We have Intel Xeon Processor (Skylake) with AVX-512. It works OK on this issue.
http://ci.paddlepaddle.org/viewLog.html?tab=buildLog&buildTypeId=Paddle_PrCiNight&buildId=37730&_focus=75#_state=65

[21:01:39][Step 1/1] Architecture:          x86_64
[21:01:39][Step 1/1] CPU op-mode(s):        32-bit, 64-bit
[21:01:39][Step 1/1] Byte Order:            Little Endian
[21:01:39][Step 1/1] CPU(s):                26
[21:01:39][Step 1/1] On-line CPU(s) list:   0-25
[21:01:39][Step 1/1] Thread(s) per core:    1
[21:01:39][Step 1/1] Core(s) per socket:    1
[21:01:39][Step 1/1] Socket(s):             26
[21:01:39][Step 1/1] NUMA node(s):          1
[21:01:39][Step 1/1] Vendor ID:             GenuineIntel
[21:01:39][Step 1/1] CPU family:            6
[21:01:39][Step 1/1] Model:                 85
[21:01:39][Step 1/1] Model name:            Intel Xeon Processor (Skylake)
[21:01:39][Step 1/1] Stepping:              4
[21:01:39][Step 1/1] CPU MHz:               1999.993
[21:01:39][Step 1/1] BogoMIPS:              3999.98
[21:01:39][Step 1/1] Hypervisor vendor:     KVM
[21:01:39][Step 1/1] Virtualization type:   full
[21:01:39][Step 1/1] L1d cache:             32K
[21:01:39][Step 1/1] L1i cache:             32K
[21:01:39][Step 1/1] L2 cache:              4096K
[21:01:39][Step 1/1] L3 cache:              16384K
[21:01:39][Step 1/1] NUMA node0 CPU(s):     0-25

@jczaja
Copy link
Contributor

jczaja commented Jan 15, 2019

@luotao1 The log you mentioned shows that on other SKX also test_analyzer_small_dam fails it just it is not timeout , but computational problem (diff in results). Perhaps it just different outcome of the same problem

@luotao1
Copy link
Contributor Author

luotao1 commented Jan 15, 2019

Yes, it is the MKL diff problem, same with #15116 (comment)

@jczaja
Copy link
Contributor

jczaja commented Jan 18, 2019

@luotao1 Since couple of people are looking at that issue, I'm sharing my current status. Hopefully it will be helpful.

I got reproduction of problem When running in a loop(upto hundred times) test_analyzer_dam_small .We can get either crash(segfault) or hang(timeout when running under ctest). CI on 5117 is building Paddle WITHOUT ON_INFER=ON and at that situation there can be hang of test_analyzer_small_dam test. If ON_INFER=ON is specified then randomly test_analyzer_small_dam will result in Segmentation Fault.

Hang:
Execution hangs on mutex when Var is to be get from here:

auto* trans_var = new_scope->Var(var_name);

Crash:
Access to new_scope::vars_ cause crash as vars_ contains invalid data.

Workaround:
When we disable Caching of scopes: TryAndCreateTransferScope, No hang or crash is observed
eg.

new_scope = TryCreateTransferScope(kernel_type_for_var,

So currently We can see that there is a problem with TransferScope, that randomly manifest as hang, crash depending on ON_INFER being set or not.

@jianhang-liu
Copy link
Contributor

@jczaja Great! Actually this afternoon we just found build withOUT ON_INFER=ON is the key to reproduce. We can easily reproduce two types of CI error now (Timeout, compare fail due to diff) in our local 6151 server. For timeout (hang) issue, we positioned the error almost same as you (but not as detail as you). We wonder whether it's caused by SCOPE_XXX_LOCK definition.

@jianhang-liu
Copy link
Contributor

@luotao1 @jczaja We confirmed the same root cause as Jacek, i.e. current code in TryCreateTransferScope (use thread_local as cache to avoid creating transfer scope) may have random failure which cause this CI failure (hangs as timeout, segfault as crash). By simply comment out it (i.e. don't use cache; always create "new_scope"), we won't run into any error anymore. We trial with several times to enable/disable this cache and proved no issue occur when cache is disabled.
@luotao1 Could you please check in framework side for a more clean fix? Thanks.

@luotao1
Copy link
Contributor Author

luotao1 commented Jan 21, 2019

By simply comment out it (i.e. don't use cache; always create "new_scope")

@jianhang-liu Could you create a simple PR to show which lines are comment out?

@Superjomn Could you help see the cache in TryCreateTransferScope?

@luotao1
Copy link
Contributor Author

luotao1 commented Jan 21, 2019

Scope* TryCreateTransferScope(OpKernelType type0, OpKernelType type1,
const Scope* scope) {
Scope* new_scope{nullptr};
size_t infer_cache_key =
CombineHash(OpKernelType::Hash()(type0), OpKernelType::Hash()(type1));
infer_cache_key =
CombineHash(infer_cache_key, std::hash<const Scope*>()(scope));
auto it = global_transfer_data_cache().find(infer_cache_key);
if (it != global_transfer_data_cache().end()) {
new_scope = global_transfer_data_cache()[infer_cache_key];
} else {
new_scope = &scope->NewScope();
global_transfer_data_cache()[infer_cache_key] = new_scope;
}
global_transfer_scope_cache().insert(new_scope);
return new_scope;

@Superjomn Is line45 unused?

@luotao1
Copy link
Contributor Author

luotao1 commented Jan 21, 2019

#15032 (comment) is hot-fixed in #15450

@luotao1
Copy link
Contributor Author

luotao1 commented Jul 10, 2019

@LeoZhao-Habana
Copy link
Contributor

do we have any conclusion why it hangs with TryCreateTransferScope()? are there any race conditions or multi-instances case?

@luotao1
Copy link
Contributor Author

luotao1 commented Jul 10, 2019

English Description

Discussed with @LeoZhao-Intel and @jianhang-liu, we have some common views:

  1. TryCreateTransferScope() is added for fixing GPU inference memory leak problem, which is not required for CPU inference. See the notes by @Superjomn
    // batches, so the `new_scope` here will result in GPU memroy explosion
    // over the running of operators.
  2. remove TryCreateTransferScope() will fix the hang problem, see Disable cache for scope creation to workaround CI failure #15450
  3. However, remove TryCreateTransferScope() will cause CPU inference memory leak problem, the reason is that we don't release the transfer_scope in OperatorWithKernel::RunImpl
    auto* transfer_scope =
    PrepareData(scope, *kernel_type_, &transfered_inplace_vars, runtime_ctx);

TODO:

  1. remove TryCreateTransferScope() and release the transfer_scope in OperatorWithKernel::RunImpl when using NaiveExecutor in CPU inference. @LeoZhao-Intel
  2. test the memory leak problem in detect model [DO NOT MERGE] detect model test2 for dynamic shape #18372 @LeoZhao-Intel @luotao1

Chinese Description

简单说下原因:

  1. 为了解决GPU预测内存泄漏,@春伟 加了TryCreateTransferScope函数,同时不释放transfer_scope。
  2. TryCreateTransferScope函数对CPU预测是不需要的,去掉该函数后,就不会有随机timeout的问题(已验证)。
  3. 目前去掉TryCreateTransferScope函数后,CPU预测会出现内存涨的问题。原因是没有释放transfer_scope。

TODO (@intel @luotao):

  1. 在CPU预测的时候,不使用TryCreateTransferScope函数,同时释放transfer_scope。
  2. 验证CPU预测时,内存泄漏问题、随机timeout问题和预测性能问题。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants