Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dev] mkl多线程下存在内存泄露问题 #22827

Closed
jiweibo opened this issue Mar 3, 2020 · 14 comments · Fixed by #23557
Closed

[dev] mkl多线程下存在内存泄露问题 #22827

jiweibo opened this issue Mar 3, 2020 · 14 comments · Fixed by #23557
Labels
developing inference Inference development Intel
Milestone

Comments

@jiweibo
Copy link
Contributor

jiweibo commented Mar 3, 2020

mkl多线程下存在内存泄露问题。

测试模型(只包括一个mul)
image

多线程下重复运行1小时,内存曲线变化
image

尝试将mklml换成 2019.3版本,依旧存在泄露情况,切换成openblas则无问题。

需要intel跟进。

复现文件整理:

复现方法:
unzip mul.zip

cd mul
编译,参考https://www.paddlepaddle.org.cn/documentation/docs/zh/advanced_guide/inference_deployment/inference/native_infer.html#a-name-c-c-a
sh run_impl.sh fluid_inference_install_dir mul_demo
nohup ./build/mul_demo --model_dir=mul_model --thread_num=3 --num=-1 &
监视内存
nohup sh mem_use.sh &
利用top实时查看
top

运行1小时后,kill掉进程
查看memlog.txt 内存变化

cpu型号:2620、2650

mul.zip

@jiweibo jiweibo created this issue from a note in Inference-dev (To do) Mar 3, 2020
@jiweibo jiweibo added developing inference Inference development labels Mar 3, 2020
@jiweibo jiweibo added this to the v2.0 milestone Mar 3, 2020
@jiweibo jiweibo moved this from To do to In progress in Inference-dev Mar 3, 2020
@jiweibo jiweibo moved this from In progress to To do in Inference-dev Mar 3, 2020
@jiweibo jiweibo changed the title 预测库多线程下疑似内存泄露问题 [dev] 预测库多线程下疑似内存泄露问题 Mar 3, 2020
@jiweibo jiweibo moved this from To do to In progress in Inference-dev Mar 3, 2020
@jiweibo jiweibo changed the title [dev] 预测库多线程下疑似内存泄露问题 [dev] mkl多线程下存在内存泄露问题 Mar 9, 2020
@luotao1 luotao1 added the Intel label Mar 9, 2020
@GaoWei8
Copy link
Contributor

GaoWei8 commented Mar 10, 2020

@lidanqing-intel Could you help see it? We verified this problem happens in E5-2620 and E5-2650 on develop branch.

@lidanqing-intel
Copy link
Contributor

@lidanqing-intel Could you help see it? We verified this problem happens in E5-2620 and E5-2650 on develop branch.

We are looking at it.

@wojtuss
Copy link

wojtuss commented Mar 10, 2020

@jiweibo , @GaoWei8
I have followed the instructions but could not reproduce the issue on CLX 6248.
The mul_demo app did not use MKL-DNN and the memory usage did not change during more than 1 hour long test. If you tested it with MKL-DNN, than it would suggest MKL-DNN primitives caching to be the culprit.

@GaoWei8
Copy link
Contributor

GaoWei8 commented Mar 11, 2020

@jiweibo , @GaoWei8
I have followed the instructions but could not reproduce the issue on CLX 6248.
The mul_demo app did not use MKL-DNN and the memory usage did not change during more than 1 hour long test. If you tested it with MKL-DNN, than it would suggest MKL-DNN primitives caching to be the culprit.

@wojtuss We verify this issue on both E5-2620 and E5-2650, but the memory usage did not change on CLX 6148. Could you reproduce this issue on E5-2650?

@jiweibo
Copy link
Contributor Author

jiweibo commented Mar 11, 2020

@jiweibo , @GaoWei8
I have followed the instructions but could not reproduce the issue on CLX 6248.
The mul_demo app did not use MKL-DNN and the memory usage did not change during more than 1 hour long test. If you tested it with MKL-DNN, than it would suggest MKL-DNN primitives caching to be the culprit.

Mkldnn is not enabled in the reproduction environment. We only used mklml. Could you reproduce this issue on E5-2650 or 2620?

@yinghu5
Copy link

yinghu5 commented Mar 12, 2020

@jiweibo,
I also try to reproduce the issue on my lab CLX machines, It seems can't reproduce the problem too. Looks it is related to the machine or environment

Could you please show lscpu and export MKLDNN_VERBOSE=1 and print the log as below.
and OS , env information

Best Regards,
Ying

yhu5@clx01:~/baidu_ml/mul$ ./build/mul_demo --model_dir=mul_model --thread_num=3 --num=-1
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0312 19:57:07.908839 2618 analysis_predictor.cc:84] Profiler is deactivated, and no profiling report will be generated.
I0312 19:57:07.916256 2618 analysis_predictor.cc:833] MODEL VERSION: 1.7.0
I0312 19:57:07.916270 2618 analysis_predictor.cc:835] PREDICTOR VERSION: 1.7.0
--- Running analysis [ir_graph_build_pass]
--- Running analysis [ir_graph_clean_pass]
--- Running analysis [ir_analysis_pass]
--- Running IR pass [simplify_with_basic_ops_pass]
--- Running IR pass [attention_lstm_fuse_pass]
--- Running IR pass [seqconv_eltadd_relu_fuse_pass]
--- Running IR pass [seqpool_cvm_concat_fuse_pass]
--- Running IR pass [fc_lstm_fuse_pass]
--- Running IR pass [mul_lstm_fuse_pass]
--- Running IR pass [fc_gru_fuse_pass]
--- Running IR pass [mul_gru_fuse_pass]
--- Running IR pass [seq_concat_fc_fuse_pass]
--- Running IR pass [fc_fuse_pass]
--- Running IR pass [repeated_fc_relu_fuse_pass]
--- Running IR pass [squared_mat_sub_fuse_pass]
--- Running IR pass [conv_bn_fuse_pass]
--- Running IR pass [conv_eltwiseadd_bn_fuse_pass]
--- Running IR pass [conv_transpose_bn_fuse_pass]
--- Running IR pass [conv_transpose_eltwiseadd_bn_fuse_pass]
--- Running IR pass [is_test_pass]
--- Running IR pass [runtime_context_cache_pass]
--- Running analysis [ir_params_sync_among_devices_pass]
--- Running analysis [adjust_cudnn_workspace_size_pass]
--- Running analysis [inference_op_replace_pass]
--- Running analysis [ir_graph_to_program_pass]
I0312 19:57:07.917754 2618 analysis_predictor.cc:462] ======= optimize end =======
I0312 19:57:07.917768 2618 naive_executor.cc:105] --- skip [feed], feed -> dataY
I0312 19:57:07.917774 2618 naive_executor.cc:105] --- skip [feed], feed -> dataX
I0312 19:57:07.917805 2618 naive_executor.cc:105] --- skip [mul_0.tmp_0], fetch -> fetch
I0312 19:57:07.917824 2618 analysis_predictor.cc:84] Profiler is deactivated, and no profiling report will be generated.
I0312 19:57:07.917832 2618 naive_executor.cc:105] --- skip [feed], feed -> dataY
I0312 19:57:07.917836 2618 naive_executor.cc:105] --- skip [feed], feed -> dataX
I0312 19:57:07.917843 2618 naive_executor.cc:105] --- skip [mul_0.tmp_0], fetch -> fetch
I0312 19:57:07.917856 2618 analysis_predictor.cc:84] Profiler is deactivated, and no profiling report will be generated.
I0312 19:57:07.917862 2618 naive_executor.cc:105] --- skip [feed], feed -> dataY
I0312 19:57:07.917866 2618 naive_executor.cc:105] --- skip [feed], feed -> dataX
I0312 19:57:07.917876 2618 naive_executor.cc:105] --- skip [mul_0.tmp_0], fetch -> fetch
I0312 19:57:07.917888 2618 analysis_predictor.cc:84] Profiler is deactivated, and no profiling report will be generated.
I0312 19:57:07.917896 2618 naive_executor.cc:105] --- skip [feed], feed -> dataY
I0312 19:57:07.917899 2618 naive_executor.cc:105] --- skip [feed], feed -> dataX
I0312 19:57:07.917906 2618 naive_executor.cc:105] --- skip [mul_0.tmp_0], fetch -> fetch
MKL_VERBOSE Intel(R) MKL 2019.0 Update 2 Product build 20190118 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector Neural Network Instructions enabled processors, Lnx 2.30GHz lp64 intel_thread
MKL_VERBOSE SGEMM(N,N,768,168,200,0x7fa82f193e98,0x7fa800c62040,768,0x7fa800bff040,200,0x7fa82f193ea0,0x7fa800f25040,768) 29.22s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE SGEMM(N,N,768,168,200,0x7fa82f994e98,0x7fa800d90040,768,0x7fa800c41040,200,0x7fa82f994ea0,0x7fa800ea6040,768) 101.27ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0312 19:57:08.101979 2660 mul_demo.cc:90] thread_id: 2 batch: 168 predict cost: 179.352ms
I0312 19:57:08.101981 2659 mul_demo.cc:90] thread_id: 1 batch: 168 predict cost: 179.356ms
MKL_VERBOSE SGEMM(N,N,768,168,200,0x7fa830195e98,0x7fa800cf9040,768,0x7fa800c20040,200,0x7fa830195ea0,0x7fa800e27040,768) 2.99s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
I0312 19:57:08.102481 2658 mul_demo.cc:90] thread_id: 0 batch: 168 predict cost: 179.873ms
MKL_VERBOSE SGEMM(N,N,768,168,200,0x7fa82f193e98,0x7fa800cf9040,768,0x7fa800c20040,200,0x7fa82f193ea0,0x7fa800e27040,768) 941.21us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
I0312 19:57:08.107509 2790 mul_demo.cc:90] thread_id: 0 batch: 168 predict cost: 1.105ms
MKL_VERBOSE SGEMM(N,N,768,168,200,0x7fa830195e98,0x7fa800c62040,768,0x7fa800bff040,200,0x7fa830195ea0,0x7fa800f25040,768) 1.01ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
I0312 19:57:08.107705 2792 mul_demo.cc:90] thread_id: 2 batch: 168 predict cost: 1.277ms
MKL_VERBOSE SGEMM(N,N,768,168,200,0x7fa82f994e98,0x7fa800d90040,768,0x7fa800c41040,200,0x7fa82f994ea0,0x7fa800ea6040,768) 973.37us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
I0312 19:57:08.107868 2791 mul_demo.cc:90] thread_id: 1 batch: 168 predict cost: 1.435ms

@lidanqing-intel
Copy link
Contributor

lidanqing-intel commented Mar 12, 2020

@yinghu5 MKLDNN is OFF, machine is avx2 machine, could you please reproduce the issue on avx2 machine?

@jiweibo
Copy link
Contributor Author

jiweibo commented Mar 12, 2020

@jiweibo,
I also try to reproduce the issue on my lab CLX machines, It seems can't reproduce the problem too. Looks it is related to the machine or environment

Could you please show lscpu and export MKLDNN_VERBOSE=1 and print the log as below.
and OS , env information

Best Regards,
Ying

yhu5@clx01:~/baidu_ml/mul$ ./build/mul_demo --model_dir=mul_model --thread_num=3 --num=-1
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0312 19:57:07.908839 2618 analysis_predictor.cc:84] Profiler is deactivated, and no profiling report will be generated.
I0312 19:57:07.916256 2618 analysis_predictor.cc:833] MODEL VERSION: 1.7.0
I0312 19:57:07.916270 2618 analysis_predictor.cc:835] PREDICTOR VERSION: 1.7.0
--- Running analysis [ir_graph_build_pass]
--- Running analysis [ir_graph_clean_pass]
--- Running analysis [ir_analysis_pass]
--- Running IR pass [simplify_with_basic_ops_pass]
--- Running IR pass [attention_lstm_fuse_pass]
--- Running IR pass [seqconv_eltadd_relu_fuse_pass]
--- Running IR pass [seqpool_cvm_concat_fuse_pass]
--- Running IR pass [fc_lstm_fuse_pass]
--- Running IR pass [mul_lstm_fuse_pass]
--- Running IR pass [fc_gru_fuse_pass]
--- Running IR pass [mul_gru_fuse_pass]
--- Running IR pass [seq_concat_fc_fuse_pass]
--- Running IR pass [fc_fuse_pass]
--- Running IR pass [repeated_fc_relu_fuse_pass]
--- Running IR pass [squared_mat_sub_fuse_pass]
--- Running IR pass [conv_bn_fuse_pass]
--- Running IR pass [conv_eltwiseadd_bn_fuse_pass]
--- Running IR pass [conv_transpose_bn_fuse_pass]
--- Running IR pass [conv_transpose_eltwiseadd_bn_fuse_pass]
--- Running IR pass [is_test_pass]
--- Running IR pass [runtime_context_cache_pass]
--- Running analysis [ir_params_sync_among_devices_pass]
--- Running analysis [adjust_cudnn_workspace_size_pass]
--- Running analysis [inference_op_replace_pass]
--- Running analysis [ir_graph_to_program_pass]
I0312 19:57:07.917754 2618 analysis_predictor.cc:462] ======= optimize end =======
I0312 19:57:07.917768 2618 naive_executor.cc:105] --- skip [feed], feed -> dataY
I0312 19:57:07.917774 2618 naive_executor.cc:105] --- skip [feed], feed -> dataX
I0312 19:57:07.917805 2618 naive_executor.cc:105] --- skip [mul_0.tmp_0], fetch -> fetch
I0312 19:57:07.917824 2618 analysis_predictor.cc:84] Profiler is deactivated, and no profiling report will be generated.
I0312 19:57:07.917832 2618 naive_executor.cc:105] --- skip [feed], feed -> dataY
I0312 19:57:07.917836 2618 naive_executor.cc:105] --- skip [feed], feed -> dataX
I0312 19:57:07.917843 2618 naive_executor.cc:105] --- skip [mul_0.tmp_0], fetch -> fetch
I0312 19:57:07.917856 2618 analysis_predictor.cc:84] Profiler is deactivated, and no profiling report will be generated.
I0312 19:57:07.917862 2618 naive_executor.cc:105] --- skip [feed], feed -> dataY
I0312 19:57:07.917866 2618 naive_executor.cc:105] --- skip [feed], feed -> dataX
I0312 19:57:07.917876 2618 naive_executor.cc:105] --- skip [mul_0.tmp_0], fetch -> fetch
I0312 19:57:07.917888 2618 analysis_predictor.cc:84] Profiler is deactivated, and no profiling report will be generated.
I0312 19:57:07.917896 2618 naive_executor.cc:105] --- skip [feed], feed -> dataY
I0312 19:57:07.917899 2618 naive_executor.cc:105] --- skip [feed], feed -> dataX
I0312 19:57:07.917906 2618 naive_executor.cc:105] --- skip [mul_0.tmp_0], fetch -> fetch
MKL_VERBOSE Intel(R) MKL 2019.0 Update 2 Product build 20190118 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector Neural Network Instructions enabled processors, Lnx 2.30GHz lp64 intel_thread
MKL_VERBOSE SGEMM(N,N,768,168,200,0x7fa82f193e98,0x7fa800c62040,768,0x7fa800bff040,200,0x7fa82f193ea0,0x7fa800f25040,768) 29.22s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE SGEMM(N,N,768,168,200,0x7fa82f994e98,0x7fa800d90040,768,0x7fa800c41040,200,0x7fa82f994ea0,0x7fa800ea6040,768) 101.27ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0312 19:57:08.101979 2660 mul_demo.cc:90] thread_id: 2 batch: 168 predict cost: 179.352ms
I0312 19:57:08.101981 2659 mul_demo.cc:90] thread_id: 1 batch: 168 predict cost: 179.356ms
MKL_VERBOSE SGEMM(N,N,768,168,200,0x7fa830195e98,0x7fa800cf9040,768,0x7fa800c20040,200,0x7fa830195ea0,0x7fa800e27040,768) 2.99s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
I0312 19:57:08.102481 2658 mul_demo.cc:90] thread_id: 0 batch: 168 predict cost: 179.873ms
MKL_VERBOSE SGEMM(N,N,768,168,200,0x7fa82f193e98,0x7fa800cf9040,768,0x7fa800c20040,200,0x7fa82f193ea0,0x7fa800e27040,768) 941.21us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
I0312 19:57:08.107509 2790 mul_demo.cc:90] thread_id: 0 batch: 168 predict cost: 1.105ms
MKL_VERBOSE SGEMM(N,N,768,168,200,0x7fa830195e98,0x7fa800c62040,768,0x7fa800bff040,200,0x7fa830195ea0,0x7fa800f25040,768) 1.01ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
I0312 19:57:08.107705 2792 mul_demo.cc:90] thread_id: 2 batch: 168 predict cost: 1.277ms
MKL_VERBOSE SGEMM(N,N,768,168,200,0x7fa82f994e98,0x7fa800d90040,768,0x7fa800c41040,200,0x7fa82f994ea0,0x7fa800ea6040,768) 973.37us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1
I0312 19:57:08.107868 2791 mul_demo.cc:90] thread_id: 1 batch: 168 predict cost: 1.435ms

lscpu:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                24
On-line CPU(s) list:   0-23
Thread(s) per core:    1
Core(s) per socket:    12
Socket(s):             2
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
Stepping:              1
CPU MHz:               2200.000
CPU max MHz:           2200.0000
CPU min MHz:           1200.0000
BogoMIPS:              4404.85
Virtualization:        VT-x
Hypervisor vendor:     vertical
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              30720K
NUMA node0 CPU(s):     0-23
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc


Mkldnn is not opened when I compile paddle. export MKLDNN_VERBOSE=1 and print the log as below

./build/mul_demo --model_dir=mul_model/ --thread_num=3 --num=3

WARNING: Logging before InitGoogleLogging() is written to STDERR
I0312 13:12:54.974669 20557 analysis_predictor.cc:135] Profiler is deactivated, and no profiling report will be generated.
I0312 13:12:54.983279 20557 analysis_predictor.cc:851] MODEL VERSION: 1.7.0
I0312 13:12:54.983321 20557 analysis_predictor.cc:853] PREDICTOR VERSION: 0.0.0
W0312 13:12:54.983352 20557 analysis_predictor.cc:866]  - Version incompatible (1) feed
W0312 13:12:54.983366 20557 analysis_predictor.cc:866]  - Version incompatible (1) fetch
W0312 13:12:54.983386 20557 analysis_predictor.cc:866]  - Version incompatible (2) mul
W0312 13:12:54.983395 20557 analysis_predictor.cc:191] WARNING: Results may be DIFF! Please use the corresponding version of the model and prediction library, and do not use the develop branch.
--- Running analysis [ir_graph_build_pass]
--- Running analysis [ir_graph_clean_pass]
--- Running analysis [ir_analysis_pass]
--- Running IR pass [simplify_with_basic_ops_pass]
--- Running IR pass [attention_lstm_fuse_pass]
--- Running IR pass [seqconv_eltadd_relu_fuse_pass]
--- Running IR pass [seqpool_cvm_concat_fuse_pass]
--- Running IR pass [fc_lstm_fuse_pass]
--- Running IR pass [mul_lstm_fuse_pass]
--- Running IR pass [fc_gru_fuse_pass]
--- Running IR pass [mul_gru_fuse_pass]
--- Running IR pass [seq_concat_fc_fuse_pass]
--- Running IR pass [fc_fuse_pass]
--- Running IR pass [repeated_fc_relu_fuse_pass]
--- Running IR pass [squared_mat_sub_fuse_pass]
--- Running IR pass [conv_bn_fuse_pass]
--- Running IR pass [conv_eltwiseadd_bn_fuse_pass]
--- Running IR pass [conv_transpose_bn_fuse_pass]
--- Running IR pass [conv_transpose_eltwiseadd_bn_fuse_pass]
--- Running IR pass [is_test_pass]
--- Running IR pass [runtime_context_cache_pass]
--- Running analysis [ir_params_sync_among_devices_pass]
--- Running analysis [adjust_cudnn_workspace_size_pass]
--- Running analysis [inference_op_replace_pass]
--- Running analysis [ir_graph_to_program_pass]
I0312 13:12:54.986470 20557 analysis_predictor.cc:480] ======= optimize end =======
I0312 13:12:54.986506 20557 naive_executor.cc:95] ---  skip [feed], feed -> dataY
I0312 13:12:54.986519 20557 naive_executor.cc:95] ---  skip [feed], feed -> dataX
I0312 13:12:54.986586 20557 naive_executor.cc:95] ---  skip [mul_0.tmp_0], fetch -> fetch
I0312 13:12:54.986632 20557 analysis_predictor.cc:135] Profiler is deactivated, and no profiling report will be generated.
I0312 13:12:54.986654 20557 naive_executor.cc:95] ---  skip [feed], feed -> dataY
I0312 13:12:54.986665 20557 naive_executor.cc:95] ---  skip [feed], feed -> dataX
I0312 13:12:54.986687 20557 naive_executor.cc:95] ---  skip [mul_0.tmp_0], fetch -> fetch
I0312 13:12:54.986721 20557 analysis_predictor.cc:135] Profiler is deactivated, and no profiling report will be generated.
I0312 13:12:54.986737 20557 naive_executor.cc:95] ---  skip [feed], feed -> dataY
I0312 13:12:54.986747 20557 naive_executor.cc:95] ---  skip [feed], feed -> dataX
I0312 13:12:54.986768 20557 naive_executor.cc:95] ---  skip [mul_0.tmp_0], fetch -> fetch
I0312 13:12:54.986793 20557 analysis_predictor.cc:135] Profiler is deactivated, and no profiling report will be generated.
I0312 13:12:54.986809 20557 naive_executor.cc:95] ---  skip [feed], feed -> dataY
I0312 13:12:54.986820 20557 naive_executor.cc:95] ---  skip [feed], feed -> dataX
I0312 13:12:54.986840 20557 naive_executor.cc:95] ---  skip [mul_0.tmp_0], fetch -> fetch
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0312 13:12:55.105252 20558 mul_demo.cc:90] thread_id: 0 batch: 168 predict cost: 115.849ms
I0312 13:12:55.105247 20560 mul_demo.cc:90] thread_id: 2 batch: 168 predict cost: 115.848ms
I0312 13:12:55.105247 20559 mul_demo.cc:90] thread_id: 1 batch: 168 predict cost: 115.865ms
I0312 13:12:55.108531 20562 mul_demo.cc:90] thread_id: 1 batch: 168 predict cost: 1.057ms
I0312 13:12:55.108608 20561 mul_demo.cc:90] thread_id: 0 batch: 168 predict cost: 1.051ms
I0312 13:12:55.108676 20563 mul_demo.cc:90] thread_id: 2 batch: 168 predict cost: 1.119ms
I0312 13:12:55.110903 20565 mul_demo.cc:90] thread_id: 1 batch: 168 predict cost: 0.94ms
I0312 13:12:55.110906 20566 mul_demo.cc:90] thread_id: 2 batch: 168 predict cost: 0.949ms
I0312 13:12:55.113029 20564 mul_demo.cc:90] thread_id: 0 batch: 168 predict cost: 0.982ms

I have reproduced memory leaks in the docker images of ubuntu16 and centos6.10.

If you need other environmental information, please feel free to leave a message, thank you very much. @yinghu5

@xw-github
Copy link

xw-github commented Mar 16, 2020

hi, @jiweibo
I have reproduced memory leaks in the docker images of ubuntu16.04

1、mul-thread
./build/mul_demo --model_dir=mul_model/ --thread_num=3 --num=-1
memlog.txt.multhread.zip

2、single-thread
./build/mul_demo --model_dir=mul_model/ --thread_num=1 --num=-1
memlog.txt.singlethread.zip

lscpu:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 24
On-line CPU(s) list: 0-23
Thread(s) per core: 1
Core(s) per socket: 12
Socket(s): 2
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
Stepping: 1
CPU MHz: 2200.000
CPU max MHz: 2200.0000
CPU min MHz: 1200.0000
BogoMIPS: 4404.83
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 30720K
NUMA node0 CPU(s): 0-23
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc

@yinghu5
Copy link

yinghu5 commented Mar 16, 2020

@jiweibo @xw-github thank you for the reproduce the issues. I can reproduce the issue too and confirmed there is memory leak. From first -turn debug, the problem seems in libiomp5.so, which alloc memory, but no release. we are checking the details at local lab.
If possible, could you please check several libiomp5.so versions, like latest one in MKL 2020, or old one in MKL 2019 etc and see if it is still problem.

Thanks
Ying

#12 scalable_aligned_malloc (size=0, alignment=1048576) at ../../src/tbbmalloc/frontend.cpp:3058
#13 0x00007fffecd948d9 in ___kmp_allocate (size=0) at ../../src/kmp_alloc.cpp:1837
#14 0x00007fffecdf78f6 in __kmp_register_root (initial_thread=0) at ../../src/kmp_runtime.cpp:4025
#15 0x00007fffecdf6d84 in __kmp_get_global_thread_id_reg () at ../../src/kmp_runtime.cpp:286
#16 0x00007fffecde3d1a in __kmp_api_omp_set_num_threads (arg=0) at ../../src/kmp_ftn_entry.h:345

@yinghu5
Copy link

yinghu5 commented Mar 17, 2020

@jiweibo @GaoWei8 @lidanqing-intel

is it possible to build paddle without libiomp5? for example, when replace link libmklml_intel.so with link libmklml_gnu.so . (maybe when build mkldnn, remove the -liomp5 option).

i did investigation and find the paddle library is linking mix openmp library, libiomp5.so and libgomp5. Such kind of mixing thread model would cause unexpected behaviors.

and on the other hands, most of libiomp5 version already integrated the scalable_aligned_malloc. we can't remove it by change the libiomp5 version .

So one way i can suggest to remove the libiomp5.so and just keep GNU default openmp.

(base) [yhu5@hsw-ep01 build]$ ldd ~/baidu_mklml/paddle/paddle/lib/libpaddle_fluid.so
linux-vdso.so.1 => (0x00007ffc680e5000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f8cd9542000)
libiomp5.so => /home/yhu5/baidu_mklml/paddle_github/paddle/build_2/third_party/install/mklml/lib/libiomp5.so (0x00007f8cd9165000)
libdnnl.so.1 => /home/yhu5/baidu_mklml/paddle_github/paddle/build_2/third_party/install/mkldnn/lib64/libdnnl.so.1 (0x00007f8cd7818000)
libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f8cd7511000)
libm.so.6 => /lib64/libm.so.6 (0x00007f8cd720e000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f8cd6ff8000)
libc.so.6 => /lib64/libc.so.6 (0x00007f8cd6c2b000)
/lib64/ld-linux-x86-64.so.2 (0x00007f8cdc4b4000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007f8cd6a26000)
libgomp.so.1 => /lib64/libgomp.so.1 (0x00007f8cd6800000)
Best Regards,
Ying

@lidanqing-intel
Copy link
Contributor

Hi, @GaoWei8 I asked @yinghu5. The suggestion is changing related cmake files, remove one intel libiomp5 library and keep gnu libgomp5. It should not decrease others performance because there is libgomp5. But to make sure, some benchmarks are necessary.
I think these answers oneapi-src/oneDNN#230 maybe useful as a reference.

@yinghu5
Copy link

yinghu5 commented Mar 24, 2020

@GaoWei8 Thank you a lot for test. I notice the linked library like
SET(MKLML_IOMP_LIB ${MKLML_LIB_DIR}/libiomp5.so) seems still in link command

could you please check the produced paddle_fluid.so ,
$ ldd xxxx/paddle/lib/libpaddle_fluid.so
and see if the libiomp5 and libgomp were still there.

Thanks
Ying

@jiweibo
Copy link
Contributor Author

jiweibo commented Apr 9, 2020

The memory leak problem has been located and resolved, thanks to all friends for their contributions.

@jiweibo jiweibo moved this from In progress to Done in Inference-dev Apr 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
developing inference Inference development Intel
Projects
Inference-dev
  
Done
Development

Successfully merging a pull request may close this issue.

7 participants