Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade MKL-DNN commit #15116

Closed
wants to merge 8 commits into from
Closed

Upgrade MKL-DNN commit #15116

wants to merge 8 commits into from

Conversation

hshen14
Copy link
Contributor

@hshen14 hshen14 commented Dec 29, 2018

The commit includes the optimizations for:

  • INT8 depthwise convolution on CLX
  • INT8 1x1 convolution on CLX
  • FP32/INT8 gemm convolution (group convolution)

test=develop

@hshen14
Copy link
Contributor Author

hshen14 commented Dec 29, 2018

http://ci.paddlepaddle.org/downloadBuildLog.html?buildId=43515&plain=true

[03:43:23]W: [Step 1/1] /paddle/build/third_party/ngraph/src/extern_ngraph/src/ngraph/runtime/cpu/mkldnn_emitter.cpp:810:51: error: expected type-specifier
[03:43:23]W: [Step 1/1] size_t primitive_index = insert_primitive(new mkldnn::relu_forward(

@baojun-nervana We want to upgrade MKL-DNN commit, which requires relu primitive should be changed to elementwise. Could you please update the code accordingly and include this PR? Thanks.

@luotao1

@luotao1
Copy link
Contributor

luotao1 commented Jan 7, 2019

@hshen14 PR #15175 is merged, and the latest error is

[04:31:45][Step 1/1] /paddle/build/third_party/install/mkldnn/lib/libmkldnn.so: undefined reference to `cblas_sgemm_pack_get_size'

http://ci.paddlepaddle.org/viewLog.html?buildId=45858&tab=buildLog&buildTypeId=Paddle_PrCi&logTab=tree&filter=all&_focus=11035

@hshen14
Copy link
Contributor Author

hshen14 commented Jan 7, 2019

@luotao1 Will try to reproduce locally first.

@hshen14
Copy link
Contributor Author

hshen14 commented Jan 7, 2019

Confirmed with MKL-DNN team, MKL-DNN breaks compilation due to new packed BLAS API change (which is not available in earlier versions of Intel MKL). So, we have to upgrade MKLML to:

https://github.com/intel/mkl-dnn/releases/download/v0.17.2/mklml_lnx_2019.0.1.20181227.tgz

Is it possible to upgrade MKLML first? @luotao1

@luotao1
Copy link
Contributor

luotao1 commented Jan 7, 2019

@hshen14 I will upgrade MKLML at first.

@luotao1 luotao1 mentioned this pull request Jan 7, 2019
@luotao1
Copy link
Contributor

luotao1 commented Jan 8, 2019

@hshen14 MKLML updated in #15190 is merged, you can update MKLDNN commit id.

@hshen14
Copy link
Contributor Author

hshen14 commented Jan 9, 2019

@luotao1

PR_Windows_CI fail:
http://ci.paddlepaddle.org/viewLog.html?buildId=47144&buildTypeId=PaddleWindows_PrWindowsCi&tab=buildLog

[01:05:12] : [Step 2/5] C:\Program Files (x86)\MSBuild\Microsoft.Cpp\v4.0\V140\Microsoft.CppCommon.targets(171,5): error MSB6006: “cmd.exe”已退出,代码为 1。 [D:\home\BuildAgent\work\a9b0372f0aea0a80\build\python\paddle_python.vcxproj]

Another GPU failure?
http://ci.paddlepaddle.org/viewLog.html?buildId=47138&buildTypeId=Paddle_PrCiPython35&tab=buildLog

local_stderr: b'W0109 01:57:43.922690 30317 init.cc:121] AVX is available, Please re-compile on local machine\nW0109 01:57:49.484200 30317 device_context.cc:257] Please NOTE: device: 0, CUDA Capability: 61, Driver API Version: 9.0, Runtime API Version: 8.0\nW0109 01:57:49.484375 30317 device_context.cc:265] device: 0, cuDNN Version: 7.0.\n\n\n\n\n\n\nTraceback (most recent call last):\n File "dist_se_resnext.py", line 258, in \n runtime_main(DistSeResneXt2x2)\n File "/paddle/build/python/paddle/fluid/tests/unittests/test_dist_base.py", line 204, in runtime_main\n model.run_trainer(args)\n File "/paddle/build/python/paddle/fluid/tests/unittests/test_dist_base.py", line 164, in run_trainer\n feed=feeder.feed(get_data()))\n File "/paddle/build/python/paddle/fluid/tests/unittests/test_dist_base.py", line 151, in get_data\n origin_batch = next(reader_generator)\n File "/paddle/build/python/paddle/batch.py", line 35, in batch_reader\n for instance in r:\n File "/paddle/build/python/paddle/reader/decorator.py", line 52, in reader\n for e in map(func, *rs):\n File "/paddle/build/python/paddle/dataset/flowers.py", line 130, in reader\n data = batch['data']\nKeyError: 'data'\n'
[01:57:50] test_dist_se_resnext failed

@hshen14
Copy link
Contributor Author

hshen14 commented Jan 9, 2019

The rest result seems fail with different symptom after several re-runs. Is there any way to make the infrastructure more stable? @luotao1

@luotao1
Copy link
Contributor

luotao1 commented Jan 9, 2019

Error log on local windows machine:
image
It seems that the output directory changes from lib to bin:
image
So we should update as follows:
image

Does mkldnn.dll decide to move from lib to bin?

@hshen14
Copy link
Contributor Author

hshen14 commented Jan 9, 2019

Need help from @yinghu5

@yinghu5
Copy link

yinghu5 commented Jan 10, 2019

Yes, MKL DNN team decide to move the destination folder from lib to bin/ as it is aligned with general binary libraries distribution. The change will be done in current master and later version (after V.17.2)

@luotao1
Copy link
Contributor

luotao1 commented Jan 10, 2019

Thanks @yinghu5!
@hshen14, could you update lib to bin to pass the windows CI?

@hshen14
Copy link
Contributor Author

hshen14 commented Jan 10, 2019

@luotao1

Seems Windows test fail:

http://ci.paddlepaddle.org/viewLog.html?buildId=48051&buildTypeId=PaddleWindows_PrWindowsCi&tab=buildLog

[08:30:43]W: [Step 4/5] Errors while running CTest
[08:30:43] : [Step 4/5]
[08:30:43] : [Step 4/5] 79% tests passed, 78 tests failed out of 372
[08:30:43] : [Step 4/5]
[08:30:43] : [Step 4/5] Total Test time (real) = 253.21 sec
[08:30:43] : [Step 4/5]
[08:30:43] : [Step 4/5] The following tests FAILED:
[08:30:43] : [Step 4/5] 1 - system_allocator_test (Exit code 0xc0000135
[08:30:43] : [Step 4/5] )
[08:30:43] : [Step 4/5] 2 - allocator_facade_test (Exit code 0xc0000135
......

@luotao1
Copy link
Contributor

luotao1 commented Jan 10, 2019

Seems Windows test fail:

@wopeizl adds the MKLDNN bin/ into CI configure path and reruns the windows CI.

@luotao1
Copy link
Contributor

luotao1 commented Jan 10, 2019

[08:42:44][Step 1/1] /paddle/paddle/fluid/inference/tests/api/tester_helper.h:91: Failure
[08:42:44][Step 1/1] The difference between pdata_ref[j] and pdata[j] is 0.0026035308837890625, which exceeds FLAGS_accuracy, where
[08:42:44][Step 1/1] pdata_ref[j] evaluates to 0.74058622121810913,
[08:42:44][Step 1/1] pdata[j] evaluates to 0.74318975210189819, and
[08:42:44][Step 1/1] FLAGS_accuracy evaluates to 0.001.
[08:42:44][Step 1/1] [  FAILED  ] Analyzer_dam.compare (2522 ms)

It seems that MKL has random diff.
http://ci.paddlepaddle.org/viewLog.html?buildId=47988&tab=buildLog&buildTypeId=Paddle_PrCi&logTab=tail

@hshen14
Copy link
Contributor Author

hshen14 commented Jan 10, 2019

As mentioned by @luotao1, there is random failure in MKL. Could you please help on that? @yinghu5 Thanks.

@luotao1
Copy link
Contributor

luotao1 commented Jan 15, 2019

The earliest MKL diff is in Dec 15th. @yinghu5 @bingyanghuang @yihuaxu

[21:24:34][Step 1/1] /paddle/paddle/fluid/inference/tests/api/tester_helper.h:87: Failure
[21:24:34][Step 1/1] The difference between pdata_ref[j] and pdata[j] is 0.0028614401817321777, which exceeds 1e-3, where
[21:24:34][Step 1/1] pdata_ref[j] evaluates to 0.74058622121810913,
[21:24:34][Step 1/1] pdata[j] evaluates to 0.74344766139984131, and
[21:24:34][Step 1/1] 1e-3 evaluates to 0.001.
[21:24:34][Step 1/1] [  FAILED  ] Analyzer_dam.compare (542 ms)

http://ci.paddlepaddle.org/viewLog.html?tab=buildLog&buildTypeId=Paddle_PrCiNight&buildId=37730&_focus=21309#_state=65

@luotao1
Copy link
Contributor

luotao1 commented Jan 15, 2019

@hshen14 I rerun the PR_CI again, and the new error is:

[04:54:15][Step 1/1] 328/525 Test #349: test_conv2d_int8_mkldnn_op ......................***Failed    2.31 sec
[04:54:15][Step 1/1] *** Aborted at 1547528055 (unix time) try "date -d @1547528055" if you are using GNU date ***
[04:54:15][Step 1/1] PC: @                0x0 (unknown)
[04:54:15][Step 1/1] *** SIGSEGV (@0x7fcb44d19e80) received by PID 71119 (TID 0x7fcb4ba6a700) from PID 1154588288; stack trace: ***
[04:54:15][Step 1/1]     @     0x7fcb4b649390 (unknown)
[04:54:15][Step 1/1]     @     0x7fcb06366f7b _ZN6mkldnn4impl3cpu32_gemm_x8s8s32x_convolution_fwd_tIL18mkldnn_data_type_t6ELS3_6EE8pp_ker_tclEPhPKiPKcPKffffimm
[04:54:15][Step 1/1]     @     0x7fcb0636c2bb _ZNK6mkldnn4impl3cpu32_gemm_x8s8s32x_convolution_fwd_tIL18mkldnn_data_type_t6ELS3_6EE19execute_forward_thrEiiPKhPKaPKcPhRKNS0_15memory_tracking9grantor_tE
[04:54:15][Step 1/1]     @     0x7fcb0636c60b _ZNK6mkldnn4impl3cpu32_gemm_x8s8s32x_convolution_fwd_tIL18mkldnn_data_type_t6ELS3_6EE15execute_forwardEv
[04:54:15][Step 1/1]     @     0x7fcb05c6abc9 _ZNK6mkldnn4impl3cpu32_gemm_x8s8s32x_convolution_fwd_tIL18mkldnn_data_type_t6ELS3_6EE7executeEPNS0_7event_tE
[04:54:15][Step 1/1]     @     0x7fcb05c5c183 mkldnn::impl::cpu::cpu_engine_t::submit()
[04:54:15][Step 1/1]     @     0x7fcb065bc096 mkldnn::impl::stream_eager_t::submit_impl()
[04:54:15][Step 1/1]     @     0x7fcb065bb481 mkldnn_stream::submit()
[04:54:15][Step 1/1]     @     0x7fcb065bb698 mkldnn_stream_submit
[04:54:15][Step 1/1]     @     0x7fcb079a2273 mkldnn::stream::submit()
[04:54:15][Step 1/1]     @     0x7fcb07f561ee paddle::operators::ConvMKLDNNOpKernel<>::ComputeINT8()
[04:54:15][Step 1/1]     @     0x7fcb07f58d17 _ZNSt17_Function_handlerIFvRKN6paddle9framework16ExecutionContextEEZNKS1_24OpKernelRegistrarFunctorINS0_8platform8CPUPlaceELb0ELm0EJNS0_9operators18ConvMKLDNNOpKernelIhfEEEEclEPKcSE_iEUlS4_E_E9_M_invokeERKSt9_Any_dataS4_
[04:54:15][Step 1/1]     @     0x7fcb092a2850 paddle::framework::OperatorWithKernel::RunImpl()
[04:54:15][Step 1/1]     @     0x7fcb0929cdfd paddle::framework::OperatorBase::Run()
[04:54:15][Step 1/1]     @     0x7fcb0789280e paddle::framework::Executor::RunPreparedContext()
[04:54:15][Step 1/1]     @     0x7fcb07894eaa paddle::framework::Executor::Run()
[04:54:15][Step 1/1]     @     0x7fcb07755188 _ZZN8pybind1112cpp_function10initializeIZN6paddle6pybindL18pybind11_init_coreERNS_6moduleEEUlRNS2_9framework8ExecutorERKNS6_11ProgramDescEPNS6_5ScopeEibbE85_vJS8_SB_SD_ibbEJNS_4nameENS_9is_methodENS_7siblingEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE1_4_FUNESV_
[04:54:15][Step 1/1]     @     0x7fcb077aaf0d pybind11::cpp_function::dispatcher()
[04:54:15][Step 1/1]     @           0x4c5326 PyEval_EvalFrameEx
[04:54:15][Step 1/1]     @           0x4b9b66 PyEval_EvalCodeEx
[04:54:15][Step 1/1]     @           0x4c1f56 PyEval_EvalFrameEx
[04:54:15][Step 1/1]     @           0x4b9b66 PyEval_EvalCodeEx
[04:54:15][Step 1/1]     @           0x4c17c6 PyEval_EvalFrameEx
[04:54:15][Step 1/1]     @           0x4b9b66 PyEval_EvalCodeEx
[04:54:15][Step 1/1]     @           0x4c17c6 PyEval_EvalFrameEx
[04:54:15][Step 1/1]     @           0x4b9b66 PyEval_EvalCodeEx
[04:54:15][Step 1/1]     @           0x4c17c6 PyEval_EvalFrameEx
[04:54:15][Step 1/1]     @           0x4b9b66 PyEval_EvalCodeEx
[04:54:15][Step 1/1]     @           0x4c1f56 PyEval_EvalFrameEx
[04:54:15][Step 1/1]     @           0x4b9b66 PyEval_EvalCodeEx
[04:54:15][Step 1/1]     @           0x4d57a3 (unknown)
[04:54:15][Step 1/1]     @           0x4a587e PyObject_Call
[04:54:15][Step 1/1] Segmentation fault

http://ci.paddlepaddle.org/viewLog.html?buildId=49403&tab=buildLog&buildTypeId=Paddle_PrCi&logTab=tree&filter=all&_focus=22857#_state=23303

@hshen14
Copy link
Contributor Author

hshen14 commented Jan 15, 2019

@xiaolil1 will take a look. BTW, is it reproducible 100% or a random issue? @luotao1

@luotao1
Copy link
Contributor

luotao1 commented Jan 15, 2019

@hshen14 @xiaolil1 Though it's a random fail, it has the risk in conv2d_int8_mkldnn.

@hshen14
Copy link
Contributor Author

hshen14 commented Jan 15, 2019

@hshen14 @xiaolil1 Though it's a random fail, it has the risk in conv2d_int8_mkldnn.

We will try reproducing locally and investigate the reason. From the crash message, it shows the potential issue of GEMM INT8, so conv INT8 kernel itself should be okay.

@luotao1
Copy link
Contributor

luotao1 commented Jan 16, 2019

@hshen14
Copy link
Contributor Author

hshen14 commented Jan 18, 2019

@luotao1 @yinghu5 INT8 UT failure was caused by a potential regression from MKL-DNN as similar symptom was reported from other DL frameworks. We had an internal bug tracking the issue and will keep you updated.

@hshen14
Copy link
Contributor Author

hshen14 commented Feb 1, 2019

@luotao1 One Windows failure - rerun now:
[03:39:45] : [Step 2/5] C:\Program Files (x86)\MSBuild\Microsoft.Cpp\v4.0\V140\Microsoft.CppCommon.targets(171,5): error MSB4018: “CustomBuild”任务意外失败。 [D:\home\BuildAgent\work\a9b0372f0aea0a80\build\paddle\fluid\operators\ctc_align_op.vcxproj]
[03:39:45] : [Step 2/5] C:\Program Files (x86)\MSBuild\Microsoft.Cpp\v4.0\V140\Microsoft.CppCommon.targets(171,5): error MSB4018: System.IO.IOException: 文件“D:\home\BuildAgent\temp\buildTmp\tmp9fda56ecc91a40e6988890b6b8543a61.cmd”正由另一进程使用,因此该进程无法访问此文件。 [D:\home\BuildAgent\work\a9b0372f0aea0a80\build\paddle\fluid\operators\ctc_align_op.vcxproj]

http://ci.paddlepaddle.org/viewLog.html?buildId=55875&buildTypeId=PaddleWindows_PrWindowsCi&tab=buildLog

@panyx0718 Need your approval for cmake external change introduced by MKL-DNN change
[03:44:37] : [Step 1/1] You must have panyx0718 approval for the api change! cmake/external
[03:44:37]W: [Step 1/1] + APPROVALS=FALSE
[03:44:37]W: [Step 1/1] + echo 'current pr 15116 got approvals: FALSE'
[03:44:37]W: [Step 1/1] + '[' FALSE == FALSE ']'
[03:44:37]W: [Step 1/1] + echo 'You must have panyx0718 approval for the api change! cmake/external'
[03:44:37]W: [Step 1/1] + exit 1

http://ci.paddlepaddle.org/viewLog.html?buildId=55852&buildTypeId=Paddle_PrCi&tab=buildLog

@hshen14
Copy link
Contributor Author

hshen14 commented Feb 15, 2019

@luotao1 @kbinias @jianhang-liu @yinghu5 After discussion, we agreed to have two-stage MKL-DNN upgrade. @kbinias's team was working on validation with MKL-DNN v0.18rc and will prepare a PR after successful validation. I will keep this PR open for reference and close it when the new PR with v0.18rc is submitted.

@hshen14
Copy link
Contributor Author

hshen14 commented Feb 22, 2019

The new MKL-DNN upgrade was WIP: #15861

Close this intermediate PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants