MKLDNN implementation of batch normalization #9904

tpatejko · 2018-04-13T10:54:22Z

This PR implements MKLDNN batch normalization. It contains:

MKLDNN batch norm backward and forward passes;
support for NCHW data layout
unittests for training and inference.

tpatejko · 2018-04-15T16:46:50Z

@luotao1 I'm having some troubles with TeamCity builds. The build finishes with the following error message:

[15:41:08][Step 1/1] Traceback (most recent call last):
[15:41:08][Step 1/1]   File "/usr/bin/pip", line 9, in <module>
[15:41:08][Step 1/1]     from pip import main
[15:41:08][Step 1/1] ImportError: cannot import name main
[15:41:08][Step 1/1] The command '/bin/sh -c pip install --upgrade pip &&     pip install -U wheel &&     pip install -U docopt PyYAML sphinx==1.5.6 &&     pip install sphinx-rtd-theme==0.1.9 recommonmark' returned a non-zero code: 1
[15:41:08][Step 1/1] 
[15:41:08][Step 1/1] Process exited with code 1
[15:41:08][Step 1/1] Process exited with code 1
[15:41:08][Step 1/1] Step Build and test (Command Line) failed

Do you know what seems to be the issue?

luotao1 · 2018-04-16T02:22:49Z

@tpatejko This bug is duplicated with #9927 and fixed in #9926. You can merge the latest codes.

tpatejko · 2018-04-17T12:28:54Z

@luotao1 I'm having some troubles with unit tests in TeamCity. The test that is failing is test_parallel_executor.

The output for the test is as follows:

[11:40:51][Step 1/1]  94/125 Test  #91: test_parallel_executor ..........................***Exception: Other 44.62 sec
[11:40:51][Step 1/1] test_parallel_testing (test_parallel_executor.ParallelExecutorTestingDuringTraining) ... FAIL
[11:40:51][Step 1/1] test_all (test_parallel_executor.TestCRFModel) ... [171.97684 167.20038]
[11:40:51][Step 1/1] [82.76634 93.07925]
[11:40:51][Step 1/1] [87.40605 84.28883]
[11:40:51][Step 1/1] [81.66299 78.8318 ]
[11:40:51][Step 1/1] [62.7163  97.20565]
[11:40:51][Step 1/1] [84.70265  85.544266]
[11:40:51][Step 1/1] [67.59907 85.60291]
[11:40:51][Step 1/1] [72.08023 69.33337]
[11:40:51][Step 1/1] [63.721405 74.92147 ]
[11:40:51][Step 1/1] [57.358616 63.71672 ]
[11:40:51][Step 1/1] ok
[11:40:51][Step 1/1] test_batchnorm_fc (test_parallel_executor.TestMNIST) ... [2.755311  2.6013417] [0.57221764 0.8664746 ]
[11:40:51][Step 1/1] ERROR
[11:40:51][Step 1/1] test_simple_fc (test_parallel_executor.TestMNIST) ... ERROR
[11:40:51][Step 1/1] test_resnet (test_parallel_executor.TestResnet) ... ERROR
[11:40:51][Step 1/1] test_main (test_parallel_executor.TestTransformer) ... skipped 'transformer is buggy in multi gpu'
[11:40:51][Step 1/1] 
[11:40:51][Step 1/1] ======================================================================
[11:40:51][Step 1/1] ERROR: test_batchnorm_fc (test_parallel_executor.TestMNIST)
[11:40:51][Step 1/1] ----------------------------------------------------------------------
[11:40:51][Step 1/1] Traceback (most recent call last):
[11:40:51][Step 1/1]   File "test_parallel_executor.py", line 276, in test_batchnorm_fc
[11:40:51][Step 1/1]     "label": label})
[11:40:51][Step 1/1]   File "test_parallel_executor.py", line 228, in check_network_convergence
[11:40:51][Step 1/1]     exe.run([], feed_dict=feed_dict)
[11:40:51][Step 1/1]   File "/paddle/build/python/paddle/fluid/parallel_executor.py", line 145, in run
[11:40:51][Step 1/1]     self.executor.run(fetch_list, fetch_var_name, feed_tensor_dict)
[11:40:51][Step 1/1] EnforceNotMet: an illegal memory access was encountered at [/paddle/paddle/fluid/platform/device_context.cc:179]
[11:40:51][Step 1/1] PaddlePaddle Call Stacks: 
[11:40:51][Step 1/1] 0       0x7f5acc347d3cp paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) + 572
[11:40:51][Step 1/1] 1       0x7f5acd234e33p paddle::platform::CUDADeviceContext::Wait() const + 515
[11:40:51][Step 1/1] 2       0x7f5acc40f20ep paddle::framework::ParallelExecutor::Run(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, paddle::framework::LoDTensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, paddle::framework::LoDTensor> > > const&) + 766
[11:40:51][Step 1/1] 3       0x7f5acc39c6b3p void pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<void, paddle::framework::ParallelExecutor, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, paddle::framework::LoDTensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, paddle::framework::LoDTensor> > > const&, pybind11::name, pybind11::is_method, pybind11::sibling>(void (paddle::framework::ParallelExecutor::*)(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, paddle::framework::LoDTensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, paddle::framework::LoDTensor> > > const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(paddle::framework::ParallelExecutor*, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, paddle::framework::LoDTensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, paddle::framework::LoDTensor> > > const&)#1}, void, paddle::framework::ParallelExecutor*, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, paddle::framework::LoDTensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, paddle::framework::LoDTensor> > > const&, pybind11::name, pybind11::is_method, pybind11::sibling>(pybind11::cpp_function::initialize<void, paddle::framework::ParallelExecutor, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, paddle::framework::LoDTensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, paddle::framework::LoDTensor> > > const&, pybind11::name, pybind11::is_method, pybind11::sibling>(void (paddle::framework::ParallelExecutor::*)(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, paddle::framework::LoDTensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, paddle::framework::LoDTensor> > > const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(paddle::framework::ParallelExecutor*, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, paddle::framework::LoDTensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, paddle::framework::LoDTensor> > > const&)#1}&&, void (*)(paddle::framework::ParallelExecutor*, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, paddle::framework::LoDTensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, paddle::framework::LoDTensor> > > const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call) + 451
[11:40:51][Step 1/1] 4       0x7f5acc362234p pybind11::cpp_function::dispatcher(_object*, _object*, _object*) + 1236
[11:40:51][Step 1/1] 5             0x4c37edp PyEval_EvalFrameEx + 31165
[11:40:51][Step 1/1] 6             0x4b9ab6p PyEval_EvalCodeEx + 774
[11:40:51][Step 1/1] 7             0x4c16e7p PyEval_EvalFrameEx + 22711
[11:40:51][Step 1/1] 8             0x4b9ab6p PyEval_EvalCodeEx + 774
[11:40:51][Step 1/1] 9             0x4c16e7p PyEval_EvalFrameEx + 22711
[11:40:51][Step 1/1] 10            0x4c136fp PyEval_EvalFrameEx + 21823
[11:40:51][Step 1/1] 11            0x4b9ab6p PyEval_EvalCodeEx + 774
[11:40:51][Step 1/1] 12            0x4d55f3p
[11:40:51][Step 1/1] 13            0x4a577ep PyObject_Call + 62
[11:40:51][Step 1/1] 14            0x4bed3dp PyEval_EvalFrameEx + 12045
[11:40:51][Step 1/1] 15            0x4b9ab6p PyEval_EvalCodeEx + 774
[11:40:51][Step 1/1] 16            0x4d54b9p
[11:40:51][Step 1/1] 17            0x4eebeep
[11:40:51][Step 1/1] 18            0x4a577ep PyObject_Call + 62
[11:40:51][Step 1/1] 19            0x548253p
[11:40:51][Step 1/1] 20            0x4c15bfp PyEval_EvalFrameEx + 22415
[11:40:51][Step 1/1] 21            0x4b9ab6p PyEval_EvalCodeEx + 774
[11:40:51][Step 1/1] 22            0x4d55f3p
[11:40:51][Step 1/1] 23            0x4a577ep PyObject_Call + 62
[11:40:51][Step 1/1] 24            0x4bed3dp PyEval_EvalFrameEx + 12045
[11:40:51][Step 1/1] 25            0x4b9ab6p PyEval_EvalCodeEx + 774
[11:40:51][Step 1/1] 26            0x4d54b9p
[11:40:51][Step 1/1] 27            0x4eebeep
[11:40:51][Step 1/1] 28            0x4a577ep PyObject_Call + 62
[11:40:51][Step 1/1] 29            0x548253p
[11:40:51][Step 1/1] 30            0x4c15bfp PyEval_EvalFrameEx + 22415
[11:40:51][Step 1/1] 31            0x4b9ab6p PyEval_EvalCodeEx + 774
[11:40:51][Step 1/1] 32            0x4d55f3p
[11:40:51][Step 1/1] 33            0x4a577ep PyObject_Call + 62
[11:40:51][Step 1/1] 34            0x4bed3dp PyEval_EvalFrameEx + 12045
[11:40:51][Step 1/1] 35            0x4b9ab6p PyEval_EvalCodeEx + 774
[11:40:51][Step 1/1] 36            0x4d54b9p
[11:40:51][Step 1/1] 37            0x4eebeep
[11:40:51][Step 1/1] 38            0x4a577ep PyObject_Call + 62
[11:40:51][Step 1/1] 39            0x548253p
[11:40:51][Step 1/1] 40            0x4c15bfp PyEval_EvalFrameEx + 22415
[11:40:51][Step 1/1] 41            0x4b9ab6p PyEval_EvalCodeEx + 774
[11:40:51][Step 1/1] 42            0x4d55f3p
[11:40:51][Step 1/1] 43            0x4a577ep PyObject_Call + 62
[11:40:51][Step 1/1] 44            0x4bed3dp PyEval_EvalFrameEx + 12045
[11:40:51][Step 1/1] 45            0x4b9ab6p PyEval_EvalCodeEx + 774
[11:40:51][Step 1/1] 46            0x4d54b9p
[11:40:51][Step 1/1] 47            0x4eebeep
[11:40:51][Step 1/1] 48            0x4a577ep PyObject_Call + 62
[11:40:51][Step 1/1] 49            0x548253p
[11:40:51][Step 1/1] 50            0x4c15bfp PyEval_EvalFrameEx + 22415
[11:40:51][Step 1/1] 51            0x4c136fp PyEval_EvalFrameEx + 21823
[11:40:51][Step 1/1] 52            0x4c136fp PyEval_EvalFrameEx + 21823
[11:40:51][Step 1/1] 53            0x4b9ab6p PyEval_EvalCodeEx + 774
[11:40:51][Step 1/1] 54            0x4d55f3p
[11:40:51][Step 1/1] 55            0x4eebeep
[11:40:51][Step 1/1] 56            0x4ee7f6p
[11:40:51][Step 1/1] 57            0x4aa9abp
[11:40:51][Step 1/1] 58            0x4c15bfp PyEval_EvalFrameEx + 22415
[11:40:51][Step 1/1] 59            0x4b9ab6p PyEval_EvalCodeEx + 774
[11:40:51][Step 1/1] 60            0x4bfa8dp PyEval_EvalFrameEx + 15453
[11:40:51][Step 1/1] 61            0x4b9ab6p PyEval_EvalCodeEx + 774
[11:40:51][Step 1/1] 62            0x4c16e7p PyEval_EvalFrameEx + 22711
[11:40:51][Step 1/1] 63            0x4b9ab6p PyEval_EvalCodeEx + 774
[11:40:51][Step 1/1] 64            0x4d54b9p
[11:40:51][Step 1/1] 65            0x4a577ep PyObject_Call + 62
[11:40:51][Step 1/1] 66            0x519a46p
[11:40:51][Step 1/1] 67            0x493b06p Py_Main + 1590
[11:40:51][Step 1/1] 68      0x7f5b0153e830p __libc_start_main + 240
[11:40:51][Step 1/1] 69            0x4933e9p _start + 41
[11:40:51][Step 1/1] 
[11:40:51][Step 1/1] 
[11:40:51][Step 1/1] ======================================================================
[11:40:51][Step 1/1] ERROR: test_simple_fc (test_parallel_executor.TestMNIST)
[11:40:51][Step 1/1] ----------------------------------------------------------------------
[11:40:51][Step 1/1] Traceback (most recent call last):
[11:40:51][Step 1/1]   File "test_parallel_executor.py", line 261, in test_simple_fc
[11:40:51][Step 1/1]     self.check_network_convergence(simple_fc_net)
[11:40:51][Step 1/1]   File "test_parallel_executor.py", line 218, in check_network_convergence
[11:40:51][Step 1/1]     startup_exe.run(startup)
[11:40:51][Step 1/1]   File "/paddle/build/python/paddle/fluid/executor.py", line 336, in run
[11:40:51][Step 1/1]     self.executor.run(program.desc, scope, 0, True, True)
[11:40:51][Step 1/1] RuntimeError: function_attributes(): after cudaFuncGetAttributes: an illegal memory access was encountered
[11:40:51][Step 1/1] 
[11:40:51][Step 1/1] ======================================================================
[11:40:51][Step 1/1] ERROR: test_resnet (test_parallel_executor.TestResnet)
[11:40:51][Step 1/1] ----------------------------------------------------------------------
[11:40:51][Step 1/1] Traceback (most recent call last):
[11:40:51][Step 1/1]   File "test_parallel_executor.py", line 305, in test_resnet
[11:40:51][Step 1/1]     batch_size=batch_size)
[11:40:51][Step 1/1]   File "test_parallel_executor.py", line 218, in check_network_convergence
[11:40:51][Step 1/1]     startup_exe.run(startup)
[11:40:51][Step 1/1]   File "/paddle/build/python/paddle/fluid/executor.py", line 336, in run
[11:40:51][Step 1/1]     self.executor.run(program.desc, scope, 0, True, True)
[11:40:51][Step 1/1] RuntimeError: function_attributes(): after cudaFuncGetAttributes: an illegal memory access was encountered
[11:40:51][Step 1/1] 
[11:40:51][Step 1/1] ======================================================================
[11:40:51][Step 1/1] FAIL: test_parallel_testing (test_parallel_executor.ParallelExecutorTestingDuringTraining)
[11:40:51][Step 1/1] ----------------------------------------------------------------------
[11:40:51][Step 1/1] Traceback (most recent call last):
[11:40:51][Step 1/1]   File "test_parallel_executor.py", line 507, in test_parallel_testing
[11:40:51][Step 1/1]     str(test_loss))
[11:40:51][Step 1/1] AssertionError: Train loss: [2.8382177 2.20449  ]
[11:40:51][Step 1/1]  Test loss:[2.8382177 2.7273917]
[11:40:51][Step 1/1] 
[11:40:51][Step 1/1] ----------------------------------------------------------------------
[11:40:51][Step 1/1] Ran 6 tests in 17.689s
[11:40:51][Step 1/1] 
[11:40:51][Step 1/1] FAILED (failures=1, errors=3, skipped=1)
[11:40:51][Step 1/1] terminate called after throwing an instance of 'paddle::platform::EnforceNotMet'
[11:40:51][Step 1/1]   what():  an illegal memory access was encountered at [/paddle/paddle/fluid/platform/device_context.cc:179]
[11:40:51][Step 1/1] PaddlePaddle Call Stacks: 
[11:40:51][Step 1/1] 0       0x7f5acc347d3cp paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) + 572
[11:40:51][Step 1/1] 1       0x7f5acd234e33p paddle::platform::CUDADeviceContext::Wait() const + 515
[11:40:51][Step 1/1] 2       0x7f5acd23507bp paddle::platform::CUDADeviceContext::~CUDADeviceContext() + 75
[11:40:51][Step 1/1] 3       0x7f5acd235881p paddle::platform::CUDADeviceContext::~CUDADeviceContext() + 17
[11:40:51][Step 1/1] 4       0x7f5acc43c50ep paddle::operators::reader::DoubleBufferReader::~DoubleBufferReader() + 62
[11:40:51][Step 1/1] 5       0x7f5acc345270p paddle::framework::Variable::PlaceholderImpl<paddle::framework::ReaderHolder>::~PlaceholderImpl() + 48
[11:40:51][Step 1/1] 6       0x7f5acd0deafcp paddle::framework::Scope::~Scope() + 188
[11:40:51][Step 1/1] 7       0x7f5acd0de9f1p paddle::framework::Scope::DropKids() + 49
[11:40:51][Step 1/1] 8       0x7f5acd0dea6ap paddle::framework::Scope::~Scope() + 42
[11:40:51][Step 1/1] 9       0x7f5acc34421ap pybind11::class_<paddle::framework::Scope>::dealloc(_object*) + 58
[11:40:51][Step 1/1] 10      0x7f5acc35bb2dp pybind11_object_dealloc + 45
[11:40:51][Step 1/1] 11            0x4fc33ap _PyModule_Clear + 1354
[11:40:51][Step 1/1] 12            0x4fbc2ep PyImport_Cleanup + 990
[11:40:51][Step 1/1] 13            0x4f8e14p Py_Finalize + 132
[11:40:51][Step 1/1] 14            0x51dc18p Py_Exit + 8
[11:40:51][Step 1/1] 15            0x51b1b7p
[11:40:51][Step 1/1] 16            0x51aaddp PyErr_PrintEx + 45
[11:40:51][Step 1/1] 17            0x519a53p
[11:40:51][Step 1/1] 18            0x493b06p Py_Main + 1590
[11:40:51][Step 1/1] 19      0x7f5b0153e830p __libc_start_main + 240
[11:40:51][Step 1/1] 20            0x4933e9p _start + 41
[11:40:51][Step 1/1] 
[11:40:51][Step 1/1] *** Aborted at 1523965227 (unix time) try "date -d @1523965227" if you are using GNU date ***
[11:40:51][Step 1/1] PC: @                0x0 (unknown)
[11:40:51][Step 1/1] *** SIGABRT (@0x610a) received by PID 24842 (TID 0x7f5b01d1e700) from PID 24842; stack trace: ***
[11:40:51][Step 1/1]     @     0x7f5b018f9390 (unknown)
[11:40:51][Step 1/1]     @     0x7f5b01553428 gsignal
[11:40:51][Step 1/1]     @     0x7f5b0155502a abort
[11:40:51][Step 1/1]     @     0x7f5af805184d __gnu_cxx::__verbose_terminate_handler()
[11:40:51][Step 1/1]     @     0x7f5af804f6b6 (unknown)
[11:40:51][Step 1/1]     @     0x7f5af804e6a9 (unknown)
[11:40:51][Step 1/1]     @     0x7f5af804f005 __gxx_personality_v0
[11:40:51][Step 1/1]     @     0x7f5af8573f83 (unknown)
[11:40:51][Step 1/1]     @     0x7f5af8574487 _Unwind_Resume
[11:40:51][Step 1/1]     @     0x7f5acd234fd6 paddle::platform::CUDADeviceContext::Wait()
[11:40:51][Step 1/1]     @     0x7f5acd23507b paddle::platform::CUDADeviceContext::~CUDADeviceContext()
[11:40:51][Step 1/1]     @     0x7f5acd235881 paddle::platform::CUDADeviceContext::~CUDADeviceContext()
[11:40:51][Step 1/1]     @     0x7f5acc43c50e paddle::operators::reader::DoubleBufferReader::~DoubleBufferReader()
[11:40:51][Step 1/1]     @     0x7f5acc345270 paddle::framework::Variable::PlaceholderImpl<>::~PlaceholderImpl()
[11:40:51][Step 1/1]     @     0x7f5acd0deafc paddle::framework::Scope::~Scope()
[11:40:51][Step 1/1]     @     0x7f5acd0de9f1 paddle::framework::Scope::DropKids()
[11:40:51][Step 1/1]     @     0x7f5acd0dea6a paddle::framework::Scope::~Scope()
[11:40:51][Step 1/1]     @     0x7f5acc34421a pybind11::class_<>::dealloc()
[11:40:51][Step 1/1]     @     0x7f5acc35bb2d pybind11_object_dealloc
[11:40:51][Step 1/1]     @           0x4fc33a _PyModule_Clear
[11:40:51][Step 1/1]     @           0x4fbc2e PyImport_Cleanup
[11:40:51][Step 1/1]     @           0x4f8e14 Py_Finalize
[11:40:51][Step 1/1]     @           0x51dc18 Py_Exit
[11:40:51][Step 1/1]     @           0x51b1b7 (unknown)
[11:40:51][Step 1/1]     @           0x51aadd PyErr_PrintEx
[11:40:51][Step 1/1]     @           0x519a53 (unknown)
[11:40:51][Step 1/1]     @           0x493b06 Py_Main
[11:40:51][Step 1/1]     @     0x7f5b0153e830 __libc_start_main
[11:40:51][Step 1/1]     @           0x4933e9 _start
[11:40:51][Step 1/1]     @                0x0 (unknown)
[11:40:51][Step 1/1] 
[11:40:51][Step 1/1]         Start  94: test_mul_op

Some of these tests seem to be failing because of incorrect memory accesses, but some of them seem to calculate results incorrectly. Some of them seem to be related to batch normalization.

I did some changes in plain batch norm, but I haven't touched the plain or GPU implementations of batch normalization.

Do you know the reason why these tests are failing?

PS. I noticed there is an issue with parallel_executor in the title #9984. Could it be related to the issues above?

luotao1 · 2018-04-17T13:19:40Z

Do you know the reason why these tests are failing? PS. I noticed there is an issue with parallel_executor in the title #9984. Could it be related to the issues above?

Yes, our test_parallel_executor fails randomly, you can re-run your commit at first.

tpatejko · 2018-04-17T14:13:05Z

@luotao1 Thanks for your comment. One more question, how do test_parallel_executor tests fail?

I can see two different types of failures:

This one seems to be related to how computations are carried out:

[11:40:51][Step 1/1] test_batchnorm_fc (test_parallel_executor.TestMNIST) ... [2.755311  2.6013417] [0.57221764 0.8664746 ]
[11:40:51][Step 1/1] ERROR

or this one:

[11:40:51][Step 1/1] FAIL: test_parallel_testing (test_parallel_executor.ParallelExecutorTestingDuringTraining)
[11:40:51][Step 1/1] ----------------------------------------------------------------------
[11:40:51][Step 1/1] Traceback (most recent call last):
[11:40:51][Step 1/1]   File "test_parallel_executor.py", line 507, in test_parallel_testing
[11:40:51][Step 1/1]     str(test_loss))
[11:40:51][Step 1/1] AssertionError: Train loss: [2.8382177 2.20449  ]
[11:40:51][Step 1/1]  Test loss:[2.8382177 2.7273917]
[11:40:51][Step 1/1] 
[11:40:51][Step 1/1] ----------------------------------------------------------------------

This one seems to be related to incorrect memory access in GPU device context

[11:40:51][Step 1/1] FAILED (failures=1, errors=3, skipped=1)
[11:40:51][Step 1/1] terminate called after throwing an instance of 'paddle::platform::EnforceNotMet'
[11:40:51][Step 1/1]   what():  an illegal memory access was encountered at [/paddle/paddle/fluid/platform/device_context.cc:179]

tpatejko · 2018-04-17T15:57:46Z

@luotao1 Could you have a look at the code, or point out someone who could review this PR?

luotao1

@tensor-tang Could you help review batch_norm_mkldnn_op.cc?

luotao1 · 2018-04-20T07:13:08Z

python/paddle/fluid/tests/unittests/test_batch_norm_mkldnn_op.py

+from test_batch_norm_op import TestBatchNormOpInference, TestBatchNormOpTraining, _reference_training, _reference_grad
+
+
+class TestMKLDNNBatchNormOpTraining(TestBatchNormOpTraining):


Could the batch_norm test like this:

Paddle/python/paddle/fluid/tests/unittests/test_conv2d_mkldnn_op.py

Lines 20 to 22 in 5e13314

class TestMKLDNN(TestConv2dOp):

def init_kernel_type(self):

self.use_mkldnn = True

Only set init_kernel_type, which will run mkldnn_batch_norm kernel.

I corrected this part. I introduced init_kernel_type method in batch norm test cases, that set use_mkldnn variable.

If you have init_kernel_type method, could line 29-148 be removed?

@luotao1 Unfortunately, test_with_place function in test_batch_norm_op.py file inverts saved_variance variable because GPU and plain CPU implementations of batch norm do that.

Batch norm operation in the MKLDNN library does not do that. So I had to reimplement test_with_place function in order to be able to compare saved_variance from reference batch norm implementation with the one used in MKLDNN.

tpatejko · 2018-04-24T07:53:21Z

@tensor-tang Could you have a look at this PR?

tensor-tang · 2018-04-24T14:23:11Z

hi @tpatejko and @luotao1, since I am taking annual leave this week outside, I can only have a quick look on the file batch_norm_mkldnn_op.cc, the logic looks ok to me, @luotao1 could you please help double check and some other items, thanks.

tpatejko · 2018-04-26T08:12:18Z

@luotao1 do you have any further remarks regarding this PR?

luotao1 · 2018-04-26T08:15:04Z

How do you think about #9904 (comment)

tpatejko · 2018-04-27T07:19:23Z

@luotao1 I'm sorry for the late response. I've just seen your comment regarding batch norm unit tests.

luotao1 · 2018-04-27T09:40:01Z

python/paddle/fluid/tests/unittests/test_batch_norm_mkldnn_op.py

+
+        place = core.CPUPlace()
+        data_format = "NCHW"
+        test_with_place(place, data_format, [2, 3, 4, 5])


@tpatejko I see the only difference between test_with_place of test_batch_norm_mkldnn_op.py and test_batch_norm_op.py is line 146-148:

+ place = core.CPUPlace() + data_format = "NCHW" + test_with_place(place, data_format, [2, 3, 4, 5])

and

Paddle/python/paddle/fluid/tests/unittests/test_batch_norm_op.py

Lines 392 to 398 in ad91bfe

places = [core.CPUPlace()]

if core.is_compiled_with_cuda() and core.op_support_gpu("batch_norm"):

places.append(core.CUDAPlace(0))

for place in places:

for data_format in ["NCHW", "NHWC"]:

test_with_place(place, data_format, [2, 3, 4, 5])

Thus, how about move test_with_place out of test_forward_backward in test_batch_norm_op.py like:

def test_with_place(self, place, data_layout, shape): .... def test_forward_backward(self): places = [core.CPUPlace()] if core.is_compiled_with_cuda() and core.op_support_gpu("batch_norm"): places.append(core.CUDAPlace(0)) for place in places: for data_format in ["NCHW", "NHWC"]: self.test_with_place(place, data_format, [2, 3, 4, 5])

Then, you can only rewrite test_forward_backward in test_batch_norm_mkldnn_op.py, likes:

class TestMKLDNNBatchNormOpTraining(TestBatchNormOpTraining): def init_kernel_type(self): self.use_mkldnn = True def test_forward_backward(self): place = core.CPUPlace() data_format = "NCHW" self. test_with_place(place, data_format, [2, 3, 4, 5])

The difference between test_with_place in TestMKLDNNBatchNormOpTraining and test_with_place in TestBatchNormOpTraining is the following:
Reference training test case computes an inverse of saved_variance:
https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/tests/unittests/test_batch_norm_op.py#L301-L307

This is done because GPU operator returns saved/batch variance already inverted.

MKLDNN implementation of batch normalization does not invert batch variance (saved variance), so in order to be able to compare results of _reference_training function used in test_with_place, I had to reuse the code of test_with_place with aforementioned lines omitted:
https://github.com/tpatejko/Paddle/blob/474fa48b38b4148c5573a8811186e122532af490/python/paddle/fluid/tests/unittests/test_batch_norm_mkldnn_op.py#L53-L59

@luotao1 One more thing: you are right about moving test_with_place out of test_backward_forward and adding data format as a parameter. I did it, but the test cases for MKLDNN were failing because of inverted batch/saved variance.

@tpatejko Thanks for your explanation. I see it. But how about add a func _reference_training_and_grad in TestBatchNormOpTraining likes:

def _reference_training_and_grad(x, scale, bias, epsilon, data_layout): y, saved_mean, saved_variance = _reference_training( x, scale, bias, epsilon, data_layout) mean_out = saved_mean * (1. - momentum) + momentum * mean variance_out = saved_variance * (1. - momentum ) + momentum * variance # run backward y_grad = np.random.random_sample(shape).astype(np.float32) x_grad, scale_grad, bias_grad = _reference_grad( x, y_grad, scale, saved_mean, saved_variance, epsilon, data_layout) return x_grad, scale_grad, bias_grad

Then, you can only rewrite _reference_training_and_grad in test_batch_norm_mkldnn_op.py.
The reason is:

There are a lot of similar codes in test_batch_norm_mkldnn_op.py and test_batch_norm_op.py.

It's hard to find out the different MKLDNN implementation of batch normalization which does not invert batch variance (saved variance).

@luotao1 I think doing reference training in a separate function is a good idea. I will try to implement it.

…ckward

…l is looked up

…ct execution of MKLDNN unit tests

…* added

…ng attributes

…ges for saved variance not being inverted

…terface

tpatejko · 2018-05-02T14:39:12Z

@luotao1 Could you have a look at the changes in the unit tests. I refactored them the way you requested.

I also added use_mkldnn attribute to the batch norm's Python interface.

luotao1

LGTM! Thanks very much!

luotao1 added the Intel label Apr 17, 2018

tpatejko requested a review from luotao1 April 17, 2018 15:56

luotao1 reviewed Apr 20, 2018

View reviewed changes

tpatejko requested a review from tensor-tang April 23, 2018 07:24

luotao1 reviewed Apr 27, 2018

View reviewed changes

Tomasz Patejko added 14 commits May 2, 2018 15:55

Initial implementation of forward pass for MKLDNN batch norm

e369e1f

Added attributes for MKLDNN batch norm

1bfdbe7

MKLDNN batch norm forward pass passes unittest. Started working on ba…

3c6cee8

…ckward

Backward pass for MKLDNN batch norm added

2267387

MKLDNN batch norm: scoring added to forward pass

3fd81c2

MKLDNN batch norm: bias as input added; handling AnyLayout when kerne…

cf66959

…l is looked up

MKLDNN batch norm: python unit tests added; mkldnn tests removed

417f7f0

MKLDNN batch norm: changes required by cpplint

1dd85f0

MKLDNN batch norm: refactoring the operator

6d005e2

MKLDNN batch norm: saved variance inversed in backward pass for corre…

21c26e2

…ct execution of MKLDNN unit tests

MKLDNN batch norm: refctoring, function for static/const cast to void…

dffda9c

…* added

MKLDNN batch norm: remove AnyLayout from batch norm

3d26d72

MKLDNN batch norm: only NCHW format is supported. Unittests refactored

3a170e5

MKDNN batch norm: use_mkldnn added to attributes

46e119c

Tomasz Patejko added 12 commits May 2, 2018 15:55

MKLDNN batch norm: AnyLayout removed from unittest

e4f6f3e

MKLDNN batch norm: added CUDNN defines to batch norm

1f685d2

MKLDNN batch norm: undefined data_format variable corrected

9180dcb

MKLDNN batch norm: use_cudnn added, use of setUp method for configuri…

0e02e73

…ng attributes

MKLDNN batch norm: added use_cudnn attribute to batch norm operator

6ddeea8

MKLDNN batch norm: correcting batch norm unit tests for MKLDNN

d8a36c6

MKLDNN batch norm: MKLDNN tests moved to another file; reverting chan…

c7bbf77

…ges for saved variance not being inverted

Change default layout to NCHW

dce16c5

MKLDNN batch norm: init_kernel_type method added to unit tests

4c38be7

MKLDNN batch norm: style changes

a3ed441

MKLDNN batch norm: unit tests refactored

94714c8

MKLDNN batch norm: added use_mkldnn attribute to batch norm python in…

d9c8396

…terface

luotao1 approved these changes May 3, 2018

View reviewed changes

luotao1 merged commit 4a497b8 into PaddlePaddle:develop May 3, 2018

luotao1 added this to Done in Intel Optimization on Fluid May 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MKLDNN implementation of batch normalization #9904

MKLDNN implementation of batch normalization #9904

tpatejko commented Apr 13, 2018

tpatejko commented Apr 15, 2018

luotao1 commented Apr 16, 2018

tpatejko commented Apr 17, 2018

luotao1 commented Apr 17, 2018

tpatejko commented Apr 17, 2018

tpatejko commented Apr 17, 2018

luotao1 left a comment

luotao1 Apr 20, 2018 •

edited

tpatejko Apr 24, 2018

luotao1 Apr 24, 2018

tpatejko Apr 27, 2018

tpatejko commented Apr 24, 2018

tensor-tang commented Apr 24, 2018

tpatejko commented Apr 26, 2018

luotao1 commented Apr 26, 2018

tpatejko commented Apr 27, 2018

luotao1 Apr 27, 2018

tpatejko Apr 27, 2018

tpatejko Apr 27, 2018

luotao1 Apr 27, 2018

tpatejko Apr 27, 2018

tpatejko commented May 2, 2018

luotao1 left a comment

		from test_batch_norm_op import TestBatchNormOpInference, TestBatchNormOpTraining, _reference_training, _reference_grad


		class TestMKLDNNBatchNormOpTraining(TestBatchNormOpTraining):

	class TestMKLDNN(TestConv2dOp):
	def init_kernel_type(self):
	self.use_mkldnn = True

	places = [core.CPUPlace()]
	if core.is_compiled_with_cuda() and core.op_support_gpu("batch_norm"):
	places.append(core.CUDAPlace(0))

	for place in places:
	for data_format in ["NCHW", "NHWC"]:
	test_with_place(place, data_format, [2, 3, 4, 5])

MKLDNN implementation of batch normalization #9904

MKLDNN implementation of batch normalization #9904

Conversation

tpatejko commented Apr 13, 2018

tpatejko commented Apr 15, 2018

luotao1 commented Apr 16, 2018

tpatejko commented Apr 17, 2018

luotao1 commented Apr 17, 2018

tpatejko commented Apr 17, 2018

tpatejko commented Apr 17, 2018

luotao1 left a comment

Choose a reason for hiding this comment

luotao1 Apr 20, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tpatejko commented Apr 24, 2018

tensor-tang commented Apr 24, 2018

tpatejko commented Apr 26, 2018

luotao1 commented Apr 26, 2018

tpatejko commented Apr 27, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tpatejko commented May 2, 2018

luotao1 left a comment

Choose a reason for hiding this comment

luotao1 Apr 20, 2018 •

edited