Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of MKLDNN LRN #9123

Merged
merged 4 commits into from
Mar 22, 2018
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
178 changes: 178 additions & 0 deletions paddle/fluid/operators/lrn_mkldnn_op.cc
Original file line number Diff line number Diff line change
@@ -0,0 +1,178 @@
/* Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. */

#include "paddle/fluid/framework/tensor.h"
#include "paddle/fluid/operators/lrn_op.h"
#include "paddle/fluid/platform/mkldnn_helper.h"

namespace paddle {
namespace operators {

using paddle::framework::Tensor;
using paddle::platform::MKLDNNDeviceContext;

template <typename T>
class LRNMKLDNNOpKernel : public paddle::framework::OpKernel<T> {
public:
void Compute(const paddle::framework::ExecutionContext& ctx) const override {
PADDLE_ENFORCE(std::is_same<T, float>::value,
"MKLDNN LRN must use float data.");
PADDLE_ENFORCE(paddle::platform::is_cpu_place(ctx.GetPlace()),
"MKLDNN LRN must use CPUPlace.");

auto& dev_ctx = ctx.template device_context<MKLDNNDeviceContext>();
const auto& mkldnn_engine = dev_ctx.GetEngine();

auto x = ctx.Input<Tensor>("X");
auto out = ctx.Output<Tensor>("Out");
auto mid = ctx.Output<Tensor>("MidOut");

auto input_data = x->data<T>();
auto output_data = out->mutable_data<T>(ctx.GetPlace());
mid->mutable_data<T>(ctx.GetPlace());
Copy link
Contributor

@tensor-tang tensor-tang Mar 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this MidOut for in MKL-DNN implementations?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to documentation of LRN operator, it's a middle result of lrn operator that is computed in forward pass and used in backward pass. In practice, CPU naive implementation sets it to a value of attribute k, and MidOut tensor is validated by LRN unit tests.

It's not really needed by MKLDNN primitives. However, it's a part of LRN interface, as an input tensor, so I decided fill it in the same way as it's done in CPU naive implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not needed by MKLDNN primitives, you can use the following ways: "MaxIndex" is only used when "pooltype" == "MAX" in sequence_pool_op

if (ctx->Attrs().Get<std::string>("pooltype") == "MAX") {
PADDLE_ENFORCE(ctx->HasOutput("MaxIndex"),
"Output(MaxIndex) of SequencePoolOp should not be null.");
ctx->SetOutputDim("MaxIndex", ctx->GetInputDim("X"));
}


const std::string key = ctx.op().Output("Out");
const std::string key_src_memory = key + "@lrn_src_memory";
const std::string key_pd = key + "@lrn_pd";
const std::string key_workspace_memory = key + "@lrn_workspace_memory";

const int n = ctx.Attr<int>("n");
const float alpha = ctx.Attr<float>("alpha");
const float beta = ctx.Attr<float>("beta");
const float k = ctx.Attr<float>("k");

auto e_mid = framework::EigenTensor<T, 4>::From(*mid);
e_mid = e_mid.constant(k);
Copy link
Contributor

@tensor-tang tensor-tang Mar 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is e_mid? and why we need it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an Eigen TensorMap that it's mapped onto data of the MidOut tensor. It's used to fill MidOut with value of attribute k.


auto dims = paddle::framework::vectorize2int(x->dims());

auto src_md = paddle::platform::MKLDNNMemDesc(
dims, mkldnn::memory::data_type::f32, mkldnn::memory::format::nchw);

auto dst_md = paddle::platform::MKLDNNMemDesc(
dims, mkldnn::memory::data_type::f32, mkldnn::memory::format::nchw);

auto forward_desc = mkldnn::lrn_forward::desc{mkldnn::prop_kind::forward,
mkldnn::lrn_across_channels,
src_md,
n,
alpha,
beta,
k};

auto forward_pd = std::make_shared<mkldnn::lrn_forward::primitive_desc>(
forward_desc, mkldnn_engine);

dev_ctx.SetBlob(key_pd, forward_pd);

auto src_memory_pd = mkldnn::memory::primitive_desc{src_md, mkldnn_engine};
auto src_memory = std::make_shared<mkldnn::memory>(
src_memory_pd, static_cast<void*>(const_cast<float*>(input_data)));

dev_ctx.SetBlob(key_src_memory, src_memory);
auto dst_memory = mkldnn::memory{{dst_md, mkldnn_engine},
static_cast<void*>(output_data)};

auto workspace_md = forward_pd->workspace_primitive_desc();
auto workspace_memory = std::make_shared<mkldnn::memory>(workspace_md);

dev_ctx.SetBlob(key_workspace_memory, workspace_memory);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This workspace_memory would always be allocated in forward, which is not performance best.

Copy link
Author

@tpatejko tpatejko Mar 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tensor-tang , you are right, that is allocated in forward pass and used in backward pass. What is your suggestion? Would you like to have it allocated only once when workspace memory is not present in the device context?

Copy link
Contributor

@tensor-tang tensor-tang Mar 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could try to get it first, then create only if it's empty.


auto forward_op = mkldnn::lrn_forward{*forward_pd, *src_memory,
*workspace_memory, dst_memory};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, this memory and function is used only in training phase, maybe you should add one TODO to optimize it when you can get phase, as a reminder.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tensor-tang, good point! Do you know how to discover how to check whether we are in training phase or scoring phase?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can following the same way of batch_norm_op, which uses is_test:

if (!is_test) {
// saved_xx is use just in this batch of data
EigenVectorArrayMap<T> saved_mean_e(
saved_mean->mutable_data<T>(ctx.GetPlace()), C);
EigenVectorArrayMap<T> saved_variance_e(
saved_variance->mutable_data<T>(ctx.GetPlace()), C);
saved_mean_e.setZero();
saved_variance_e.setZero();

Copy link
Author

@tpatejko tpatejko Mar 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luotao1 @tensor-tang, I do see your point but in case of LRN is_test would become a part of user interface not only for MKLDNN LRN, but also for other implementations of LRN. Do you think adding such attribute would be justified for other implementations of LRN operator?

You could make such case for other operators as well. We could end up in the situation when each operator sets its own local is_test attribute to maintain information about phase, when this information is in fact global (global as being maintain on the network level).

Would it make more sense to maintain phase information in network executor?

I could add TODO comment saying that the blob should not be allocated when PaddlePaddle is in testing phase.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think adding such attribute would be justified for other implementations of LRN operator

Yes, "is_test" is also suited for other implementation of LRN operators.

You could make such case for other operators as well. We could end up in the situation when each operator sets its own local is_test attribute to maintain information about phase, when this information is in fact global (global as being maintain on the network level). Would it make more sense to maintain phase information in network executor?

Now, framework/prune.cc use this attribute in inference phase.

void inference_optimize_impl(proto::ProgramDesc* input, int block_id) {
auto* op_field = input->mutable_blocks(block_id)->mutable_ops();
for (auto& op_desc : *op_field) {
for (auto& attr : *op_desc.mutable_attrs()) {
if (attr.name() == "is_test") {
attr.set_b(true);
break;
}
}
}
}

I could add TODO comment saying that the blob should not be allocated when PaddlePaddle is in testing phase.

Yes, you can either add TODO comment or fix in this PR.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luotao1 thanks for your reply. I will add the attribute to the operator.

But would it be more useful to maintain testing/training information on the network level, instead of keeping on the operator level?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now We maintain testing/training information on network level as well.

def clone(self, for_test=False):
"""Clone the Program object
Set for_test to False when we want to clone the program for training.
Set for_test to True when we want to clone the program for testing.
Args:
for_test(bool): Some operators, such as batch_norm and drop_out ops,
behave differently in training and testing. If for_test is True,
the is_test attributes in these operators will be set to True for
testing purposes, otherwise, they remain unchanged.
Returns(Program):
The cloned Program object.
"""
p = Program()
if for_test:
p.desc = core.inference_optimize(self.desc)
else:
p.desc = core.ProgramDesc(self.desc)
p.blocks = [Block(p, i) for i in xrange(self.desc.num_blocks())]
p.sync_with_cpp()
p.copy_param_info_from(self)
return p

But if operator level doesn't have is_test information, how could this op behave differently in test or train phase?


std::vector<mkldnn::primitive> pipeline = {forward_op};
mkldnn::stream(mkldnn::stream::kind::eager).submit(pipeline).wait();
}
};

template <typename T>
class LRNMKLDNNGradOpKernel : public paddle::framework::OpKernel<T> {
public:
void Compute(const paddle::framework::ExecutionContext& ctx) const override {
PADDLE_ENFORCE(std::is_same<T, float>::value,
"MKLDNN LRN must use float data.");
PADDLE_ENFORCE(paddle::platform::is_cpu_place(ctx.GetPlace()),
"MKLDNN LRN must use CPUPlace.");

auto x = ctx.Input<Tensor>("X");

auto out_grad = ctx.Input<Tensor>(framework::GradVarName("Out"));
auto x_grad = ctx.Output<Tensor>(framework::GradVarName("X"));

const std::string key = ctx.op().Input("Out");
const std::string key_src_memory = key + "@lrn_src_memory";
const std::string key_pd = key + "@lrn_pd";
const std::string key_workspace_memory = key + "@lrn_workspace_memory";

const int n = ctx.Attr<int>("n");
const float alpha = ctx.Attr<float>("alpha");
const float beta = ctx.Attr<float>("beta");
const float k = ctx.Attr<float>("k");

auto& dev_ctx = ctx.template device_context<MKLDNNDeviceContext>();
const auto& mkldnn_engine = dev_ctx.GetEngine();

auto x_grad_data = x_grad->mutable_data<T>(ctx.GetPlace());
auto out_grad_data = out_grad->data<T>();

auto dims = paddle::framework::vectorize2int(x->dims());

auto src_md = paddle::platform::MKLDNNMemDesc(
dims, mkldnn::memory::data_type::f32, mkldnn::memory::format::nchw);

auto diff_src_md = paddle::platform::MKLDNNMemDesc(
dims, mkldnn::memory::data_type::f32, mkldnn::memory::format::nchw);

auto diff_dst_md = paddle::platform::MKLDNNMemDesc(
dims, mkldnn::memory::data_type::f32, mkldnn::memory::format::nchw);

auto diff_dst_memory =
mkldnn::memory{{diff_dst_md, mkldnn_engine},
static_cast<void*>(const_cast<float*>(out_grad_data))};

auto diff_src_memory = mkldnn::memory{{diff_src_md, mkldnn_engine},
static_cast<void*>(x_grad_data)};

auto backward_desc = mkldnn::lrn_backward::desc{
mkldnn::lrn_across_channels, src_md, diff_src_md, n, alpha, beta, k};

auto forward_pd = dev_ctx.GetBlob(key_pd);

auto backward_pd = mkldnn::lrn_backward::primitive_desc{
backward_desc, mkldnn_engine,
*static_cast<mkldnn::lrn_forward::primitive_desc*>(forward_pd.get())};

std::shared_ptr<void> workspace_memory =
dev_ctx.GetBlob(key_workspace_memory);

auto src_memory = dev_ctx.GetBlob(key_src_memory);
auto backward_op = mkldnn::lrn_backward{
backward_pd, *static_cast<mkldnn::memory*>(src_memory.get()),
diff_dst_memory, *static_cast<mkldnn::memory*>(workspace_memory.get()),
diff_src_memory};

std::vector<mkldnn::primitive> pipeline = {backward_op};
mkldnn::stream(mkldnn::stream::kind::eager).submit(pipeline).wait();
}
};
} // namespace operators
} // namespace paddle

namespace ops = paddle::operators;

REGISTER_OP_KERNEL(lrn, MKLDNN, paddle::platform::CPUPlace,
ops::LRNMKLDNNOpKernel<float>);
REGISTER_OP_KERNEL(lrn_grad, MKLDNN, paddle::platform::CPUPlace,
ops::LRNMKLDNNGradOpKernel<float>);
44 changes: 43 additions & 1 deletion paddle/fluid/operators/lrn_op.cc
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,9 @@ See the License for the specific language governing permissions and
limitations under the License. */

#include "paddle/fluid/operators/lrn_op.h"
#ifdef PADDLE_WITH_MKLDNN
#include "paddle/fluid/platform/mkldnn_helper.h"
#endif

namespace paddle {
namespace operators {
Expand Down Expand Up @@ -116,6 +119,26 @@ struct LRNGradFunctor<platform::CPUDeviceContext, T> {
template struct LRNGradFunctor<platform::CPUDeviceContext, float>;
template struct LRNGradFunctor<platform::CPUDeviceContext, double>;

namespace {
framework::OpKernelType GetExpectedLRNKernel(
const framework::ExecutionContext& ctx) {
framework::LibraryType library_{framework::LibraryType::kPlain};
#ifdef PADDLE_WITH_MKLDNN
if (library_ == framework::LibraryType::kPlain &&
platform::CanMKLDNNBeUsed(ctx)) {
library_ = framework::LibraryType::kMKLDNN;
}
#endif

std::string data_format = ctx.Attr<std::string>("data_format");
// TODO(pzelazko-intel): enable MKLDNN layout when it's ready
framework::DataLayout layout_ = framework::StringToDataLayout(data_format);
return framework::OpKernelType(
framework::ToDataType(ctx.Input<Tensor>("X")->type()), ctx.GetPlace(),
layout_, library_);
}
} // namespace

class LRNOp : public framework::OperatorWithKernel {
public:
using framework::OperatorWithKernel::OperatorWithKernel;
Expand All @@ -135,6 +158,11 @@ class LRNOp : public framework::OperatorWithKernel {
ctx->SetOutputDim("MidOut", x_dim);
ctx->ShareLoD("X", /*->*/ "Out");
}

framework::OpKernelType GetExpectedKernelType(
const framework::ExecutionContext& ctx) const override {
return GetExpectedLRNKernel(ctx);
}
};

template <typename T>
Expand Down Expand Up @@ -176,6 +204,16 @@ class LRNOpMaker : public framework::OpProtoAndCheckerMaker {
"beta is the power number.")
.SetDefault(0.75)
.GreaterThan(0.0);
AddAttr<bool>("use_mkldnn",
"(bool, default false) Only used in mkldnn kernel")
.SetDefault(false);
AddAttr<std::string>(
"data_format",
"(string, default NCHW) Only used in "
"An optional string from: \"NHWC\", \"NCHW\". "
"Defaults to \"NHWC\". Specify the data format of the output data, "
"the input will be transformed automatically. ")
.SetDefault("AnyLayout");

AddComment(R"DOC(
Local Response Normalization Operator.
Expand Down Expand Up @@ -223,8 +261,12 @@ class LRNOpGrad : public framework::OperatorWithKernel {
auto x_dims = ctx->GetInputDim("X");
ctx->SetOutputDim(framework::GradVarName("X"), x_dims);
}
};

framework::OpKernelType GetExpectedKernelType(
const framework::ExecutionContext& ctx) const override {
return GetExpectedLRNKernel(ctx);
}
};
} // namespace operators
} // namespace paddle

Expand Down
10 changes: 10 additions & 0 deletions python/paddle/fluid/tests/unittests/test_lrn_op.py
Original file line number Diff line number Diff line change
Expand Up @@ -87,5 +87,15 @@ def test_check_grad_normal(self):
self.check_grad(['X'], 'Out', max_relative_error=0.01)


class TestLRNMKLDNNOp(TestLRNOp):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two algorithms ACROSS_CHANNELS and WITHIN_CHANNEL, but TestLRNMKLDNNOp here only test the default one.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this remark, @luotao1. PaddlePaddle seems to implement LRN that works across channels. For MKLDNN, WITHIN_CHANNEL option does not work in backward pass.

I decided to remove "algorithm" attribute and fix LRN algorithm to ACROSS_CHANNELS.

def get_attrs(self):
attrs = TestLRNOp.get_attrs(self)
attrs['use_mkldnn'] = True
return attrs

def test_check_output(self):
self.check_output(atol=0.002)


if __name__ == "__main__":
unittest.main()