Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add the LinearChainCrf operator. #5084

Merged
merged 21 commits into from
Nov 1, 2017
Merged

Conversation

lcy-seso
Copy link
Contributor

@lcy-seso lcy-seso commented Oct 25, 2017

fixes #4168
fixes #5250

Currently, this operator only runs on CPU.

  1. This operator has been manually tested with the old CRFLayer. Their computation results have no differences.
    2. Currently, the linear chain CRF operator does not backpropagate the gradients it receives, so it can only be used at the end of the network. This can be easily fixed.
  2. The following things still need to be optimized:
    - It is better to use double in unitest for this operator. But, currently, I come into a coredump and I haven't checked it yet. I will fix this later.
    • It is better if some local intermediate variable is optimized to avoid frequently allocating/freeing memory.

PADDLE_ENFORCE_LT(begin_idx, end_idx,
"Begin index must be less than end index.");
"The start row index must be less than the end row index.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/less/smaller/g

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// Right now, we just bet that sum won't be zero. If this really happens, we
// will figure out what should be done then.
PADDLE_ENFORCE(sum,
"The unnormalized probabilites of all possible unfinished "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not probabilities. Normalization is performed for numerical stability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sum is used to normalize one row of the alpha to avoid overflow or underflow, but one row of Alpha(in forward computation)/Beta(in backward computation) is meaningful (alpha dotmul beta is also meanningful).

Beta is calculated in "reverse order" with compared to Alpha (not mathmatically accurate). They all caputres unnormalzied probabilities for unfinished seqences. The accumulation of this sum come to the normalizion factor.

I try to make the warning information meaningful.

PADDLE_ENFORCE(sum,
"The unnormalized probabilites of all possible unfinished "
"sequences must be greater than 0.");
for (size_t i = 0; i < len; ++i) x[i] /= sum;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should retain Wei Xu's code here. Because multiplication needs less CPU cycles than division.

"batch. A two dimensional tensor with shape [N x D], "
"denoted as \f$\alpha\f$. \f$\alpha$\f is a memo table used to "
"calculate the normalization factor in CRF. \f$\alpha[k, v]$\f stores "
"the unnormalized probabilites of all possible unfinished sequences of "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ``unnormalized probabilites'' is not a proper interpretation.

Copy link
Contributor Author

@lcy-seso lcy-seso Oct 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "unnormalized probabilities" is just from the paper.
image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am guessing you are referring to this paper: http://cseweb.ucsd.edu/~elkan/250Bwinter2012/loglinearCRFs.pdf

It would be valuable to cite this in the comments for reference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are quite right. It is this paper. I will add it. Thank you very much for all your comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I add this into the operator's comments.

"tensor with shape [S x 1], where S is the sequence number in a "
"mini-batch. "
"Note: S is equal to the sequence number in a mini-batch. The "
"output "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge line 91 and 92 to a single line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}
beta_value[k * tag_num + i] = sum;
}
NormalizeL1<T>(beta_value + k * tag_num, tag_num);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restore comment "// normalizeL1 is to avoid underflow or overflow at (**)"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

auto alpha_mat = EigenMatrix<T>::From(*alpha);
auto beta_mat = EigenMatrix<T>::From(*beta);
auto x_grad_mat = EigenMatrix<T>::From(*emission_grad);
auto* place = ctx.GetEigenDevice<platform::CPUPlace>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change auto* to auto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using auto* is suggested by our style.

auto row_sum = prob.sum(Eigen::DSizes<int, 1>(1))
.reshape(Eigen::DSizes<int, 2>(seq_length, 1))
.broadcast(Eigen::DSizes<int, 2>(1, tag_num));
x_grad_mat.device(*place) = prob / row_sum;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change *place to place

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto* is suggested by our style, so I do not change this.

# the linear_chain_crf operator only supports sequence (LoD level = 1)
lod = [[0]]
for i in range(SEQ_NUM):
lod[-1].append(lod[-1][-1] + random.randint(1, MAX_SEQ_LEN))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sequence of length 1, 2 and 3 must be included in the test. These are high-risk boundary conditions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested all these cases offline. Actually, I agree the unit test needs some fixed boundary cases. I will fix this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I will fix the unit test in the next PR, for the unit test in the latest develop branch framework seems to have a certain problem.

from op_test import OpTest


class LinearChainCrfForward(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A test case of LinearChainCrfForward() should also be included to make sure the implementation is correct. Furthermore, it is also useful for detecting unexpected errors introduced by future modifications. Reference: https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/gserver/tests/test_CRFLayerGrad.cpp#L56

Copy link
Contributor Author

@lcy-seso lcy-seso Oct 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The forward computation has already been added in the unit test. I can how the original codes do the check later.

Copy link
Contributor Author

@lcy-seso lcy-seso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

follow comments.

PADDLE_ENFORCE_LT(begin_idx, end_idx,
"Begin index must be less than end index.");
"The start row index must be less than the end row index.");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// Right now, we just bet that sum won't be zero. If this really happens, we
// will figure out what should be done then.
PADDLE_ENFORCE(sum,
"The unnormalized probabilites of all possible unfinished "
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sum is used to normalize one row of the alpha to avoid overflow or underflow, but one row of Alpha(in forward computation)/Beta(in backward computation) is meaningful (alpha dotmul beta is also meanningful).

Beta is calculated in "reverse order" with compared to Alpha (not mathmatically accurate). They all caputres unnormalzied probabilities for unfinished seqences. The accumulation of this sum come to the normalizion factor.

I try to make the warning information meaningful.

"tensor with shape [S x 1], where S is the sequence number in a "
"mini-batch. "
"Note: S is equal to the sequence number in a mini-batch. The "
"output "
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Linear chain CRF is a special case of CRF that is useful for sequence labeling
task. Sequence labeling tasks do not assume a lot of conditional
independences among inputs. They only concern about the input and the output
being linear sequences. Thus, the graph model of CRF is a simple chain or
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

: public framework::OpKernel<T> {
public:
void Compute(const framework::ExecutionContext& ctx) const override {
PADDLE_ENFORCE(platform::is_cpu_place(ctx.GetPlace()),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I am doing this. I will make this operator running on GPU by simply copy the inputs to GPU.


// Beta is the memo table used in dynamic programming to calculate the
// backwark vectors. For a backward vector i (the i-th row of beta), it
// captures the unnormalized probabilities of partial sequences starting at
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As explained above, each element in alpha and beta, and also alpha dotmul beta are meaningful. I try to make the comments meaningful.

for (size_t i = 0; i < tag_num; ++i) {
T sum = 0.;
for (size_t j = 0; j < tag_num; ++j) {
sum += w_exps[(i + state_trans_base_idx) * tag_num + j] *
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}
beta_value[k * tag_num + i] = sum;
}
NormalizeL1<T>(beta_value + k * tag_num, tag_num);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

from op_test import OpTest


class LinearChainCrfForward(object):
Copy link
Contributor Author

@lcy-seso lcy-seso Oct 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The forward computation has already been added in the unit test. I can how the original codes do the check later.

# the linear_chain_crf operator only supports sequence (LoD level = 1)
lod = [[0]]
for i in range(SEQ_NUM):
lod[-1].append(lod[-1][-1] + random.randint(1, MAX_SEQ_LEN))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested all these cases offline. Actually, I agree the unit test needs some fixed boundary cases. I will fix this.

// resized to its correct size in the function Compute.
ctx->SetOutputDim("LogLikelihood", {emission_dims[0], 1});

ctx->ShareLoD("Emission", /*->*/ "EmissionExps");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the output is not a LoDTensor, this line can be removed.

PADDLE_ENFORCE(ctx->HasOutput(framework::GradVarName("Emission")),
"Output(Emission@GRAD) should be not null.");
PADDLE_ENFORCE(ctx->HasOutput(framework::GradVarName("Transition")),
"Output(Transition@GRAD) should be not null.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove these two checks. The input gradients is not to be calculated at any time. Just like the old framework, some times, one branch of the network does not need to do backward.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


ctx->SetOutputDim(framework::GradVarName("Emission"), emission_exps_dims);
ctx->SetOutputDim(framework::GradVarName("Transition"),
transition_exps_dims);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (ctx->HasOutput(framework::GradVarName("Emission")) {
   ...
}
if (ctx->HasOutput(framework::GradVarName("Transition")) {
   ...
}

see mul_op.cc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

const Tensor* transition_exps, const Tensor* alpha,
const Tensor* label, Tensor* beta,
Tensor* transition_grad,
Tensor* emission_grad) const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the const input arguments: const Tensor&,
For the output arguments: Tensor*

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thank you for pointing this out.

@@ -0,0 +1,54 @@
/* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this header file also be used for the CRF decoding code? If not, it seems this header file can be removed in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After our discussion. I will keep this header and move the function Compute into the header file and add GPU kernel function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

for (size_t i = 0; i < tag_num; ++i) {
T sum = 0.;
for (size_t j = 0; j < tag_num; ++j) {
sum += alpha_value[(k - 1) * tag_num + j] *
Copy link
Contributor

@pengli09 pengli09 Oct 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wei's add a comment "// (*)" at the end of this line. Line 321 refers to this line by this mark.

for (size_t i = 0; i < tag_num; ++i) {
T sum = 0.;
for (size_t j = 0; j < tag_num; ++j) {
sum += w_exps[(i + state_trans_base_idx) * tag_num + j] *
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wei added a mark "// (**)" at the end of this line, which is referred by line 495.

framework::make_ddim({static_cast<int>(batch_size), 1}),
platform::CPUPlace());

auto place = ctx.GetEigenDevice<platform::CPUPlace>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change auto to auto*, and change place to *place accordingly in the following lines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

@abhinavarora abhinavarora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to @pengli09 comments, there are some style issues. Apart from that the PR looks very solid. Great Job!

PADDLE_ENFORCE_LE(end_idx, dims_[0], "The end row index is out of bound.");
PADDLE_ENFORCE_LT(
begin_idx, end_idx,
"The start row index must be smaller than the end row index.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

smaller -> lesser

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

"See more details in the operator's comments.");
AddInput(
"Label",
"(LoDTensor, default: LoDTensor<int>). The groundtruth which is a 2-D "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a space in groundtruth. It should be ground truth

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thank you.

"Emission",
"(LoDTensor, default: LoDTensor<float>). "
"The unscaled emission weight matrix for the linear chain CRF. "
"This input is a LoDTensor with shape [N x D] where N is the total "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simply write that N is the size of the minibatch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this is no problem. Thank you. Done.

"batch. A two dimensional tensor with shape [N x D], "
"denoted as \f$\alpha\f$. \f$\alpha$\f is a memo table used to "
"calculate the normalization factor in CRF. \f$\alpha[k, v]$\f stores "
"the unnormalized probabilites of all possible unfinished sequences of "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am guessing you are referring to this paper: http://cseweb.ucsd.edu/~elkan/250Bwinter2012/loglinearCRFs.pdf

It would be valuable to cite this in the comments for reference.


Linear chain CRF is a special case of CRF that is useful for sequence labeling
task. Sequence labeling tasks do not assume a lot of conditional
independences among inputs. They only concern about the input and the output
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can write this sentence as: The only constraint they impose is that the input and output must be linear sequences

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. Done.

ctx->SetOutputDim("Alpha", emission_dims);
ctx->SetOutputDim("EmissionExps", emission_dims);
ctx->SetOutputDim("TransitionExps", transition_dims);
// (TODO caoying) This is tricky. The 1st dimension of Output(LogLikelihood)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can follow the Google C++ stye guide for TODO comments. http://google.github.io/styleguide/cppguide.html#TODO_Comments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thank you.


protected:
// Explicitly set that the data type of output of the linear_chain_crf_grad
// operator is determined by its input: graidents of LogLikelihood.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: graidents -> gradients

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thank you.

}
} // namespace

using framework::LoDTensor;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This violates the Google Style Guide. For more details check out https://google.github.io/styleguide/cppguide.html#Namespaces

auto* label = ctx.Input<LoDTensor>("Label");

auto in_lod = emission_weights->lod();
PADDLE_ENFORCE(in_lod.size(), "Input(Emission) is not a sequence.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error message language could be Input(Emission) must be a sequence

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

auto in_lod = emission_weights->lod();
PADDLE_ENFORCE(in_lod.size(), "Input(Emission) is not a sequence.");

// TODO(caoying) The checks related to LoD information should be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can follow the Google C++ stye guide for TODO comments. http://google.github.io/styleguide/cppguide.html#TODO_Comments

@lcy-seso
Copy link
Contributor Author

I have not finished yet, so, please do not begin the review. Thank you all.

@lcy-seso
Copy link
Contributor Author

lcy-seso commented Oct 31, 2017

  1. After discussion with @qingqing01, to make this operator can also run on GPU device, in this PR, I register the GPU kernel which simply copies the inputs and outputs between the CPU/GPU devices. This copy introduces many local variables and makes the codes looks not very clean. I will keep on optimizing in the following PR.

2. Unfortunately, after I merged the lasted develop branch, the previously succeeded gradient check fails. Many related issues have been created and some the operators have been removed from the unittest (for example, the gru_unit_op, recurrent_op, lstm_unit_op, softmax_with_cross_entropy ). It seems that some changes of the gradient check framework make the unit test of this operator fail. I create this issue #5250.

  1. I will enhance the unit test by adding more fixed boundary cases in the next PR.

Thank you all for all of the comments ~

Copy link
Contributor

@qingqing01 qingqing01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve this PR. The enhancements if have can be done in next PR.

if (platform::is_gpu_place(ctx.GetPlace())) {
emission_weights = &emission_weight_tensor;
transition_weights = &transition_weight_tensor;
label = &label_tensor;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For short lines code, the LoDTensor xx and Tensor xx from line 72 to line 86 can be removed, then use new LoDTensor() and new Tensor() here for the pointers from line 72 to line 86. And delete these pointers at last.


Tensor emission_row_max;
emission_row_max.mutable_data<T>(
framework::make_ddim({static_cast<int>(batch_size), 1}),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static_cast<int64_t>

The number type in DDim is int64_t.

@@ -38,8 +39,8 @@ class CrossEntropyOp : public framework::OperatorWithKernel {
"If Attr(soft_label) == true, the 2nd dimension of "
"Input(X) and Input(Label) should be equal.");
} else {
PADDLE_ENFORCE_EQ(label_dims[1], 1,
"If Attr(soft_label) == false, the 2nd dimension of "
PADDLE_ENFORCE_EQ(label_dims[1], 1UL,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dimension type is int64_t.

Copy link
Contributor

@pengli09 pengli09 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve as requested by @lcy-seso . However, there are still a few places need to be updated/fixed. Please revise the code as soon as possible once you have time.

@lcy-seso lcy-seso merged commit 8401039 into PaddlePaddle:develop Nov 1, 2017
@lcy-seso
Copy link
Contributor Author

lcy-seso commented Nov 1, 2017

I will enhance the unittest, and refine the codes in the following PR. Thanks for all the comments.

@lcy-seso lcy-seso deleted the crf branch November 1, 2017 03:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The gradient check of the CRF operator fails after merging the latest dev branch. CRF Operator.
4 participants