Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix 2078 #2165

Merged
merged 4 commits into from
May 22, 2017
Merged

Fix 2078 #2165

merged 4 commits into from
May 22, 2017

Conversation

pkuyym
Copy link
Contributor

@pkuyym pkuyym commented May 16, 2017

fixes #2078

@pkuyym
Copy link
Contributor Author

pkuyym commented May 16, 2017

@guoshengCS Thanks for testing and document writing.

this->storeLocalValues();
std::vector<std::string> buffers;
paddle::str::split(name, '.', &buffers);
auto it = this->values_.find(buffers[buffers.size() - 1]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change buffers[buffers.size() - 1] to buffers.back()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


private:
void storeLocalValues() const {
CHECK_GT(numOutputSegments_, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change CHECK_GT to CHECK_GE, numOutputSegments_ can be 0 in practice. For example, the label sequence is O O O O.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

private:
void storeLocalValues() const {
CHECK_GT(numOutputSegments_, 0);
CHECK_GT(numLabelSegments_, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change CHECK_GT to CHECK_GE.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

void storeLocalValues() const {
CHECK_GT(numOutputSegments_, 0);
CHECK_GT(numLabelSegments_, 0);
double precision = (double)numCorrect_ / numOutputSegments_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change it to double precision = !numOutputSegments_ ? 0 : (double)numCorrect_ / numOutputSegments_;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

CHECK_GT(numOutputSegments_, 0);
CHECK_GT(numLabelSegments_, 0);
double precision = (double)numCorrect_ / numOutputSegments_;
double recall = (double)numCorrect_ / numLabelSegments_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change it to double recall = !numLabelSegments_ ? 0 : (double)numCorrect_ / numLabelSegments_;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

To make it clear, let's illustrate by a NER example.
Assuming that there are two named entity types including ORG and PER which are called 'chunk type' here,
if 'IOB' scheme were used, the label set will be extended to a set including B-ORG, I-ORG, B-PER, I-PER and O,
in which B-ORG for begining of ORG and I-ORG for end of ORG.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change end of ORG to inside of

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


'plain' means the whole chunk must contain exactly the same chunk label.
Realizing that the number of is chunk type is 2 and number of tag type is 2, it is easy to validate this.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should change the example to make chunk type and tag type of different number. In this case, both of them are 2 and may fail to help the users to clarify their misunderstanding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


For each label in the label sequence, we have:
To use chunk evaluator, the construction of label dict should obey the following rules:
(1) Use one of the listed labelling schemes. These schemes differ in ways indicating chunk boundry.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should define "chunk type", "tag type" before the following table. And we'd better have a running example to show how to label the words using different schemes. In fact, the following table is the protocol for assigning tag types, not the definition of the schemes. Therefore, I think we also need another table for the definitions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

The total number of different labels is numTagType*numChunkTypes+1.
We support 4 labelling scheme.
The tag type for each of the scheme is shown as follows:
(2) Map can be done correctly by the listed equations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change Map to Mapping
Change can be done to is done. I think is done is better. Because if can be done was used, it may mislead the users to think this is only one of the feasible options. However, this is the only feasible one because we hard coded it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

IOB 0 1 - -
IOE - 0 1 -
IOBES 0 1 2 3
Continue the NER example, and the label dict should like this to satify above equations:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change like to look like

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor Author

@pkuyym pkuyym left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.Override getTypeImpl instead of getType.
2.I think holding precision, recall and F1-score into an unified map could make the code cleaner and easier to maintain and the extra computation cost is trivial.
3.Revise the document following the review comments.

Copy link
Contributor

@pengli09 pengli09 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except getNames(). Please consult other members to make the final decision.

@luotao1
Copy link
Contributor

luotao1 commented May 18, 2017

请参考 http://www.paddlepaddle.org/develop/doc_cn/howto/dev/write_docs_cn.html 看一下生成出的文档格式是否正确。

IOB Two labels for chunk type X, B-X for chunk begining and I-X for chunk inside.
IOE Two labels for chunk type X, E-X for chunk ending and I-X for chunk inside.
IOBES Four labels for chunk type X, B-X for chunk begining, I-X for chunk inside, E-X for chunk end and S-X for single word chunk.
.. code-block:: python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

366行可以去掉,只需要一个.. code-block::python即可。下同。
请生成下文档看下显示是否正确。目前粗看,会有一些问题。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


For each label in the label sequence, we have:
The construction of label dict should obey the following rules:
(1) Use one of the listed labelling schemes. These schemes differ in ways indicating chunk boundry.

.. code-block:: python
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why code-block is python? It seems a plain text?

..[SPACE]code-block:: [language]
[EMPTY_LINE]
[SPACE][SPACE][SPACE]Your texts.
.. code-block:: text

   abc
   def

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See this documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Collaborator

@reyoung reyoung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically LGTM.

@pkuyym pkuyym merged commit d3e003b into PaddlePaddle:develop May 22, 2017
@pkuyym pkuyym deleted the fix-2078 branch May 24, 2017 06:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Evaluator cannot return the expected metrics.
4 participants