Add an example of NER task in fluid style. For issue #644 #689

jshower · 2018-03-06T08:27:08Z

We add an example of NER task in fluid style. The model structure is similar to its V2 version.

fixes #644

Add readme for the NER task.

guoshengCS · 2018-03-09T03:01:54Z

fluid/sequence_tagging_for_ner/README.md

@@ -0,0 +1,120 @@
+# 命名实体识别


如果和原来v2示例的内容相同的话，建议删掉。其他一些数据和脚本文件类似。可以说明参考原先的内容。

内容上还是存在着一些差别，为了方便读者使用，还是予以保留了。

README里面重复内容太多，Github上面官方repo 如果背景介绍部分的文字完全相同，不可以直接Copy。
文字完全相同的部分请添加链接。不能直接复制。

guoshengCS · 2018-03-09T03:10:27Z

fluid/sequence_tagging_for_ner/network_conf.py

+        learning_rate=mix_hidden_lr)
+    hidden_para_attr = fluid.ParamAttr(
+        initializer=NormalInitializer(
+            loc=0.0, scale=(1. / math.sqrt(hidden_dim) / 3), seed=0),


initializer里的seed不要固定了吧。

已删除。

guoshengCS · 2018-03-09T06:05:35Z

fluid/sequence_tagging_for_ner/network_conf.py

+import paddle.fluid as fluid
+from paddle.fluid.initializer import NormalInitializer
+from utils import logger, load_dict, get_embedding
+import math


请参照 https://www.python.org/dev/peps/pep-0008/#imports 调整下import的顺序并合适的加入空行。

Imports should be grouped in the following order: standard library imports related third party imports local application/library specific imports You should put a blank line between each group of imports.

guoshengCS · 2018-03-09T06:07:06Z

fluid/sequence_tagging_for_ner/train.py

+import reader
+import os
+import math
+import numpy as np


请参照 https://www.python.org/dev/peps/pep-0008/#imports 调整下import的顺序并合适的加入空行。

Imports should be grouped in the following order: standard library imports related third party imports local application/library specific imports You should put a blank line between each group of imports.

guoshengCS · 2018-03-09T06:16:33Z

fluid/sequence_tagging_for_ner/train.py

+                      ", Recall " + str(recall) + ", F1_score" + str(f1_score))
+            batch_id = batch_id + 1
+
+        pass_precision, pass_recall, pass_f1_score = test(


训练数据在chunk_evaluator.reset()之后的accumulated metrics，是可以通过chunk_evaluator.eval()获得的，不用另行运行test程序。可以参考 https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/tests/book/test_label_semantic_roles.py#L222

guoshengCS · 2018-03-09T06:30:18Z

fluid/sequence_tagging_for_ner/train.py

+        print("[TestSet] pass_id:" + str(pass_id) + " pass_precision:" + str(
+            pass_precision) + " pass_recall:" + str(pass_recall) +
+              " pass_f1_score:" + str(pass_f1_score))
+


model_save_dir 没有使用，请加上模型保存的代码，可以参考这里 https://github.com/PaddlePaddle/models/blob/develop/fluid/DeepASR/train.py#L246

已修改。

guoshengCS · 2018-03-09T06:34:14Z

也请加上infer.py和inference的代码，可以参考 https://github.com/PaddlePaddle/models/blob/develop/fluid/DeepASR/infer.py

也请加上GPU支持和多卡/多线程代码，可以参考 https://github.com/PaddlePaddle/models/blob/develop/fluid/DeepASR/model_utils/model.py#L85

另外收敛行和学习效果是否方便验证和提供下。辛苦

remove redundant code

jshower · 2018-03-13T12:46:26Z

修复了上面提到的没有保存模型，缺少infer.py，缺少并行与gpu/cpu选择，import格式不规范等问题。有些文件与原来的V2示例比较相似，但是细节内容上还是有较多差别，为了方便用户的直接使用，还是选择了保留。

jshower · 2018-03-13T13:15:53Z

在增加这个模型的多线程版本时，遇到了一个问题，应该是paddle的多线程在特定情况下的一个bug，提了ISSUE，请关注 #732

guoshengCS · 2018-03-15T11:45:34Z

fluid/sequence_tagging_for_ner/README.md

+    1. 输入文本的词典
+    2. 为词典中的词语提供预训练好的词向量
+    2. 标记标签的词典
+   标记标签词典已附在`data`目录中，对应于`data/target.txt`文件。输入文本的词典以及词典中词语的预训练的词向量来自：[Stanford CS224d](http://cs224d.stanford.edu/)课程作业。**为运行本例，请首先在`data`目录下运行`download.sh`脚本下载输入文本的词典和预训练的词向量。** 完成后会将这两个文件一并放入`data`目录下，输入文本的词典和预训练的词向量分别对应：`data/vocab.txt`和`data/wordVectors.txt`这两个文件。


请去掉这个data/vocab.txt文件，download.sh会下载这个词典文件。

好的，已删除

guoshengCS · 2018-03-15T11:52:01Z

fluid/sequence_tagging_for_ner/README.md

+
+```
+leicestershire  B-ORG   B-LOC
+extended        O       O


请确认并更正下这里的格式。

格式确实是分成三列，三列分别是输入的词语，标准标签，生成的标签，以制表符分割，在提交的时候制表符会被替换成四个空格，但是实际产出时确实是以制表符分割。

guoshengCS · 2018-03-15T12:03:29Z

fluid/sequence_tagging_for_ner/train.py

+        batch_size=BATCH_SIZE)
+
+    place = fluid.CUDAPlace(0) if use_gpu else fluid.CPUPlace()
+    #place = fluid.CPUPlace()


请去掉注释掉的这一行。

guoshengCS · 2018-03-15T12:05:23Z

fluid/sequence_tagging_for_ner/README.md

+
+1. 运行 `sh data/download.sh`
+2. 修改 `train.py` 的 `main` 函数，指定数据路径
+


看到main函数已经有所修改，请结合新的代码修改下这里的内容。

guoshengCS · 2018-03-15T12:14:56Z

fluid/sequence_tagging_for_ner/infer.py

+import paddle.fluid as fluid
+import paddle.v2 as paddle
+from network_conf import ner_net
+from utils import load_dict, load_reverse_dict


请和其他文件一样规范下这里import的格式。

guoshengCS · 2018-03-15T12:28:56Z

fluid/sequence_tagging_for_ner/infer.py

+        batch_size=6,
+        test_data_file="data/test",
+        vocab_file="data/vocab.txt",
+        target_file="data/target.txt")


也请为infer加上GPU支持，并相应的调整README中对应的infer部分的内容。

已添加和调整

guoshengCS · 2018-03-15T15:51:30Z

fluid/sequence_tagging_for_ner/train.py

+              " pass_f1_score:" + str(pass_f1_score))
+
+        save_dirname = os.path.join(model_save_dir, "params_pass_%d" % pass_id)
+        fluid.io.save_inference_model(save_dirname, ['word', 'mark', 'target'],


建议save_inference_model时请不要加入target。

在不添加target时，使用多线程模式训练的模型，保存后再次被load时运行会出错。这应该是因为多线程时，在parallel_do中的linear_chain_crf使用到了target，后续在预测时，虽然我们要获得的crf_decode并不需要target，但是程序仍然会报缺少target的错误。在单线程时则不会出现类似情况。考虑到程序的稳定性，并且增加一个target实际上并不会给使用者带来太大麻烦和开销，所以还是保留了target。

guoshengCS · 2018-03-15T15:57:44Z

fluid/sequence_tagging_for_ner/infer.py

+    word = fluid.layers.data(name='word', shape=[1], dtype='int64', lod_level=1)
+    mark = fluid.layers.data(name='mark', shape=[1], dtype='int64', lod_level=1)
+    target = fluid.layers.data(
+        name='target', shape=[1], dtype='int64', lod_level=1)


请去掉这三个data layer，这三个data layer应该是没有作用的，是加到default_main_program里的，而后面fluid.io.load_inference_model会返回新的Program对象inference_program ，executor运行的是这个inference_program .

已删除。

guoshengCS · 2018-03-16T02:42:36Z

另外push新代码的时候请在每条comment下面简单回复下Done或者进行下说明方便知道哪些上次comment的内容修改了，这也是需要养成的习惯，能够提高review的效率。辛苦

jshower · 2018-03-19T03:21:19Z

已按照要求进行了修改和回复。辛苦再次确认是否可以approve。谢谢~

lcy-seso

请不要merge 重复代码。保持目录结构，复用已有代码。

lcy-seso · 2018-03-19T08:18:52Z

fluid/sequence_tagging_for_ner/README.md

@@ -0,0 +1,120 @@
+# 命名实体识别


README里面重复内容太多，Github上面官方repo 如果背景介绍部分的文字完全相同，不可以直接Copy。
文字完全相同的部分请添加链接。不能直接复制。

lcy-seso · 2018-03-19T08:35:17Z

fluid/sequence_tagging_for_ner/README.md

+for    O    O
+DGDG    O    O
+.    O    O
+```


148 ~ 173行markdown缩进有问题，请加上缩进。

lcy-seso · 2018-03-19T08:35:32Z

fluid/sequence_tagging_for_ner/README.md

+
+    输出分为三列，以“\t” 分隔，第一列是输入的词语，第二列是标准结果，第三列为生成的标记结果。多条输入序列之间以空行分隔。
+
+## 真实结果示例


去掉“真实”

lcy-seso · 2018-03-19T08:38:06Z

fluid/sequence_tagging_for_ner/README.md

+## 真实结果示例
+
+<p align="center">
+<img src="imgs/convergent_curve.png" width="80%" align="center"/><br/>


请将 “convergent_curve” 修改为 “convergence_curve”。

lcy-seso · 2018-03-19T08:50:00Z

fluid/sequence_tagging_for_ner/README.md

+
+<p align="center">
+<img src="imgs/convergent_curve.png" width="80%" align="center"/><br/>
+图1. Fluid下实验结果示例


请将图注修改为：学习曲线

请重绘以下这幅图，有以下问题：

x轴和y轴请标注分别代表什么含义。

x轴和y轴的标注字体太小，看不清。

蓝色曲线的文字图注字体看不清楚。

图注请不要出现Fluid。

lcy-seso · 2018-03-19T09:03:05Z

fluid/sequence_tagging_for_ner/infer.py

@@ -0,0 +1,64 @@
+import numpy as np


line 1 之后空一行。

lcy-seso · 2018-03-19T09:05:14Z

fluid/sequence_tagging_for_ner/reader.py

@@ -0,0 +1,65 @@
+"""


重复代码不可以merge。请保持目录结构，使用相对路径引用。

已删除改文件，在README.md里注明了获取的方式。

lcy-seso · 2018-03-19T09:05:50Z

fluid/sequence_tagging_for_ner/train.py

+import math
+
+import numpy as np
+import paddle.v2 as paddle


line 3 和 line 4 交换。

lcy-seso · 2018-03-19T09:08:18Z

fluid/sequence_tagging_for_ner/utils.py

@@ -0,0 +1,61 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-


line1 可以删掉。

重复代码请不要merge。保持目录结构，重用已有代码。

已删除该文件，在README.md中注明了获取的方式，将与之前不同的函数及新增的函数创建了一个utils_extend.py文件

lcy-seso · 2018-03-19T09:10:26Z

fluid/sequence_tagging_for_ner/infer.py

+
+
+def infer(model_path, batch_size, test_data_file, vocab_file, target_file,
+          use_gpu):


请为这个函数添加docstring，风格可以参考reader.py中的注释风格。

jshower · 2018-03-20T06:18:48Z

已将上面提到的问题进行了修改，请确认是否可以合入，谢谢！

lcy-seso · 2018-03-21T00:15:54Z

fluid/sequence_tagging_for_ner/README.md


-根据序列标注结果可以直接得到实体边界和实体类别。类似的，分词、词性标注、语块识别、[语义角色标注](http://book.paddlepaddle.org/07.label_semantic_roles/index.cn.html)等任务都可通过序列标注来解决。使用神经网络模型解决问题的思路通常是：前层网络学习输入的特征表示，网络的最后一层在特征基础上完成最终的任务；对于序列标注问题，通常：使用基于RNN的网络结构学习特征，将学习到的特征接入CRF完成序列标注。实际上是将传统CRF中的线性模型换成了非线性神经网络。沿用CRF的出发点是：CRF使用句子级别的似然概率，能够更好的解决标记偏置问题[[2](#参考文献)]。本例也将基于此思路建立模型。虽然，这里以NER任务作为示例，但所给出的模型可以应用到其他各种序列标注任务中。
+参照https://github.com/PaddlePaddle/models/blob/develop/sequence_tagging_for_ner/README.md中的数据获取方式，将获取的data目录复制到本目录下。


这里可以试着在网页端预览，可以看到由于没有空格连接展示有问题。
尽量不要直接贴URL地址，请使用markdown超链接标注方式。

请增加一些过度的语言，以及清晰的操作步骤，尽量避免直接贴链接。例如（以下只作为建议，可以自己控制修改）：

请参考PaddlePaddle v2版本命名实体识别一节中数据获取方式，将该例中的data文件夹拷贝至本例目录下，运行其中的download.sh脚本获取训练和测试数据。

line 28行请同样修改，请不要直接贴链接，使用超链接标注方式。同时请加上必要的描述和过度。

lcy-seso · 2018-03-21T00:20:15Z

fluid/sequence_tagging_for_ner/README.md

-### 编写数据读取接口
-
-自定义数据读取接口只需编写一个 Python 生成器实现从原始输入文本中解析一条训练样本的逻辑。[reader.py](./reader.py) 中的`data_reader`函数实现了读取原始数据返回类型为： `paddle.data_type.integer_value_sequence`的 3 个输入（分别对应：词语在字典的序号、是否为大写、标注结果在字典中的序号）给`network_conf.ner_net`中定义的 3 个 `data_layer` 的功能。
+本例需要使用https://github.com/PaddlePaddle/models/blob/develop/sequence_tagging_for_ner/reader.py以及https://github.com/PaddlePaddle/models/blob/develop/sequence_tagging_for_ner/utils.py，请将这两个文件复制到本目录下。


请按照上一条comment进行修改。

已修改。

格式修改

jshower · 2018-03-21T06:47:03Z

已按照要求修改，辛苦在review一下是否可以合入。谢谢！

lcy-seso

LGTM.

jshower and others added 5 commits March 6, 2018 07:07

add an example of the NER task in fluid style

3b77123

add an example of the NER task in fluid style

6353576

Create README.md

88910ad

Add readme for the NER task.

for yapf check

dd0fe96

Merge branch 'jzy2' of https://github.com/PaddlePaddle/models into jzy2

f522a32

lcy-seso requested review from lcy-seso, guoshengCS and pkuyym March 6, 2018 09:29

guoshengCS reviewed Mar 9, 2018

View reviewed changes

jshower and others added 8 commits March 13, 2018 11:50

for code review

95c030a

Update README.md

315bb04

Update train.py

f35cef0

remove redundant code

add img

9e544d4

for style check

338ce12

Merge branch 'jzy2' of https://github.com/PaddlePaddle/models into jzy2

bb30556

Update README.md

89bc536

for code style

ef70b62

change data

1875ab1

jshower requested a review from guru4elephant March 15, 2018 03:24

guoshengCS reviewed Mar 16, 2018

View reviewed changes

jshower added 3 commits March 19, 2018 02:50

change data

a67b25e

Merge branch 'jzy2' of https://github.com/PaddlePaddle/models into jzy2

c6aca9a

remove redundant code

0a8f16a

lcy-seso requested changes Mar 19, 2018

View reviewed changes

for code review

f88033e

for code review

f3d8ff1

lcy-seso reviewed Mar 21, 2018

View reviewed changes

jshower added 3 commits March 21, 2018 14:01

Update README.md

e29cb7b

格式修改

Update README.md

1e47e04

Update README.md

e46259b

lcy-seso approved these changes Mar 21, 2018

View reviewed changes

lcy-seso merged commit 555332f into develop Mar 21, 2018

jshower deleted the jzy2 branch March 21, 2018 08:32


		1. 运行 `sh data/download.sh`
		2. 修改 `train.py` 的 `main` 函数，指定数据路径


		输出分为三列，以“\t” 分隔，第一列是输入的词语，第二列是标准结果，第三列为生成的标记结果。多条输入序列之间以空行分隔。

		## 真实结果示例

		@@ -0,0 +1,61 @@
		#!/usr/bin/env python
		# -- coding: utf-8 --



		def infer(model_path, batch_size, test_data_file, vocab_file, target_file,
		use_gpu):


		根据序列标注结果可以直接得到实体边界和实体类别。类似的，分词、词性标注、语块识别、[语义角色标注](http://book.paddlepaddle.org/07.label_semantic_roles/index.cn.html)等任务都可通过序列标注来解决。使用神经网络模型解决问题的思路通常是：前层网络学习输入的特征表示，网络的最后一层在特征基础上完成最终的任务；对于序列标注问题，通常：使用基于RNN的网络结构学习特征，将学习到的特征接入CRF完成序列标注。实际上是将传统CRF中的线性模型换成了非线性神经网络。沿用CRF的出发点是：CRF使用句子级别的似然概率，能够更好的解决标记偏置问题[[2](#参考文献)]。本例也将基于此思路建立模型。虽然，这里以NER任务作为示例，但所给出的模型可以应用到其他各种序列标注任务中。
		参照https://github.com/PaddlePaddle/models/blob/develop/sequence_tagging_for_ner/README.md中的数据获取方式，将获取的data目录复制到本目录下。

Add an example of NER task in fluid style. For issue #644 #689

Add an example of NER task in fluid style. For issue #644 #689

Conversation

jshower commented Mar 6, 2018 • edited by lcy-seso Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guoshengCS commented Mar 9, 2018

jshower commented Mar 13, 2018

jshower commented Mar 13, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guoshengCS commented Mar 16, 2018 • edited Loading

jshower commented Mar 19, 2018

lcy-seso left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jshower commented Mar 20, 2018

lcy-seso Mar 21, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jshower commented Mar 21, 2018

lcy-seso left a comment

Choose a reason for hiding this comment

jshower commented Mar 6, 2018 •

edited by lcy-seso

Loading

guoshengCS commented Mar 16, 2018 •

edited

Loading

lcy-seso Mar 21, 2018 •

edited

Loading