Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an example of NER task in fluid style. For issue #644 #689

Merged
merged 22 commits into from
Mar 21, 2018
Merged

Conversation

jshower
Copy link
Collaborator

@jshower jshower commented Mar 6, 2018

We add an example of NER task in fluid style. The model structure is similar to its V2 version.

fixes #644

@@ -0,0 +1,120 @@
# 命名实体识别
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果和原来v2示例的内容相同的话,建议删掉。其他一些数据和脚本文件类似。可以说明参考原先的内容。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

内容上还是存在着一些差别,为了方便读者使用,还是予以保留了。

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

README里面重复内容太多,Github上面官方repo 如果背景介绍部分的文字完全相同,不可以直接Copy。
文字完全相同的部分请添加链接。不能直接复制。

learning_rate=mix_hidden_lr)
hidden_para_attr = fluid.ParamAttr(
initializer=NormalInitializer(
loc=0.0, scale=(1. / math.sqrt(hidden_dim) / 3), seed=0),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initializer里的seed不要固定了吧。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已删除。

import paddle.fluid as fluid
from paddle.fluid.initializer import NormalInitializer
from utils import logger, load_dict, get_embedding
import math
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

请参照 https://www.python.org/dev/peps/pep-0008/#imports 调整下import的顺序并合适的加入空行。

Imports should be grouped in the following order:

standard library imports
related third party imports
local application/library specific imports
You should put a blank line between each group of imports.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已调整

import reader
import os
import math
import numpy as np
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

请参照 https://www.python.org/dev/peps/pep-0008/#imports 调整下import的顺序并合适的加入空行。

Imports should be grouped in the following order:

standard library imports
related third party imports
local application/library specific imports
You should put a blank line between each group of imports.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已调整

", Recall " + str(recall) + ", F1_score" + str(f1_score))
batch_id = batch_id + 1

pass_precision, pass_recall, pass_f1_score = test(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

训练数据在chunk_evaluator.reset()之后的accumulated metrics,是可以通过chunk_evaluator.eval()获得的,不用另行运行test程序。可以参考 https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/fluid/tests/book/test_label_semantic_roles.py#L222

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

print("[TestSet] pass_id:" + str(pass_id) + " pass_precision:" + str(
pass_precision) + " pass_recall:" + str(pass_recall) +
" pass_f1_score:" + str(pass_f1_score))

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

model_save_dir 没有使用,请加上模型保存的代码,可以参考这里 https://github.com/PaddlePaddle/models/blob/develop/fluid/DeepASR/train.py#L246

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改。

@guoshengCS
Copy link
Collaborator

也请加上infer.py和inference的代码,可以参考 https://github.com/PaddlePaddle/models/blob/develop/fluid/DeepASR/infer.py

也请加上GPU支持和多卡/多线程代码,可以参考 https://github.com/PaddlePaddle/models/blob/develop/fluid/DeepASR/model_utils/model.py#L85

另外收敛行和学习效果是否方便验证和提供下。辛苦

@jshower
Copy link
Collaborator Author

jshower commented Mar 13, 2018

修复了上面提到的没有保存模型,缺少infer.py,缺少并行与gpu/cpu选择,import格式不规范等问题。有些文件与原来的V2示例比较相似,但是细节内容上还是有较多差别,为了方便用户的直接使用,还是选择了保留。

@jshower
Copy link
Collaborator Author

jshower commented Mar 13, 2018

在增加这个模型的多线程版本时,遇到了一个问题,应该是paddle的多线程在特定情况下的一个bug,提了ISSUE,请关注 #732

1. 输入文本的词典
2. 为词典中的词语提供预训练好的词向量
2. 标记标签的词典
标记标签词典已附在`data`目录中,对应于`data/target.txt`文件。输入文本的词典以及词典中词语的预训练的词向量来自:[Stanford CS224d](http://cs224d.stanford.edu/)课程作业。**为运行本例,请首先在`data`目录下运行`download.sh`脚本下载输入文本的词典和预训练的词向量。** 完成后会将这两个文件一并放入`data`目录下,输入文本的词典和预训练的词向量分别对应:`data/vocab.txt`和`data/wordVectors.txt`这两个文件。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

请去掉这个data/vocab.txt文件,download.sh会下载这个词典文件。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的,已删除


```
leicestershire B-ORG B-LOC
extended O O
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

请确认并更正下这里的格式。
image

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

格式确实是分成三列,三列分别是输入的词语,标准标签,生成的标签,以制表符分割,在提交的时候制表符会被替换成四个空格,但是实际产出时确实是以制表符分割。

batch_size=BATCH_SIZE)

place = fluid.CUDAPlace(0) if use_gpu else fluid.CPUPlace()
#place = fluid.CPUPlace()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

请去掉注释掉的这一行。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已删除


1. 运行 `sh data/download.sh`
2. 修改 `train.py` 的 `main` 函数,指定数据路径

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

看到main函数已经有所修改,请结合新的代码修改下这里的内容。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

import paddle.fluid as fluid
import paddle.v2 as paddle
from network_conf import ner_net
from utils import load_dict, load_reverse_dict
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

请和其他文件一样规范下这里import的格式。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已规范

batch_size=6,
test_data_file="data/test",
vocab_file="data/vocab.txt",
target_file="data/target.txt")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

也请为infer加上GPU支持,并相应的调整README中对应的infer部分的内容。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已添加和调整

" pass_f1_score:" + str(pass_f1_score))

save_dirname = os.path.join(model_save_dir, "params_pass_%d" % pass_id)
fluid.io.save_inference_model(save_dirname, ['word', 'mark', 'target'],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议save_inference_model时请不要加入target。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

在不添加target时,使用多线程模式训练的模型,保存后再次被load时运行会出错。这应该是因为多线程时,在parallel_do中的linear_chain_crf使用到了target,后续在预测时,虽然我们要获得的crf_decode并不需要target,但是程序仍然会报缺少target的错误。在单线程时则不会出现类似情况。考虑到程序的稳定性,并且增加一个target实际上并不会给使用者带来太大麻烦和开销,所以还是保留了target。

word = fluid.layers.data(name='word', shape=[1], dtype='int64', lod_level=1)
mark = fluid.layers.data(name='mark', shape=[1], dtype='int64', lod_level=1)
target = fluid.layers.data(
name='target', shape=[1], dtype='int64', lod_level=1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

请去掉这三个data layer,这三个data layer应该是没有作用的,是加到default_main_program里的,而后面fluid.io.load_inference_model会返回新的Program对象inference_program ,executor运行的是这个inference_program .

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已删除。

@guoshengCS
Copy link
Collaborator

guoshengCS commented Mar 16, 2018

另外push新代码的时候请在每条comment下面简单回复下Done或者进行下说明方便知道哪些上次comment的内容修改了,这也是需要养成的习惯,能够提高review的效率。辛苦

@jshower
Copy link
Collaborator Author

jshower commented Mar 19, 2018

已按照要求进行了修改和回复。辛苦再次确认是否可以approve。谢谢~

Copy link
Collaborator

@lcy-seso lcy-seso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

请不要merge 重复代码。保持目录结构,复用已有代码。

@@ -0,0 +1,120 @@
# 命名实体识别
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

README里面重复内容太多,Github上面官方repo 如果背景介绍部分的文字完全相同,不可以直接Copy。
文字完全相同的部分请添加链接。不能直接复制。

for O O
DGDG O O
. O O
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

148 ~ 173行markdown缩进有问题,请加上缩进。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改


输出分为三列,以“\t” 分隔,第一列是输入的词语,第二列是标准结果,第三列为生成的标记结果。多条输入序列之间以空行分隔。

## 真实结果示例
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

去掉“真实”

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

## 真实结果示例

<p align="center">
<img src="imgs/convergent_curve.png" width="80%" align="center"/><br/>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

请将 “convergent_curve” 修改为 “convergence_curve”。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改


<p align="center">
<img src="imgs/convergent_curve.png" width="80%" align="center"/><br/>
图1. Fluid下实验结果示例
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 请将图注修改为:学习曲线
  2. 请重绘以下这幅图,有以下问题:
    • x轴和y轴请标注分别代表什么含义。
    • x轴和y轴的标注字体太小,看不清。
    • 蓝色曲线的文字图注字体看不清楚。
    • 图注请不要出现Fluid。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

@@ -0,0 +1,64 @@
import numpy as np
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line 1 之后空一行。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

@@ -0,0 +1,65 @@
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

重复代码不可以merge。请保持目录结构,使用相对路径引用。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已删除改文件,在README.md里注明了获取的方式。

import math

import numpy as np
import paddle.v2 as paddle
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line 3 和 line 4 交换。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

@@ -0,0 +1,61 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line1 可以删掉。

重复代码请不要merge。保持目录结构,重用已有代码。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已删除该文件,在README.md中注明了获取的方式,将与之前不同的函数及新增的函数创建了一个utils_extend.py文件



def infer(model_path, batch_size, test_data_file, vocab_file, target_file,
use_gpu):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

请为这个函数添加docstring,风格可以参考reader.py中的注释风格。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已添加

@jshower
Copy link
Collaborator Author

jshower commented Mar 20, 2018

已将上面提到的问题进行了修改,请确认是否可以合入,谢谢!


根据序列标注结果可以直接得到实体边界和实体类别。类似的,分词、词性标注、语块识别、[语义角色标注](http://book.paddlepaddle.org/07.label_semantic_roles/index.cn.html)等任务都可通过序列标注来解决。使用神经网络模型解决问题的思路通常是:前层网络学习输入的特征表示,网络的最后一层在特征基础上完成最终的任务;对于序列标注问题,通常:使用基于RNN的网络结构学习特征,将学习到的特征接入CRF完成序列标注。实际上是将传统CRF中的线性模型换成了非线性神经网络。沿用CRF的出发点是:CRF使用句子级别的似然概率,能够更好的解决标记偏置问题[[2](#参考文献)]。本例也将基于此思路建立模型。虽然,这里以NER任务作为示例,但所给出的模型可以应用到其他各种序列标注任务中
参照https://github.com/PaddlePaddle/models/blob/develop/sequence_tagging_for_ner/README.md中的数据获取方式,将获取的data目录复制到本目录下
Copy link
Collaborator

@lcy-seso lcy-seso Mar 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 这里可以试着在网页端预览,可以看到由于没有空格连接展示有问题。
    尽量不要直接贴URL地址,请使用markdown超链接标注方式。

  2. 请增加一些过度的语言,以及清晰的操作步骤,尽量避免直接贴链接。例如(以下只作为建议,可以自己控制修改):

    请参考PaddlePaddle v2版本命名实体识别 一节中数据获取方式,将该例中的data文件夹拷贝至本例目录下,运行其中的download.sh脚本获取训练和测试数据。

  3. line 28行请同样修改,请不要直接贴链接,使用超链接标注方式。同时请加上必要的描述和过度。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

### 编写数据读取接口

自定义数据读取接口只需编写一个 Python 生成器实现从原始输入文本中解析一条训练样本的逻辑。[reader.py](./reader.py) 中的`data_reader`函数实现了读取原始数据返回类型为: `paddle.data_type.integer_value_sequence`的 3 个输入(分别对应:词语在字典的序号、是否为大写、标注结果在字典中的序号)给`network_conf.ner_net`中定义的 3 个 `data_layer` 的功能。
本例需要使用https://github.com/PaddlePaddle/models/blob/develop/sequence_tagging_for_ner/reader.py以及https://github.com/PaddlePaddle/models/blob/develop/sequence_tagging_for_ner/utils.py,请将这两个文件复制到本目录下。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

请按照上一条comment进行修改。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改。

@jshower
Copy link
Collaborator Author

jshower commented Mar 21, 2018

已按照要求修改,辛苦在review一下是否可以合入。谢谢!

Copy link
Collaborator

@lcy-seso lcy-seso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@lcy-seso lcy-seso merged commit 555332f into develop Mar 21, 2018
@jshower jshower deleted the jzy2 branch March 21, 2018 08:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add sequence labeling model for Fluid.
3 participants