Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add text classification demo for nested sequence data #367

Merged

Conversation

peterzhang2029
Copy link
Contributor

resolve #366

@peterzhang2029 peterzhang2029 changed the title add nest text classification add text classification demo for nested sequence data Oct 12, 2017
Copy link
Collaborator

@lcy-seso lcy-seso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please refine the doc first.

@@ -0,0 +1,173 @@
# 双层序列文本分类
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

基于双层序列的文本分类

@@ -0,0 +1,173 @@
# 双层序列文本分类
## 简介
序列数据是自然语言处理任务面对的一种主要输入数据类型:一句话是由词语构成的序列,多句话进一步构成了段落。因此,段落可以看作是一个嵌套的双层的序列,这个序列的每个元素又是一个序列。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 序列数据 --> 序列
  2. 一句话是由词语构成的序列,多句话进一步构成了段落。 --> 句子由词语构成,而多个句子进一步构成了段落。
  3. 因此,段落可以看作是一个嵌套的双层的序列 --> 这句话有错误:段落可以看作是一个嵌套的序列(或者叫作:双层序列),这个序列的每个元素又是一个序列。

## 简介
序列数据是自然语言处理任务面对的一种主要输入数据类型:一句话是由词语构成的序列,多句话进一步构成了段落。因此,段落可以看作是一个嵌套的双层的序列,这个序列的每个元素又是一个序列。

双层序列是`PaddlePaddle`支持的一种非常灵活的数据组织方式,帮助我们更好地描述段落、多轮对话等更为复杂的语言数据。基于双层序列输入,我们可以设计一个层次化的网络,分别从词语和句子级别编码输入数据,更好地完成一些复杂的语言理解任务。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • 帮助我们 --> 能够帮助我们
  • 基于双层序列输入 --> 以双层序列作为输入
  • 更好地完成--> 从而更好地完成


双层序列是`PaddlePaddle`支持的一种非常灵活的数据组织方式,帮助我们更好地描述段落、多轮对话等更为复杂的语言数据。基于双层序列输入,我们可以设计一个层次化的网络,分别从词语和句子级别编码输入数据,更好地完成一些复杂的语言理解任务。

本示例将演示如何使用`PaddlePaddle`来组织双层序列文本数据,完成文本分类任务。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • 本示例 --> 本例
  • 本例演示如何在PaddlePaddle 中将长文本输入(通常能达到段落或者篇章基本)组织为双层序列,完成对长文本的分类任务。

本示例将演示如何使用`PaddlePaddle`来组织双层序列文本数据,完成文本分类任务。

## 模型介绍
对于文本分类,我们将一段文本看成句子的数组,每个句子又是单词的数组,这便是一种双层序列的输入数据。而将这个段落的每一句话用卷积神经网络编码为一个向量,再将每句话的表示向量经过池化层编码成一个段落的向量, 即可得到段落的表示向量。对于分类任务,将段落表示向量作为分类器的输入可以得到分类结果。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 删掉第一句“对于文本分类”
  2. 单词 --> 词语
  3. 删掉“这便是一种双层序列的输入数据。”
  4. 请不要使用“数组”而是使用“序列”,请保证关键名词在行文过程中的一致性。

我们将一段文本看成句子的序列,而每个句子又是词语的序列。

  1. 下面这些内容请另起一段。

我们首先用卷积神经网络编码段落中的每一句话;然后,将每句话的表示向量经过池化层得到段落的编码向量;最后将段落的编码向量作为分类器(以softmax层的全连接层)输入,得到最终的分类结果。


PaddlePaddle 实现该网络结构的代码见 `network_conf.py`。

对于双层序列的处理,需要先将双层时间序列数据先变换成单层时间序列数据,再对每一个单层时间序列进行处理。 PaddlePaddle提供了 `recurrent_group` 接口进行转换,在本例中,我们将文本数据的每一段,通过 recurrent_group 进行拆解,拆解成的每一句话再通过一个 CNN网络学习对应的向量表示。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 请区分 “对于” 和 “对” 的使用。
  2. 这样描述并没有很好地反映 recurrent_group 的设计思想,这一段先简化处理,保证尽快merge。。

在 PaddlePaddle 中 ,recurrent_group 是帮助我们构建处理双层序列的层次化模型的主要工具。这里,我们使用2个嵌套的 recurrent_group。外层的 recurrent_group 将段落拆解为句子,step 函数中拿到的输入是句子序列;内层的recurrent_group 将句子拆解为词语,step 函数中拿到的输入是非序列的词语。

在词语级别,我们通过 CNN 网络以词向量为输入学习句子表示;在段落级别,将每个句子的表示通过池化作用得到段落表示。


PaddlePaddle 实现该网络结构的代码见 `network_conf.py`。

对于双层序列的处理,需要先将双层时间序列数据先变换成单层时间序列数据,再对每一个单层时间序列进行处理。 PaddlePaddle提供了 `recurrent_group` 接口进行转换,在本例中,我们将文本数据的每一段,通过 recurrent_group 进行拆解,拆解成的每一句话再通过一个 CNN网络学习对应的向量表示。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 请区分 “对于” 和 “对” 的使用。
  2. 这样描述并没有很好地反映 recurrent_group 的设计思想,这一段先简化处理,保证尽快merge。。

在 PaddlePaddle 中 ,recurrent_group 是帮助我们构建处理双层序列的层次化模型的主要工具。这里,我们使用2个嵌套的 recurrent_group。外层的 recurrent_group 将段落拆解为句子,step 函数中拿到的输入是句子序列;内层的recurrent_group 将句子拆解为词语,step 函数中拿到的输入是非序列的词语。

在词语级别,我们通过 CNN 网络以词向量为输入输出学习到的句子表示;在段落级别,将每个句子的表示通过池化作用得到段落表示。


PaddlePaddle 实现该网络结构的代码见 `network_conf.py`。

对于双层序列的处理,需要先将双层时间序列数据先变换成单层时间序列数据,再对每一个单层时间序列进行处理。 PaddlePaddle提供了 `recurrent_group` 接口进行转换,在本例中,我们将文本数据的每一段,通过 recurrent_group 进行拆解,拆解成的每一句话再通过一个 CNN网络学习对应的向量表示。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 请区分 “对于” 和 “对” 的使用。
  2. 这样描述并没有很好地反映 recurrent_group 的设计思想,这一段先简化处理,保证尽快merge。。

在 PaddlePaddle 中 ,recurrent_group 是帮助我们构建处理双层序列的层次化模型的主要工具。这里,我们使用2个嵌套的 recurrent_group。外层的 recurrent_group 将段落拆解为句子,step 函数中拿到的输入是句子序列;内层的recurrent_group 将句子拆解为词语,step 函数中拿到的输入是非序列的词语。

在词语级别,我们通过 CNN 网络以词向量为输入输出学习到的句子表示;在段落级别,将每个句子的表示通过池化作用得到段落表示。

hidden_size],
step=cnn_cov_group)
```
使用`recurrent_group`接口进行变换时,需要将输入序列传入 `input` 属性。 由于本例要实现的变换是`双层时间序列 => 单层时间序列`,所以我们需要将输入数据标记成 `SubsequenceInput`。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

先删掉这句话。文档应保证一定的技术精确性。

### 训练
1.数据组织

假设有如下格式的训练数据:每一行为一条样本,以 `\t` 分隔,第一列是类别标签,第二列是输入文本的内容。以下是两条示例数据:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

假设有如下格式的训练数据 --> 输入数据格式如下:

Copy link
Collaborator

@lcy-seso lcy-seso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 请统一加上docstring;
  2. 加上 requirements.txt 文件;


### 预测

1.修改 `infer.py` 中以下变量,指定使用的模型、指定测试数据。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 请把 infer.py 的使用方式也改成指定命令行参数,不要让用户来修改代码。
  2. 今后 models 下模型的使用方式都保持统一,从命令行指定运行参数。请参考 generate chinese potery 的例子。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

@@ -0,0 +1,4 @@
At this point it seems almost unnecessary to state that Jon Bon Jovi delivers a firm, strong, seamless performance as Derek Bliss. His capability as an actor has been previously established by his critical acclaim garnered in other films (The Leading Man, No Looking Back). But, in case anyone is still wondering, yes, Jon Bon Jovi can act. He can act well and that's come to be expected of him. It's easy to separate Derek from the guy who belts out hits on VH-1.<br /><br />I generally would not watch a horror movie. I've come to expect them to focus on sensationalistic gore rather than dialogue and plot. What pleased me most about this film was that there really was a viable plot being moved along. The gore is not so much as to become the focus of the film and does not have a disturbingly realistic quality of films with higher technical effects budgets. So, gore fans might be disappointed, but story fans will not.<br /><br />Unlike an action film like U-571 where the dialogue takes a back seat to the bombast, we get a chance to know "the good guys" and actually care what happens to them. A few scenes are left unexplained (like Derek's hallucinations) but you get the feeling certain aspects were as they were to lay the foundation for a sequel. Unfortunately, with the lack of interest shown by Hollywood in this film, that sequel will never happen. These few instances are forgiveable knowing that Vampires could have been a continuing series.<br /><br />Is this the best film I've ever seen in my life? No. Is it a good way to spend about two hours being entertained? Yes. It won't leave the person who fears horror movies with insomnia and it won't leave the horror movie lover completely disappointed either. If you're somewhere in between the horror genre loather and the horror genre lover, this film is for you. It reaches a happy medium with the effects and story balancing each other.<br /><br />
The original Vampires (1998) is one of my favorites. I was curious to see how a sequel would work considering they used none of the original characters. I was quite surprised at how this played out. As a rule, sequels are never as good as the original, with a few exceptions. Though this one was not a great movie, the writer did well in keeping the main themes & vampire lore from the first one in tact. Jon Bon Jovi was a drawback initially, but he proved to be a half-way decent Slayer. I doubt anyone could top James Wood's performance in the first one, though. unless you bring in Buffy!<br /><br />All in all, this was a decent watch & I would watch it again.<br /><br />I was left with two questions, though... what happened to Jack Crow & how did Derek Bliss come to be a slayer? Guess we'll just have to leave that to imagination.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个示例数据比较长,减少到能够说明问题的最少数据。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

@@ -0,0 +1,4 @@
1 I liked the film. Some of the action scenes were very interesting, tense and well done. I especially liked the opening scene which had a semi truck in it. A very tense action scene that seemed well done.<br /><br />Some of the transitional scenes were filmed in interesting ways such as time lapse photography, unusual colors, or interesting angles. Also the film is funny is several parts. I also liked how the evil guy was portrayed too. I'd give the film an 8 out of 10.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

示例数据太长,减少到能够说明问题的最少数据。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

@@ -0,0 +1,4 @@
0 I admit that I am a vampire addict: I have seen so many vampire movies I have lost count and this one is definitely in the top ten. I was very impressed by the original John Carpenter's Vampires and when I descovered there was a sequel I went straight out and bought it. This movie does not obey quite the same rules as the first, and it is not quite so dark, but it is close enough and I felt that it built nicely on the original.<br /><br />Jon Bon Jovi was very good as Derek Bliss: his performance was likeable and yet hard enough for the viewer to believe that he might actually be able to survive in the world in which he lives. One of my favourite parts was just after he meets Zoey and wanders into the bathroom of the diner to check to see if she is more than she seems. His comments are beautifully irreverant and yet emminently practical which contrast well with the rest of the scene as it unfolds.<br /><br />The other cast members were also well chosen and they knitted nicely to produce an entertaining and original film. It is not simply a rehash of the first movie and it has grown in a similar way to the way Fright Night II grew out of Fright Night. There are different elements which make it a fresh movie with a similar theme.<br /><br />If you like vampire movies I would recommend this one. If you prefer your films less bloody then choose something else.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

数据太长,减少到能够说明问题的最少数据。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

@@ -0,0 +1,82 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

删掉 line 1 ~ 2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

test_batch = []


if __name__ == "__main__":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

用 python 下面的 import click 包,改成指定命令行参数,不要让用户来改代码。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

@@ -0,0 +1,252 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

删掉 line 1 ~ 2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

@@ -0,0 +1,163 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import os
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

删掉line 1 ~ 2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

@@ -0,0 +1,61 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import logging
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

删掉line 1 ~ 2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

logger.setLevel(logging.INFO)


def parse_train_cmd():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以直接用 click 包。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

Copy link
Collaborator

@lcy-seso lcy-seso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个配置现在写的比较简单,希望后面可以加上文本卷积上的batch norm示例。可以在下一个PR里面做。

@click.option(
"--model_path",
type=str,
default='models/params_pass_00000.tar.gz',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 不要设置这个默认参数,让这个字段变成 required 的,让用户来指定。
  2. 需要的是一个目录来存储所有保存的模型,而不是一个文件名。不要让后面轮次的模型,覆盖前面的。留下来让用户做模型选择。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

"--data_path",
default=None,
help=("path of data for inference (default: None). "
"if this parameter is not set, "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

注释是一句独立的话,请英文首字母大写,以句号结束。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

@@ -6,12 +6,16 @@ def cnn_cov_group(group_input, hidden_size):
input=group_input, context_len=3, hidden_size=hidden_size)
conv4 = paddle.networks.sequence_conv_pool(
input=group_input, context_len=4, hidden_size=hidden_size)

#output_group = paddle.layer.concat(input=[conv3, conv4])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这一行被删掉的代码是必须的吗?如果是请加注释,如果不是,请删掉。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

Copy link
Collaborator

@lcy-seso lcy-seso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 请把 docstring 全部补全;
  2. 调整代码结构,增加配置文件,把所有网络结构,学习相关的超参数都独立到配置中,可以快速直接修改超参数,而不要在代码中找这些超参数。



def cnn_cov_group(group_input, hidden_size):
conv3 = paddle.networks.sequence_conv_pool(
Copy link
Collaborator

@lcy-seso lcy-seso Oct 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 请按照 rnn language model , GNR 的例子,将网络结构中的所有超参数全部独立出来。包括不限于:
    • 隐层大小;层数;
    • 正则,初始化等学习相关参数;

models 下新增的例子,请保持清晰地结构:让用户可以直接找到学习相关的超参数,直接进行修改,而不要让用户修改时在代码中找。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

"word dictionay will be built from "
"the training data automatically."))
@click.option(
"--class_num", type=int, default=2, help="The class number (default: 2).")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • 不要硬质定类别数目,让用户指定一个label字典,将 label 名字映射到编码,自动判断类别数目。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE


dict_dim = len(word_dict)
emb_size = 28
hidden_size = 128
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

把超参数全部独立放在配置文件中。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE


logger.info("Length of word dictionary is : %d." % (dict_dim))

paddle.init(use_gpu=True, trainer_count=4)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. models 下所有例子默认CPU,单线程执行。不要默认GPU四卡执行。
  2. 把这两个设置放在配置中,在文档中提示修改这两个参数可以使用GPU多机多卡训练。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE


paddle.init(use_gpu=True, trainer_count=4)

# network config
Copy link
Collaborator

@lcy-seso lcy-seso Oct 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

代码里面的注释也尽量保证是完整的符合语法的一句话。不要只写几个不是广泛接受的名词缩写。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

regularization=paddle.optimizer.L2Regularization(rate=1e-3),
model_average=paddle.optimizer.ModelAverage(average_window=0.5))

# create trainer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create the trainer instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

cost, prob, label = nest_net(
dict_dim, emb_size, hidden_size, class_num, is_infer=False)

# create parameters
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create all the trainable parameters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

update_equation=adam_optimizer)

# begin training network
feeding = {"word": 0, "label": 1}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个注释的位置不对,这里是指定feeding字典,不是 begin training.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE


def _event_handler(event):
"""
Define end batch and end pass event handler
Copy link
Collaborator

@lcy-seso lcy-seso Oct 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果是一句完整的话,请首字母大写,以句号结尾,注意必要的地方名词前加冠词。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

# create parameters
parameters = paddle.parameters.create(cost)

# create optimizer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

把 optimizer 的定义放在网络定义之前。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

## 指定训练配置参数

`config.py`脚本中包含训练配置和模型配置的参数设置, 示例代码如下:
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python , 后面少了 python

```

修改`train.py`脚本中的启动参数,可以直接运行本例。 以`data`目录下的示例数据为例,在终端执行:
```bash
python train.py --train_data_dir 'data/train_data' --test_data_dir 'data/test_data' --word_dict_path 'dict.txt'
python train.py --train_data_dir 'data/train_data' --test_data_dir 'data/test_data' --word_dict_path 'word_dict.txt' --label_dict_path 'label_dict.txt'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

命令加上换行符,放在一行之内太长了。

python train.py \
  --train_data_dir 'data/train_data'  \
  --test_data_dir 'data/test_data' \
  --word_dict_path 'word_dict.txt' \
  --label_dict_path 'label_dict.txt'

```

2.以`data`目录下的示例数据为例,在终端执行:
```bash
python infer.py --data_path 'data/infer.txt' --word_dict_path 'dict.txt' --model_path 'models/params_pass_00000.tar.gz'
python infer.py --data_path 'data/infer.txt' --word_dict_path 'word_dict.txt' --label_dict_path 'label_dict.txt' --model_path 'models/params_pass_00000.tar.gz'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

命令加上换行符,放在一行太长了。

python infer.py \
  --data_path 'data/infer.txt' \
  --word_dict_path 'word_dict.txt' \
  --label_dict_path 'label_dict.txt' \
  --model_path 'models/params_pass_00000.tar.gz'


class TrainerConfig(object):

# whether to use GPU for training
Copy link
Collaborator

@lcy-seso lcy-seso Oct 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whether to use GPU in training or not.


# whether to use GPU for training
use_gpu = False
# the number of threads used in one machine
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of computing threads.

Covolution group definition
:param group_input: The input of this layer.
:type group_input: LayerOutput
:params hidden_size: Size of FC layer.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The size of the fully connected layer.

  • 注释中名词是特指,前加 the。

@@ -107,6 +116,33 @@
pip install -r requirements.txt
```

## 指定训练配置参数

`config.py`脚本中包含训练配置和模型配置的参数设置, 示例代码如下:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

通过config.py脚本修改训练和模型配置参数,脚本中有对可配置参数的详细解释,示例如下:

...
```
用户可以对具体参数进行设置实现训练, 例如通过设置 `use_gpu` 参数来指定是否使用 GPU
进行训练。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • 这句话不通顺。
  • 修改 config.py 对参数进行调整。例如,通过修改 use_gpu 参数来指定是否使用 GPU进行训练。

@@ -53,25 +57,30 @@ def _infer_a_batch(inferer, test_batch, ids_2_word):
word_dict = reader.imdb_word_dict()
word_reverse_dict = dict((value, key)
for key, value in word_dict.iteritems())

label_reverse_dict = {0: "positive", 1: "negative"}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个字典不要写死,写死了,替换为自己数据的时候要出错。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里是内置imdb数据集的字典,不会与用户自定义的字典冲突,已经添加注释说明。

@@ -6,7 +6,7 @@
logger.setLevel(logging.INFO)


def build_dict(data_dir, save_path, use_col=1, cutoff_fre=1):
def build_word_dict(data_dir, save_path, use_col=1, cutoff_fre=1):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个文件中的函数,请加上docstring。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

Copy link
Collaborator

@lcy-seso lcy-seso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@lcy-seso lcy-seso merged commit eeb85f2 into PaddlePaddle:develop Oct 15, 2017
@peterzhang2029 peterzhang2029 deleted the nest_text_classification branch October 16, 2017 02:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add text classification demo for nested sequence data
2 participants