clean codes of language model. #128

lcy-seso · 2017-06-26T09:26:34Z

remove n-gram language model, because it is repeated with word embedding in PaddleBook, and hsigmod, and NCE in Paddle Models.
refactor rnn language model and beam search.
rename the directory from language model into generate_sequence_by_rnn_lm to make it more accurate.

luotao1

整体不错，有一些小问题改了就可以merge了。

luotao1 · 2017-06-28T06:34:17Z

generate_sequence_by_rnn_lm/README.md

+## RNN 语言模型
+### 简介
+
+RNN是一个序列模型，基本思路是：在时刻$t$，将前一时刻$t-1$的隐藏层输出和$t$时刻的词向量一起输入到隐藏层从而得到时刻$t$的特征表示，然后用这个特征表示得到t时刻的预测输出，如此在时间维上递归下去。可以看出RNN善于使用上文信息、历史知识，具有“记忆”功能。理论上RNN能实现“长依赖”（即利用很久之前的知识），但在实际应用中发现效果并不理想，于是出现了很多RNN的变种，如常用的LSTM和GRU，它们对传统RNN的cell进行了改进，弥补了传统RNN的不足，本例中即使用了LSTM、GRU。下图是RNN（广义上包含了LSTM、GRU等）语言模型“循环”思想的示意图：


用这个特征表示得到t时刻：$t$

cell指记忆单元么？

done.

cell 记忆单元，已经修改成中文。

luotao1 · 2017-06-28T06:40:50Z

generate_sequence_by_rnn_lm/README.md

+* 2，运行`python generate.py`运行文本生成。（输入的文本默认为`data/train_data_examples.txt`，生成的文本默认保存到`data/gen_result.txt`中。）
+
+
+**如果需要使用自己的语料、定制模型，需要修改的地方主要是`语料`和`config.py`中的配置，需要注意的细节和适配工作详情如下：**


语料-》语料，去掉代码格式

luotao1 · 2017-06-28T06:43:25Z

generate_sequence_by_rnn_lm/README.md

+    1. `gen_file`：指定输入数据文件，每行是一个句子的前缀，**需要预先分词**。
+    2. `gen_result`：指定输出文件路径，生成结果将写入此文件。
+    3. `max_gen_len`：指定每一句生成的话最长长度，如果模型无法生成出`<e>`，当生成 `max_gen_len` 个词语后，生成过程会自动终止。
+    4. `beam_size`：Beam Search 算法每一步的展开宽度


136行少了一个句号。

luotao1 · 2017-06-28T06:44:07Z

generate_sequence_by_rnn_lm/README.md

+        - 第二列是输入的前缀。
+    2. 第二 ~ `beam_size + 1` 行是生成结果，同样以 `\t` 分隔为两列：
+        - 第一列是该生成序列的对数概率（log probability）
+        - 第二列是生成的文本序列，正常的生成结果会以符号`<e>`结尾，如果没有以`<e>`结尾，意味着超过了最大序列长度，生成强制终止


158,161,162行少了句号。

luotao1 · 2017-06-28T06:47:51Z

generate_sequence_by_rnn_lm/README.md

+    1. `<unk>`：不出现在字典中的词
+    2. `<e>`：句子的结束符
+
+    *注：需要注意的是，词典越大生成的内容越丰富，但训练耗时越久。一般中文分词之后，语料中不同的词能有几万乃至几十万，如果`max_word_num`取值过小则导致`<unk>`占比过高，如果`max_word_num`取值较大，则严重影响训练速度（对精度也有影响）。所以，也有“按字”训练模型的方式，即：把每个汉字当做一个词，常用汉字也就几千个，使得字典的大小不会太大、不会丢失太多信息，但汉语中同一个字在不同词中语义相差很大，有时导致模型效果不理想。建议多试试、根据实际情况选择是“按词训练”还是“按字训练”。*


这一段注写的很不错。是否考虑每个readme都写点类似的“编者按”、“XX说”、”XX分享“之类的私货分享呢？或者开个专栏也可以。

好的~ 可以找一个合适的地方。

lcy-seso

follow comments.

lcy-seso · 2017-06-28T07:12:17Z

generate_sequence_by_rnn_lm/README.md

+## RNN 语言模型
+### 简介
+
+RNN是一个序列模型，基本思路是：在时刻$t$，将前一时刻$t-1$的隐藏层输出和$t$时刻的词向量一起输入到隐藏层从而得到时刻$t$的特征表示，然后用这个特征表示得到t时刻的预测输出，如此在时间维上递归下去。可以看出RNN善于使用上文信息、历史知识，具有“记忆”功能。理论上RNN能实现“长依赖”（即利用很久之前的知识），但在实际应用中发现效果并不理想，于是出现了很多RNN的变种，如常用的LSTM和GRU，它们对传统RNN的cell进行了改进，弥补了传统RNN的不足，本例中即使用了LSTM、GRU。下图是RNN（广义上包含了LSTM、GRU等）语言模型“循环”思想的示意图：


done.

cell 记忆单元，已经修改成中文。

lcy-seso · 2017-06-28T07:13:14Z

generate_sequence_by_rnn_lm/README.md

+* 2，运行`python generate.py`运行文本生成。（输入的文本默认为`data/train_data_examples.txt`，生成的文本默认保存到`data/gen_result.txt`中。）
+
+
+**如果需要使用自己的语料、定制模型，需要修改的地方主要是`语料`和`config.py`中的配置，需要注意的细节和适配工作详情如下：**


lcy-seso · 2017-06-28T07:13:45Z

generate_sequence_by_rnn_lm/README.md

+    1. `<unk>`：不出现在字典中的词
+    2. `<e>`：句子的结束符
+
+    *注：需要注意的是，词典越大生成的内容越丰富，但训练耗时越久。一般中文分词之后，语料中不同的词能有几万乃至几十万，如果`max_word_num`取值过小则导致`<unk>`占比过高，如果`max_word_num`取值较大，则严重影响训练速度（对精度也有影响）。所以，也有“按字”训练模型的方式，即：把每个汉字当做一个词，常用汉字也就几千个，使得字典的大小不会太大、不会丢失太多信息，但汉语中同一个字在不同词中语义相差很大，有时导致模型效果不理想。建议多试试、根据实际情况选择是“按词训练”还是“按字训练”。*


好的~ 可以找一个合适的地方。

lcy-seso · 2017-06-28T07:14:25Z

generate_sequence_by_rnn_lm/README.md

+    1. `gen_file`：指定输入数据文件，每行是一个句子的前缀，**需要预先分词**。
+    2. `gen_result`：指定输出文件路径，生成结果将写入此文件。
+    3. `max_gen_len`：指定每一句生成的话最长长度，如果模型无法生成出`<e>`，当生成 `max_gen_len` 个词语后，生成过程会自动终止。
+    4. `beam_size`：Beam Search 算法每一步的展开宽度


lcy-seso · 2017-06-28T07:14:54Z

generate_sequence_by_rnn_lm/README.md

+        - 第二列是输入的前缀。
+    2. 第二 ~ `beam_size + 1` 行是生成结果，同样以 `\t` 分隔为两列：
+        - 第一列是该生成序列的对数概率（log probability）
+        - 第二列是生成的文本序列，正常的生成结果会以符号`<e>`结尾，如果没有以`<e>`结尾，意味着超过了最大序列长度，生成强制终止


clean codes of language model.

6b0f946

lcy-seso requested a review from luotao1 June 26, 2017 09:26

lcy-seso force-pushed the clean_rnn_lm_codes branch from ead19c2 to a8e4f42 Compare June 26, 2017 10:19

follow comments and rename the directory.

e9a0aa8

lcy-seso force-pushed the clean_rnn_lm_codes branch from a8e4f42 to e9a0aa8 Compare June 26, 2017 10:24

lcy-seso requested a review from llxxxll June 26, 2017 10:30

lcy-seso self-assigned this Jun 26, 2017

luotao1 approved these changes Jun 28, 2017

View reviewed changes

follow comments.

7d3a8cd

lcy-seso commented Jun 28, 2017

View reviewed changes

lcy-seso merged commit c075ae2 into PaddlePaddle:develop Jun 28, 2017

lcy-seso deleted the clean_rnn_lm_codes branch June 28, 2017 09:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clean codes of language model. #128

clean codes of language model. #128

lcy-seso commented Jun 26, 2017 •

edited

Loading

luotao1 left a comment

luotao1 Jun 28, 2017

lcy-seso Jun 28, 2017

luotao1 Jun 28, 2017

lcy-seso Jun 28, 2017

luotao1 Jun 28, 2017

lcy-seso Jun 28, 2017

luotao1 Jun 28, 2017

lcy-seso Jun 28, 2017

luotao1 Jun 28, 2017

lcy-seso Jun 28, 2017

lcy-seso left a comment

lcy-seso Jun 28, 2017

lcy-seso Jun 28, 2017

lcy-seso Jun 28, 2017

lcy-seso Jun 28, 2017

lcy-seso Jun 28, 2017

		* 2，运行`python generate.py`运行文本生成。（输入的文本默认为`data/train_data_examples.txt`，生成的文本默认保存到`data/gen_result.txt`中。）


		如果需要使用自己的语料、定制模型，需要修改的地方主要是`语料`和`config.py`中的配置，需要注意的细节和适配工作详情如下：

clean codes of language model. #128

clean codes of language model. #128

Conversation

lcy-seso commented Jun 26, 2017 • edited Loading

luotao1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lcy-seso left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lcy-seso commented Jun 26, 2017 •

edited

Loading