-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clean codes of language model. #128
Conversation
ead19c2
to
a8e4f42
Compare
a8e4f42
to
e9a0aa8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
整体不错,有一些小问题改了就可以merge了。
## RNN 语言模型 | ||
### 简介 | ||
|
||
RNN是一个序列模型,基本思路是:在时刻$t$,将前一时刻$t-1$的隐藏层输出和$t$时刻的词向量一起输入到隐藏层从而得到时刻$t$的特征表示,然后用这个特征表示得到t时刻的预测输出,如此在时间维上递归下去。可以看出RNN善于使用上文信息、历史知识,具有“记忆”功能。理论上RNN能实现“长依赖”(即利用很久之前的知识),但在实际应用中发现效果并不理想,于是出现了很多RNN的变种,如常用的LSTM和GRU,它们对传统RNN的cell进行了改进,弥补了传统RNN的不足,本例中即使用了LSTM、GRU。下图是RNN(广义上包含了LSTM、GRU等)语言模型“循环”思想的示意图: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 用这个特征表示得到t时刻:$t$
- cell指记忆单元么?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- done.
- cell 记忆单元,已经修改成中文。
* 2,运行`python generate.py`运行文本生成。(输入的文本默认为`data/train_data_examples.txt`,生成的文本默认保存到`data/gen_result.txt`中。) | ||
|
||
|
||
**如果需要使用自己的语料、定制模型,需要修改的地方主要是`语料`和`config.py`中的配置,需要注意的细节和适配工作详情如下:** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
语料
-》语料,去掉代码格式
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
1. `gen_file`:指定输入数据文件,每行是一个句子的前缀,**需要预先分词**。 | ||
2. `gen_result`:指定输出文件路径,生成结果将写入此文件。 | ||
3. `max_gen_len`:指定每一句生成的话最长长度,如果模型无法生成出`<e>`,当生成 `max_gen_len` 个词语后,生成过程会自动终止。 | ||
4. `beam_size`:Beam Search 算法每一步的展开宽度 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
136行少了一个句号。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
- 第二列是输入的前缀。 | ||
2. 第二 ~ `beam_size + 1` 行是生成结果,同样以 `\t` 分隔为两列: | ||
- 第一列是该生成序列的对数概率(log probability) | ||
- 第二列是生成的文本序列,正常的生成结果会以符号`<e>`结尾,如果没有以`<e>`结尾,意味着超过了最大序列长度,生成强制终止 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
158,161,162行少了句号。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
1. `<unk>`:不出现在字典中的词 | ||
2. `<e>`:句子的结束符 | ||
|
||
*注:需要注意的是,词典越大生成的内容越丰富,但训练耗时越久。一般中文分词之后,语料中不同的词能有几万乃至几十万,如果`max_word_num`取值过小则导致`<unk>`占比过高,如果`max_word_num`取值较大,则严重影响训练速度(对精度也有影响)。所以,也有“按字”训练模型的方式,即:把每个汉字当做一个词,常用汉字也就几千个,使得字典的大小不会太大、不会丢失太多信息,但汉语中同一个字在不同词中语义相差很大,有时导致模型效果不理想。建议多试试、根据实际情况选择是“按词训练”还是“按字训练”。* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这一段注写的很不错。是否考虑每个readme都写点类似的“编者按”、“XX说”、”XX分享“之类的私货分享呢?或者开个专栏也可以。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好的~ 可以找一个合适的地方。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
follow comments.
## RNN 语言模型 | ||
### 简介 | ||
|
||
RNN是一个序列模型,基本思路是:在时刻$t$,将前一时刻$t-1$的隐藏层输出和$t$时刻的词向量一起输入到隐藏层从而得到时刻$t$的特征表示,然后用这个特征表示得到t时刻的预测输出,如此在时间维上递归下去。可以看出RNN善于使用上文信息、历史知识,具有“记忆”功能。理论上RNN能实现“长依赖”(即利用很久之前的知识),但在实际应用中发现效果并不理想,于是出现了很多RNN的变种,如常用的LSTM和GRU,它们对传统RNN的cell进行了改进,弥补了传统RNN的不足,本例中即使用了LSTM、GRU。下图是RNN(广义上包含了LSTM、GRU等)语言模型“循环”思想的示意图: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- done.
- cell 记忆单元,已经修改成中文。
* 2,运行`python generate.py`运行文本生成。(输入的文本默认为`data/train_data_examples.txt`,生成的文本默认保存到`data/gen_result.txt`中。) | ||
|
||
|
||
**如果需要使用自己的语料、定制模型,需要修改的地方主要是`语料`和`config.py`中的配置,需要注意的细节和适配工作详情如下:** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
1. `<unk>`:不出现在字典中的词 | ||
2. `<e>`:句子的结束符 | ||
|
||
*注:需要注意的是,词典越大生成的内容越丰富,但训练耗时越久。一般中文分词之后,语料中不同的词能有几万乃至几十万,如果`max_word_num`取值过小则导致`<unk>`占比过高,如果`max_word_num`取值较大,则严重影响训练速度(对精度也有影响)。所以,也有“按字”训练模型的方式,即:把每个汉字当做一个词,常用汉字也就几千个,使得字典的大小不会太大、不会丢失太多信息,但汉语中同一个字在不同词中语义相差很大,有时导致模型效果不理想。建议多试试、根据实际情况选择是“按词训练”还是“按字训练”。* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好的~ 可以找一个合适的地方。
1. `gen_file`:指定输入数据文件,每行是一个句子的前缀,**需要预先分词**。 | ||
2. `gen_result`:指定输出文件路径,生成结果将写入此文件。 | ||
3. `max_gen_len`:指定每一句生成的话最长长度,如果模型无法生成出`<e>`,当生成 `max_gen_len` 个词语后,生成过程会自动终止。 | ||
4. `beam_size`:Beam Search 算法每一步的展开宽度 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
- 第二列是输入的前缀。 | ||
2. 第二 ~ `beam_size + 1` 行是生成结果,同样以 `\t` 分隔为两列: | ||
- 第一列是该生成序列的对数概率(log probability) | ||
- 第二列是生成的文本序列,正常的生成结果会以符号`<e>`结尾,如果没有以`<e>`结尾,意味着超过了最大序列长度,生成强制终止 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
word embedding
in PaddleBook, andhsigmod
, andNCE
in Paddle Models.language model
intogenerate_sequence_by_rnn_lm
to make it more accurate.