Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finish generating chinese poetry #439

Merged
merged 14 commits into from
Nov 20, 2017
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions generate_chinese_poetry/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,12 +34,12 @@ python preprocess.py --datadir data/raw --outfile data/poems.txt --dictfile data
```

上述脚本执行完后将生成处理好的训练数据poems.txt和数据字典dict.txt。poems.txt中每行为一首唐诗的信息,分为三列,分别为题目、作者、诗内容。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 数据字典 --> 字典。
  2. 默认情况下,字典如何构建?分词/分字?字频率统计,默认截断频率是多少,提供一些基本的信息。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经更新README,增加了字典构建的描述

在诗内容中,诗句之间用'.'分隔。
在诗内容中,诗句之间用`.`分隔。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"." 分隔之后,训练数据的构造策略是什么?谁是源谁是目标?请解释一下数据策略。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已经更新README,增加了数据构建的简要描述


训练数据示例:
```text
登鸛雀樓 王之渙 白日依山盡,黃河入海流.欲窮千里目,更上一層樓
觀獵 李白 太守耀清威,乘閑弄晚暉.江沙橫獵騎,山火遶行圍.箭逐雲鴻落,鷹隨月兔飛.不知白日暮,歡賞夜方歸
觀獵 李白 太守耀清威,乘閑弄晚暉.江沙橫獵騎,山火遶行圍.箭逐雲鴻落,鷹隨月兔飛.不知白日暮,歡賞夜方歸
晦日重宴 陳嘉言 高門引冠蓋,下客抱支離.綺席珍羞滿,文場翰藻摛.蓂華彫上月,柳色藹春池.日斜歸戚里,連騎勒金羈
```

Expand Down
6 changes: 0 additions & 6 deletions generate_chinese_poetry/data/dict.txt
Original file line number Diff line number Diff line change
@@ -1,12 +1,6 @@
<<<<<<< HEAD
<s>
<e>
<unk>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不要把字典放在github上面,这个字典可以通过脚本来自动构建。

=======
<unk>
<s>
<e>
>>>>>>> 7943732ab34254df801d72b0b5e04f6f320e4127
Expand Down
16 changes: 2 additions & 14 deletions generate_chinese_poetry/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -76,12 +76,12 @@
```

上述脚本执行完后将生成处理好的训练数据poems.txt和数据字典dict.txt。poems.txt中每行为一首唐诗的信息,分为三列,分别为题目、作者、诗内容。
在诗内容中,诗句之间用'.'分隔。
在诗内容中,诗句之间用`.`分隔。

训练数据示例:
```text
登鸛雀樓 王之渙 白日依山盡,黃河入海流.欲窮千里目,更上一層樓
觀獵 李白 太守耀清威,乘閑弄晚暉.江沙橫獵騎,山火遶行圍.箭逐雲鴻落,鷹隨月兔飛.不知白日暮,歡賞夜方歸
觀獵 李白 太守耀清威,乘閑弄晚暉.江沙橫獵騎,山火遶行圍.箭逐雲鴻落,鷹隨月兔飛.不知白日暮,歡賞夜方歸
晦日重宴 陳嘉言 高門引冠蓋,下客抱支離.綺席珍羞滿,文場翰藻摛.蓂華彫上月,柳色藹春池.日斜歸戚里,連騎勒金羈
```

Expand Down Expand Up @@ -120,11 +120,7 @@
### 训练执行
```bash
python train.py \
<<<<<<< HEAD
--num_passes 20 \
=======
--num_passes 10 \
>>>>>>> 7943732ab34254df801d72b0b5e04f6f320e4127
--batch_size 256 \
--use_gpu True \
--trainer_count 1 \
Expand Down Expand Up @@ -172,16 +168,11 @@
例如将诗句 `白日依山盡,黃河入海流` 保存在文件 `input.txt` 中作为预测下句诗的输入,执行命令:
```bash
python generate.py \
<<<<<<< HEAD
--model_path models/pass_00014.tar.gz \
=======
--model_path models/pass_00100.tar.gz \
>>>>>>> 7943732ab34254df801d72b0b5e04f6f320e4127
--word_dict_path data/dict.txt \
--test_data_path input.txt \
--save_file output.txt
```
<<<<<<< HEAD
生成结果将保存在文件 `output.txt` 中。对于上述示例输入,生成的诗句如下:
```text
-21.2048 不 知 身 外 事 , 何 處 是 閑 遊
Expand All @@ -190,9 +181,6 @@
-21.7312 不 知 身 外 事 , 何 事 是 何 求
-22.1956 不 知 身 外 事 , 何 處 是 人 愁
```
=======
生成结果将保存在文件 `output.txt` 中。
>>>>>>> 7943732ab34254df801d72b0b5e04f6f320e4127

</div>
<!-- You can change the lines below now. -->
Expand Down
4 changes: 0 additions & 4 deletions generate_chinese_poetry/preprocess.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,11 +16,7 @@ def build_vocabulary(dataset, cutoff=0):
dictionary = filter(lambda x: x[1] >= cutoff, dictionary.items())
dictionary = sorted(dictionary, key=lambda x: (-x[1], x[0]))
vocab, _ = list(zip(*dictionary))
<<<<<<< HEAD
return (u"<s>", u"<e>", u"<unk>") + vocab
=======
return (u"<unk>", u"<s>", u"<e>") + vocab
>>>>>>> 7943732ab34254df801d72b0b5e04f6f320e4127


@click.command("preprocess")
Expand Down