Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add README for Transformer #994

Merged
merged 4 commits into from
Jul 3, 2018

Conversation

guoshengCS
Copy link
Collaborator

Add README for Transformer

@guoshengCS guoshengCS force-pushed the add-transformer-README branch 7 times, most recently from 0340f27 to 54ee859 Compare June 20, 2018 12:27
@kuke kuke requested review from kuke and ktlichkid June 28, 2018 04:20
```sh
perl multi_bleu.perl data/newstest2013.tok.de < prdict.tok.txt
```
可以看到如下的评估结果。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以看到如下的评估结果 -> 可以看到类似如下的结果,因为用户自己训练这个评估结果并不确定

```sh
paste -d '\t' train.tok.clean.bpe.32000.en train.tok.clean.bpe.32000.de > train.tok.clean.bpe.32000.en-de
```
此外还需要在词典文件中加上表示序列的开始、序列的结束和未登录词的3个特殊符号 `<s>` 、`<e>` 和 `<unk>` 。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

有无命令完成这三个特殊符号的自动插入?


### 数据准备

我们这里使用 [WMT'16 EN-DE 数据集](http://www.statmt.org/wmt16/translation-task.html),同时参照论文中的设置使用 BPE(byte-pair encoding)[4]编码的数据,使用这种方式表示的数据能够更好的解决未登录词(out-of-vocabulary,OOV)的问题。用到的 BPE 数据可以参照[这里](https://github.com/google/seq2seq/blob/master/docs/data.md)进行下载,下载后解压,其中 `train.tok.clean.bpe.32000.en` 和 `train.tok.clean.bpe.32000.de` 为使用 BPE 的训练数据(平行语料,分别对应了英语和德语),`newstest2013.tok.bpe.32000.en` 和 `newstest2013.tok.bpe.32000.de` 等为测试数据(`newstest2013.tok.en` 和 `newstest2013.tok.de` 等则为对应的未使用 BPE 的测试数据),`vocab.bpe.32000` 为相应的词典文件(源语言和目标语言共享该词典文件)。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这一部分可能要针对多种数据形式进行调整

@kuke kuke requested review from shanyi15, reyoung and guochaorong and removed request for reyoung, guochaorong and shanyi15 June 29, 2018 08:31
@shanyi15
Copy link
Collaborator

shanyi15 commented Jul 2, 2018

format LGTM

Copy link
Collaborator

@kuke kuke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@guoshengCS guoshengCS merged commit b0dc90d into PaddlePaddle:develop Jul 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants