Add doc for english LM. #295

pkuyym · 2017-09-19T08:45:41Z

pkuyym · 2017-09-19T08:57:46Z

deep_speech_2/README.md

+  * Repeated whitespace are squeezed to one and the beginning whitespace are removed. Notice that all transcriptions are lowercase, so all characters are converted to lowercases.
+  * Top 400000 words by frequency are selected to build the vocabulary and all words not in the vocabulary are replaced with 'UNKNOWNWORD'.
+
+Now the preprocessing is done and we get a clean corpus to train the language model. Our released language model are pruned by '0 1 1 1 1'. To save disk storage we convert the arpa file to 'trie' binary file with parameters '-a 22 -q 8 -b 8'.


Our released language model are pruned by '0 1 1 1 1' and the max order of n-gram is 5.

xinghai-sun · 2017-09-19T08:48:57Z

deep_speech_2/README.md

+
+The english corpus is from the [Common Crawl Repository](http://commoncrawl.org) and you can download it from [statmt](http://data.statmt.org/ngrams/deduped_en). We use part en.00 to train our english languge model. There are some preprocessing steps before training:
+
+  * Characters which not in [A-Za-z0-9\s'] are removed and arabic numbers are converted to english numbers like 1000 to one thousand.


remove which
why \s ? whitespace?

xinghai-sun · 2017-09-19T08:50:45Z

deep_speech_2/README.md

+The english corpus is from the [Common Crawl Repository](http://commoncrawl.org) and you can download it from [statmt](http://data.statmt.org/ngrams/deduped_en). We use part en.00 to train our english languge model. There are some preprocessing steps before training:
+
+  * Characters which not in [A-Za-z0-9\s'] are removed and arabic numbers are converted to english numbers like 1000 to one thousand.
+  * Repeated whitespace are squeezed to one and the beginning whitespace are removed. Notice that all transcriptions are lowercase, so all characters are converted to lowercases.


--> beginning and trailing ?
whitespace 的复数形式是？
两处lowercase的单复数形式？

trailing的whitespace影响不大，比如句尾的换行符，没有必要去掉

xinghai-sun · 2017-09-19T08:55:27Z

deep_speech_2/README.md

+
+  * Characters which not in [A-Za-z0-9\s'] are removed and arabic numbers are converted to english numbers like 1000 to one thousand.
+  * Repeated whitespace are squeezed to one and the beginning whitespace are removed. Notice that all transcriptions are lowercase, so all characters are converted to lowercases.
+  * Top 400000 words by frequency are selected to build the vocabulary and all words not in the vocabulary are replaced with 'UNKNOWNWORD'.


---> Top 400,000 most frequent words are ....and the rest are replaced with ...

xinghai-sun · 2017-09-19T08:59:08Z

deep_speech_2/README.md

+  * Repeated whitespace are squeezed to one and the beginning whitespace are removed. Notice that all transcriptions are lowercase, so all characters are converted to lowercases.
+  * Top 400000 words by frequency are selected to build the vocabulary and all words not in the vocabulary are replaced with 'UNKNOWNWORD'.
+
+Now the preprocessing is done and we get a clean corpus to train the language model. Our released language model are pruned by '0 1 1 1 1'. To save disk storage we convert the arpa file to 'trie' binary file with parameters '-a 22 -q 8 -b 8'.


”0 1 1 1 1“ 不熟悉kenlm的人看比较费解，能给出是哪个参数并稍微解释下吗。"-a -q -b"同上。

xinghai-sun · 2017-09-19T09:01:03Z

deep_speech_2/README.md

@@ -219,6 +219,18 @@ sh download_lm_ch.sh
 ```
 If you wish to train your own better language model, please refer to [KenLM](https://github.com/kpu/kenlm) for tutorials.

+Here we provide some tips to show how we prepearing our english and mandarin language models.
+
+#### English LM


Prepare Your Own English LM ? 以和前面区分开来？
并且在前面部分的最开始加上 #### Download LMs

Add doc for english LM.

35034a3

pkuyym requested a review from xinghai-sun September 19, 2017 08:45

pkuyym commented Sep 19, 2017

View reviewed changes

xinghai-sun requested changes Sep 19, 2017

View reviewed changes

xinghai-sun approved these changes Sep 19, 2017

View reviewed changes

Refine doc.

2cff5b5

pkuyym merged commit e99d6f4 into PaddlePaddle:develop Sep 19, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add doc for english LM. #295

Add doc for english LM. #295

pkuyym commented Sep 19, 2017

pkuyym Sep 19, 2017

xinghai-sun Sep 19, 2017

xinghai-sun Sep 19, 2017

pkuyym Sep 19, 2017

xinghai-sun Sep 19, 2017

xinghai-sun Sep 19, 2017

xinghai-sun Sep 19, 2017


		The english corpus is from the [Common Crawl Repository](http://commoncrawl.org) and you can download it from [statmt](http://data.statmt.org/ngrams/deduped_en). We use part en.00 to train our english languge model. There are some preprocessing steps before training:

		* Characters which not in [A-Za-z0-9\s'] are removed and arabic numbers are converted to english numbers like 1000 to one thousand.

Add doc for english LM. #295

Add doc for english LM. #295

Conversation

pkuyym commented Sep 19, 2017

pkuyym Sep 19, 2017

Choose a reason for hiding this comment

xinghai-sun Sep 19, 2017

Choose a reason for hiding this comment

xinghai-sun Sep 19, 2017

Choose a reason for hiding this comment

pkuyym Sep 19, 2017

Choose a reason for hiding this comment

xinghai-sun Sep 19, 2017

Choose a reason for hiding this comment

xinghai-sun Sep 19, 2017

Choose a reason for hiding this comment

xinghai-sun Sep 19, 2017

Choose a reason for hiding this comment