-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add doc for english LM. #295
Conversation
deep_speech_2/README.md
Outdated
* Repeated whitespace are squeezed to one and the beginning whitespace are removed. Notice that all transcriptions are lowercase, so all characters are converted to lowercases. | ||
* Top 400000 words by frequency are selected to build the vocabulary and all words not in the vocabulary are replaced with 'UNKNOWNWORD'. | ||
|
||
Now the preprocessing is done and we get a clean corpus to train the language model. Our released language model are pruned by '0 1 1 1 1'. To save disk storage we convert the arpa file to 'trie' binary file with parameters '-a 22 -q 8 -b 8'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our released language model are pruned by '0 1 1 1 1' and the max order of n-gram is 5.
deep_speech_2/README.md
Outdated
|
||
The english corpus is from the [Common Crawl Repository](http://commoncrawl.org) and you can download it from [statmt](http://data.statmt.org/ngrams/deduped_en). We use part en.00 to train our english languge model. There are some preprocessing steps before training: | ||
|
||
* Characters which not in [A-Za-z0-9\s'] are removed and arabic numbers are converted to english numbers like 1000 to one thousand. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove which
why \s
? whitespace?
deep_speech_2/README.md
Outdated
The english corpus is from the [Common Crawl Repository](http://commoncrawl.org) and you can download it from [statmt](http://data.statmt.org/ngrams/deduped_en). We use part en.00 to train our english languge model. There are some preprocessing steps before training: | ||
|
||
* Characters which not in [A-Za-z0-9\s'] are removed and arabic numbers are converted to english numbers like 1000 to one thousand. | ||
* Repeated whitespace are squeezed to one and the beginning whitespace are removed. Notice that all transcriptions are lowercase, so all characters are converted to lowercases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--> beginning and trailing ?
whitespace 的复数形式是?
两处lowercase的单复数形式?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trailing的whitespace影响不大,比如句尾的换行符,没有必要去掉
deep_speech_2/README.md
Outdated
|
||
* Characters which not in [A-Za-z0-9\s'] are removed and arabic numbers are converted to english numbers like 1000 to one thousand. | ||
* Repeated whitespace are squeezed to one and the beginning whitespace are removed. Notice that all transcriptions are lowercase, so all characters are converted to lowercases. | ||
* Top 400000 words by frequency are selected to build the vocabulary and all words not in the vocabulary are replaced with 'UNKNOWNWORD'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
---> Top 400,000 most frequent words are ....and the rest are replaced with ...
deep_speech_2/README.md
Outdated
* Repeated whitespace are squeezed to one and the beginning whitespace are removed. Notice that all transcriptions are lowercase, so all characters are converted to lowercases. | ||
* Top 400000 words by frequency are selected to build the vocabulary and all words not in the vocabulary are replaced with 'UNKNOWNWORD'. | ||
|
||
Now the preprocessing is done and we get a clean corpus to train the language model. Our released language model are pruned by '0 1 1 1 1'. To save disk storage we convert the arpa file to 'trie' binary file with parameters '-a 22 -q 8 -b 8'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
”0 1 1 1 1“ 不熟悉kenlm的人看比较费解,能给出是哪个参数并稍微解释下吗。"-a -q -b"同上。
deep_speech_2/README.md
Outdated
@@ -219,6 +219,18 @@ sh download_lm_ch.sh | |||
``` | |||
If you wish to train your own better language model, please refer to [KenLM](https://github.com/kpu/kenlm) for tutorials. | |||
|
|||
Here we provide some tips to show how we prepearing our english and mandarin language models. | |||
|
|||
#### English LM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prepare Your Own English LM ? 以和前面区分开来?
并且在前面部分的最开始加上 #### Download LMs
fix #294