wiki-lm

Script to train a n-gram Language Model (LM) of any order on any language with Wikipedia articles. See my blog post for details.

Usage:

The script is located at create_lm.sh. Type ./create_lm.sh -h to display the help.

Example usage

# create a 4-gram LM for German using the 400k most frequent words and probing as data structure. Artifacts will be removed after estimation. 
./create_lm.sh -l de -o 4 -m 400000 -d probing -r

Dependencies

The script assumes the following dependencies are available on your system:

KenLM: KenLM is used to create the LM and assumes that lmplz and build_binary are on $PATH. See the KenLM docs for more information about how to build those binaries.
Pipeline Viewer to show a progress bar

Corpus creation

The script will perform all steps necessary to create a corpus from the Wikipedia articles that can be used for estimation. Some of the logic used for creating the corpus is contained in create_corpus.py. The main steps are the following:

Download the dump from Wikipedia
Remove Wiki markup and extract raw text from articles. This step uses the Wikipedia Extractor which was implemented by me and was simply copied from the GitHub Repo.
Tokenize the text into sentences using NLTK. All text is normalized. The result is used to write a compressed corpus file with one sentence per line, words separated by a single whitespace (as expected by KenLM). Normalization is done as follows:
- try to convert any non-ASCII characters to their ASCII equivalents using unidecode, if this is possible. This is neccessary to reduce possible spelling errors with accentuated characters and get rid of ambigous spelling variants (like e.g. the ß used in German that is sometimes also written as ss). Umlauts (äöü) will not be replaced.
- punctuation is removed (this includes any punctuation used to mark the end of a sentence)
- replace any purely numeric word-tokens within each sentence (year numbers etc.) by the <num> token. This is done because such tokens usually do not carry any semantic meaning and can be replaced by any other number. Word-tokens containing a digit (e.g. WW2) will be processed by replacing the digit by # (e.g. WW#).
- remove any whitespace at the beginning and end of each sentence as well as multiple whitespaces between
- make everything lowercase
Estimate the probability of n-grams using lmplz and create an ARPA file containing the estimations
Create a binary KenLM model from the ARPA file using build_binary that can be loaded by the kenlm python module and used for fast estimation.

(c) Daniel Tiefenauer

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
WikiExtractor.py		WikiExtractor.py
create_corpus.py		create_corpus.py
create_lm.sh		create_lm.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

WikiExtractor.py

WikiExtractor.py

create_corpus.py

create_corpus.py

create_lm.sh

create_lm.sh

requirements.txt

requirements.txt

Repository files navigation

wiki-lm

Usage:

Example usage

Dependencies

Corpus creation

About

Releases

Packages

Languages

License

tiefenauer/wiki-lm

Folders and files

Latest commit

History

Repository files navigation

wiki-lm

Usage:

Example usage

Dependencies

Corpus creation

About

Resources

License

Stars

Watchers

Forks

Languages