BoKenLm is a project for training a KenLM n-gram language model for the Tibetan language. It uses SentencePiece for tokenization and KenLM for language model creation. This toolkit is designed to be straightforward for creating language models from a large text corpus.
-
Clone the repository:
git clone <repository-url> cd BoKenLm
-
Create and activate a virtual environment (recommended):
python3 -m venv .env source .env/bin/activate -
Install dependencies: The project uses
pyproject.tomlto manage dependencies. Install them using pip:pip install -e .
The entire training process (SentencePiece tokenization and KenLM model building) is handled by a single script.
-
Prepare your data: You need a large corpus of clean Tibetan text. The corpus should be in a single
.txtfile with one sentence per line. -
Configure the training script: Open the file
src/BoKenLm/train_lm.pyand modify the configuration section inside themain()function:# Path to the clean Tibetan corpus file. corpus_path = "path/to/your/clean_corpus.txt" # Directory to save the trained models. output_dir = "models/kenlm" # Vocabulary size for SentencePiece tokenizer. vocab_size = 32000 # N-gram order for the KenLM model. ngram = 5
Update
corpus_pathto point to your text file. You can also adjustoutput_dir,vocab_size, and thengramorder. -
Run training: Execute the script from the root directory of the project:
python src/BoKenLm/train_lm.py
The script will first train a SentencePiece model and save it to your
output_dir. Then, it will use that model to tokenize the corpus and train a KenLM model, saving the finallm.arpafile in the same directory.
Once training is complete, you can use the generated SentencePiece model (tokenizer.model in your output directory) to tokenize new Tibetan text.
Here is an example Python snippet:
import sentencepiece as spm
# Load the trained model
sp = spm.SentencePieceProcessor(model_file="models/kenlm/tokenizer.model")
# Example Tibetan text
tibetan_text = "བཀྲ་ཤིས་བདེ་ལེགས།"
# Encode text into tokens (pieces)
tokens = sp.encode_as_pieces(tibetan_text)
print(f"Tokens: {tokens}")
# Encode text into token IDs
ids = sp.encode_as_ids(tibetan_text)
print(f"Token IDs: {ids}")
# Decode from IDs back to text
decoded_text = sp.decode_ids(ids)
print(f"Decoded Text: {decoded_text}")If you'd like to help out, check out our contributing guidelines.
- File an issue on the project's GitHub page.
- Email us at openpecha[at]gmail.com.
- Join our discord.
BoKenLm is licensed under the MIT License.