BoKenLm

BoKenLm is a project for training a KenLM n-gram language model for the Tibetan language. It uses SentencePiece for tokenization and KenLM for language model creation. This toolkit is designed to be straightforward for creating language models from a large text corpus.

Installation

Clone the repository:
```
git clone <repository-url>
cd BoKenLm
```
Create and activate a virtual environment (recommended):
```
python3 -m venv .env
source .env/bin/activate
```
Install dependencies: The project uses pyproject.toml to manage dependencies. Install them using pip:
```
pip install -e .
```

Usage

Training the Language Model

The entire training process (SentencePiece tokenization and KenLM model building) is handled by a single script.

Prepare your data: You need a large corpus of clean Tibetan text. The corpus should be in a single .txt file with one sentence per line.

Configure the training script: Open the file src/BoKenLm/train_lm.py and modify the configuration section inside the main() function:

# Path to the clean Tibetan corpus file.
corpus_path = "path/to/your/clean_corpus.txt"
# Directory to save the trained models.
output_dir = "models/kenlm"
# Vocabulary size for SentencePiece tokenizer.
vocab_size = 32000
# N-gram order for the KenLM model.
ngram = 5

Update corpus_path to point to your text file. You can also adjust output_dir, vocab_size, and the ngram order.

Run training: Execute the script from the root directory of the project:
```
python src/BoKenLm/train_lm.py
```
The script will first train a SentencePiece model and save it to your output_dir. Then, it will use that model to tokenize the corpus and train a KenLM model, saving the final lm.arpa file in the same directory.

Tokenizing Text with the Trained Model

Once training is complete, you can use the generated SentencePiece model (tokenizer.model in your output directory) to tokenize new Tibetan text.

Here is an example Python snippet:

import sentencepiece as spm

# Load the trained model
sp = spm.SentencePieceProcessor(model_file="models/kenlm/tokenizer.model")

# Example Tibetan text
tibetan_text = "བཀྲ་ཤིས་བདེ་ལེགས།"

# Encode text into tokens (pieces)
tokens = sp.encode_as_pieces(tibetan_text)
print(f"Tokens: {tokens}")

# Encode text into token IDs
ids = sp.encode_as_ids(tibetan_text)
print(f"Token IDs: {ids}")

# Decode from IDs back to text
decoded_text = sp.decode_ids(ids)
print(f"Decoded Text: {decoded_text}")

Contributing

If you'd like to help out, check out our contributing guidelines.

How to get help

File an issue on the project's GitHub page.
Email us at openpecha[at]gmail.com.
Join our discord.

License

BoKenLm is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github/workflows		.github/workflows
.vscode		.vscode
docs		docs
src/BoKenLm		src/BoKenLm
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BoKenLm

Installation

Usage

Training the Language Model

Tokenizing Text with the Trained Model

Contributing

How to get help

License

About

Uh oh!

Releases

Packages

Languages

License

OpenPecha/BoKenLm

Folders and files

Latest commit

History

Repository files navigation

BoKenLm

Installation

Usage

Training the Language Model

Tokenizing Text with the Trained Model

Contributing

How to get help

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages