botok-rs

A fast Tibetan tokenizer written in Rust.

This is a Rust port of the Python botok library. It provides efficient tokenization of Tibetan text using a dictionary-based longest-match algorithm.

Features

Fast: Written in Rust for maximum performance (10-100x faster than Python)
Dictionary-based: Uses a Trie data structure for efficient longest-match tokenization
Simple mode: Can also do basic syllable tokenization without a dictionary
CLI tool: Includes a command-line interface for quick tokenization
Python bindings: Use from Python with the same performance benefits
Library: Can be used as a library in other Rust projects

Installation

Python

pip install git+https://github.com/OpenPecha/botok-rs.git

Or build from source:

git clone https://github.com/OpenPecha/botok-rs.git
cd botok-rs
pip install maturin
maturin build --features python --release
pip install target/wheels/botok_rs-*.whl

Rust CLI

cd botok-rs
cargo build --release

The binary will be at target/release/botok.

Usage

Python

from botok_rs import WordTokenizer

# Just works! Auto-downloads dialect pack on first use
wt = WordTokenizer()  # Downloads "general" dialect pack automatically
tokens = wt.tokenize("བཀྲ་ཤིས་བདེ་ལེགས།")
for t in tokens:
    print(f"{t.text}: {t.pos}")
# བཀྲ་ཤིས་: NOUN
# བདེ་ལེགས: NOUN
# །: None

Dialect Pack Management

from botok_rs import (
    WordTokenizer, 
    download_dialect_pack, 
    dialect_pack_exists,
    get_default_base_path
)

# Check where dialect packs are stored
print(get_default_base_path())
# ~/Documents/botok-rs/dialect_packs/

# Check if a dialect pack exists locally
if not dialect_pack_exists("general"):
    download_dialect_pack("general")

# Use a specific dialect pack
wt = WordTokenizer(dialect_name="general")

# Disable auto-download for manual dictionary management
wt = WordTokenizer(auto_download=False)
wt.load_tsv_file("my_dictionary.tsv")
wt.add_word("བཀྲ་ཤིས", pos="NOUN")

Simple Tokenization (No Dictionary)

from botok_rs import SimpleTokenizer, chunk, get_syls

# Simple syllable tokenization (no dictionary)
tokens = SimpleTokenizer.tokenize("བཀྲ་ཤིས་བདེ་ལེགས།")
for t in tokens:
    print(f"{t.text} ({t.chunk_type})")

# Get syllables
syls = get_syls("བཀྲ་ཤིས་བདེ་ལེགས")
# ['བཀྲ', 'ཤིས', 'བདེ', 'ལེགས']

# Low-level chunking
chunks = chunk("བཀྲ་ཤིས། Hello 123")
# [('TEXT', 'བཀྲ་', 0, 12), ('TEXT', 'ཤིས', 12, 9), ('PUNCT', '། ', 21, 4), ('LATIN', 'Hello 123', 25, 9)]

Token Properties

from botok_rs import WordTokenizer

wt = WordTokenizer()
wt.add_word("བཀྲ་ཤིས", pos="NOUN", lemma="བཀྲ་ཤིས", freq=1000)

tokens = wt.tokenize("བཀྲ་ཤིས།")
t = tokens[0]

print(t.text)        # 'བཀྲ་ཤིས་'
print(t.pos)         # 'NOUN'
print(t.lemma)       # 'བཀྲ་ཤིས'
print(t.freq)        # 1000
print(t.syls)        # ['བཀྲ', 'ཤིས']
print(t.chunk_type)  # 'TEXT'
print(t.start)       # 0 (byte offset)
print(t.len)         # 21 (byte length)

# Convenience methods
print(t.is_word())   # True
print(t.is_punct())  # False
print(t.to_dict())   # {'text': 'བཀྲ་ཤིས་', 'pos': 'NOUN', ...}

Command Line

# Simple syllable tokenization (no dictionary)
botok -s "བཀྲ་ཤིས་བདེ་ལེགས།"

# With a dictionary
botok -d dictionary.tsv "བཀྲ་ཤིས་བདེ་ལེགས།"

# JSON output
botok -s -j "བཀྲ་ཤིས་བདེ་ལེགས།"

# From stdin
echo "བཀྲ་ཤིས་བདེ་ལེགས།" | botok -s

As a Rust Library

use botok_rs::{Tokenizer, TrieBuilder, SimpleTokenizer};

// Option 1: Simple syllable tokenization (no dictionary)
let tokens = SimpleTokenizer::tokenize("བཀྲ་ཤིས་བདེ་ལེགས།");
for token in &tokens {
    println!("{}", token.text);
}

// Option 2: Dictionary-based tokenization
let tsv = "བཀྲ་ཤིས\tNOUN\t\t\t1000\nབདེ་ལེགས\tNOUN\t\t\t500";
let mut builder = TrieBuilder::new();
builder.load_tsv(tsv);
let trie = builder.build();

let tokenizer = Tokenizer::new(trie);
let tokens = tokenizer.tokenize("བཀྲ་ཤིས་བདེ་ལེགས།");

for token in &tokens {
    println!("{}: {:?}", token.text, token.pos);
}

Dictionary Format

The dictionary uses a TSV (tab-separated values) format:

form	pos	lemma	sense	freq
བཀྲ་ཤིས	NOUN			1000
བདེ་ལེགས	NOUN			500

Columns:

form: The word form (with tsek between syllables)
pos: Part-of-speech tag (NOUN, VERB, etc.)
lemma: Base form of the word (optional)
sense: Meaning/gloss (optional)
freq: Frequency count (optional)

Lines starting with # are treated as comments.

Architecture

The tokenizer works in several stages:

Character Classification: Each character is classified (consonant, vowel, punctuation, etc.)
Chunking: Text is segmented into chunks (syllables, punctuation, etc.)
Trie Lookup: Syllable sequences are matched against the dictionary using longest-match
Token Creation: Matched sequences become tokens with linguistic information

Performance

The Rust implementation is 200-400x faster than Python botok:

Text Size	Python botok	Rust botok-rs	Speedup
Small (17 chars)	0.60 ms	0.003 ms	240x
Medium (306 chars)	5.4 ms	0.028 ms	190x
Large (15K chars)	307 ms	1.19 ms	258x

Benchmark Details

Small text (17 chars, 1000 iterations)
  Python:  0.603 ms (mean)
  Rust:    0.003 ms (mean)
  Speedup: 240x faster

Medium text (306 chars, 500 iterations)
  Python:  5.411 ms (mean)
  Rust:    0.028 ms (mean)
  Speedup: 190x faster

Large text (15K chars, 50 iterations)
  Python:  306.891 ms (mean)
  Rust:    1.190 ms (mean)
  Speedup: 258x faster

Run the benchmark yourself:

python benchmark.py

Why So Fast?

Zero-copy parsing: Rust's ownership model allows efficient string handling
Compiled code: No interpreter overhead
Cache-friendly: Trie data structure optimized for CPU cache
No GC pauses: Deterministic memory management

License

Apache-2.0 (same as the original botok)

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
python/botok_rs		python/botok_rs
src		src
tests		tests
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
benchmark.py		benchmark.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

botok-rs

Features

Installation

Python

Rust CLI

Usage

Python

Dialect Pack Management

Simple Tokenization (No Dictionary)

Token Properties

Command Line

As a Rust Library

Dictionary Format

Architecture

Performance

Benchmark Details

Why So Fast?

License

About

Uh oh!

Releases

Packages

Languages

OpenPecha/botok-rs

Folders and files

Latest commit

History

Repository files navigation

botok-rs

Features

Installation

Python

Rust CLI

Usage

Python

Dialect Pack Management

Simple Tokenization (No Dictionary)

Token Properties

Command Line

As a Rust Library

Dictionary Format

Architecture

Performance

Benchmark Details

Why So Fast?

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages