This repository provides an implementation of Google's SentencePiece tokenisation algorithm in Rust. The code employs dynamic programming along with an Expectation Maximization (EM) approach to iteratively refine token probabilities and determine optimal token segmentation.
- Rust: Ensure you have Rust installed.
- Dependencies:
Clone the repository and build the project using Cargo:
git clone https://github.com/Abhigyan126/SentencePiece-Tokenisation.git
cd SentencePiece-Tokenisation
cargo build --release
Create a new SentencePiece Tokeniser instance:
let mut spt = SPT::new();
Prepare your text, token frequency map, and character set. For example:
use std::collections::{HashMap, HashSet};
let text = "your training text here";
let mut tokens: HashMap<String, usize> = HashMap::new();
// Populate `tokens` with token frequencies, e.g., from an initial vocabulary
// Extract the set of characters from the text
let characters: HashSet<char> = text.chars().collect();
let vocab_size = 1000; // Desired vocabulary size
let delta = 0.001; // Convergence threshold for EM
let max_iter = 100; // Maximum iterations per EM round
let max_round = 10; // Maximum EM rounds
Fit the model using the fit
method. This runs multiple rounds of the EM algorithm and prunes tokens to maintain vocabulary size:
spt.fit(text, &mut tokens, &characters, vocab_size, delta, max_iter, max_round)?;
After fitting, you can tokenize new text. The tokenizer replaces spaces with underscores and supports n-best tokenisation:
let input_text = "your input text here";
let nbest_size = 5; // Number of best candidates to consider during tokenisation
let tokenization = spt.tokenize(input_text, nbest_size);
println!("Tokenization: {:?}", tokenization);