SentencePiece Tokenisation in Rust

This repository provides an implementation of Google's SentencePiece tokenisation algorithm in Rust. The code employs dynamic programming along with an Expectation Maximization (EM) approach to iteratively refine token probabilities and determine optimal token segmentation.

Getting Started

Prerequisites

Rust: Ensure you have Rust installed.
Dependencies:
- Standard library types like HashMap and HashSet
- A custom implementation of a Trie for token management
- The regex crate for pattern matching
- A function (or crate) providing the digamma function for probability computations
- A random number generator (e.g., using the rand crate)

Installation

Clone the repository and build the project using Cargo:

git clone https://github.com/Abhigyan126/SentencePiece-Tokenisation.git
cd SentencePiece-Tokenisation
cargo build --release

Usage

1. Initialization

Create a new SentencePiece Tokeniser instance:

let mut spt = SPT::new();

2. Preparing Training Data

Prepare your text, token frequency map, and character set. For example:

use std::collections::{HashMap, HashSet};

let text = "your training text here";
let mut tokens: HashMap<String, usize> = HashMap::new();
// Populate `tokens` with token frequencies, e.g., from an initial vocabulary

// Extract the set of characters from the text
let characters: HashSet<char> = text.chars().collect();
let vocab_size = 1000;    // Desired vocabulary size
let delta = 0.001;        // Convergence threshold for EM
let max_iter = 100;       // Maximum iterations per EM round
let max_round = 10;       // Maximum EM rounds

3. Fitting the Model

Fit the model using the fit method. This runs multiple rounds of the EM algorithm and prunes tokens to maintain vocabulary size:

spt.fit(text, &mut tokens, &characters, vocab_size, delta, max_iter, max_round)?;

4. Tokenisation

After fitting, you can tokenize new text. The tokenizer replaces spaces with underscores and supports n-best tokenisation:

let input_text = "your input text here";
let nbest_size = 5;  // Number of best candidates to consider during tokenisation
let tokenization = spt.tokenize(input_text, nbest_size);
println!("Tokenization: {:?}", tokenization);

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Python Implementation		Python Implementation
src		src
.gitignore		.gitignore
Cargo.toml		Cargo.toml
README.md		README.md
sample_train.txt		sample_train.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SentencePiece Tokenisation in Rust

Getting Started

Prerequisites

Installation

Usage

1. Initialization

2. Preparing Training Data

3. Fitting the Model

4. Tokenisation

About

Uh oh!

Uh oh!

Languages

Abhigyan126/SentencePiece-Tokenisation

Folders and files

Latest commit

History

Repository files navigation

SentencePiece Tokenisation in Rust

Getting Started

Prerequisites

Installation

Usage

1. Initialization

2. Preparing Training Data

3. Fitting the Model

4. Tokenisation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages