TamilTokenizer is a byte-level BPE tokenizer project focused on Tamil text.
It includes:
- A reusable tokenizer package (
minbpe) - A corpus preparation pipeline (
data/) - A training script to produce tokenizer artifacts (
train.py) - A FastAPI inference service (
server.py) - A Gradio visualization app (
app.py) - A Docker setup for API deployment
flowchart TD
subgraph group_core["Tokenizer core"]
node_minbpe_init["API<br/>package export<br/>[__init__.py]"]
node_minbpe_base["Base<br/>tokenizer core<br/>[base.py]"]
node_minbpe_basic["Basic BPE<br/>tokenizer variant<br/>[basic.py]"]
node_minbpe_regexs["Regex BPE<br/>tokenizer variant<br/>[regexs.py]"]
node_minbpe_tests["Tests<br/>[test_tokenizer.py]"]
end
subgraph group_pipeline["Data pipeline"]
node_download_html["Download HTML<br/>data ingest<br/>[download_html.py]"]
node_corpus_create["Build Corpus<br/>data transform<br/>[corpus_create.py]"]
end
subgraph group_artifacts["Artifacts"]
node_raw_data[("Raw data<br/>staging dir")]
node_processed_data[("Processed data<br/>staging dir")]
node_model_files[("Models<br/>artifact store")]
node_regex_model["regex.model<br/>trained model<br/>[regex.model]"]
node_regex_vocab["regex.vocab<br/>vocab artifact<br/>[regex.vocab]"]
end
subgraph group_entrypoints["Entrypoints"]
node_train_py["Train<br/>orchestrator<br/>[train.py]"]
node_app_py["App<br/>runtime entrypoint<br/>[app.py]"]
node_server_py["Server<br/>runtime entrypoint<br/>[server.py]"]
end
subgraph group_tooling["Tooling"]
node_tooling{{"Tooling"}}
end
node_minbpe_init -->|"exports"| node_minbpe_base
node_minbpe_init -->|"exports"| node_minbpe_basic
node_minbpe_init -->|"exports"| node_minbpe_regexs
node_minbpe_basic -->|"extends"| node_minbpe_base
node_minbpe_regexs -->|"extends"| node_minbpe_base
node_minbpe_tests -.->|"validates"| node_minbpe_init
node_download_html -->|"writes"| node_raw_data
node_raw_data -->|"feeds"| node_corpus_create
node_corpus_create -->|"writes"| node_processed_data
node_processed_data -->|"feeds"| node_train_py
node_train_py -->|"uses"| node_minbpe_init
node_train_py -->|"produces"| node_model_files
node_model_files -->|"contains"| node_regex_model
node_model_files -->|"contains"| node_regex_vocab
node_regex_model -->|"loads"| node_app_py
node_regex_vocab -->|"loads"| node_app_py
node_regex_model -->|"loads"| node_server_py
node_tooling -.->|"supports"| node_train_py
node_tooling -.->|"supports"| node_minbpe_tests
click node_minbpe_init "https://github.com/muthukamalan/tamiltokenizers/blob/main/minbpe/__init__.py"
click node_minbpe_base "https://github.com/muthukamalan/tamiltokenizers/blob/main/minbpe/base.py"
click node_minbpe_basic "https://github.com/muthukamalan/tamiltokenizers/blob/main/minbpe/basic.py"
click node_minbpe_regexs "https://github.com/muthukamalan/tamiltokenizers/blob/main/minbpe/regexs.py"
click node_minbpe_tests "https://github.com/muthukamalan/tamiltokenizers/blob/main/minbpe/test_tokenizer.py"
click node_download_html "https://github.com/muthukamalan/tamiltokenizers/blob/main/data/download_html.py"
click node_corpus_create "https://github.com/muthukamalan/tamiltokenizers/blob/main/data/corpus_create.py"
click node_raw_data "https://github.com/muthukamalan/tamiltokenizers/tree/main/data/raw"
click node_processed_data "https://github.com/muthukamalan/tamiltokenizers/tree/main/data/processed"
click node_train_py "https://github.com/muthukamalan/tamiltokenizers/blob/main/train.py"
click node_app_py "https://github.com/muthukamalan/tamiltokenizers/blob/main/app.py"
click node_server_py "https://github.com/muthukamalan/tamiltokenizers/blob/main/server.py"
click node_model_files "https://github.com/muthukamalan/tamiltokenizers/tree/main/models"
click node_regex_model "https://github.com/muthukamalan/tamiltokenizers/blob/main/models/regex.model"
click node_regex_vocab "https://github.com/muthukamalan/tamiltokenizers/blob/main/models/regex.vocab"
classDef toneNeutral fill:#f8fafc,stroke:#334155,stroke-width:1.5px,color:#0f172a
classDef toneBlue fill:#dbeafe,stroke:#2563eb,stroke-width:1.5px,color:#172554
classDef toneAmber fill:#fef3c7,stroke:#d97706,stroke-width:1.5px,color:#78350f
classDef toneMint fill:#dcfce7,stroke:#16a34a,stroke-width:1.5px,color:#14532d
classDef toneRose fill:#ffe4e6,stroke:#e11d48,stroke-width:1.5px,color:#881337
classDef toneIndigo fill:#e0e7ff,stroke:#4f46e5,stroke-width:1.5px,color:#312e81
classDef toneTeal fill:#ccfbf1,stroke:#0f766e,stroke-width:1.5px,color:#134e4a
class node_minbpe_init,node_minbpe_base,node_minbpe_basic,node_minbpe_regexs,node_minbpe_tests toneBlue
class node_download_html,node_corpus_create toneAmber
class node_raw_data,node_processed_data,node_model_files,node_regex_model,node_regex_vocab toneMint
class node_train_py,node_app_py,node_server_py toneRose
class node_tooling toneIndigo
- Byte Pair Encoding (BPE) tokenization with custom merge training
- Tamil-specific regex chunking pattern (
[\u0B80-\u0BFF]+) for training/encoding - Save/load tokenizer artifacts (
.model,.vocab) - REST API endpoint for tokenization
- Interactive Gradio UI for token visualization
- Optional data acquisition and corpus generation from HTML sources
.
├── app.py # Gradio tokenizer visualizer
├── data
│ ├── corpus_create.py
│ ├── download_html.py
│ ├── processed
│ └── raw
├── Dockerfile # Containerized API runtime
├── EDA.ipynb
├── LICENSE
├── Makefile
├── minbpe
│ ├── base.py
│ ├── basic.py
│ ├── __init__.py
│ ├── regexs.py
│ └── test_tokenizer.py
├── models
│ ├── regex.model # Serialized tokenizer merges/config
│ └── regex.vocab # Human-readable vocabulary dump
├── pyproject.toml
├── README.md
├── requirements.txt # Project metadata and dependencies (uv)
├── server.py # FastAPI API server (/encode)
└── train.py # Tokenizer training script
6 directories, 19 filesdata/download_html.pydownloads source HTML files intodata/raw/.data/corpus_create.pyparses HTML, extracts text with BeautifulSoup, and keeps Tamil Unicode range characters.- Final corpus is written to
data/processed/tamil_corpus.txt.
train.pyreadsdata/processed/tamil_corpus.txt- Trains
RegexTokenizerwithvocab_size=1000 - Writes artifacts to
models/regex.modelandmodels/regex.vocab
server.pyloadsmodels/regex.modeland exposesPOST /encodeapp.pyloads the same model and renders tokenized text in color via Gradio
- Python 3.10+ recommended
- uv (recommended package manager), or
pip - Docker (optional, for containerized API)
Note:
pyproject.tomlcurrently specifies>=3.14, but most tooling and dependencies are compatible with mainstream Python 3.10+ environments.
uv syncRun API server:
uv run python server.pyRun Gradio app:
uv run python app.pypython -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtRun API server:
python server.pyRun Gradio app:
python app.pyStart the server (server.py) and call:
- Endpoint:
POST /encode - Default URL:
http://localhost:8000/encode
Request:
{ "text": "ஆனந்த சிலை மனம் நெகிழ கண்டார்"}Response (shape):
{
"token_ids": [ ... ],
"token_details": [
{
"token_id": 123,
"token_bytes": "b'...'",
"token_text": "..."
}
],
"full_text": "..."
}Example with curl:
curl -X POST "http://localhost:8000/encode" -H "Content-Type: application/json" -d '{"text":"ஆதி அந்தமில்லாத காலம்"}'Build image:
docker build -t tamil-tokenizer:latest .Run container:
docker run --rm -p 8000:8000 tamil-tokenizer:latestTest the API:
curl -X POST "http://localhost:8000/encode" \
-H "Content-Type: application/json" \
-d '{"text":"தமிழ் மொழி அழகு"}'If you want to retrain from fresh data:
- Download raw HTML corpus:
python data/download_html.py
- Build processed corpus:
python data/corpus_create.py
- Train tokenizer and generate model files:
python train.py
make download-data # fetch raw corpus sources
make get-corpous # build processed corpus (spelling kept as in Makefile)
make train-tokenizer # train tokenizer model
make build-image # build Docker image (tag: tamil-tiktok:latest)
make gradio # run Gradio app
make del-model # remove files in models/
make clean # remove caches and generated artifactsfrom minbpe import RegexTokenizer
tokenizer = RegexTokenizer()
tokenizer.load("./models/regex.model")
text = "சிந்தாமணி சிலப்பதிகாரம்"
ids = tokenizer.encode(text)
decoded = tokenizer.decode(ids)
print(ids)
print(decoded)Run tests with:
pytest -qmodels/regex.modelmust exist before runningserver.pyorapp.py.- The tokenizer currently emphasizes Tamil Unicode block text splitting.
- Data download uses external sources and may fail due to site/network changes.
This project is licensed under the terms in LICENSE.

