Ukrainian text optimizer for LLMs — fewer tokens, better comprehension.
Normalizes surzhyk, slang, fillers, and maps to English for cloud LLMs. Saves 60-73% tokens while improving response quality.
UA: Оптимізація українських текстів для LLM. Нормалізує суржик, сленг, мат — і стискає в англійську для Claude/GPT. Економія 60-73% токенів, якість відповідей зростає зі 67% до 100%.
Tested on 53,351 texts (Telegram corpus + books), 12 IT prompts across 4 GPT models:
| Metric | Value |
|---|---|
| Token savings (cloud) | 73% |
| Token savings (without seq2seq) | 49% |
| Lexicon coverage | 88% |
| Seq2seq exact match | 98.2% |
| GPT response quality (original UA) | 67% |
| GPT response quality (squeezed EN) | 100% |
| Quality preservation | 150% (squeezed > original) |
Original UA: "блін продакшн впав після деплою, що робити першим"
Squeezed EN: "damn production crashed after deploy, what do first"
Tokens: 45 → 12 (-73%)
GPT accuracy: 67% → 100%
graph LR
A[UA text<br/>surzhyk, slang] --> B[crack_open<br/>normalize]
B --> C[compress<br/>remove fillers]
C --> D[map_to_en<br/>lexicon + seq2seq]
D --> E[EN compressed<br/>for LLM]
style A fill:#fdd,stroke:#c33
style E fill:#dfd,stroke:#3a3
| Layer | What it does | How |
|---|---|---|
| crack_open | surzhyk, slang, profanity → standard UA | 360 rules + pymorphy3 lemmatization |
| compress | remove fillers, intensifiers, noise | rule-based pattern matching |
| map_to_en | UA → compact English | 47K lexicon + seq2seq (28K expression pairs) |
pip install dormouse-uaEverything works out of the box — lexicon (47K entries), seq2seq model (28K expression pairs), and vocab files are bundled in the package.
For embeddings search (stir/mumble/sip) — needs PyTorch:
pip install dormouse-ua[ml] # + torch, sentence-transformers
pip install dormouse-ua[all] # everythingfrom dormouse import squeeze
# Normalize only (layers 1+2)
squeeze("шо там по баґу, пофікси плз")
# → "що там по помилці, виправ"
# Cloud mode — compress for Claude/GPT (layers 1+2+3)
squeeze("ваще нормально, канєшно зробимо", target="cloud")
# → "generally ok, sure do"from openai import OpenAI
from dormouse import DormouseClient
client = DormouseClient(OpenAI()) # or Anthropic()
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "шо там по деплою, він ваще не робе"}],
)
# Prompt: squeeze → EN → GPT → unsqueeze → Ukrainian responsefrom dormouse import sniff
results = sniff(
["Борщ український", "Чізкейк Нью-Йорк", "Лимонад"],
{
"Гарячі страви": "борщ суп юшка харчо крем-суп гуляш",
"Десерти": "торт чізкейк еклер фондан мус",
"Напої": "сік морс лимонад компот узвар",
}
)
# → [SniffResult("Борщ український", "Гарячі страви", 0.43),
# SniffResult("Чізкейк Нью-Йорк", "Десерти", 0.45),
# SniffResult("Лимонад", "Напої", 0.49)]Example-based classification: provide examples per category instead of just names. Uses MiniLM-L12-v2 embeddings, no API calls. Also accepts list[str] for simple name-based classification.
from dormouse import stir, mumble, sip
stir("report.pdf") # index
results = mumble("холодні закуски") # search by meaning
topics = sip("data.xlsx", topics=["HR", "finance"]) # classifydormouse squeeze "шо там по баґу" -t cloud
dormouse stir book.pdf
dormouse mumble "головний герой"| Method | Tokens | Savings | Quality |
|---|---|---|---|
| Original UA | 1,312 | — | 4.65/5 |
| dormouse | 620 | 53% | 4.50/5 |
| LLMLingua (on UA) | 1,182 | 10% | 4.60/5 |
| dormouse + LLMLingua | 595 | 55% | 4.60/5 |
LLMLingua achieves only 10% savings on Ukrainian — its GPT-2 perplexity model doesn't understand Cyrillic. dormouse gives 5x more compression on the same texts.
The problem: Ukrainian Cyrillic costs 3-4x more tokens than equivalent English text in GPT-4/Claude.
| Tool | Ukrainian | Savings (on UA texts) | Approach |
|---|---|---|---|
| dormouse | native | 53% (tested) | normalize + compress + translate |
| LLMLingua | no | 10% (tested) | ML perplexity pruning |
| Selective Context | no | ~10-15%* | self-information filtering |
| token-reducer | no | ~10-15%* | 6-stage pipeline |
*Estimated — these tools use similar English-trained models, expected to perform comparably to LLMLingua on Cyrillic.
All existing compression tools work on already English text. dormouse solves the problem one level earlier — transforms expensive Ukrainian (3-4 tokens/word) into cheap English (1-1.5 tokens/word) while preserving meaning. No other tool specifically optimizes Ukrainian for LLMs.
Cost reduction — Ukrainian Cyrillic encodes into 2-4x more tokens than equivalent English. dormouse saves 60-73% on input tokens.
Chatbots & support — Users write in surzhyk/slang, dormouse normalizes before LLM, GPT gives concrete answers instead of generic responses.
RAG & document search — User searches in slang, documents are in literary language. dormouse normalizes both sides → finds by meaning.
AI agents — Long chains of actions eat context window. 73% compression = 73% more "memory" for the agent.
Batch processing — 10K comments through GPT for sentiment analysis. Squeeze first → cheaper and faster.
Local search & classification (no API needed) — stir/mumble/sip work fully offline. Index PDF/Excel/TXT, search by meaning, classify by topics — all on CPU with local embeddings (MiniLM-L12-v2). No cloud, no keys, no cost.
Full evaluation ran for 4 days on 53,351 texts:
Corpus: 53,351 texts (Telegram + books)
Squeeze speed: 606 texts/sec (normalization)
Seq2seq model: 7.3M params, 28K expression pairs
Stir/mumble: 8,441 chunks indexed, search ~600ms
Sip classification: 99% texts classified (8 topics)
| Model | UA score | Squeezed EN | Preservation |
|---|---|---|---|
| GPT-4.1 | 4.79 | 4.86 | 102% |
| GPT-4.1-mini | 4.71 | 4.68 | 99% |
| GPT-4o-mini | 4.61 | 4.60 | 100% |
| GPT-4.1-nano | 4.58 | 4.56 | 100% |
| GPT-5.5 | 4.00 | 4.00 | 100% |
| Gemini 2.0 Flash | 4.11 | 4.10 | 100% |
Squeeze preserves 99-102% quality across all tested models. GPT-4.1 actually performs better on squeezed text.
Note on GPT-5.5 scores: GPT-5.5 shows lower absolute scores (4.0 vs 4.79 for GPT-4.1) — this is an artifact of our heuristic judge (length + structure based). GPT-5.5 produces shorter, more precise answers that score lower on this metric. A proper LLM-judge eval would likely show higher scores. Preservation ratio (100%) is the meaningful metric here.
| Model | UA score | Squeezed EN | Delta |
|---|---|---|---|
| Qwen2.5-72B | 4.9/5 | 4.5/5 | -0.4 |
| Qwen2.5-7B | 4.4/5 | 3.6/5 | -0.8 |
| Llama-3.2-1B | 2.7/5 | 2.8/5 | +0.1 |
For small models (<7B), use
brew()with native Ukrainian — they understand UA better than squeezed EN.
src/dormouse/
├── optimizer.py — squeeze() main pipeline
├── rule_engine.py — normalization (360 rules + pymorphy3)
├── compressor.py — filler/noise removal
├── classifier.py — sniff() embeddings-based classification
├── mapper.py — UA→EN via lexicon + lemma + transliteration
├── seq2seq.py — expression translator (GRU encoder-decoder)
├── teapot.py — stir/mumble/sip/brew (search + LLM)
├── embedder.py — sentence-transformers wrapper
├── middleware.py — OpenAI/Anthropic SDK proxy
├── cli.py — Click CLI
├── assets.py — bundled data + lazy download fallback
└── data/ — lexicon.db, seq2seq model, vocab, rules
git clone https://github.com/ChuprinaDaria/dormouse
cd dormouse
pip install -e ".[dev,morph]"
DORMOUSE_DATA_DIR=./data pytest tests/ -vMIT
Built by Daria Chuprina because she can 👾.