Cut LLM token costs by 40β75% β no API key required.
Works as a CLI tool for Claude Code, Gemini CLI, Codex, aider, and any LLM tool.
Optional Python SDK for Anthropic Β· OpenAI Β· Ollama Β· PySpark batch.
pip install llm-tokenoptim
# Inject caveman compression into Claude Code (auto-loaded via CLAUDE.md)
llm-tokenoptim install-claude --level full
# Pipe skill into any LLM tool manually
llm-tokenoptim skill full | pbcopy # macOS clipboard β paste into any chat
# Wrap ANY LLM CLI tool β works with gemini, codex, aider, sgpt, llm, ollama
llm-tokenoptim wrap --level full -- gemini "explain kubernetes networking"
llm-tokenoptim wrap --level ultra -- claude "write a redis cache class"
llm-tokenoptim wrap --level standard -- aider --model gpt-4o
# Compress a verbose prompt before sending
llm-tokenoptim compress "Could you please help me understand what tokenization means in the context of large language models?"Five compression levels β no API, no GPU, <0.1ms overhead:
| Level | Output savings | Style |
|---|---|---|
lite |
~20% | Strip pleasantries only |
standard (default) |
~40% | Terse engineer mode |
full |
~60% | Caveman mode β drop articles, bullets > prose |
ultra |
~70% | Symbols + fragments |
ancient |
~75% | Stone tablet β extreme |
Inspired by caveman. Extended with multi-provider SDK, prompt-side compression, memory management, retry, caching, and PySpark.
Every LLM API call burns money proportional to token count. Most teams waste tokens in predictable, fixable ways:
| Waste source | Typical overhead | llm-tokenoptim fix |
|---|---|---|
| Verbose prompts ("Could you please help me...") | +15β30% input tokens | Regex prompt compressor |
| Pleasantries in output ("Great question! Certainly!") | +40β75% output tokens | 6-level output compressor |
| Ballooning conversation history | Grows unbounded per turn | Memory window + auto-compaction |
| Repeated API calls for same prompt | 100% waste | Disk-backed response cache |
| Short-sighted serial API calls | NΓ latency at same cost | batch_chat() with concurrency |
| Suboptimal JSON for structured output | ~2Γ vs YAML | Output format hints |
Measured on 500 real ShareGPT conversations. Run
python benchmarks/run_benchmark.pyto reproduce.
| Level | Mean reduction | Median | P90 | Latency |
|---|---|---|---|---|
light |
6% | 0% | 23% | <0.1ms |
medium |
12% | 11% | 24% | <0.1ms |
full |
14% | 12% | 26% | <0.1ms |
With LLMLingua ML backend (
pip install "llm-tokenoptim[ml]"): 40β60% reduction on verbose prompts. Install the extra to unlock it β the library falls back to regex automatically if not present.
| Level | Est. output savings | System prompt overhead |
|---|---|---|
lite |
~20% | ~115 tokens |
standard (default) |
~40% | ~190 tokens |
full |
~60% | ~190 tokens |
ultra |
~70% | ~185 tokens |
ancient |
~75% | ~190 tokens |
Output savings are measured against base Claude Haiku responses on coding and explanation tasks. Break-even point for
standard: any response longer than ~475 tokens.
# Core library β zero dependencies
pip install llm-tokenoptim
# With providers
pip install "llm-tokenoptim[anthropic]" # Claude (async + streaming)
pip install "llm-tokenoptim[openai]" # OpenAI / Groq / Together
pip install "llm-tokenoptim[spark]" # PySpark batch compression
pip install "llm-tokenoptim[ml]" # LLMLingua ML compression (40-60%)
pip install "llm-tokenoptim[all]" # Everythingfrom tokenoptim import PromptCompressor
c = PromptCompressor(level="medium")
compressed, stats = c.compress(
"Could you please help me understand what tokenization means? "
"I would really like to know, perhaps with examples if that makes sense."
)
print(stats)
# Chars: 142 β 119 (16.2% β) | ~Tokens: 35 β 29 (17.1% β)import asyncio
from tokenoptim import AsyncOptimizedClient
from tokenoptim.providers import AsyncAnthropicProvider
async def main():
client = AsyncOptimizedClient(
provider=AsyncAnthropicProvider(model="claude-haiku-4-5-20251001"),
compress_prompts=True,
output_level="full", # ~60% output reduction
memory_enabled=True,
memory_max_turns=10,
token_budget=100_000,
)
# Standard call
resp = await client.chat("Explain distributed consensus algorithms")
print(resp["content"])
# Streaming β tokens arrive as generated
async for chunk in client.stream("Write a Python rate limiter"):
print(chunk, end="", flush=True)
# Batch β 5 calls concurrently, rate-limited
responses = await client.batch_chat(
["What is Redis?", "What is Kafka?", "What is Flink?"],
max_concurrency=3,
)
client.counter.report()
asyncio.run(main())from tokenoptim import OptimizedClient
from tokenoptim.providers import AnthropicProvider
client = OptimizedClient(
provider=AnthropicProvider(),
compress_prompts=True,
output_level="standard",
memory_enabled=True,
)
resp = client.chat("How does PySpark partitioning work?")
print(resp["content"])# pip install "llm-tokenoptim[ml]"
from tokenoptim.core.ml_compressor import MLPromptCompressor
c = MLPromptCompressor(target_token_rate=0.5) # keep 50% β 50% reduction
compressed, stats = c.compress(very_long_prompt)
print(stats)
# [llmlingua] ~Tokens: 1200 β 600 (50.0% β)Falls back to regex automatically if llmlingua is not installed.
from tokenoptim import ResponseCache, OptimizedClient
from tokenoptim.providers import AnthropicProvider
cache = ResponseCache(directory="~/.cache/llm-tokenoptim", ttl_seconds=3600)
provider = AnthropicProvider()
messages = [{"role": "user", "content": "What is tokenization?"}]
key = cache.make_key(messages, model="claude-haiku-4-5-20251001")
if hit := cache.get(key):
print("Cache hit! Zero API cost.")
print(hit["content"])
else:
resp = provider.chat(messages, max_tokens=512)
cache.set(key, resp)
print(resp["content"])
print(cache.stats())
# {'hits': 1, 'misses': 1, 'hit_rate_pct': 50.0, 'memory_entries': 1}client = AsyncOptimizedClient(provider=..., memory_enabled=True)
# Check memory state
print(client.memory_stats())
# {'enabled': True, 'active_window_turns': 4, 'estimated_window_tokens': 890, ...}
# Toggle off for a fast stateless call
client.toggle_memory() # β Memory OFF
# Toggle back on
client.toggle_memory() # β Memory ON
client.clear_memory() # Wipe historyAutomatic exponential backoff on 429 / 5xx β configured at construction:
from tokenoptim import RetryConfig, AsyncOptimizedClient
client = AsyncOptimizedClient(
provider=...,
retry=RetryConfig(
max_attempts=5, # 4 retries
base_delay=1.0, # start at 1s
max_delay=60.0, # cap at 60s
backoff_factor=2.0, # double each time
jitter=True, # Β±25% randomisation
),
)Compress millions of prompts before sending to any LLM:
from pyspark.sql import SparkSession
from tokenoptim.spark import SparkTokenOptimizer
spark = SparkSession.builder.appName("llm-tokenoptim").getOrCreate()
df = spark.read.parquet("s3://bucket/raw-prompts/")
optimizer = SparkTokenOptimizer(level="full", spark=spark)
df_out = optimizer.compress_dataframe(df, prompt_col="prompt")
df_out.write.parquet("s3://bucket/compressed-prompts/")
optimizer.savings_report(df, df_out)
# ββββ llm-tokenoptim PySpark Savings Report ββββ
# Prompts processed : 4,500,000
# Total original tokens: 892,400,000
# Total compressed : 768,000,000
# Tokens saved : 124,400,000 (13.9% avg)
# Est. cost saved : $373.20 at $3/1M tokens
# βββββββββββββββββββββββββββββββββββββββββWith the [ml] extra, LLMLingua runs as a Spark UDF on each executor β 40β60% reduction at scale.
# ββ Skill injection (primary use β no API key needed) βββββββββββββββββββββββββ
llm-tokenoptim skill [lite|standard|full|ultra|ancient] # print skill to stdout
llm-tokenoptim install-claude --level full # append to ./CLAUDE.md
llm-tokenoptim install-global --level standard # append to ~/CLAUDE.md
# ββ Wrap any LLM CLI tool βββββββββββββββββββββββββββββββββββββββββββββββββββββ
llm-tokenoptim wrap --level full -- gemini "explain kubernetes"
llm-tokenoptim wrap --level ultra -- claude "write a redis cache class"
llm-tokenoptim wrap --level full -- codex "refactor this function"
llm-tokenoptim wrap --level standard -- aider --model gpt-4o
llm-tokenoptim wrap --level full -- llm "summarize this doc"
llm-tokenoptim wrap --level ultra -- ollama run llama3
# ββ Prompt compression (Python regex, <0.1ms, no API) ββββββββββββββββββββββββ
llm-tokenoptim compress "Could you please help me understand what a token is?"
llm-tokenoptim compress --level full --file my_prompt.txt --output compressed.txt
# ββ Benchmarks ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
llm-tokenoptim bench --input prompts.txt # benchmark one prompt per line
llm-tokenoptim levels # show all levels and savings
python benchmarks/run_benchmark.py --samples 500src/llm-tokenoptim/
βββ core/
β βββ compressor.py # Regex prompt compression (3 levels, <0.1ms)
β βββ ml_compressor.py # LLMLingua ML compression (optional, 40β60%)
β βββ output_style.py # 6-level caveman system prompt injection
β βββ memory.py # Toggleable context window + auto-compaction
β βββ counter.py # Multi-provider token tracking + budget guard
β βββ retry.py # Exponential backoff (429/5xx/timeouts)
β βββ cache.py # Disk + memory response cache (LRU, TTL)
βββ providers/
β βββ anthropic.py # Sync Claude (prompt caching enabled)
β βββ anthropic_async.py # Async Claude + streaming
β βββ openai.py # Sync OpenAI / Groq / Together
β βββ openai_async.py # Async OpenAI + streaming
β βββ ollama.py # Local models via Ollama
βββ spark/udf.py # PySpark UDF + SparkTokenOptimizer
βββ client.py # OptimizedClient (sync)
βββ async_client.py # AsyncOptimizedClient (async + batch + stream)
βββ cli.py # llm-tokenoptim CLI
The benchmark table above reflects actual measurements on ShareGPT prompts. The regex compressor gets 12β14% mean reduction on typical prompts. P90 reaches 24β26% on verbose inputs.
To unlock the claimed 40β60% range, install the [ml] extra which uses LLMLingua β a real perplexity-based NLP compressor from Microsoft Research.
Output compression savings (40β75%) apply to the output side and are system-prompt driven β they are measured against Claude Haiku with and without the compression directive, and hold consistently across coding and explanation tasks.
git clone https://github.com/manasmourya/llm-tokenoptim
cd llm-tokenoptim
pip install -e ".[dev]"
pytest tests/ -v
python benchmarks/run_benchmark.py --no-downloadPRs welcome. Please run ruff check src/ tests/ before submitting.
MIT Β© 2026 Manas Mourya
Inspired by caveman. Extended with multi-provider support, async/streaming, prompt-side compression (regex + LLMLingua), memory management, retry logic, response caching, and PySpark batch processing.
A complete runbook β copy-paste any block directly into your terminal or Python file.
pip install llm-tokenoptim # zero-dependency core
pip install "llm-tokenoptim[anthropic]" # + Claude support
pip install "llm-tokenoptim[openai]" # + OpenAI / Groq
pip install "llm-tokenoptim[ml]" # + LLMLingua ML compression
pip install "llm-tokenoptim[spark]" # + PySpark batch
pip install "llm-tokenoptim[all]" # everything# Install caveman-style compression into your project permanently
tokenoptim install-claude --level full
# Install globally (applies to every project)
tokenoptim install-global --level standard
# Print any skill to stdout β pipe it anywhere
tokenoptim skill lite
tokenoptim skill standard
tokenoptim skill full
tokenoptim skill ultra
tokenoptim skill ancient
# Copy to clipboard (macOS)
tokenoptim skill full | pbcopy
# Pipe directly into a file
tokenoptim skill ultra > my_system_prompt.md# Claude Code
tokenoptim wrap --level full -- claude "explain kubernetes networking"
# Gemini CLI
tokenoptim wrap --level ultra -- gemini "write a redis cache class in Python"
# OpenAI Codex
tokenoptim wrap --level standard -- codex "refactor this function for readability"
# Aider (AI pair programmer)
tokenoptim wrap --level full -- aider --model gpt-4o
# Simon Willison's llm CLI
tokenoptim wrap --level full -- llm "summarize this document"
# Ollama (local models)
tokenoptim wrap --level ultra -- ollama run llama3
# shell-gpt
tokenoptim wrap --level standard -- sgpt "explain docker networking"
# Show all available levels
tokenoptim levels# Compare all levels at once
tokenoptim compress "Could you please help me understand what tokenization means in LLMs?"
# Use a specific level
tokenoptim compress "Could you please help me..." --level full
# Compress a file, save output
tokenoptim compress --level full --file my_prompt.txt --output compressed.txt
# Benchmark a file of prompts (one per line)
tokenoptim bench --input prompts.txtfrom tokenoptim import PromptCompressor
# Try all three levels
for level in ["light", "medium", "full"]:
c = PromptCompressor(level=level)
compressed, stats = c.compress(
"Could you please help me understand what tokenization means? "
"I would really appreciate examples if you have them."
)
print(f"[{level}] {stats}")
# Compress a list of chat messages
from tokenoptim.core.compressor import PromptCompressor
c = PromptCompressor(level="full")
messages = [
{"role": "user", "content": "Could you please explain Redis?"},
{"role": "assistant", "content": "Redis is a key-value store."},
{"role": "user", "content": "Could you please give me a Python example?"},
]
compressed_messages = c.compress_messages(messages)
print(compressed_messages)pip install "llm-tokenoptim[ml]"from tokenoptim.core.ml_compressor import MLPromptCompressor
c = MLPromptCompressor(
target_token_rate=0.5, # keep 50% of tokens β ~50% reduction
fallback_level="full", # use regex if llmlingua not installed
)
print("Backend:", c.backend) # "llmlingua" or "regex-fallback"
compressed, stats = c.compress("""
In the context of large language models, tokenization refers to the process
of converting raw text into numerical representations called tokens that the
model can process. Each token typically represents a word or subword unit...
""")
print(stats)from tokenoptim import OptimizedClient, RetryConfig, ResponseCache
from tokenoptim.providers import AnthropicProvider
client = OptimizedClient(
provider=AnthropicProvider(model="claude-haiku-4-5-20251001"),
compress_prompts=True, # compress input prompts
output_level="full", # ~60% shorter responses
memory_enabled=True, # track conversation history
memory_max_turns=10, # keep last 10 turns
token_budget=50_000, # warn if exceeded
retry=RetryConfig(max_attempts=4, base_delay=1.0),
cache=ResponseCache(directory="~/.cache/tokenoptim", ttl_seconds=3600),
)
resp = client.chat("What is a transformer architecture?")
print(resp["content"])
# See token usage
client.counter.report()
# See full client status
client.status()import asyncio
from tokenoptim import AsyncOptimizedClient
from tokenoptim.providers import AsyncAnthropicProvider
async def main():
client = AsyncOptimizedClient(
provider=AsyncAnthropicProvider(model="claude-haiku-4-5-20251001"),
compress_prompts=True,
output_level="full",
memory_enabled=True,
)
# Regular async call
resp = await client.chat("Explain gradient descent")
print(resp["content"])
# Streaming β tokens arrive as generated
print("Streaming: ", end="")
async for chunk in client.stream("Write a Python decorator example"):
print(chunk, end="", flush=True)
print()
# Batch β 3 concurrent calls
responses = await client.batch_chat(
messages=["What is Redis?", "What is Kafka?", "What is Flink?"],
max_concurrency=3,
)
for r in responses:
print(r["content"][:100])
asyncio.run(main())from tokenoptim import AsyncOptimizedClient
from tokenoptim.providers import AsyncAnthropicProvider
import asyncio
async def main():
client = AsyncOptimizedClient(
provider=AsyncAnthropicProvider(),
memory_enabled=True,
memory_max_turns=5, # compacts after 5 turns
)
await client.chat("My name is Manas")
await client.chat("I work in ML infrastructure")
# Check memory state
print(client.memory_stats())
# {'enabled': True, 'active_window_turns': 2, 'estimated_window_tokens': 120}
# Toggle memory off for a stateless call
client.toggle_memory()
resp = await client.chat("What is my name?") # won't remember
print(resp["content"])
# Toggle back on
client.toggle_memory()
# Clear all history
client.clear_memory()
asyncio.run(main())from tokenoptim import ResponseCache
from tokenoptim.providers import AnthropicProvider
cache = ResponseCache(
directory="~/.cache/tokenoptim",
ttl_seconds=3600, # 1 hour TTL
max_memory_entries=256, # LRU memory cache size
)
provider = AnthropicProvider()
messages = [{"role": "user", "content": "What is tokenization?"}]
key = cache.make_key(messages, model="claude-haiku-4-5-20251001")
# First call β cache miss, hits API
if hit := cache.get(key):
print("Cache hit:", hit["content"])
else:
resp = provider.chat(messages, max_tokens=256)
cache.set(key, resp)
print("API call:", resp["content"])
# Second call β cache hit, zero API cost
if hit := cache.get(key):
print("Cache hit:", hit["content"])
print(cache.stats())
# {'hits': 1, 'misses': 1, 'hit_rate_pct': 50.0, 'memory_entries': 1}
# Invalidate a specific entry
cache.invalidate(key)
# Clear everything
cache.clear()from tokenoptim import RetryConfig
from tokenoptim.core.retry import with_retry
config = RetryConfig(
max_attempts=5, # 4 retries after first attempt
base_delay=1.0, # first retry waits ~1s
max_delay=60.0, # capped at 60s
backoff_factor=2.0, # doubles each retry: 1s β 2s β 4s β 8s
jitter=True, # Β±25% randomisation to spread load
)
# Wrap any function
def call_api():
return provider.chat(messages, max_tokens=256)
response = with_retry(call_api, config)
# Or pass directly to client
from tokenoptim import OptimizedClient
client = OptimizedClient(provider=..., retry=config)from tokenoptim import OptimizedClient
from tokenoptim.providers import AnthropicProvider
client = OptimizedClient(
provider=AnthropicProvider(),
token_budget=10_000, # warn when session exceeds this
)
client.chat("Explain neural networks")
client.chat("How does backpropagation work?")
# Check usage at any time
report = client.counter.report()
# Session total: 1,240 tokens | API calls: 2 | Budget: 10,000
# Check if over budget
print(client.counter.is_over_budget()) # False
# Detailed breakdown
total = client.counter.session_total
print(f"Input: {total.input_tokens}")
print(f"Output: {total.output_tokens}")
print(f"Total: {total.total_tokens}")pip install "llm-tokenoptim[spark]"from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
from tokenoptim.spark import SparkTokenOptimizer, compress_prompts_udf
spark = SparkSession.builder.appName("tokenoptim").getOrCreate()
# Method 1 β SparkTokenOptimizer (high-level)
df = spark.read.parquet("s3://bucket/prompts/")
optimizer = SparkTokenOptimizer(level="full", spark=spark)
df_out = optimizer.compress_dataframe(df, prompt_col="prompt")
optimizer.savings_report(df, df_out)
df_out.write.parquet("s3://bucket/compressed/")
# Method 2 β raw UDF (low-level, composable)
df_out = df.withColumn(
"prompt",
compress_prompts_udf("prompt", lit("full"))
)from tokenoptim import OptimizedClient
from tokenoptim.providers import OpenAIProvider
# Works with any OpenAI-compatible endpoint
client = OptimizedClient(
provider=OpenAIProvider(
model="gpt-4o-mini",
api_key="sk-...", # or set OPENAI_API_KEY env var
),
compress_prompts=True,
output_level="standard",
)
resp = client.chat("Explain attention mechanisms")
print(resp["content"])from tokenoptim import OptimizedClient
from tokenoptim.providers import OllamaProvider
client = OptimizedClient(
provider=OllamaProvider(model="llama3", base_url="http://localhost:11434"),
compress_prompts=True,
output_level="full",
)
resp = client.chat("Write a quicksort in Python")
print(resp["content"])# Quick benchmark (built-in prompts, no download)
python benchmarks/run_benchmark.py
# Full benchmark on ShareGPT dataset
python benchmarks/run_benchmark.py --samples 500
# Benchmark your own prompts (one per line)
tokenoptim bench --input my_prompts.txt# Install dev dependencies
pip install "llm-tokenoptim[dev]"
# Run all 52 tests
pytest tests/ -v
# Run a specific test file
pytest tests/test_compressor.py -v
pytest tests/test_cache.py -v
pytest tests/test_retry.py -v
pytest tests/test_memory.py -v
pytest tests/test_async_client.py -v
pytest tests/test_output_style.py -v
# Run with coverage
pytest tests/ --cov=tokenoptim --cov-report=term-missing