TurboQuant-powered agentic AI framework for long-context LLMs on consumer hardware.
TurboAgent is a pip-installable Python package that brings Google Research's TurboQuant KV-cache compression to open-source LLMs for local, consumer-hardware agentic AI. It delivers 6x+ memory reduction and up to 8x attention speedup with zero measurable accuracy loss.
- One-line agent creation with 6x+ KV compression -- 32k-1M+ effective context on a single RTX 4090
- Hardware-aware auto-tuning -- detects CUDA/ROCm/Metal/CPU and selects optimal configuration
- Agentic-first primitives -- persistent multi-turn memory, RAG with vector-search, multi-agent swarms
- Multiple backends -- llama.cpp (consumer GPUs), vLLM (server throughput), PyTorch (research)
- Zero-calibration, training-free -- just like the paper guarantees
pip install turboagent-ai[llama]from turboagent import TurboAgent
agent = TurboAgent(
"meta-llama/Llama-3.1-70B-Instruct",
kv_mode="turbo3",
context=131072,
)
response = agent.run("Analyze my 50k-token research doc and suggest experiments...")
print(response) # KV usage <4 GB total# Core + llama.cpp backend (recommended for consumer GPUs)
pip install turboagent-ai[llama]
# With vLLM for server-style throughput
pip install turboagent-ai[vllm]
# With HuggingFace Transformers for research
pip install turboagent-ai[torch]
# With native TurboQuant C++/CUDA kernels (recommended for best performance)
pip install turboagent-ai[native]
# Development
pip install turboagent-ai[dev]# Scaffold a new agent project
turboagent init my_agent
# Detect hardware and show optimal configuration
turboagent info
# Run benchmarks
turboagent benchmark --model-size 70from turboagent.agents.swarm import TurboSwarm, SwarmAgent
swarm = TurboSwarm(
"meta-llama/Llama-3.1-70B-Instruct",
agents=[
SwarmAgent(name="researcher", role="deep research"),
SwarmAgent(name="critic", role="critical review"),
SwarmAgent(name="writer", role="clear writing"),
],
)
results = swarm.run("Analyze the latest advances in KV cache compression.")from turboagent.agents.rag import TurboVectorStore
store = TurboVectorStore(embedding_dim=768)
store.add_documents(texts=chunks, embeddings=embeddings)
results = store.query(query_embedding, top_k=5)turboagent/
├── quant/ # TurboQuantKVCache (PolarQuant + QJL)
├── backends/ # llama.cpp, vLLM, PyTorch engines
├── agents/ # TurboAgent, TurboVectorStore, TurboSwarm
├── hardware/ # Auto-detection and optimal config
├── cli.py # Project scaffolding and benchmarks
└── utils.py # Shared helpers
| Mode | Bits per Value | Compression | Best For |
|---|---|---|---|
| turbo3 | 3.25 bpv | 4.9x | Maximum context on limited VRAM |
| turbo4 | 4.25 bpv | 3.8x | Higher quality, ample memory |
- Python >= 3.10
- PyTorch >= 2.5.0
- One of: llama-cpp-python, vLLM, or HuggingFace Transformers
git clone https://github.com/TurboAgentAI/turboagent.git
cd turboagent
pip install -e ".[dev]"
pytest tests/ -v -m "not integration"The open-source core is free forever under the MIT license.
TurboAgent Enterprise adds commercial extensions for teams and organizations:
- SSO / SAML authentication
- Audit logging and compliance exports (SOC-2, GDPR)
- Air-gapped on-premise licensing
- SecureMultiAgentSwarm with governance policies and RBAC
- Multi-node KV cache sharing
- Priority kernels and dedicated support SLAs
# Enterprise features activate with a license key
# export TURBOAGENT_LICENSE_KEY="TA-ENT-your-key-here"
from turboagent.enterprise.swarm import SecureMultiAgentSwarm
from turboagent.enterprise.audit import AuditLoggerLearn more: turboagent.to/enterprise | Contact: enterprise@turboagent.to
MIT — the open-source core is free for commercial and personal use. Commercial extensions are available under a separate license. See Enterprise.
Built on community TurboQuant implementations:
- tonbistudio/turboquant-pytorch (PyTorch reference)
- TheTom/llama-cpp-turboquant (llama.cpp fork)
- 0xSero/turboquant (vLLM Triton kernels)
- turboquant-kv (C++/CUDA bindings)