Iterate. Tune. Ship better AI.
IteraTūn is an open-source prompt version control and A/B testing platform for AI applications.
It gives your prompts the same discipline that Git gives your code.
Every developer building AI applications has this problem — and almost nobody talks about it.
You write a prompt, it works okay, you tweak it, maybe it gets better, maybe worse.
You have no idea. You just keep going.
Week 1: "Summarize this in 3 bullet points." ← worked great
Week 2: "Summarize this in 5 bullet points." ← seemed fine?
Week 3: "Summarize this briefly, formal tone." ← users complaining now
Week 4: ...you can't even remember what the old prompt was
There is no history. No comparison. No data. No way to go back.
Your codebase is version controlled. Your database is backed up. Your prompts?
Hardcoded in a .env file somewhere, changed on a whim, lost forever.
| Problem | IteraTūn Solution |
|---|---|
| No prompt history | Every change is saved as a new version |
| No way to compare prompts | Side-by-side diff between any two versions |
| Prompt changes are guesswork | A/B test variants with real traffic |
| No performance data | Automated scoring via LLM-as-judge |
| Prompts hardcoded in codebase | Fetch prompts via API at runtime |
| Can't rollback a bad prompt | One-click rollback to any previous version |
from iteratun import IteraTun
client = IteraTun(api_key="your-api-key")
# Save a new version of your prompt
client.push(
name="summarize-document",
content="Summarize the following text in 3 bullet points.",
tags=["summarization", "v1"]
)# Instead of hardcoding prompts in your codebase
prompt = client.get("summarize-document", version="latest")
# Use it normally with any LLM
response = openai.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": f"{prompt}\n\n{document}"}]
)Every time your app calls .get(), IteraTūn logs:
- Which version was served
- The response received
- Latency and token count
- Automated quality score
# Not sure if v1 or v2 is better? Split the traffic.
client.ab_test(
name="summarize-document",
variants=["v1", "v2"],
split=[50, 50] # 50% traffic each
)IteraTūn handles the routing. After enough samples, check the dashboard — the data tells you which prompt wins.
# Bad prompt in production? Roll back in one line.
client.rollback("summarize-document", to_version="v1")┌─────────────────────────────────────────────────────────┐
│ Your AI App │
│ (uses IteraTūn Python SDK) │
└──────────────────────────┬──────────────────────────────┘
│ push / get / rollback
▼
┌─────────────────────────────────────────────────────────┐
│ IteraTūn API (FastAPI) │
│ │
│ ┌─────────────┐ ┌────────────┐ ┌──────────────┐ │
│ │ Prompt │ │ A/B Test │ │ Version │ │
│ │ Registry │ │ Engine │ │ Control │ │
│ └──────┬──────┘ └─────┬──────┘ └──────┬───────┘ │
│ │ │ │ │
│ └────────────────▼──────────────────┘ │
│ │ │
│ ┌──────▼──────┐ │
│ │ Kafka │ │
│ │ (async eval │ │
│ │ queue) │ │
│ └──────┬──────┘ │
│ │ │
│ ┌────────────▼────────────┐ │
│ │ Eval Engine │ │
│ │ (LLM-as-judge scores │ │
│ │ every response) │ │
│ └────────────┬────────────┘ │
└───────────────────────────┼─────────────────────────────┘
│
┌─────────────────┼──────────────────┐
│ │ │
┌──────▼──────┐ ┌───────▼──────┐ ┌───────▼──────┐
│ PostgreSQL │ │ Redis │ │ Dashboard │
│ (versions, │ │ (sessions, │ │ (WebSocket │
│ scores, │ │ caching, │ │ live view) │
│ A/B data) │ │ rate limit) │ │ │
└─────────────┘ └──────────────┘ └──────────────┘
| Layer | Technology | Purpose |
|---|---|---|
| API | FastAPI | Core REST API |
| SDK | Python Package | pip install iteratun |
| Queue | Apache Kafka | Async evaluation pipeline |
| Database | PostgreSQL | Prompts, versions, scores, A/B data |
| Cache | Redis | Sessions, rate limiting, fast lookups |
| Auth | OAuth2 + JWT | GitHub / Google login, API keys |
| Realtime | WebSockets | Live dashboard updates |
| Deploy | Docker + Kubernetes | Scalable microservice deployment |
- Every prompt change creates an immutable version
- Full diff view between any two versions
- Timestamped history with author and changelog
- One-click rollback to any previous version
- Split traffic between prompt variants by percentage
- Automatic traffic routing via SDK
- Statistical significance tracking
- Winner detection with configurable thresholds
- LLM-as-judge scores every response automatically
- Scores on: relevance, instruction following, tone consistency, length
- Scores aggregated per prompt version over time
- Async via Kafka — zero latency impact on your app
- Live view of prompt performance via WebSockets
- A/B test progress and winner stats
- Version history timeline
- Per-tag and per-model breakdown
- Project-based isolation
- API key management per project
- OAuth2 login via GitHub / Google
- Rate limiting per user and per project
POST /api/prompts → create a new prompt
POST /api/prompts/{name}/versions → push a new version
GET /api/prompts/{name} → get latest version
GET /api/prompts/{name}/versions → list all versions
GET /api/prompts/{name}/diff → diff between two versions
POST /api/prompts/{name}/rollback → rollback to a version
POST /api/prompts/{name}/ab-test → start an A/B test
GET /api/prompts/{name}/ab-test → get A/B test results
GET /api/prompts/{name}/scores → get evaluation scores over time
# Clone the repo
git clone https://github.com/AditHash/iteratun
cd iteratun
# Start all services
docker-compose up -d
# API is live at
http://localhost:8000
# Dashboard at
http://localhost:3000pip install iteratunfrom iteratun import IteraTun
client = IteraTun(api_key="your-api-key")
# Push your first prompt
client.push(
name="my-first-prompt",
content="You are a helpful assistant. Answer clearly and concisely."
)
# Fetch and use it
prompt = client.get("my-first-prompt")
print(prompt.content)- Prompt versioning and history
- A/B testing engine
- Automated LLM-as-judge evaluation
- Python SDK
- JavaScript / TypeScript SDK
- Prompt templates with variable injection
- Team collaboration and comments per version
- Webhook alerts on performance degradation
- Integration with LangChain and LlamaIndex
- Self-hosted one-click deploy (Railway / Render)
The name says it all.
Itera — you iterate over prompt versions, testing and refining.
Tūn — you tune prompts to peak performance using real data.
Every serious AI team does prompt iteration. Almost none of them do it with any discipline or data.
IteraTūn changes that.
IteraTūn is open source and contributions are welcome.
# Fork the repo, create a branch
git checkout -b feature/your-feature
# Make your changes, then open a PRPlease read CONTRIBUTING.md before submitting a pull request.
MIT License — free to use, modify, and distribute.
Built with ❤️ by AditHash