Delta Report

Delta Report: Current State → Stable Release

Comprehensive analysis of the gap between current implementation and a stable, shippable release.

Derived from: all 30 wiki pages, Conceptual Model, Features reference (450+ endpoints), Testing Plan, Test Coverage Audit, Acceptance Criteria, and direct codebase investigation.

Date: 2026-03-09 | Version: 2.0.4

1. The Three States

Current State — What's Built Today

65% complete against the full Conceptual Model (33 of 56 features fully implemented, 7 partial, 16 missing).

Production-Ready Systems

System	Status	Key Evidence
OpenAI API Compatibility	Complete	`POST /v1/chat/completions` — streaming, JSON mode, tool calling, logprobs
Anthropic API Compatibility	Complete	`POST /v1/messages` — streaming in Anthropic event format
Authentication	Complete	Privy, Google OAuth, GitHub, phone, email. Fernet encryption, HMAC lookup, temp email detection
API Key Management	Complete	Creation, rotation, scoping, IP allowlists, domain restrictions, expiration. `api_keys_new` table
3-Layer Rate Limiting	Complete	IP middleware (velocity mode), API key (Redis), anonymous. In-memory fallback when Redis down
Model Catalog	Complete	10,000+ models, 30+ providers, background sync, HuggingFace enrichment, search, dedup, trending
Provider Failover	Complete	14-provider chain, model-aware rules (OpenAI→OpenRouter only, etc.)
Circuit Breakers	Complete	CLOSED→OPEN (5 failures)→HALF_OPEN (60s timeout)→CLOSED/OPEN. Per-provider
Intelligent Routing	Complete	Code Router (SWE-bench/HumanEval tiered), General Router (quality/cost/latency/balanced)
Credit System	Complete	Pre-flight checks, deduction, idempotency (UNIQUE constraint + RPC), subscription priority, auto-refund
Stripe Payments	Complete	Checkout, payment intents, webhooks (6 events), subscriptions, refunds. 1 credit = $0.01
Plans & Trials	Complete	3-day/$5 trial, Free/Starter/Pro/Enterprise tiers, daily usage caps ($1/day configurable)
Coupons	Complete	Create, validate, redeem, per-user limits, expiration. `coupons` + `coupon_redemptions` tables
Referrals	Complete	Code generation, $10 bonus both sides on first $10+ purchase, 10 uses max
Chat History	Complete	Sessions, messages, batch save, full-text search, sharing, feedback, auto-injection
Activity Logging	Complete	User actions, API usage, security events. 90-day retention, GDPR export/anonymization
Audit System	Complete	SOC 2/HIPAA/GDPR compliance. Tamper-proof, hash chain, severity-based retention (7yr/3yr/1yr/90d)
RBAC	Complete	Admin/User/Developer/Support roles, permission decorators, scope-based key permissions
Health Monitoring	Complete	Tiered (Critical 5min/Popular 30min/Standard 2-4hr), passive capture, incidents, 50+ Prometheus metrics
Observability	Complete	Prometheus + Grafana, OpenTelemetry, Sentry, Pyroscope (cache/Redis layers)
Error Monitoring	Complete	Autonomous monitor, pattern detection, fixable classification, critical alerts
Feature Flags	Complete	Statsig gates, configs, experiments, percentage rollouts
Image Generation	Complete	Provider routing, credit deduction, multiple providers
Audio Transcription	Complete	File upload and base64
Server-Side Tools	Complete	Web search, TTS, SSRF protection
Admin	Complete	80+ endpoints: user/credit/cache/sync/role/trial/downtime/coupon management
CI/CD	Complete	Supabase migrations with destructive operation blocking, GitHub Actions

Partially Built Systems

System	What Works	What's Missing
Provider Credit Monitoring	OpenRouter: full implementation with API call, 15-min cache, threshold alerts (critical $5, warning $20, info $50), email alerts	29 other providers have TODO stubs. No preemptive deprioritization in failover chain
Response Caching	`response_cache.py` exists with SHA-256 hashing, Redis + in-memory fallback. User cache settings endpoints exist (`GET/PUT /user/cache-settings`)	Cache is metadata-only (models, providers, health). NOT wired into inference pipeline. User cache preference is stored but ignored during inference. Butter.dev proxy called regardless of preference
Load Balancing	Failover chain with priority ordering. Model selector with quality priors + real-time metrics. Hash-based sticky routing per conversation	No weighted traffic splitting. No dynamic latency-optimal selection (General Router "latency" hardcodes to `groq/llama-3.3-70b-versatile`). No cost-optimal provider selection per model
Model Quality Scoring	Hardcoded quality priors for ~20 models in `model_selector.py` (task-specific: simple_qa, code_gen, reasoning, etc.). SWE-bench/HumanEval in Code Router's `code_quality_priors.json`	Not stored in DB. Not updatable without code change. Missing MMLU, MATH, MT-Bench, LMSYS Arena ELO, LiveBench. No per-customer quality tracking
Usage Analytics	Admin-side: model usage view, chat request monitoring, request counts by model. Cache analytics via Butter	No per-API-key breakdown (`activity_log` stores `user_id` but NOT `api_key_id`). No latency percentiles for customers (p50/p95/p99 admin-only). No CSV/JSON export
Google Vertex	REST path: function calling transformation implemented (`_translate_openai_tools_to_vertex()`). Models working for standard inference	SDK (non-REST) path has TODO: "Function calling may not work correctly." Wiki notes function calling as "in progress"
Streaming Normalization	OpenAI, Gemini, Anthropic, Fireworks formats handled in `stream_normalizer.py` with dedicated normalizers	Providers returning completely non-standard format are silently dropped (returns `None`). No error/warning to client
AI-Specific Tracing	Arize config file exists. OpenTelemetry captures inference metadata	Arize Phoenix not exposed via API. Braintrust not integrated. No prompt/response pair recording for quality analysis

Not Built At All

System	Conceptual Model Section	Description
Input Guardrails	2.2	PII detection (phone, SSN, email, credit card scanning), prompt injection defense, topic restrictions, content moderation
Output Guardrails	2.2	Content filtering on responses, structured output validation (JSON schema conformance), hallucination flags (normalized safety metadata)
Semantic Cache	2.5	Vector similarity matching for semantically equivalent prompts. Requires vector DB + embedding model
Exact-Match Inference Cache	2.5	SHA-256 hash of {messages + model + params} → cached response. 20K entries, 60-min TTL, LRU eviction
Customer Webhooks	2.7	Outbound event delivery (credits.low, credits.depleted, model.degraded, rate_limit.approaching). HMAC signing, retry logic, delivery log
SLA Tracking	2.7	Per-tier SLA definitions, violation detection (P99 latency, error rate), credit-back compensation
Batch/Async Inference	2.8	`POST /v1/batch/jobs` for bulk workloads at reduced cost. Job queue, status polling, webhook on completion
Prompt Management	2.8	Template library with versioning, template variables, A/B testing, per-key default system prompts
Evaluation/Playground	2.8	Side-by-side model comparison, regression testing, interactive prompt testing UI
Geo-Aware Routing	2.11	IP geolocation, nearest-region provider selection, latency-based geographic optimization
Data Residency	2.11	GDPR compliance routing (EU customers → EU providers), data sovereignty enforcement
Multi-Region Redis	2.11	Cache replication across regions
Traffic Splitting	2.3	Weighted distribution across providers for same model (e.g., 70/30 split)
Per-Customer Quality Tracking	2.4	Per-customer success rate tracking, model preference learning, personalized routing

Conceptual State — The Full Vision

56 features across 10 layers. Includes enterprise capabilities (geo-routing, SLA credit-backs, semantic caching) and developer platform features (prompt management, batch inference, playground) that are future roadmap items.

Expected State — What Stable Release Requires

Not everything in the Conceptual Model. The expected state is: every feature that's exposed to users works correctly, safely, and predictably. No half-built features visible. No billing bugs. No security holes. No silent failures.

2. What "Stable" Means

Plain Language

A developer signs up, gets an API key, sends requests to any model through the OpenAI or Anthropic API format, gets reliable responses with automatic failover, sees exactly what they spent, pays for what they used, and never encounters a broken feature, a silent failure, a double-charge, or an exposed stack trace. Every endpoint that's reachable does what it says. Features that aren't ready yet aren't visible.

Precise Requirements

S1 — Reliability: Every inference request either succeeds or returns a clear, actionable error. Provider failures silently failover. Circuit breakers prevent cascading failures. Redis going down doesn't break the system. Health endpoints always return 200 (degradation in body, not status code).

S2 — Billing Correctness: Credits deducted accurately per (prompt_tokens × prompt_price) + (completion_tokens × completion_price). Pre-flight checks prevent wasted provider calls. No double-charging on retries. Subscription allowance consumed before purchased credits. Provider 5xx auto-refunds. User 4xx does not refund. High-value models never served at default pricing.

S3 — Security: API keys encrypted at rest (Fernet AES-128). HMAC-SHA256 for key lookup. SQL/XSS/command/path injection prevented. RBAC enforced on all admin endpoints. Audit trail for security-sensitive operations. Rate limiting on all 3 layers with proper response headers.

S4 — No Ghost Features: Every user-reachable endpoint returns real, functional data. No stubs that accept configuration but do nothing. No UI toggles for non-functional features. If a feature isn't built, the endpoint shouldn't exist.

S5 — Observability: Prometheus metrics, OpenTelemetry traces, Sentry error tracking operational. Health monitoring detects provider degradation. Admin dashboard shows user counts, credit totals, API usage. Problems are detectable before users report them.

S6 — Billing Integrity: Stripe payments add correct credit amounts. Webhooks are idempotent. Trial limits enforced (3 days, $5 cap, $1/day). Expired trials blocked from paid models, allowed on :free models. Coupon redemption validates expiry, one-per-user, user-specificity.

S7 — Consistent DX: All error responses have consistent JSON format. Streaming SSE normalized across all providers. Rate limit 429 responses include standard headers. Documentation matches behavior.

3. P0 — Must Fix Before Release

These cause billing errors, security incidents, or user trust erosion if shipped as-is.

P0-1: Remove or Implement Butter.dev Cache Settings

The Problem: GET /user/cache-settings and PUT /user/cache-settings are exposed to users. They store a enable_butter_cache preference in the user's preferences JSON column. However, src/routes/chat.py (line 697) calls get_butter_pooled_async_client() without checking the user's preference. The Butter proxy is always used regardless of the setting.

Why It's P0: This is a ghost feature. Users can toggle a setting that does nothing. If a user disables caching and expects their data not to go through a third-party proxy, their expectation is violated. This erodes trust.

What to Do: Either (a) wire the preference check into the inference path so enable_butter_cache=false bypasses the Butter proxy, or (b) remove both endpoints entirely and remove the Butter preference from the user schema. Option (b) is faster and simpler.

Files: src/routes/users.py (lines 305-408), src/routes/chat.py (line 697)

P0-2: Verify Credit Deduction Atomicity on Legacy Path

The Problem: Credit deduction has two code paths in src/db/users.py (lines 701-1106):

Atomic path (lines 862-967): Uses atomic_deduct_credits RPC stored procedure. Single PostgreSQL transaction — balance update AND transaction log happen together. This is correct.
Legacy/fallback path (lines 987-1096): Used when the RPC is unavailable. Two separate calls:
- Line 1006-1018: Updates users table (balance deduction)
- Line 1066-1074: Logs transaction via log_credit_transaction()
- Lines 1077-1082: If transaction logging fails, credits are already deducted. Error is logged but not re-raised.

Why It's P0: On the legacy path, a crash or DB error between the two calls creates a state where the user's balance is reduced but there's no transaction record. The user was charged but there's no audit trail. This is a billing integrity issue.

What to Do: Determine if the legacy path is still reachable in production. If the atomic_deduct_credits RPC exists in all environments (production, staging), the legacy path may be dead code. If it IS reachable, wrap both operations in a single transaction or make the legacy path re-raise the logging error (allowing the balance update to roll back). Alternatively, remove the legacy path entirely if the RPC is always available.

Idempotency is solid: Request ID check at lines 745-765 (application level) + UNIQUE constraint on credit_transactions.request_id (DB level) + atomic RPC path combines check-and-deduct. The race condition window on the application-level check is covered by the DB constraint.

Files: src/db/users.py (lines 701-1106), supabase/migrations/20260223000001_add_request_id_to_credit_transactions.sql

P0-3: Verify High-Value Model Pricing Protection End-to-End

The Problem: src/services/pricing.py (lines 783-839) has an explicit guard:

HIGH_VALUE_MODEL_PATTERNS = [
    "gpt-4", "gpt-5", "o1-", "o3-", "o4-",
    "claude-3", "claude-opus", "claude-sonnet-4",
    "gemini-1.5-pro", "gemini-2", "gemini-pro",
    "command-r-plus", "mixtral-8x22b"
]

When a high-value model matches AND pricing falls to the $0.00002/token default, a ValueError is raised with a Sentry alert (lines 808-839). Non-high-value models are allowed to use default pricing (lines 842-854).

Why It's P0: The guard exists but needs end-to-end verification. Questions:

Is this function called BEFORE the provider API call in chat.py? If pricing resolution happens after the inference call, the guard is too late — the provider was already called and tokens were consumed.
Are the patterns comprehensive? New models (GPT-4.1, Claude 4, Gemini 2.5) may not match existing patterns.
What happens when the ValueError is raised — does the user get a clear 4xx error or a 500?

What to Do: Trace the call chain from chat.py through pricing resolution to confirm the guard fires BEFORE the provider call. Add any missing model patterns (especially newer models). Verify the ValueError is caught and returns a clear error to the user (not a 500).

Files: src/services/pricing.py (lines 783-839), src/routes/chat.py (pricing resolution section)

P0-4: Verify Provider Error Auto-Refund Path

The Problem: src/routes/chat.py (lines 1670-1742) classifies errors and refunds:

5xx errors (502, 503) → "provider_error" → refund via refund_credits()
Timeout errors → "timeout_error" → refund
4xx errors (400, 404) → "not_found_error" → no refund
Refund condition (lines 1699-1705): Only when credit_deduction_success=True, not anonymous, and total_tokens > 0

Why It's P0: The logic looks correct but needs integration testing. Edge cases:

What if the provider returns 502 but the response was partially streamed (tokens already consumed)? Is the full deduction refunded or just the unused portion?
What if refund_credits() itself fails? Is the failure logged? Is the user notified?
The condition requires total_tokens > 0 — what about requests that fail before any tokens are generated? Those should not have been charged in the first place (pre-flight check should catch them).

What to Do: Write integration tests that: (1) force a 503 from the primary provider, verify refund transaction exists; (2) force a 400, verify no refund; (3) force a timeout, verify refund; (4) verify partial stream + error still refunds correctly.

Files: src/routes/chat.py (lines 1670-1742)

P0-5: Fix Rate Limit Headers on Layers 2 and 3

The Problem: Three rate limit layers, three different header behaviors:

Layer	File	Headers on 429
Layer 1: IP Middleware	`security_middleware.py` (lines 647-716)	YES: `Retry-After`, `X-RateLimit-Limit`, `X-RateLimit-Remaining`, `X-RateLimit-Reset`, `X-RateLimit-Reason`, `X-RateLimit-Mode`
Layer 2: API Key Service	`rate_limiting.py` (lines 78-94)	PARTIAL: Header fields defined in `RateLimitResult` dataclass (`ratelimit_limit_requests`, `ratelimit_reset_requests`, etc.) but not converted to HTTP response headers
Layer 3: Anonymous Limiter	`anonymous_rate_limiter.py`	NO: No `Retry-After` or `X-RateLimit-*` headers

Why It's P0: When Layer 2 or Layer 3 triggers a 429, the client gets a bare rejection with no information about when to retry. This causes:

Clients retry immediately in a tight loop (making the problem worse)
SDKs like the OpenAI Python client expect Retry-After to implement backoff
API consumers can't implement proper retry logic

What to Do:

Layer 2: The RateLimitResult dataclass already has the right fields. Add code in the route handler (or a middleware) to convert these fields to HTTP headers when returning a 429.
Layer 3: Add Retry-After and X-RateLimit-* headers to the anonymous limiter's 429 response.

Files: src/services/rate_limiting.py (lines 78-94), src/services/anonymous_rate_limiter.py, src/routes/chat.py (where rate limit results are converted to responses)

P0-6: Audit Admin Endpoint Auth Coverage

The Problem: The Features wiki documents that GET /admin/model-sync/providers has "No auth enforced." Direct code investigation found that POST /admin/create (user registration endpoint at src/routes/admin.py lines 52-101) is intentionally public. But the model-sync providers endpoint needs verification.

Why It's P0: Any admin endpoint accessible without auth is a security issue. The model-sync providers endpoint leaks internal infrastructure details (which 33 providers are configured, their slugs, their sync status).

What to Do: Audit every route in src/routes/admin.py and src/routes/admin_*.py for missing Depends(require_admin). The POST /admin/create endpoint is intentional (user registration). Everything else must require admin auth. Fix any gaps found.

Files: src/routes/admin.py, all files matching src/routes/admin*.py

P0-7: Reconcile Trial Configuration

The Problem: Three sources give different trial parameters:

Source	Credits	Duration	Daily Limit
`CLAUDE.md`	$5	3 days	Not specified
Wiki (Free-Trial-System.md)	$10 / 1000 credits	3 days	Not specified
Code (`src/config/usage_limits.py`)	$5	3 days	$1/day

Additionally, src/db/trials.py line 44 has a formula trial_days * 5 that yields $70 for a 14-day trial — suggesting the function accepts variable trial durations but the default is 3 days.

Why It's P0: If the wiki says $10 but the code gives $5, users who read the docs expect $10 and get $5. Or worse — if there's a code path that gives $10 and another that gives $5, different users get different amounts depending on their signup path.

What to Do: Determine the canonical trial amount. Update all three sources (CLAUDE.md, wiki, code comments) to match. Verify there's exactly one code path for trial credit allocation and it uses the configured value.

Files: src/config/usage_limits.py, src/db/trials.py, src/db/api_keys.py, docs/CONCEPTUAL_MODEL.md

4. P1 — Should Fix Before Release

These cause bad user experience but aren't billing/security issues.

P1-1: Extend Provider Credit Monitoring Beyond OpenRouter

Current State: src/services/provider_credit_monitor.py — only OpenRouter has a real implementation (lines 33-138). It calls https://openrouter.ai/api/v1/auth/key, caches for 15 minutes, and has threshold-based alerting (critical: $5, warning: $20, info: $50). Lines 165-167 have TODO stubs for all other providers.

Impact: When a non-OpenRouter provider runs out of upstream credits, requests fail with 402 from the provider. Failover catches this (402 triggers failover), but if multiple providers exhaust credits simultaneously, the failover chain degrades. No warning before it happens.

What to Do: Implement credit checking for the top 5 providers by traffic volume. Each provider has a different API for balance checking — some may not have one at all. For providers without balance APIs, monitor for 402 response frequency as a proxy signal.

Scope: src/services/provider_credit_monitor.py

P1-2: Standardize Error Response Format

Current State: Across all route files:

~95% use raise HTTPException(status_code=XXX, detail="message") — produces {"detail": "message"}
~5% use JSONResponse(status_code=XXX, content={...}) — custom format
Chat endpoints use APIExceptions helper which produces OpenAI-compatible format: {"error": {"message": "...", "type": "...", "code": "..."}}

Impact: Clients parsing error responses must handle multiple formats. The OpenAI SDK expects {"error": {"message": "..."}}. FastAPI's default {"detail": "..."} breaks OpenAI SDK error handling.

What to Do: For OpenAI/Anthropic-compatible endpoints (/v1/chat/completions, /v1/messages, /v1/images/generations, /v1/audio/transcriptions), ensure all errors use OpenAI-compatible format. For other endpoints, FastAPI default is fine. The key is that inference endpoints must be SDK-compatible.

Scope: src/routes/chat.py, src/routes/messages.py, src/routes/images.py, src/routes/audio.py

P1-3: Add Automated Catalog Gating at Sync Time

Current State: src/services/model_catalog_sync.py — during model sync, extract_pricing() (lines 136-153) returns all None values if pricing is missing. Line 368 checks if any(pricing.values()) but this is non-blocking. Models without pricing ARE synced into the catalog and become visible to users.

Impact: A model without pricing enters the catalog. When a user requests it:

If it's a high-value model, the pricing guard blocks it (ValueError) — user gets an error
If it's a non-high-value model, it falls to default pricing ($0.00002/token) — potentially under-billing
Either way, the user experience is poor: the model is in the catalog but doesn't work properly

What to Do: Add a validation gate in the sync pipeline: reject models where not any(pricing.values()) and the model is not explicitly whitelisted. Log rejected models for admin review. This prevents "dark" models from appearing in the catalog.

Scope: src/services/model_catalog_sync.py

P1-4: Fix Activity Log Pagination `total` Field

Current State: src/routes/users.py (lines 510-518):

"total": len(transactions),  # Returns page count, NOT DB total

Impact: Any frontend using total to calculate page count gets wrong numbers. If a user has 500 transactions and requests limit=50, total returns 50 (the page size), not 500 (the actual total). Pagination shows "1 page" when there are 10.

What to Do: Add a separate count query (SELECT COUNT(*) FROM credit_transactions WHERE user_id = ...) and return that as total. Rename the current field to count or returned to avoid confusion.

Scope: src/routes/users.py (line 515), potentially src/db/credit_transactions.py

P1-5: Complete Google Vertex Function Calling

Current State: src/services/google_vertex_client.py:

REST path (lines 250-402, 662-707): Function calling transformation IS implemented — _translate_openai_tools_to_vertex() converts OpenAI tool format to Vertex functionDeclarations, _translate_tool_choice_to_vertex() handles tool_choice options
SDK path (line 585-587): TODO comment — "Function calling may not work correctly"

Impact: If Vertex models are in the catalog with supports_function_calling: true but the SDK path is used for some requests, function calling silently fails or produces wrong results.

What to Do: Either (a) ensure the REST path is always used when tools are present (route around the SDK path), or (b) implement function calling in the SDK path, or (c) mark Vertex models as supports_function_calling: false in the catalog until SDK path is complete.

Scope: src/services/google_vertex_client.py

P1-6: Define Subscription Overage Strategy

Current State: src/routes/chat.py (lines 2024-2025, 3737-3738) — when credits <= 0, a 402 Payment Required is returned. This covers the case where subscription allowance AND purchased credits are both 0.

The Gap: The wiki (Subscription-Plans.md) notes that the overage handling strategy is incomplete — "block vs. allow with notification" is undefined. Currently, the system blocks (402). But there's no:

Warning notification when credits are running low (e.g., at 20% remaining)
Grace period for subscribers (allow a few more requests while they top up)
Clear messaging in the 402 response about what to do (buy credits vs. upgrade plan)

What to Do: The block behavior (402) is correct and safe for v1. Enhance the 402 response body to include: current balance, link to purchase credits, and link to upgrade plan. Optionally, add a low-balance warning header (X-Credits-Remaining) on successful responses when balance drops below a threshold.

Scope: src/routes/chat.py, src/services/pricing.py

P1-7: Test Circuit Breaker Recovery End-to-End

Current State: src/services/circuit_breaker.py — circuit breaker uses a 60-second timeout (not 5 minutes as the Conceptual Model states). After 60 seconds in OPEN state, transitions to HALF_OPEN. Requires 2 consecutive successes in HALF_OPEN to return to CLOSED.

Discrepancy: The Conceptual Model (section 2.3) says "auto-recovers after 5 minutes of cool-down." The wiki (Testing Plan case 25.6) says "Wait 5 min after OPEN." But the code uses 60 seconds. This is either a doc error or a code error.

What to Do: Decide on the correct timeout value. 60 seconds is more aggressive recovery, 5 minutes is more conservative. Update either the code or the documentation to match. Then run an integration test that verifies the full state machine with real timing.

Scope: src/services/circuit_breaker.py (line 67), docs/CONCEPTUAL_MODEL.md, wiki Testing Plan

P1-8: Verify Streaming Normalization for Edge Cases

Current State: src/services/stream_normalizer.py handles OpenAI, Gemini, Anthropic, and Fireworks formats. If a provider returns a completely unrecognized format, the normalizer returns None (line 94) — the chunk is silently dropped.

Impact: A provider updating their streaming format could cause chunks to disappear from the user's stream without any error. The user sees a truncated or empty response.

What to Do: When the normalizer drops a chunk (returns None), log a warning with the raw chunk content and provider name. This creates visibility into normalization failures. Optionally, pass unrecognized chunks through as-is rather than dropping them.

Scope: src/services/stream_normalizer.py

5. P2 — Nice to Have

These improve developer experience but aren't blockers.

P2-1: Add Per-API-Key Usage Breakdown

Current State: src/db/activity.py stores user_id with each activity record but NOT api_key_id. Usage is aggregated at the user level only. A user with 3 API keys (one for web app, one for mobile, one for testing) cannot see which key consumed what.

What to Do: Add api_key_id to the activity log schema. Populate it during inference request logging. Add a query endpoint: GET /user/api-keys/{key_id}/usage.

Scope: src/db/activity.py, src/routes/chat.py (logging section), src/routes/users.py

P2-2: Add Usage Export (CSV/JSON)

Current State: No export endpoint exists. Usage data is in the database but only accessible through paginated API calls.

What to Do: Add GET /user/usage/export?format=csv&start_date=...&end_date=... that returns a downloadable file with columns: date, model, provider, tokens_in, tokens_out, cost, api_key.

Scope: New endpoint in src/routes/users.py

P2-3: Surface Latency Percentiles to Customers

Current State: p50/p95/p99 latency stats exist in Redis via redis_metrics.get_latency_percentiles() and are exposed at GET /api/monitoring/latency/{provider}/{model}. This endpoint uses get_optional_api_key() — likely intended as admin-only but may be publicly accessible.

What to Do: Create a user-facing endpoint: GET /user/latency?model=... returning p50/p95/p99 for models the user has used. Restrict the admin monitoring endpoints to require admin auth.

Scope: src/routes/users.py, src/routes/monitoring.py

P2-4: Improve Notification Delivery

Current State: src/services/notification.py sends emails via Resend (resend.Emails.send()). On failure: logs error, returns False, no retry, no fallback, no persistent delivery tracking. The caller continues silently — user never knows the notification failed.

What to Do: Add retry logic (2-3 attempts with backoff). Log delivery status (success/failure/retry) to a notification_deliveries table. Surface delivery history in admin dashboard.

Scope: src/services/notification.py, potentially new DB table

6. Deferred — Post-Release Roadmap

These are Conceptual Model features that require new infrastructure, not hardening. They should be communicated as "coming soon" in release notes.

#	Feature	Why Defer	Effort	Dependencies
D-1	Guardrails — PII Detection	Needs embedding/classification models, per-key config schema	Large	Moderation API or custom model
D-2	Guardrails — Prompt Injection Defense	Needs injection pattern DB, real-time classification	Large	Security research
D-3	Guardrails — Content Moderation	Needs integration with moderation classifiers (OpenAI Moderation, Perspective API)	Medium	External API
D-4	Guardrails — Output Filtering	Needs response scanning pipeline, configurable policies	Medium	Moderation API
D-5	Guardrails — Structured Output Validation	Needs JSON Schema validator in response path	Small	jsonschema library
D-6	Guardrails — Hallucination Flags	Needs normalized safety metadata schema across all providers	Medium	Provider documentation
D-7	Guardrails — Topic Restrictions	Needs per-key configuration, classifier pipeline	Medium	Classification model
D-8	Semantic Cache	Needs vector DB (Pinecone/Qdrant/Chroma), embedding model, similarity search	Large	Infrastructure
D-9	Exact-Match Inference Cache	Wire `response_cache.py` into inference path with proper invalidation, TTL, LRU	Medium	None (infra exists)
D-10	Customer Webhooks	Delivery queue, retry logic, HMAC signing, management endpoints, delivery log	Medium	Job queue
D-11	Batch/Async Inference	New API surface (`/v1/batch`), job queue (Celery/RQ), worker pool	Large	Job queue infrastructure
D-12	Prompt Management	Template storage, versioning, variable substitution, A/B testing	Medium	DB schema
D-13	Evaluation/Playground	Frontend-coupled; backend needs comparison API	Medium	Frontend
D-14	SLA Tracking & Credit-back	Per-tier definitions, violation detection, auto-compensation	Medium	Business rules
D-15	Geo-Aware Routing	IP geolocation, region-aware provider ranking	Large	GeoIP database
D-16	Data Residency (GDPR)	Legal + technical: EU-only routing, data classification	Large	Legal review
D-17	Traffic Splitting	Weighted distribution across providers, A/B provider testing	Medium	Routing changes
D-18	Dynamic Latency/Cost Routing	Real-time latency tracking per provider per model → routing decisions	Medium	Metrics pipeline
D-19	Per-Customer Quality Tracking	Success rate per customer per model, preference learning	Medium	Analytics pipeline
D-20	Provider Credit Monitoring (remaining 28)	Each provider has different API	Medium (cumulative)	Provider APIs

7. Summary

Change Counts

Priority	Count	Scope
P0 — Must fix	7	Billing integrity, ghost features, security, rate limit headers, trial config
P1 — Should fix	8	Provider monitoring, error format, catalog gating, pagination bug, Vertex, overage, circuit breaker timing, stream normalization
P2 — Nice to have	4	Per-key usage, export, latency percentiles, notification delivery
Deferred	20	Guardrails (7), caching (2), webhooks, batch, prompts, evaluation, SLA, geo-routing, GDPR, traffic splitting, dynamic routing, quality tracking, provider monitoring

Execution Order

Week 1:  P0-1 through P0-7 (ghost features, billing atomicity, pricing guard,
         refund path, rate limit headers, admin auth, trial config)

Week 2:  P1-1 through P1-4 (provider credit monitoring, error format,
         catalog gating, pagination bug)

Week 3:  P1-5 through P1-8 (Vertex function calling, overage strategy,
         circuit breaker timing, stream normalization)

Week 4:  P2-1 through P2-4 (per-key usage, export, latency, notifications)

Week 5:  Full regression testing against Testing Plan (250+ cases)
         and Acceptance Criteria (202 criteria)

The Bottom Line

The core product — inference, billing, security, failover, catalog, monitoring — is genuinely strong. 450+ endpoints, 30+ providers, 10,000+ models, comprehensive observability. The path to stable is hardening what exists (7 P0 fixes, 8 P1 fixes), not building new features. The 20 deferred items are clearly v2+ roadmap. The single highest-risk item is P0-2 (credit deduction atomicity on the legacy path) — if the RPC is unavailable and the legacy path fires, there's a window for billing inconsistency.

Home

Reading Path (start here, in order)

Testing

Security & Access

Billing

Monitoring

Features

Providers

Operations

Data References

Delta Report

Delta Report: Current State → Stable Release

Table of Contents

1. The Three States

Current State — What's Built Today

Production-Ready Systems

Partially Built Systems

Not Built At All

Conceptual State — The Full Vision

Expected State — What Stable Release Requires

2. What "Stable" Means

Plain Language

Precise Requirements

3. P0 — Must Fix Before Release

P0-1: Remove or Implement Butter.dev Cache Settings

P0-2: Verify Credit Deduction Atomicity on Legacy Path

P0-3: Verify High-Value Model Pricing Protection End-to-End

P0-4: Verify Provider Error Auto-Refund Path

P0-5: Fix Rate Limit Headers on Layers 2 and 3

P0-6: Audit Admin Endpoint Auth Coverage

P0-7: Reconcile Trial Configuration

4. P1 — Should Fix Before Release

P1-1: Extend Provider Credit Monitoring Beyond OpenRouter

P1-2: Standardize Error Response Format

P1-3: Add Automated Catalog Gating at Sync Time

P1-4: Fix Activity Log Pagination total Field

P1-5: Complete Google Vertex Function Calling

P1-6: Define Subscription Overage Strategy

P1-7: Test Circuit Breaker Recovery End-to-End

P1-8: Verify Streaming Normalization for Edge Cases

5. P2 — Nice to Have

P2-1: Add Per-API-Key Usage Breakdown

P2-2: Add Usage Export (CSV/JSON)

P2-3: Surface Latency Percentiles to Customers

P2-4: Improve Notification Delivery

6. Deferred — Post-Release Roadmap

Recommended Post-Release Priority

7. Summary

Change Counts

Execution Order

The Bottom Line

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

P1-4: Fix Activity Log Pagination `total` Field