-
Notifications
You must be signed in to change notification settings - Fork 1
Delta Report
Comprehensive analysis of the gap between current implementation and a stable, shippable release.
Derived from: all 30 wiki pages, Conceptual Model, Features reference (450+ endpoints), Testing Plan, Test Coverage Audit, Acceptance Criteria, and direct codebase investigation.
Date: 2026-03-09 | Version: 2.0.4
- The Three States
- What "Stable" Means
- P0 — Must Fix Before Release
- P1 — Should Fix Before Release
- P2 — Nice to Have
- Deferred — Post-Release Roadmap
- Summary
65% complete against the full Conceptual Model (33 of 56 features fully implemented, 7 partial, 16 missing).
| System | Status | Key Evidence |
|---|---|---|
| OpenAI API Compatibility | Complete |
POST /v1/chat/completions — streaming, JSON mode, tool calling, logprobs |
| Anthropic API Compatibility | Complete |
POST /v1/messages — streaming in Anthropic event format |
| Authentication | Complete | Privy, Google OAuth, GitHub, phone, email. Fernet encryption, HMAC lookup, temp email detection |
| API Key Management | Complete | Creation, rotation, scoping, IP allowlists, domain restrictions, expiration. api_keys_new table |
| 3-Layer Rate Limiting | Complete | IP middleware (velocity mode), API key (Redis), anonymous. In-memory fallback when Redis down |
| Model Catalog | Complete | 10,000+ models, 30+ providers, background sync, HuggingFace enrichment, search, dedup, trending |
| Provider Failover | Complete | 14-provider chain, model-aware rules (OpenAI→OpenRouter only, etc.) |
| Circuit Breakers | Complete | CLOSED→OPEN (5 failures)→HALF_OPEN (60s timeout)→CLOSED/OPEN. Per-provider |
| Intelligent Routing | Complete | Code Router (SWE-bench/HumanEval tiered), General Router (quality/cost/latency/balanced) |
| Credit System | Complete | Pre-flight checks, deduction, idempotency (UNIQUE constraint + RPC), subscription priority, auto-refund |
| Stripe Payments | Complete | Checkout, payment intents, webhooks (6 events), subscriptions, refunds. 1 credit = $0.01 |
| Plans & Trials | Complete | 3-day/$5 trial, Free/Starter/Pro/Enterprise tiers, daily usage caps ($1/day configurable) |
| Coupons | Complete | Create, validate, redeem, per-user limits, expiration. coupons + coupon_redemptions tables |
| Referrals | Complete | Code generation, $10 bonus both sides on first $10+ purchase, 10 uses max |
| Chat History | Complete | Sessions, messages, batch save, full-text search, sharing, feedback, auto-injection |
| Activity Logging | Complete | User actions, API usage, security events. 90-day retention, GDPR export/anonymization |
| Audit System | Complete | SOC 2/HIPAA/GDPR compliance. Tamper-proof, hash chain, severity-based retention (7yr/3yr/1yr/90d) |
| RBAC | Complete | Admin/User/Developer/Support roles, permission decorators, scope-based key permissions |
| Health Monitoring | Complete | Tiered (Critical 5min/Popular 30min/Standard 2-4hr), passive capture, incidents, 50+ Prometheus metrics |
| Observability | Complete | Prometheus + Grafana, OpenTelemetry, Sentry, Pyroscope (cache/Redis layers) |
| Error Monitoring | Complete | Autonomous monitor, pattern detection, fixable classification, critical alerts |
| Feature Flags | Complete | Statsig gates, configs, experiments, percentage rollouts |
| Image Generation | Complete | Provider routing, credit deduction, multiple providers |
| Audio Transcription | Complete | File upload and base64 |
| Server-Side Tools | Complete | Web search, TTS, SSRF protection |
| Admin | Complete | 80+ endpoints: user/credit/cache/sync/role/trial/downtime/coupon management |
| CI/CD | Complete | Supabase migrations with destructive operation blocking, GitHub Actions |
| System | What Works | What's Missing |
|---|---|---|
| Provider Credit Monitoring | OpenRouter: full implementation with API call, 15-min cache, threshold alerts (critical $5, warning $20, info $50), email alerts | 29 other providers have TODO stubs. No preemptive deprioritization in failover chain |
| Response Caching |
response_cache.py exists with SHA-256 hashing, Redis + in-memory fallback. User cache settings endpoints exist (GET/PUT /user/cache-settings) |
Cache is metadata-only (models, providers, health). NOT wired into inference pipeline. User cache preference is stored but ignored during inference. Butter.dev proxy called regardless of preference |
| Load Balancing | Failover chain with priority ordering. Model selector with quality priors + real-time metrics. Hash-based sticky routing per conversation | No weighted traffic splitting. No dynamic latency-optimal selection (General Router "latency" hardcodes to groq/llama-3.3-70b-versatile). No cost-optimal provider selection per model |
| Model Quality Scoring | Hardcoded quality priors for ~20 models in model_selector.py (task-specific: simple_qa, code_gen, reasoning, etc.). SWE-bench/HumanEval in Code Router's code_quality_priors.json
|
Not stored in DB. Not updatable without code change. Missing MMLU, MATH, MT-Bench, LMSYS Arena ELO, LiveBench. No per-customer quality tracking |
| Usage Analytics | Admin-side: model usage view, chat request monitoring, request counts by model. Cache analytics via Butter | No per-API-key breakdown (activity_log stores user_id but NOT api_key_id). No latency percentiles for customers (p50/p95/p99 admin-only). No CSV/JSON export |
| Google Vertex | REST path: function calling transformation implemented (_translate_openai_tools_to_vertex()). Models working for standard inference |
SDK (non-REST) path has TODO: "Function calling may not work correctly." Wiki notes function calling as "in progress" |
| Streaming Normalization | OpenAI, Gemini, Anthropic, Fireworks formats handled in stream_normalizer.py with dedicated normalizers |
Providers returning completely non-standard format are silently dropped (returns None). No error/warning to client |
| AI-Specific Tracing | Arize config file exists. OpenTelemetry captures inference metadata | Arize Phoenix not exposed via API. Braintrust not integrated. No prompt/response pair recording for quality analysis |
| System | Conceptual Model Section | Description |
|---|---|---|
| Input Guardrails | 2.2 | PII detection (phone, SSN, email, credit card scanning), prompt injection defense, topic restrictions, content moderation |
| Output Guardrails | 2.2 | Content filtering on responses, structured output validation (JSON schema conformance), hallucination flags (normalized safety metadata) |
| Semantic Cache | 2.5 | Vector similarity matching for semantically equivalent prompts. Requires vector DB + embedding model |
| Exact-Match Inference Cache | 2.5 | SHA-256 hash of {messages + model + params} → cached response. 20K entries, 60-min TTL, LRU eviction |
| Customer Webhooks | 2.7 | Outbound event delivery (credits.low, credits.depleted, model.degraded, rate_limit.approaching). HMAC signing, retry logic, delivery log |
| SLA Tracking | 2.7 | Per-tier SLA definitions, violation detection (P99 latency, error rate), credit-back compensation |
| Batch/Async Inference | 2.8 |
POST /v1/batch/jobs for bulk workloads at reduced cost. Job queue, status polling, webhook on completion |
| Prompt Management | 2.8 | Template library with versioning, template variables, A/B testing, per-key default system prompts |
| Evaluation/Playground | 2.8 | Side-by-side model comparison, regression testing, interactive prompt testing UI |
| Geo-Aware Routing | 2.11 | IP geolocation, nearest-region provider selection, latency-based geographic optimization |
| Data Residency | 2.11 | GDPR compliance routing (EU customers → EU providers), data sovereignty enforcement |
| Multi-Region Redis | 2.11 | Cache replication across regions |
| Traffic Splitting | 2.3 | Weighted distribution across providers for same model (e.g., 70/30 split) |
| Per-Customer Quality Tracking | 2.4 | Per-customer success rate tracking, model preference learning, personalized routing |
56 features across 10 layers. Includes enterprise capabilities (geo-routing, SLA credit-backs, semantic caching) and developer platform features (prompt management, batch inference, playground) that are future roadmap items.
Not everything in the Conceptual Model. The expected state is: every feature that's exposed to users works correctly, safely, and predictably. No half-built features visible. No billing bugs. No security holes. No silent failures.
A developer signs up, gets an API key, sends requests to any model through the OpenAI or Anthropic API format, gets reliable responses with automatic failover, sees exactly what they spent, pays for what they used, and never encounters a broken feature, a silent failure, a double-charge, or an exposed stack trace. Every endpoint that's reachable does what it says. Features that aren't ready yet aren't visible.
S1 — Reliability: Every inference request either succeeds or returns a clear, actionable error. Provider failures silently failover. Circuit breakers prevent cascading failures. Redis going down doesn't break the system. Health endpoints always return 200 (degradation in body, not status code).
S2 — Billing Correctness: Credits deducted accurately per (prompt_tokens × prompt_price) + (completion_tokens × completion_price). Pre-flight checks prevent wasted provider calls. No double-charging on retries. Subscription allowance consumed before purchased credits. Provider 5xx auto-refunds. User 4xx does not refund. High-value models never served at default pricing.
S3 — Security: API keys encrypted at rest (Fernet AES-128). HMAC-SHA256 for key lookup. SQL/XSS/command/path injection prevented. RBAC enforced on all admin endpoints. Audit trail for security-sensitive operations. Rate limiting on all 3 layers with proper response headers.
S4 — No Ghost Features: Every user-reachable endpoint returns real, functional data. No stubs that accept configuration but do nothing. No UI toggles for non-functional features. If a feature isn't built, the endpoint shouldn't exist.
S5 — Observability: Prometheus metrics, OpenTelemetry traces, Sentry error tracking operational. Health monitoring detects provider degradation. Admin dashboard shows user counts, credit totals, API usage. Problems are detectable before users report them.
S6 — Billing Integrity: Stripe payments add correct credit amounts. Webhooks are idempotent. Trial limits enforced (3 days, $5 cap, $1/day). Expired trials blocked from paid models, allowed on :free models. Coupon redemption validates expiry, one-per-user, user-specificity.
S7 — Consistent DX: All error responses have consistent JSON format. Streaming SSE normalized across all providers. Rate limit 429 responses include standard headers. Documentation matches behavior.
These cause billing errors, security incidents, or user trust erosion if shipped as-is.
The Problem: GET /user/cache-settings and PUT /user/cache-settings are exposed to users. They store a enable_butter_cache preference in the user's preferences JSON column. However, src/routes/chat.py (line 697) calls get_butter_pooled_async_client() without checking the user's preference. The Butter proxy is always used regardless of the setting.
Why It's P0: This is a ghost feature. Users can toggle a setting that does nothing. If a user disables caching and expects their data not to go through a third-party proxy, their expectation is violated. This erodes trust.
What to Do: Either (a) wire the preference check into the inference path so enable_butter_cache=false bypasses the Butter proxy, or (b) remove both endpoints entirely and remove the Butter preference from the user schema. Option (b) is faster and simpler.
Files: src/routes/users.py (lines 305-408), src/routes/chat.py (line 697)
The Problem: Credit deduction has two code paths in src/db/users.py (lines 701-1106):
-
Atomic path (lines 862-967): Uses
atomic_deduct_creditsRPC stored procedure. Single PostgreSQL transaction — balance update AND transaction log happen together. This is correct. -
Legacy/fallback path (lines 987-1096): Used when the RPC is unavailable. Two separate calls:
- Line 1006-1018: Updates
userstable (balance deduction) - Line 1066-1074: Logs transaction via
log_credit_transaction() - Lines 1077-1082: If transaction logging fails, credits are already deducted. Error is logged but not re-raised.
- Line 1006-1018: Updates
Why It's P0: On the legacy path, a crash or DB error between the two calls creates a state where the user's balance is reduced but there's no transaction record. The user was charged but there's no audit trail. This is a billing integrity issue.
What to Do: Determine if the legacy path is still reachable in production. If the atomic_deduct_credits RPC exists in all environments (production, staging), the legacy path may be dead code. If it IS reachable, wrap both operations in a single transaction or make the legacy path re-raise the logging error (allowing the balance update to roll back). Alternatively, remove the legacy path entirely if the RPC is always available.
Idempotency is solid: Request ID check at lines 745-765 (application level) + UNIQUE constraint on credit_transactions.request_id (DB level) + atomic RPC path combines check-and-deduct. The race condition window on the application-level check is covered by the DB constraint.
Files: src/db/users.py (lines 701-1106), supabase/migrations/20260223000001_add_request_id_to_credit_transactions.sql
The Problem: src/services/pricing.py (lines 783-839) has an explicit guard:
HIGH_VALUE_MODEL_PATTERNS = [
"gpt-4", "gpt-5", "o1-", "o3-", "o4-",
"claude-3", "claude-opus", "claude-sonnet-4",
"gemini-1.5-pro", "gemini-2", "gemini-pro",
"command-r-plus", "mixtral-8x22b"
]
When a high-value model matches AND pricing falls to the $0.00002/token default, a ValueError is raised with a Sentry alert (lines 808-839). Non-high-value models are allowed to use default pricing (lines 842-854).
Why It's P0: The guard exists but needs end-to-end verification. Questions:
- Is this function called BEFORE the provider API call in
chat.py? If pricing resolution happens after the inference call, the guard is too late — the provider was already called and tokens were consumed. - Are the patterns comprehensive? New models (GPT-4.1, Claude 4, Gemini 2.5) may not match existing patterns.
- What happens when the ValueError is raised — does the user get a clear 4xx error or a 500?
What to Do: Trace the call chain from chat.py through pricing resolution to confirm the guard fires BEFORE the provider call. Add any missing model patterns (especially newer models). Verify the ValueError is caught and returns a clear error to the user (not a 500).
Files: src/services/pricing.py (lines 783-839), src/routes/chat.py (pricing resolution section)
The Problem: src/routes/chat.py (lines 1670-1742) classifies errors and refunds:
- 5xx errors (502, 503) →
"provider_error"→ refund viarefund_credits() - Timeout errors →
"timeout_error"→ refund - 4xx errors (400, 404) →
"not_found_error"→ no refund - Refund condition (lines 1699-1705): Only when
credit_deduction_success=True, not anonymous, andtotal_tokens > 0
Why It's P0: The logic looks correct but needs integration testing. Edge cases:
- What if the provider returns 502 but the response was partially streamed (tokens already consumed)? Is the full deduction refunded or just the unused portion?
- What if
refund_credits()itself fails? Is the failure logged? Is the user notified? - The condition requires
total_tokens > 0— what about requests that fail before any tokens are generated? Those should not have been charged in the first place (pre-flight check should catch them).
What to Do: Write integration tests that: (1) force a 503 from the primary provider, verify refund transaction exists; (2) force a 400, verify no refund; (3) force a timeout, verify refund; (4) verify partial stream + error still refunds correctly.
Files: src/routes/chat.py (lines 1670-1742)
The Problem: Three rate limit layers, three different header behaviors:
| Layer | File | Headers on 429 |
|---|---|---|
| Layer 1: IP Middleware |
security_middleware.py (lines 647-716) |
YES: Retry-After, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, X-RateLimit-Reason, X-RateLimit-Mode
|
| Layer 2: API Key Service |
rate_limiting.py (lines 78-94) |
PARTIAL: Header fields defined in RateLimitResult dataclass (ratelimit_limit_requests, ratelimit_reset_requests, etc.) but not converted to HTTP response headers
|
| Layer 3: Anonymous Limiter | anonymous_rate_limiter.py |
NO: No Retry-After or X-RateLimit-* headers |
Why It's P0: When Layer 2 or Layer 3 triggers a 429, the client gets a bare rejection with no information about when to retry. This causes:
- Clients retry immediately in a tight loop (making the problem worse)
- SDKs like the OpenAI Python client expect
Retry-Afterto implement backoff - API consumers can't implement proper retry logic
What to Do:
- Layer 2: The
RateLimitResultdataclass already has the right fields. Add code in the route handler (or a middleware) to convert these fields to HTTP headers when returning a 429. - Layer 3: Add
Retry-AfterandX-RateLimit-*headers to the anonymous limiter's 429 response.
Files: src/services/rate_limiting.py (lines 78-94), src/services/anonymous_rate_limiter.py, src/routes/chat.py (where rate limit results are converted to responses)
The Problem: The Features wiki documents that GET /admin/model-sync/providers has "No auth enforced." Direct code investigation found that POST /admin/create (user registration endpoint at src/routes/admin.py lines 52-101) is intentionally public. But the model-sync providers endpoint needs verification.
Why It's P0: Any admin endpoint accessible without auth is a security issue. The model-sync providers endpoint leaks internal infrastructure details (which 33 providers are configured, their slugs, their sync status).
What to Do: Audit every route in src/routes/admin.py and src/routes/admin_*.py for missing Depends(require_admin). The POST /admin/create endpoint is intentional (user registration). Everything else must require admin auth. Fix any gaps found.
Files: src/routes/admin.py, all files matching src/routes/admin*.py
The Problem: Three sources give different trial parameters:
| Source | Credits | Duration | Daily Limit |
|---|---|---|---|
CLAUDE.md |
$5 | 3 days | Not specified |
| Wiki (Free-Trial-System.md) | $10 / 1000 credits | 3 days | Not specified |
Code (src/config/usage_limits.py) |
$5 | 3 days | $1/day |
Additionally, src/db/trials.py line 44 has a formula trial_days * 5 that yields $70 for a 14-day trial — suggesting the function accepts variable trial durations but the default is 3 days.
Why It's P0: If the wiki says $10 but the code gives $5, users who read the docs expect $10 and get $5. Or worse — if there's a code path that gives $10 and another that gives $5, different users get different amounts depending on their signup path.
What to Do: Determine the canonical trial amount. Update all three sources (CLAUDE.md, wiki, code comments) to match. Verify there's exactly one code path for trial credit allocation and it uses the configured value.
Files: src/config/usage_limits.py, src/db/trials.py, src/db/api_keys.py, docs/CONCEPTUAL_MODEL.md
These cause bad user experience but aren't billing/security issues.
Current State: src/services/provider_credit_monitor.py — only OpenRouter has a real implementation (lines 33-138). It calls https://openrouter.ai/api/v1/auth/key, caches for 15 minutes, and has threshold-based alerting (critical: $5, warning: $20, info: $50). Lines 165-167 have TODO stubs for all other providers.
Impact: When a non-OpenRouter provider runs out of upstream credits, requests fail with 402 from the provider. Failover catches this (402 triggers failover), but if multiple providers exhaust credits simultaneously, the failover chain degrades. No warning before it happens.
What to Do: Implement credit checking for the top 5 providers by traffic volume. Each provider has a different API for balance checking — some may not have one at all. For providers without balance APIs, monitor for 402 response frequency as a proxy signal.
Scope: src/services/provider_credit_monitor.py
Current State: Across all route files:
- ~95% use
raise HTTPException(status_code=XXX, detail="message")— produces{"detail": "message"} - ~5% use
JSONResponse(status_code=XXX, content={...})— custom format - Chat endpoints use
APIExceptionshelper which produces OpenAI-compatible format:{"error": {"message": "...", "type": "...", "code": "..."}}
Impact: Clients parsing error responses must handle multiple formats. The OpenAI SDK expects {"error": {"message": "..."}}. FastAPI's default {"detail": "..."} breaks OpenAI SDK error handling.
What to Do: For OpenAI/Anthropic-compatible endpoints (/v1/chat/completions, /v1/messages, /v1/images/generations, /v1/audio/transcriptions), ensure all errors use OpenAI-compatible format. For other endpoints, FastAPI default is fine. The key is that inference endpoints must be SDK-compatible.
Scope: src/routes/chat.py, src/routes/messages.py, src/routes/images.py, src/routes/audio.py
Current State: src/services/model_catalog_sync.py — during model sync, extract_pricing() (lines 136-153) returns all None values if pricing is missing. Line 368 checks if any(pricing.values()) but this is non-blocking. Models without pricing ARE synced into the catalog and become visible to users.
Impact: A model without pricing enters the catalog. When a user requests it:
- If it's a high-value model, the pricing guard blocks it (ValueError) — user gets an error
- If it's a non-high-value model, it falls to default pricing ($0.00002/token) — potentially under-billing
- Either way, the user experience is poor: the model is in the catalog but doesn't work properly
What to Do: Add a validation gate in the sync pipeline: reject models where not any(pricing.values()) and the model is not explicitly whitelisted. Log rejected models for admin review. This prevents "dark" models from appearing in the catalog.
Scope: src/services/model_catalog_sync.py
Current State: src/routes/users.py (lines 510-518):
"total": len(transactions), # Returns page count, NOT DB totalImpact: Any frontend using total to calculate page count gets wrong numbers. If a user has 500 transactions and requests limit=50, total returns 50 (the page size), not 500 (the actual total). Pagination shows "1 page" when there are 10.
What to Do: Add a separate count query (SELECT COUNT(*) FROM credit_transactions WHERE user_id = ...) and return that as total. Rename the current field to count or returned to avoid confusion.
Scope: src/routes/users.py (line 515), potentially src/db/credit_transactions.py
Current State: src/services/google_vertex_client.py:
- REST path (lines 250-402, 662-707): Function calling transformation IS implemented —
_translate_openai_tools_to_vertex()converts OpenAI tool format to VertexfunctionDeclarations,_translate_tool_choice_to_vertex()handles tool_choice options - SDK path (line 585-587): TODO comment — "Function calling may not work correctly"
Impact: If Vertex models are in the catalog with supports_function_calling: true but the SDK path is used for some requests, function calling silently fails or produces wrong results.
What to Do: Either (a) ensure the REST path is always used when tools are present (route around the SDK path), or (b) implement function calling in the SDK path, or (c) mark Vertex models as supports_function_calling: false in the catalog until SDK path is complete.
Scope: src/services/google_vertex_client.py
Current State: src/routes/chat.py (lines 2024-2025, 3737-3738) — when credits <= 0, a 402 Payment Required is returned. This covers the case where subscription allowance AND purchased credits are both 0.
The Gap: The wiki (Subscription-Plans.md) notes that the overage handling strategy is incomplete — "block vs. allow with notification" is undefined. Currently, the system blocks (402). But there's no:
- Warning notification when credits are running low (e.g., at 20% remaining)
- Grace period for subscribers (allow a few more requests while they top up)
- Clear messaging in the 402 response about what to do (buy credits vs. upgrade plan)
What to Do: The block behavior (402) is correct and safe for v1. Enhance the 402 response body to include: current balance, link to purchase credits, and link to upgrade plan. Optionally, add a low-balance warning header (X-Credits-Remaining) on successful responses when balance drops below a threshold.
Scope: src/routes/chat.py, src/services/pricing.py
Current State: src/services/circuit_breaker.py — circuit breaker uses a 60-second timeout (not 5 minutes as the Conceptual Model states). After 60 seconds in OPEN state, transitions to HALF_OPEN. Requires 2 consecutive successes in HALF_OPEN to return to CLOSED.
Discrepancy: The Conceptual Model (section 2.3) says "auto-recovers after 5 minutes of cool-down." The wiki (Testing Plan case 25.6) says "Wait 5 min after OPEN." But the code uses 60 seconds. This is either a doc error or a code error.
What to Do: Decide on the correct timeout value. 60 seconds is more aggressive recovery, 5 minutes is more conservative. Update either the code or the documentation to match. Then run an integration test that verifies the full state machine with real timing.
Scope: src/services/circuit_breaker.py (line 67), docs/CONCEPTUAL_MODEL.md, wiki Testing Plan
Current State: src/services/stream_normalizer.py handles OpenAI, Gemini, Anthropic, and Fireworks formats. If a provider returns a completely unrecognized format, the normalizer returns None (line 94) — the chunk is silently dropped.
Impact: A provider updating their streaming format could cause chunks to disappear from the user's stream without any error. The user sees a truncated or empty response.
What to Do: When the normalizer drops a chunk (returns None), log a warning with the raw chunk content and provider name. This creates visibility into normalization failures. Optionally, pass unrecognized chunks through as-is rather than dropping them.
Scope: src/services/stream_normalizer.py
These improve developer experience but aren't blockers.
Current State: src/db/activity.py stores user_id with each activity record but NOT api_key_id. Usage is aggregated at the user level only. A user with 3 API keys (one for web app, one for mobile, one for testing) cannot see which key consumed what.
What to Do: Add api_key_id to the activity log schema. Populate it during inference request logging. Add a query endpoint: GET /user/api-keys/{key_id}/usage.
Scope: src/db/activity.py, src/routes/chat.py (logging section), src/routes/users.py
Current State: No export endpoint exists. Usage data is in the database but only accessible through paginated API calls.
What to Do: Add GET /user/usage/export?format=csv&start_date=...&end_date=... that returns a downloadable file with columns: date, model, provider, tokens_in, tokens_out, cost, api_key.
Scope: New endpoint in src/routes/users.py
Current State: p50/p95/p99 latency stats exist in Redis via redis_metrics.get_latency_percentiles() and are exposed at GET /api/monitoring/latency/{provider}/{model}. This endpoint uses get_optional_api_key() — likely intended as admin-only but may be publicly accessible.
What to Do: Create a user-facing endpoint: GET /user/latency?model=... returning p50/p95/p99 for models the user has used. Restrict the admin monitoring endpoints to require admin auth.
Scope: src/routes/users.py, src/routes/monitoring.py
Current State: src/services/notification.py sends emails via Resend (resend.Emails.send()). On failure: logs error, returns False, no retry, no fallback, no persistent delivery tracking. The caller continues silently — user never knows the notification failed.
What to Do: Add retry logic (2-3 attempts with backoff). Log delivery status (success/failure/retry) to a notification_deliveries table. Surface delivery history in admin dashboard.
Scope: src/services/notification.py, potentially new DB table
These are Conceptual Model features that require new infrastructure, not hardening. They should be communicated as "coming soon" in release notes.
| # | Feature | Why Defer | Effort | Dependencies |
|---|---|---|---|---|
| D-1 | Guardrails — PII Detection | Needs embedding/classification models, per-key config schema | Large | Moderation API or custom model |
| D-2 | Guardrails — Prompt Injection Defense | Needs injection pattern DB, real-time classification | Large | Security research |
| D-3 | Guardrails — Content Moderation | Needs integration with moderation classifiers (OpenAI Moderation, Perspective API) | Medium | External API |
| D-4 | Guardrails — Output Filtering | Needs response scanning pipeline, configurable policies | Medium | Moderation API |
| D-5 | Guardrails — Structured Output Validation | Needs JSON Schema validator in response path | Small | jsonschema library |
| D-6 | Guardrails — Hallucination Flags | Needs normalized safety metadata schema across all providers | Medium | Provider documentation |
| D-7 | Guardrails — Topic Restrictions | Needs per-key configuration, classifier pipeline | Medium | Classification model |
| D-8 | Semantic Cache | Needs vector DB (Pinecone/Qdrant/Chroma), embedding model, similarity search | Large | Infrastructure |
| D-9 | Exact-Match Inference Cache | Wire response_cache.py into inference path with proper invalidation, TTL, LRU |
Medium | None (infra exists) |
| D-10 | Customer Webhooks | Delivery queue, retry logic, HMAC signing, management endpoints, delivery log | Medium | Job queue |
| D-11 | Batch/Async Inference | New API surface (/v1/batch), job queue (Celery/RQ), worker pool |
Large | Job queue infrastructure |
| D-12 | Prompt Management | Template storage, versioning, variable substitution, A/B testing | Medium | DB schema |
| D-13 | Evaluation/Playground | Frontend-coupled; backend needs comparison API | Medium | Frontend |
| D-14 | SLA Tracking & Credit-back | Per-tier definitions, violation detection, auto-compensation | Medium | Business rules |
| D-15 | Geo-Aware Routing | IP geolocation, region-aware provider ranking | Large | GeoIP database |
| D-16 | Data Residency (GDPR) | Legal + technical: EU-only routing, data classification | Large | Legal review |
| D-17 | Traffic Splitting | Weighted distribution across providers, A/B provider testing | Medium | Routing changes |
| D-18 | Dynamic Latency/Cost Routing | Real-time latency tracking per provider per model → routing decisions | Medium | Metrics pipeline |
| D-19 | Per-Customer Quality Tracking | Success rate per customer per model, preference learning | Medium | Analytics pipeline |
| D-20 | Provider Credit Monitoring (remaining 28) | Each provider has different API | Medium (cumulative) | Provider APIs |
-
D-9: Exact-Match Inference Cache — highest ROI. Infrastructure exists (
response_cache.py). Wire it into chat.py. Reduces provider costs immediately. - D-10: Customer Webhooks — enterprise requirement. Unblocks automation workflows.
- D-5: Structured Output Validation — small effort, high value. Just add jsonschema validation in response path.
- D-3: Content Moderation — compliance requirement. Integrate OpenAI Moderation API as first pass.
- D-11: Batch Inference — competitive differentiator. 50% cost savings for bulk workloads.
| Priority | Count | Scope |
|---|---|---|
| P0 — Must fix | 7 | Billing integrity, ghost features, security, rate limit headers, trial config |
| P1 — Should fix | 8 | Provider monitoring, error format, catalog gating, pagination bug, Vertex, overage, circuit breaker timing, stream normalization |
| P2 — Nice to have | 4 | Per-key usage, export, latency percentiles, notification delivery |
| Deferred | 20 | Guardrails (7), caching (2), webhooks, batch, prompts, evaluation, SLA, geo-routing, GDPR, traffic splitting, dynamic routing, quality tracking, provider monitoring |
Week 1: P0-1 through P0-7 (ghost features, billing atomicity, pricing guard,
refund path, rate limit headers, admin auth, trial config)
Week 2: P1-1 through P1-4 (provider credit monitoring, error format,
catalog gating, pagination bug)
Week 3: P1-5 through P1-8 (Vertex function calling, overage strategy,
circuit breaker timing, stream normalization)
Week 4: P2-1 through P2-4 (per-key usage, export, latency, notifications)
Week 5: Full regression testing against Testing Plan (250+ cases)
and Acceptance Criteria (202 criteria)
The core product — inference, billing, security, failover, catalog, monitoring — is genuinely strong. 450+ endpoints, 30+ providers, 10,000+ models, comprehensive observability. The path to stable is hardening what exists (7 P0 fixes, 8 P1 fixes), not building new features. The 20 deferred items are clearly v2+ roadmap. The single highest-risk item is P0-2 (credit deduction atomicity on the legacy path) — if the RPC is unavailable and the legacy path fires, there's a window for billing inconsistency.
Reading Path (start here, in order)
- Conceptual Model
- Stability Definition
- Conceptual Model Features
- Features
- Delta Report
- Features-Acceptance-Criteria
Testing
Security & Access
Billing
Monitoring
Features
Providers
Operations
Data References