-
Notifications
You must be signed in to change notification settings - Fork 1
Stability Definition
Reading path: Conceptual Model | Stability Definition (you are here) | Conceptual Model Features | Features | Delta Report | Features-Acceptance-Criteria
Read after: Conceptual Model (so you know what the system is) Next: Conceptual Model Features (the full feature spec)
TL;DR — Stability = the system keeps its promise under all conditions. Three perspectives: User ("my request works, costs what I expect, I never see infrastructure"), Product Owner ("every feature works correctly and features don't break each other"), Engineer ("every component has a fallback, failures don't cascade, and we can see it happening"). The system has 6 known gaps vs the conceptual model: no guardrails, single region, no health-weighted routing, no SLA credit-backs, no customer webhooks, no traffic splitting.
The conceptual model makes one commitment:
"One API key, every AI model, automatic reliability, one bill."
Stability is the degree to which the system keeps that promise under all conditions. If a developer integrates once and never has to think about provider outages, billing errors, or routing failures — the system is stable. The moment they write workaround code, manually switch providers, or debug a billing discrepancy — stability has broken.
"Stable means I don't have to think about it."
- I send a request, I get a response. Provider outages are invisible to me.
- My OpenAI/Anthropic SDK works by changing
base_urlonly. Streaming, function calling, JSON mode, logprobs — all work unchanged. - Response latency is predictable. The model I asked for is the one that answers.
- Rate limits are communicated via headers (
X-RateLimit-Remaining,Retry-After), not discovered through unexplained failures. - Credits deduct once per request. Provider errors (5xx) are auto-refunded. My balance matches what I expect.
- When something is degraded, the status page and health endpoints reflect it.
- Trial expiration doesn't brick my app —
:freemodels still work. - Streaming connections deliver chunks or timeout cleanly, never hang.
- IP allowlists enforce exactly what's configured.
| Experience | What makes it stable |
|---|---|
| Request succeeded | 14-provider failover chain, transparent |
| It was fast | 7 cache layers, health-weighted routing, concurrency control |
| Cost was correct | Pre-flight credit check, idempotent deduction, auto-refund on 5xx |
| Not blocked unfairly | Three-layer rate limiting, authenticated user bypass, tuned thresholds |
| I can see what happened | Activity logs, usage stats, model health endpoints, status page |
"Stable means every feature works correctly, consistently, and together."
The conceptual model defines seven blocks: PROTECT, ROUTE, OPTIMIZE, BILL, CATALOG, PLATFORM, OBSERVE. Stability means all seven hold simultaneously.
| Block | Stable when... |
|---|---|
| PROTECT | Auth never leaks. Rate limits never fail open. Redis down → in-memory fallback activates, not "no limiting." |
| ROUTE | 120+ aliases resolve correctly. Failover triggers on 5xx/402, not on 400/429. Circuit breakers recover after cool-down. |
| OPTIMIZE | Health tiers classify models correctly. Passive monitoring adds zero request latency. Caches serve stale-but-valid data during provider outages. |
| BILL | Credits deduct atomically. Subscription allowance drains before purchased credits. High-value models never served at default pricing. |
| CATALOG | Models without pricing are excluded. Background sync keeps catalog fresh without hitting providers on the hot path. |
| PLATFORM | Chat history persists. Shared links remain accessible. Analytics don't lose data under load. |
| OBSERVE | Prometheus metrics reflect reality. Health dashboards show actual states. Alerts fire on real incidents, not 4xx noise. |
| Interaction | Correct behavior |
|---|---|
| Failover + Billing | Cost calculated using the provider that actually served the request |
| Rate Limiting + Velocity Mode | Paid users get less restriction than free users during velocity mode |
| Circuit Breaker + Health | Open breaker reflected in health data; HALF_OPEN validated by real test request before CLOSED |
| Caching + Credits | Cache hit = no credit deduction. Cache miss = deduction after provider response. |
| Trial + Failover | Trial users benefit from failover. Provider failure doesn't consume trial budget. |
| Catalog + Pricing | Model without pricing blocked from catalog — never served at $0 |
- Conformance tests (25 checks in Testing Plan, Section 25) pass continuously.
- No revenue leakage: no model served without correct pricing, no double-charges, no missed deductions.
- Feature completeness vs conceptual model sections 2.2–2.11.
"Stable means every component degrades gracefully, recovers automatically, and never cascades failure."
From the conceptual model (Section 2.5):
"No cache failure ever blocks a user request."
This applies to every layer. No single component failure produces a user-visible outage.
- IP rate limiting: residential 300 RPM, datacenter 60 RPM.
- Velocity mode triggers on 5xx only (not 4xx). Auto-deactivates after 3 min.
- Authenticated users bypass IP limits.
- Failure mode: Bad thresholds → legitimate users get 429s. (See issue #1091: 166+ blocked requests from misconfigured velocity mode.)
- Failover on: 401, 402, 403, 404, 502, 503, 504.
- No failover on: 400 (user error), 429 (use backoff).
- Model-aware: OpenAI models → OpenAI/OpenRouter only. Open-source → all providers.
- Circuit breaker: 5 consecutive failures → OPEN. 5 min cool-down → HALF_OPEN → test → CLOSED.
- Failure mode: Wrong failover → format mismatch or cost mismatch. No failover → user sees raw provider errors.
- Tiered active checks: critical 5min, popular 30min, standard 2h, on-demand 4h.
- Passive: every real request updates health as a background task (zero overhead).
- Database-backed persistence (survives restarts).
- Failure mode: Dead providers stay in the routing chain → requests go into black holes.
Semantic → Exact-match → External (Butter.dev) → Provider API
Supporting: Auth cache, Catalog L1/L2, DB query cache, Health cache, Local memory fallback.
- Redis failure → in-memory LRU (500 entries, 15 min TTL).
- Catalog cache has stampede protection (one rebuild at a time).
- Failure mode: All caches miss → every request hits DB + provider. Latency spikes 5ms → 500ms. Under load, connection pool exhausts.
- Pool: 80 primary, 30–100 read replica, separate bulk pool.
- Thread-safe singleton with double-checked locking.
- Failed init retries after 60s (no reconnect storms).
- Failure mode: Pool exhaustion → all layers fail (auth, billing, catalog, history).
- Semaphore: 20 concurrent, 50 queued, 10s queue timeout.
- Overload → 503 (not 429).
- Streaming endpoints exempt.
- Failure mode: Without this, traffic spikes overwhelm the event loop → death spiral.
- Pre-flight: estimate max cost before provider call. Insufficient → 402 immediately.
- Idempotent: same request ID → single deduction.
- Atomic: balance + transaction in one DB operation.
- Auto-refund: 5xx → credits returned. 4xx → credits kept.
- High-value protection: GPT-4/Claude/Gemini/o-series blocked if pricing falls to default.
- Failure mode: Double-charging, under-billing expensive models, or negative balances.
| Component fails | Fallback | User impact |
|---|---|---|
| Redis | In-memory LRU + local rate limiting | None |
| Primary provider | 14-provider failover | None |
| Database (momentary) | Cached auth + cached catalog | None for reads |
| Database (extended) | Degraded — health reports it | Partial (new auth/billing fail) |
| Health monitor | Passive monitoring from real requests | Reduced visibility |
| Prometheus/Grafana | System functions, not observable | None for users |
| Sentry | Errors logged locally | None for users |
| Stripe webhook | Returns 200, retries later | None (eventual consistency) |
Features described in the conceptual model but not yet fully implemented:
| Gap | Conceptual Model Section | Impact on stability |
|---|---|---|
| Input/Output guardrails (PII, injection, moderation) | 2.2 | No content safety layer yet |
| Multi-region deployment | 2.11 | Single region = single point of geographic failure |
| Health-weighted routing (route to healthiest first) | 2.3 | Always tries primary provider first, even if degraded |
| SLA tracking with credit-back | 2.7 | No automatic compensation for SLA breaches |
Customer webhooks (credits.low, model.degraded) |
2.7 | Customers can't automate on system events |
| Traffic splitting across providers | 2.3 | Over-reliance on primary provider per model |
- User: "My request works, costs what I expect, and I never see infrastructure."
- Product Owner: "Every feature works correctly and features don't break each other."
- Engineer: "Every component has a fallback, failures don't cascade, and we can see it happening."
Reading Path (start here, in order)
- Conceptual Model
- Stability Definition
- Conceptual Model Features
- Features
- Delta Report
- Features-Acceptance-Criteria
Testing
Security & Access
Billing
Monitoring
Features
Providers
Operations
Data References