-
Notifications
You must be signed in to change notification settings - Fork 1
Current System Delta and Gaps
Comparison of the Conceptual Model Features (56 required features) against the Current Implementation (450+ endpoints). This document identifies what is fully implemented, what is partially implemented with specific gaps, and what is entirely missing.
Last Updated: 2026-03-05
| Status | Count | Percentage |
|---|---|---|
| Fully Implemented | 33 | 59% |
| Partially Implemented | 7 | 12% |
| Not Implemented | 16 | 29% |
| Total | 56 | 100% |
The gaps cluster in three areas:
- Input/Output Guardrails -- 6 of 7 guardrail features are missing (the entire safety layer)
- Developer Platform -- 3 of 4 features are missing (prompt management, batch inference, playground)
- Infrastructure -- 2 of 3 features are missing (multi-region, data residency)
- Fully Implemented Features (33)
- Partially Implemented Features (7)
- Not Implemented Features (16)
- Gap Analysis by Layer
- Priority Recommendations
These features exist in the current system with endpoints and functionality that match the conceptual model requirements.
| # | Feature | Evidence |
|---|---|---|
| 1.1 | API Key Authentication | Fernet AES-128 encryption + SHA-256 HMAC hashing. Full auth chain: get_api_key() -> validate_api_key_security() -> get_user(). Key creation, listing, rotation, deletion endpoints. |
| 1.2 | Role-Based Access Control (RBAC) | Admin role enforcement via require_admin. Role CRUD: GET/POST /admin/roles/*. Audit log for role changes. Roles: admin, user (team/dev/free tiers via plans). |
| 1.3 | Per-Key IP Allowlists | Full CRUD on /api/admin/ip-whitelist. CIDR range support via ipaddress.ip_network(). Validated in auth pipeline via validate_api_key_security(). |
| 1.4 | Domain Restrictions | Domain referrer validation in auth pipeline. Stored per API key in domain_referrers field. Checked during validate_api_key_security(). |
| 1.5 | Three-Layer Rate Limiting | Layer 1: IP-level SecurityMiddleware with velocity mode (25% error threshold, 3-min window). Layer 2: Redis-backed API key limits (INCR rate_limit:{api_key_id}:{minute}). Layer 3: Anonymous IP-hash limits. In-memory fallback when Redis unavailable. |
| 2.1 | Model Resolution Pipeline | Alias normalization (normalize_model_string()), provider detection, model ID transformation (model_transformations.py). 120+ aliases. Router prefix detection (router:general:*, router:code:*). |
| 2.2 | Intelligent Routing -- General Router | NotDiamond integration. 4 modes: balanced, quality, cost, latency. Fallback models per mode. 5 endpoints: settings, models, fallback-models, stats, test. |
| 2.3 | Intelligent Routing -- Code Router | SWE-bench + HumanEval benchmarks via code_quality_priors.json. 4 modes: auto, price, quality, agentic. 4 model tiers. 5 endpoints. |
| 2.4 | Provider Failover |
build_provider_failover_chain() with 14+ providers. Circuit breaker-aware (skips OPEN providers). Model-aware rules (OpenAI->OpenRouter, Anthropic->OpenRouter, open-source->all). Triggers on 401-404, 502-504. |
| 2.5 | Circuit Breakers | CLOSED/OPEN/HALF_OPEN states. Redis-backed (circuit_breaker:{provider}:*, 3600s TTL). Config: 5 failures->OPEN, 300s recovery, 3 successes->CLOSED. 4 management endpoints. Prometheus metrics on state transitions. |
| 2.6 | Health-Weighted Load Balancing | Health scores 0-100 per provider. "Health-based provider selection" as step 10 in inference pipeline. Provider health checked before routing. |
| 3.1 | Tiered Health Monitoring | Multiple check types: Quick (sub-ms), Standard (DB+Redis), Railway (comprehensive), System (memory/CPU). Background monitoring with start/stop. Per-provider and per-model health. 30+ health endpoints. |
| 3.2 | Passive Health Capture | Confirmed in inference pipeline: "Background post-processing after stream completes: credit deduction, activity log, chat history save, health capture." Zero overhead on request path. |
| 3.3 | Incident Management | Downtime tracking: incident CRUD, severity levels, Loki log capture (30s timeout, 10K entries), resolution with notes, MTTR calculation, error pattern analysis. 8 admin endpoints. |
| 3.6 | Provider Credit Monitoring |
GET /api/provider-credits/balance (all providers), GET /api/provider-credits/balance/{provider}. 15-min in-memory cache. Status thresholds: critical <= $5, warning <= $20. Currently OpenRouter only. |
| 4.4 | Supporting Caches | Auth cache (5-min TTL, 512 entries), catalog L1 (in-process, 5-min), catalog L2 (Redis, 15-30 min), rate limit LRU cache, health cache (6-min), HuggingFace cache, user lookup cache. Graceful degradation: Redis down -> local memory. |
| 5.1 | Background Model Sync | 12 admin endpoints: trigger, all, full, incremental, providers-only, per-provider, reset-and-resync, flush. 33 syncable providers. Stores to models_catalog DB table. |
| 5.2 | Model Metadata Standard | Standardized fields: id, name, provider_slug, context_length, pricing, source_gateway. Model health status. Catalog CRUD endpoints. |
| 5.4 | HuggingFace Enrichment | 6 endpoints: discovery, search, author models, model details (downloads, likes, parameters), model card, file listing. Redis-cached. Admin cache management. |
| 5.5 | Model Discovery & Search | Full-text search, filtering (provider, gateway, modality), trending models, low-latency models, model comparison, batch-compare, developer views, deduplicated unique view, rankings leaderboard. 50+ endpoints. |
| 6.1 | Credit System | Token-based billing: (prompt_tokens x prompt_price) + (completion_tokens x completion_price). Pre-flight credit check. Post-inference deduction. Refund endpoint. Transaction types: trial, purchase, api_usage, admin_credit, refund, bonus. Daily usage cap. |
| 6.2 | Plans & Tiers | Trial (3 days, $5), subscription plans via Stripe. Plan entitlements, usage vs limits. Upgrade/downgrade/cancel. Partner trials (Redbeard: 14-day Pro, $100 credits). |
| 6.3 | Customer Usage Analytics |
GET /user/activity/stats (requests, tokens, spend by date/model/provider), GET /user/activity/log (paginated logs), API key usage, environment usage breakdown, credit transactions. |
| 8.1 | Internal Metrics & Dashboards | Prometheus /metrics with OpenMetrics exemplar support. Grafana SimpleJSON datasource (6 endpoints). Parsed JSON metrics with P50/P95/P99. 40+ observability endpoints. Anomaly detection. |
| 8.2 | Distributed Tracing | OpenTelemetry with Tempo. TraceContextMiddleware. Exemplar linking (metrics->traces). Status, config, test-trace endpoints. Loki integration for logs. |
| 8.3 | Error Tracking | Sentry (AutoSentryMiddleware, 50% admin sampling). Loki log ingestion. Error classification pipeline (7 categories). AI fix generation via Claude (claude-3-5-sonnet). 13 error monitoring endpoints. |
| 8.6 | Customer-Facing Observability | User activity stats/logs. Public status page (9 endpoints: status, providers, models, incidents, uptime, search). API key audit logs. Model health visibility. |
| 9.1 | OpenAI-Compatible API |
POST /v1/chat/completions -- full drop-in replacement. Streaming SSE, tool/function calling, JSON mode, logprobs. All standard parameters. 30+ provider routing. |
| 9.2 | Anthropic-Compatible API |
POST /v1/messages -- drop-in Claude compatibility. Same routing/billing pipeline. |
| 10.3 | Multi-Target Deployment | Vercel serverless (api/index.py), Railway/Docker (start.sh), dev (src/main.py). Railway-specific health endpoint. |
Additional implemented features not in the conceptual model: Coupons, referrals, chat history/sessions/sharing, feedback system, Nosana GPU computing, partner trials, server-side tools (web search, TTS), image generation, audio transcription, admin dashboard, analytics event forwarding (Statsig/PostHog).
These features have some implementation but are missing key aspects required by the conceptual model.
Conceptual requirement: For models available on multiple providers, dynamically route to the provider with the lowest current P50 latency.
What exists:
- General Router "latency" mode exists, but routes to a single hardcoded model (
groq/llama-3.3-70b-versatile), not dynamically to the lowest-latency provider - Latency monitoring endpoints exist (
GET /api/monitoring/latency-trends/{provider},GET /api/monitoring/latency/{provider}/{model}) - Provider timing diagnostics exist (
GET /api/diagnostics/provider-timing)
What's missing:
- Dynamic per-request P50 latency comparison across providers serving the same model
- Real-time latency-based provider ranking in the failover chain
- Latency data feeding back into the routing decision for arbitrary models (not just the router modes)
Conceptual requirement: When the user requests cost optimization, dynamically select the cheapest provider serving the requested model while meeting minimum quality and latency thresholds.
What exists:
- General Router "cost" mode exists, but maps to a single hardcoded model (
openai/gpt-4o-mini) - Code Router "price" mode exists with static tier assignments
- Pricing data per model per provider exists in the catalog
- Cost analysis endpoints exist (
GET /api/monitoring/cost-analysis)
What's missing:
- Dynamic cost comparison across providers serving the same model at request time
- Quality/latency threshold enforcement when selecting the cheapest option
- Per-request cost-optimal provider selection (current implementation selects a cheap model, not the cheapest provider for a given model)
Conceptual requirement: Maintain quality scores for every model from standardized benchmarks (MMLU, HumanEval, MATH, MT-Bench, LMSYS Arena ELO, LiveBench, SWE-bench) blended with real-time signals.
What exists:
- SWE-bench and HumanEval benchmarks used in Code Router via
code_quality_priors.json - Model rankings/leaderboard endpoint (
GET /ranking/models) - Model health tracking (success rate, latency)
What's missing:
- MMLU, MATH, MT-Bench, LMSYS Arena ELO, LiveBench benchmark integration
- Task-specific quality priors (code, reasoning, creative writing, summarization, translation, etc.) for all models, not just coding models
- Dynamic blending of static benchmarks with real-time signals (success rate, retry rate, format compliance)
- Quality scores exposed in model catalog metadata for all models
- Benchmark data is static and loaded once at startup, never refreshed
Conceptual requirement: Enforce quality gates: resolvable pricing required, active provider required, valid modality required, no duplicates.
What exists:
- Model activation/deactivation endpoints (
POST /catalog/models-db/{model_id}/activate|deactivate) - Health status filtering (
GET /catalog/models-db/health/{health_status}) - Pricing data in catalog
- Deduplicated view (
GET /v1/models/unique)
What's missing:
- Automated gating that rejects models without resolvable pricing at sync time
- Automated validation that a model's provider is active and reachable before inclusion
- Modality validation gates
- Documentation/enforcement of the "high-value model protection" rule (blocking premium models if pricing falls through to defaults) as an automated catalog gate rather than a runtime check
Conceptual requirement: Uptime tracking per provider/model/tier, SLA breach alerting, automated credit-back on violations.
What exists:
- Uptime tracking endpoints:
GET /health/providers/uptime,GET /health/models/uptime,GET /v1/status/uptime/{provider}/{model_id} - Health alerting service (
health_alerting.py) - Incident management with resolution tracking
What's missing:
- Per-customer-tier SLA definitions (different SLAs for Team vs Enterprise)
- SLA violation detection (P99 latency or error rate exceeding threshold)
- Automated credit-back compensation when SLA is breached
- Customer-visible SLA compliance reporting
Conceptual requirement: LLM-specific observability via Arize Phoenix and Braintrust for prompt/response pairs, token usage, quality scoring.
What exists:
- Arize config file (
src/config/arize) referenced in CLAUDE.md - OpenTelemetry tracing captures inference requests
- Token usage tracking per request in activity logs
What's missing:
- No Arize Phoenix endpoints or dashboard integration exposed in the API
- No Braintrust integration
- No prompt/response pair recording for quality analysis (only metadata captured)
- No model performance comparison via AI-specific tracing tools
Conceptual requirement: Continuous CPU and memory profiling of hot paths via Pyroscope with operation-tagged data.
What exists:
- Pyroscope instrumentation for Redis/cache layers (recent commit:
feat(profiling): instrument all Redis/cache layers with Pyroscope tags) - Operation context tags applied to profiling data
What's missing:
- No profiling-specific API endpoints
- Profiling coverage limited to cache/Redis operations; auth, routing, and provider call paths not instrumented
- No way to view or query profiling data from within the Gatewayz system (requires external Pyroscope UI)
These features have no evidence of implementation in the current system.
Conceptual requirement: Scan prompts for PII (phone numbers, SSNs, emails, credit cards) before sending to providers. Optionally redact or block.
Current state: A sanitize_pii_for_logging() utility exists in security_validators.py that masks PII in log output, but this is for internal logging only -- it does not scan or modify user prompts before they are sent to providers. No prompt-level PII detection exists.
Gap: The entire input PII scanning and redaction pipeline is missing.
Conceptual requirement: Detect and block known prompt injection patterns before they reach providers.
Current state: The codebase contains SQL injection and log injection prevention in security_validators.py, but no prompt injection detection for LLM inputs.
Gap: No prompt injection pattern library, no scanning of user messages, no blocking capability.
Conceptual requirement: Per-API-key configuration to restrict models to specific content domains.
Current state: Domain restrictions exist for HTTP referrers (network-level), but there is no content-level topic restriction on prompts.
Gap: No content classification, no per-key topic policy configuration, no prompt-level topic enforcement.
Conceptual requirement: Integration with moderation classifiers to block harmful inputs.
Current state: Some content safety references exist in provider clients (Google Vertex, Cloudflare Workers) for handling provider-side safety responses, but no pre-dispatch input moderation exists.
Gap: No moderation classifier integration, no harmful input detection, no pre-provider content scanning.
Conceptual requirement: Scan model responses for policy violations before returning to the customer.
Current state: No output content filtering exists. Responses are passed through from providers without content scanning.
Gap: No response scanning, no policy violation detection, no output blocking.
Conceptual requirement: Validate model responses conform to requested JSON schema before returning.
Current state: response_format parameter is passed through to providers (documented in chat endpoints), and model capabilities are tracked in model_capabilities.json. But no gateway-side validation of the response occurs.
Gap: No schema validation of model output at the gateway level. Validation is delegated entirely to the provider.
Conceptual requirement: Standardize provider safety metadata (refusals, content flags) into one consistent format.
Current state: Minimal extraction of reasoning_content from Anthropic thinking models in anthropic_transformer.py. Safety metadata from different providers is not normalized.
Gap: No standardized safety metadata schema. Provider-specific refusal formats are not translated into a common format.
Conceptual requirement: Distribute inference load across providers for the same model (e.g., 70/30 split) to prevent over-reliance and gather performance data.
Current state: The failover chain is ordered by priority, but requests always go to the primary provider first. No weighted distribution exists.
Gap: No traffic splitting logic, no configurable split ratios, no multi-provider load distribution.
Conceptual requirement: Track whether a model performs well for a specific customer's use case over time, enabling personalized routing recommendations.
Current state: Activity logs track per-user usage (model, tokens, cost), but there is no quality signal tracking per customer per model, and no personalized routing.
Gap: No per-customer success rate tracking, no per-customer model preference learning, no personalized routing recommendations.
Conceptual requirement: Cache inference responses and match against semantically similar prompts using vector similarity (cosine > 0.95).
Current state: No vector database, no embedding generation, no similarity matching. All caching is exact-match or metadata-level.
Gap: Entire semantic caching subsystem is missing -- embeddings, vector storage, similarity search, cache lookup integration in inference pipeline.
Conceptual requirement: Cache inference responses keyed by SHA-256 hash of request (messages + model + params). 20K entries, 60-min TTL, LRU eviction.
Current state: Caching exists for metadata (models, providers, health, users), but no response-level caching for inference results.
Gap: No inference response cache. Every inference request hits a provider regardless of whether the identical request was made recently.
Conceptual requirement: Third-party shared LLM response caching proxy with sub-100ms cache hits.
Current state: User cache settings endpoints exist (GET/PUT /user/cache-settings referencing "Butter.dev cache preference"), indicating the UI/config layer exists, but no actual Butter.dev cache integration is present in the inference pipeline.
Gap: Config layer exists but the actual cache proxy integration in the request path is missing.
Conceptual requirement: Programmatic event notifications (credits.low, credits.depleted, model.degraded, rate_limit.approaching, batch.completed) with HMAC-signed payloads, retry logic, and delivery log.
Current state: Stripe payment webhooks exist (incoming from Stripe to Gatewayz). User notification preferences and email notifications exist. But no outbound customer webhooks (Gatewayz to customer's URL) are implemented.
Gap: No webhook delivery system, no event subscription management, no HMAC signing, no delivery log, no retry logic for outbound webhooks.
Conceptual requirement: Template library with versioning, template variables, A/B testing, per-key prompt defaults.
Current state: No prompt management system exists. Prompts are sent ad-hoc per request.
Gap: Entire feature missing -- no templates, no versioning, no variables, no A/B testing, no per-key defaults.
Conceptual requirement: Submit prompt lists for async processing at reduced cost, with status polling and webhook completion notification.
Current state: The Features page explicitly states: "Does not provide batch/async inference (all requests are synchronous or streaming)."
Gap: Entire feature missing -- no batch job API, no async processing, no off-peak scheduling, no results download.
Conceptual requirement: Interactive web UI for testing prompts against any model.
Current state: The backend does not serve any web UI. No playground-related routes or services exist.
Gap: Entire feature missing. This would likely be a frontend application, not a backend feature, but requires backend API support for model listing, inference, and parameter exploration.
Conceptual requirement: Geo-aware provider selection to route to the nearest provider region.
Current state: Some providers have regional support (Google Vertex, Alibaba Cloud), but there is no gateway-level geo-routing logic that considers user location.
Gap: No user geolocation detection, no region-aware provider ranking, no latency-based geographic optimization.
Conceptual requirement: Route EU customers' requests to EU-based providers for GDPR compliance.
Current state: No data residency enforcement exists.
Gap: No user region classification, no EU-only provider routing, no data residency policy enforcement, no compliance documentation.
| Layer | Total Features | Implemented | Partial | Missing | Completion |
|---|---|---|---|---|---|
| Ingress Layer | 12 | 5 | 0 | 7 | 42% |
| Core Routing Engine | 9 | 6 | 2 | 1 | 78% |
| Intelligence Layer | 6 | 4 | 1 | 1 | 75% |
| Caching System | 4 | 1 | 0 | 3 | 25% |
| Model Catalog | 5 | 4 | 1 | 0 | 90% |
| Business Layer | 5 | 3 | 1 | 1 | 70% |
| Developer Platform | 4 | 0 | 1 | 3 | 12% |
| Observability | 6 | 4 | 2 | 0 | 83% |
| API Compatibility | 2 | 2 | 0 | 0 | 100% |
| Infrastructure | 3 | 1 | 0 | 2 | 33% |
| Total | 56 | 33 | 7 | 16 | 65% |
- API Compatibility (100%) -- Both OpenAI and Anthropic APIs fully implemented
- Model Catalog (90%) -- Discovery, sync, enrichment, search all working
- Observability (83%) -- Prometheus, Grafana, OpenTelemetry, Sentry, Loki, error monitoring all operational
- Core Routing Engine (78%) -- Model resolution, failover, circuit breakers, intelligent routing all working
- Developer Platform (12%) -- Only model comparison exists; prompt management, batch inference, and playground are all missing
- Caching System (25%) -- Supporting caches work, but all three inference response caching layers (semantic, exact-match, Butter.dev) are missing
- Infrastructure (33%) -- Only multi-target deployment exists; multi-region and data residency are missing
- Ingress Layer (42%) -- Auth and rate limiting are strong, but all 7 guardrail features (PII, injection, topic, moderation, content filtering, schema validation, hallucination flags) are missing
-
Inference Response Caching (4.2) -- Implementing exact-match response caching would immediately reduce provider costs and improve latency for repeated queries. Low complexity, high ROI.
-
Customer Webhooks (6.4) -- Required for enterprise customers who need programmatic notifications for credits, health events, and rate limits. Blocks enterprise adoption.
-
Input Content Moderation (1.9) -- Prevents platform misuse and is often required for enterprise compliance. Integration with an external moderation API (e.g., OpenAI moderation endpoint) would be relatively straightforward.
-
Batch / Async Inference (7.2) -- Common requirement for data processing workloads. Competitors (OpenAI, Anthropic) offer batch APIs at 50% discount.
-
Dynamic Cost-Optimal Selection (2.8) -- Completing this would differentiate from competitors by automatically finding the cheapest provider per model per request.
-
Dynamic Latency-Optimal Selection (2.7) -- Same differentiation for latency-sensitive workloads.
-
Prompt Management (7.1) -- High-value developer experience feature. Template versioning and A/B testing are table-stakes for production AI deployments.
-
Semantic Cache (4.1) -- High complexity (requires embedding infrastructure) but significant cost reduction potential.
-
Guardrails Suite (1.6-1.8, 1.10-1.12) -- Important for enterprise but can be initially addressed by recommending customers use provider-side safety features.
-
Multi-Region / Data Residency (10.1, 10.2) -- Required for EU enterprise customers but can be deferred until geographic expansion demands it.
Source: Conceptual Model Features | Features (Current Implementation) | API Mappings
Reading Path (start here, in order)
- Conceptual Model
- Stability Definition
- Conceptual Model Features
- Features
- Delta Report
- Features-Acceptance-Criteria
Testing
Security & Access
Billing
Monitoring
Features
Providers
Operations
Data References