Skip to content

Current System Delta and Gaps

arminrad edited this page Mar 6, 2026 · 2 revisions

Current System Delta and Gaps

Comparison of the Conceptual Model Features (56 required features) against the Current Implementation (450+ endpoints). This document identifies what is fully implemented, what is partially implemented with specific gaps, and what is entirely missing.

Last Updated: 2026-03-05


Summary

Status Count Percentage
Fully Implemented 33 59%
Partially Implemented 7 12%
Not Implemented 16 29%
Total 56 100%

The gaps cluster in three areas:

  1. Input/Output Guardrails -- 6 of 7 guardrail features are missing (the entire safety layer)
  2. Developer Platform -- 3 of 4 features are missing (prompt management, batch inference, playground)
  3. Infrastructure -- 2 of 3 features are missing (multi-region, data residency)

Table of Contents


Fully Implemented Features (33)

These features exist in the current system with endpoints and functionality that match the conceptual model requirements.

# Feature Evidence
1.1 API Key Authentication Fernet AES-128 encryption + SHA-256 HMAC hashing. Full auth chain: get_api_key() -> validate_api_key_security() -> get_user(). Key creation, listing, rotation, deletion endpoints.
1.2 Role-Based Access Control (RBAC) Admin role enforcement via require_admin. Role CRUD: GET/POST /admin/roles/*. Audit log for role changes. Roles: admin, user (team/dev/free tiers via plans).
1.3 Per-Key IP Allowlists Full CRUD on /api/admin/ip-whitelist. CIDR range support via ipaddress.ip_network(). Validated in auth pipeline via validate_api_key_security().
1.4 Domain Restrictions Domain referrer validation in auth pipeline. Stored per API key in domain_referrers field. Checked during validate_api_key_security().
1.5 Three-Layer Rate Limiting Layer 1: IP-level SecurityMiddleware with velocity mode (25% error threshold, 3-min window). Layer 2: Redis-backed API key limits (INCR rate_limit:{api_key_id}:{minute}). Layer 3: Anonymous IP-hash limits. In-memory fallback when Redis unavailable.
2.1 Model Resolution Pipeline Alias normalization (normalize_model_string()), provider detection, model ID transformation (model_transformations.py). 120+ aliases. Router prefix detection (router:general:*, router:code:*).
2.2 Intelligent Routing -- General Router NotDiamond integration. 4 modes: balanced, quality, cost, latency. Fallback models per mode. 5 endpoints: settings, models, fallback-models, stats, test.
2.3 Intelligent Routing -- Code Router SWE-bench + HumanEval benchmarks via code_quality_priors.json. 4 modes: auto, price, quality, agentic. 4 model tiers. 5 endpoints.
2.4 Provider Failover build_provider_failover_chain() with 14+ providers. Circuit breaker-aware (skips OPEN providers). Model-aware rules (OpenAI->OpenRouter, Anthropic->OpenRouter, open-source->all). Triggers on 401-404, 502-504.
2.5 Circuit Breakers CLOSED/OPEN/HALF_OPEN states. Redis-backed (circuit_breaker:{provider}:*, 3600s TTL). Config: 5 failures->OPEN, 300s recovery, 3 successes->CLOSED. 4 management endpoints. Prometheus metrics on state transitions.
2.6 Health-Weighted Load Balancing Health scores 0-100 per provider. "Health-based provider selection" as step 10 in inference pipeline. Provider health checked before routing.
3.1 Tiered Health Monitoring Multiple check types: Quick (sub-ms), Standard (DB+Redis), Railway (comprehensive), System (memory/CPU). Background monitoring with start/stop. Per-provider and per-model health. 30+ health endpoints.
3.2 Passive Health Capture Confirmed in inference pipeline: "Background post-processing after stream completes: credit deduction, activity log, chat history save, health capture." Zero overhead on request path.
3.3 Incident Management Downtime tracking: incident CRUD, severity levels, Loki log capture (30s timeout, 10K entries), resolution with notes, MTTR calculation, error pattern analysis. 8 admin endpoints.
3.6 Provider Credit Monitoring GET /api/provider-credits/balance (all providers), GET /api/provider-credits/balance/{provider}. 15-min in-memory cache. Status thresholds: critical <= $5, warning <= $20. Currently OpenRouter only.
4.4 Supporting Caches Auth cache (5-min TTL, 512 entries), catalog L1 (in-process, 5-min), catalog L2 (Redis, 15-30 min), rate limit LRU cache, health cache (6-min), HuggingFace cache, user lookup cache. Graceful degradation: Redis down -> local memory.
5.1 Background Model Sync 12 admin endpoints: trigger, all, full, incremental, providers-only, per-provider, reset-and-resync, flush. 33 syncable providers. Stores to models_catalog DB table.
5.2 Model Metadata Standard Standardized fields: id, name, provider_slug, context_length, pricing, source_gateway. Model health status. Catalog CRUD endpoints.
5.4 HuggingFace Enrichment 6 endpoints: discovery, search, author models, model details (downloads, likes, parameters), model card, file listing. Redis-cached. Admin cache management.
5.5 Model Discovery & Search Full-text search, filtering (provider, gateway, modality), trending models, low-latency models, model comparison, batch-compare, developer views, deduplicated unique view, rankings leaderboard. 50+ endpoints.
6.1 Credit System Token-based billing: (prompt_tokens x prompt_price) + (completion_tokens x completion_price). Pre-flight credit check. Post-inference deduction. Refund endpoint. Transaction types: trial, purchase, api_usage, admin_credit, refund, bonus. Daily usage cap.
6.2 Plans & Tiers Trial (3 days, $5), subscription plans via Stripe. Plan entitlements, usage vs limits. Upgrade/downgrade/cancel. Partner trials (Redbeard: 14-day Pro, $100 credits).
6.3 Customer Usage Analytics GET /user/activity/stats (requests, tokens, spend by date/model/provider), GET /user/activity/log (paginated logs), API key usage, environment usage breakdown, credit transactions.
8.1 Internal Metrics & Dashboards Prometheus /metrics with OpenMetrics exemplar support. Grafana SimpleJSON datasource (6 endpoints). Parsed JSON metrics with P50/P95/P99. 40+ observability endpoints. Anomaly detection.
8.2 Distributed Tracing OpenTelemetry with Tempo. TraceContextMiddleware. Exemplar linking (metrics->traces). Status, config, test-trace endpoints. Loki integration for logs.
8.3 Error Tracking Sentry (AutoSentryMiddleware, 50% admin sampling). Loki log ingestion. Error classification pipeline (7 categories). AI fix generation via Claude (claude-3-5-sonnet). 13 error monitoring endpoints.
8.6 Customer-Facing Observability User activity stats/logs. Public status page (9 endpoints: status, providers, models, incidents, uptime, search). API key audit logs. Model health visibility.
9.1 OpenAI-Compatible API POST /v1/chat/completions -- full drop-in replacement. Streaming SSE, tool/function calling, JSON mode, logprobs. All standard parameters. 30+ provider routing.
9.2 Anthropic-Compatible API POST /v1/messages -- drop-in Claude compatibility. Same routing/billing pipeline.
10.3 Multi-Target Deployment Vercel serverless (api/index.py), Railway/Docker (start.sh), dev (src/main.py). Railway-specific health endpoint.

Additional implemented features not in the conceptual model: Coupons, referrals, chat history/sessions/sharing, feedback system, Nosana GPU computing, partner trials, server-side tools (web search, TTS), image generation, audio transcription, admin dashboard, analytics event forwarding (Statsig/PostHog).


Partially Implemented Features (7)

These features have some implementation but are missing key aspects required by the conceptual model.


2.7 Latency-Optimal Selection

Conceptual requirement: For models available on multiple providers, dynamically route to the provider with the lowest current P50 latency.

What exists:

  • General Router "latency" mode exists, but routes to a single hardcoded model (groq/llama-3.3-70b-versatile), not dynamically to the lowest-latency provider
  • Latency monitoring endpoints exist (GET /api/monitoring/latency-trends/{provider}, GET /api/monitoring/latency/{provider}/{model})
  • Provider timing diagnostics exist (GET /api/diagnostics/provider-timing)

What's missing:

  • Dynamic per-request P50 latency comparison across providers serving the same model
  • Real-time latency-based provider ranking in the failover chain
  • Latency data feeding back into the routing decision for arbitrary models (not just the router modes)

2.8 Cost-Optimal Selection

Conceptual requirement: When the user requests cost optimization, dynamically select the cheapest provider serving the requested model while meeting minimum quality and latency thresholds.

What exists:

  • General Router "cost" mode exists, but maps to a single hardcoded model (openai/gpt-4o-mini)
  • Code Router "price" mode exists with static tier assignments
  • Pricing data per model per provider exists in the catalog
  • Cost analysis endpoints exist (GET /api/monitoring/cost-analysis)

What's missing:

  • Dynamic cost comparison across providers serving the same model at request time
  • Quality/latency threshold enforcement when selecting the cheapest option
  • Per-request cost-optimal provider selection (current implementation selects a cheap model, not the cheapest provider for a given model)

3.4 Model Quality Scoring & Benchmarks

Conceptual requirement: Maintain quality scores for every model from standardized benchmarks (MMLU, HumanEval, MATH, MT-Bench, LMSYS Arena ELO, LiveBench, SWE-bench) blended with real-time signals.

What exists:

  • SWE-bench and HumanEval benchmarks used in Code Router via code_quality_priors.json
  • Model rankings/leaderboard endpoint (GET /ranking/models)
  • Model health tracking (success rate, latency)

What's missing:

  • MMLU, MATH, MT-Bench, LMSYS Arena ELO, LiveBench benchmark integration
  • Task-specific quality priors (code, reasoning, creative writing, summarization, translation, etc.) for all models, not just coding models
  • Dynamic blending of static benchmarks with real-time signals (success rate, retry rate, format compliance)
  • Quality scores exposed in model catalog metadata for all models
  • Benchmark data is static and loaded once at startup, never refreshed

5.3 Catalog Inclusion Requirements

Conceptual requirement: Enforce quality gates: resolvable pricing required, active provider required, valid modality required, no duplicates.

What exists:

  • Model activation/deactivation endpoints (POST /catalog/models-db/{model_id}/activate|deactivate)
  • Health status filtering (GET /catalog/models-db/health/{health_status})
  • Pricing data in catalog
  • Deduplicated view (GET /v1/models/unique)

What's missing:

  • Automated gating that rejects models without resolvable pricing at sync time
  • Automated validation that a model's provider is active and reachable before inclusion
  • Modality validation gates
  • Documentation/enforcement of the "high-value model protection" rule (blocking premium models if pricing falls through to defaults) as an automated catalog gate rather than a runtime check

6.5 SLA Tracking

Conceptual requirement: Uptime tracking per provider/model/tier, SLA breach alerting, automated credit-back on violations.

What exists:

  • Uptime tracking endpoints: GET /health/providers/uptime, GET /health/models/uptime, GET /v1/status/uptime/{provider}/{model_id}
  • Health alerting service (health_alerting.py)
  • Incident management with resolution tracking

What's missing:

  • Per-customer-tier SLA definitions (different SLAs for Team vs Enterprise)
  • SLA violation detection (P99 latency or error rate exceeding threshold)
  • Automated credit-back compensation when SLA is breached
  • Customer-visible SLA compliance reporting

8.4 AI-Specific Tracing

Conceptual requirement: LLM-specific observability via Arize Phoenix and Braintrust for prompt/response pairs, token usage, quality scoring.

What exists:

  • Arize config file (src/config/arize) referenced in CLAUDE.md
  • OpenTelemetry tracing captures inference requests
  • Token usage tracking per request in activity logs

What's missing:

  • No Arize Phoenix endpoints or dashboard integration exposed in the API
  • No Braintrust integration
  • No prompt/response pair recording for quality analysis (only metadata captured)
  • No model performance comparison via AI-specific tracing tools

8.5 Profiling

Conceptual requirement: Continuous CPU and memory profiling of hot paths via Pyroscope with operation-tagged data.

What exists:

  • Pyroscope instrumentation for Redis/cache layers (recent commit: feat(profiling): instrument all Redis/cache layers with Pyroscope tags)
  • Operation context tags applied to profiling data

What's missing:

  • No profiling-specific API endpoints
  • Profiling coverage limited to cache/Redis operations; auth, routing, and provider call paths not instrumented
  • No way to view or query profiling data from within the Gatewayz system (requires external Pyroscope UI)

Not Implemented Features (16)

These features have no evidence of implementation in the current system.


1.6 Input Guardrails -- PII Detection

Conceptual requirement: Scan prompts for PII (phone numbers, SSNs, emails, credit cards) before sending to providers. Optionally redact or block.

Current state: A sanitize_pii_for_logging() utility exists in security_validators.py that masks PII in log output, but this is for internal logging only -- it does not scan or modify user prompts before they are sent to providers. No prompt-level PII detection exists.

Gap: The entire input PII scanning and redaction pipeline is missing.


1.7 Input Guardrails -- Prompt Injection Defense

Conceptual requirement: Detect and block known prompt injection patterns before they reach providers.

Current state: The codebase contains SQL injection and log injection prevention in security_validators.py, but no prompt injection detection for LLM inputs.

Gap: No prompt injection pattern library, no scanning of user messages, no blocking capability.


1.8 Input Guardrails -- Topic Restrictions

Conceptual requirement: Per-API-key configuration to restrict models to specific content domains.

Current state: Domain restrictions exist for HTTP referrers (network-level), but there is no content-level topic restriction on prompts.

Gap: No content classification, no per-key topic policy configuration, no prompt-level topic enforcement.


1.9 Input Guardrails -- Content Moderation

Conceptual requirement: Integration with moderation classifiers to block harmful inputs.

Current state: Some content safety references exist in provider clients (Google Vertex, Cloudflare Workers) for handling provider-side safety responses, but no pre-dispatch input moderation exists.

Gap: No moderation classifier integration, no harmful input detection, no pre-provider content scanning.


1.10 Output Guardrails -- Content Filtering

Conceptual requirement: Scan model responses for policy violations before returning to the customer.

Current state: No output content filtering exists. Responses are passed through from providers without content scanning.

Gap: No response scanning, no policy violation detection, no output blocking.


1.11 Output Guardrails -- Structured Output Validation

Conceptual requirement: Validate model responses conform to requested JSON schema before returning.

Current state: response_format parameter is passed through to providers (documented in chat endpoints), and model capabilities are tracked in model_capabilities.json. But no gateway-side validation of the response occurs.

Gap: No schema validation of model output at the gateway level. Validation is delegated entirely to the provider.


1.12 Output Guardrails -- Hallucination Flags

Conceptual requirement: Standardize provider safety metadata (refusals, content flags) into one consistent format.

Current state: Minimal extraction of reasoning_content from Anthropic thinking models in anthropic_transformer.py. Safety metadata from different providers is not normalized.

Gap: No standardized safety metadata schema. Provider-specific refusal formats are not translated into a common format.


2.9 Traffic Splitting

Conceptual requirement: Distribute inference load across providers for the same model (e.g., 70/30 split) to prevent over-reliance and gather performance data.

Current state: The failover chain is ordered by priority, but requests always go to the primary provider first. No weighted distribution exists.

Gap: No traffic splitting logic, no configurable split ratios, no multi-provider load distribution.


3.5 Per-Customer Quality Tracking

Conceptual requirement: Track whether a model performs well for a specific customer's use case over time, enabling personalized routing recommendations.

Current state: Activity logs track per-user usage (model, tokens, cost), but there is no quality signal tracking per customer per model, and no personalized routing.

Gap: No per-customer success rate tracking, no per-customer model preference learning, no personalized routing recommendations.


4.1 Semantic Cache

Conceptual requirement: Cache inference responses and match against semantically similar prompts using vector similarity (cosine > 0.95).

Current state: No vector database, no embedding generation, no similarity matching. All caching is exact-match or metadata-level.

Gap: Entire semantic caching subsystem is missing -- embeddings, vector storage, similarity search, cache lookup integration in inference pipeline.


4.2 Exact-Match Response Cache

Conceptual requirement: Cache inference responses keyed by SHA-256 hash of request (messages + model + params). 20K entries, 60-min TTL, LRU eviction.

Current state: Caching exists for metadata (models, providers, health, users), but no response-level caching for inference results.

Gap: No inference response cache. Every inference request hits a provider regardless of whether the identical request was made recently.


4.3 External Cache (Butter.dev)

Conceptual requirement: Third-party shared LLM response caching proxy with sub-100ms cache hits.

Current state: User cache settings endpoints exist (GET/PUT /user/cache-settings referencing "Butter.dev cache preference"), indicating the UI/config layer exists, but no actual Butter.dev cache integration is present in the inference pipeline.

Gap: Config layer exists but the actual cache proxy integration in the request path is missing.


6.4 Customer Webhooks

Conceptual requirement: Programmatic event notifications (credits.low, credits.depleted, model.degraded, rate_limit.approaching, batch.completed) with HMAC-signed payloads, retry logic, and delivery log.

Current state: Stripe payment webhooks exist (incoming from Stripe to Gatewayz). User notification preferences and email notifications exist. But no outbound customer webhooks (Gatewayz to customer's URL) are implemented.

Gap: No webhook delivery system, no event subscription management, no HMAC signing, no delivery log, no retry logic for outbound webhooks.


7.1 Prompt Management

Conceptual requirement: Template library with versioning, template variables, A/B testing, per-key prompt defaults.

Current state: No prompt management system exists. Prompts are sent ad-hoc per request.

Gap: Entire feature missing -- no templates, no versioning, no variables, no A/B testing, no per-key defaults.


7.2 Batch / Async Inference

Conceptual requirement: Submit prompt lists for async processing at reduced cost, with status polling and webhook completion notification.

Current state: The Features page explicitly states: "Does not provide batch/async inference (all requests are synchronous or streaming)."

Gap: Entire feature missing -- no batch job API, no async processing, no off-peak scheduling, no results download.


7.4 Playground

Conceptual requirement: Interactive web UI for testing prompts against any model.

Current state: The backend does not serve any web UI. No playground-related routes or services exist.

Gap: Entire feature missing. This would likely be a frontend application, not a backend feature, but requires backend API support for model listing, inference, and parameter exploration.


10.1 Multi-Region Routing

Conceptual requirement: Geo-aware provider selection to route to the nearest provider region.

Current state: Some providers have regional support (Google Vertex, Alibaba Cloud), but there is no gateway-level geo-routing logic that considers user location.

Gap: No user geolocation detection, no region-aware provider ranking, no latency-based geographic optimization.


10.2 Data Residency

Conceptual requirement: Route EU customers' requests to EU-based providers for GDPR compliance.

Current state: No data residency enforcement exists.

Gap: No user region classification, no EU-only provider routing, no data residency policy enforcement, no compliance documentation.


Gap Analysis by Layer

Layer Total Features Implemented Partial Missing Completion
Ingress Layer 12 5 0 7 42%
Core Routing Engine 9 6 2 1 78%
Intelligence Layer 6 4 1 1 75%
Caching System 4 1 0 3 25%
Model Catalog 5 4 1 0 90%
Business Layer 5 3 1 1 70%
Developer Platform 4 0 1 3 12%
Observability 6 4 2 0 83%
API Compatibility 2 2 0 0 100%
Infrastructure 3 1 0 2 33%
Total 56 33 7 16 65%

Strongest Areas (>75% complete)

  • API Compatibility (100%) -- Both OpenAI and Anthropic APIs fully implemented
  • Model Catalog (90%) -- Discovery, sync, enrichment, search all working
  • Observability (83%) -- Prometheus, Grafana, OpenTelemetry, Sentry, Loki, error monitoring all operational
  • Core Routing Engine (78%) -- Model resolution, failover, circuit breakers, intelligent routing all working

Weakest Areas (<50% complete)

  • Developer Platform (12%) -- Only model comparison exists; prompt management, batch inference, and playground are all missing
  • Caching System (25%) -- Supporting caches work, but all three inference response caching layers (semantic, exact-match, Butter.dev) are missing
  • Infrastructure (33%) -- Only multi-target deployment exists; multi-region and data residency are missing
  • Ingress Layer (42%) -- Auth and rate limiting are strong, but all 7 guardrail features (PII, injection, topic, moderation, content filtering, schema validation, hallucination flags) are missing

Priority Recommendations

High Priority (Revenue/Safety Impact)

  1. Inference Response Caching (4.2) -- Implementing exact-match response caching would immediately reduce provider costs and improve latency for repeated queries. Low complexity, high ROI.

  2. Customer Webhooks (6.4) -- Required for enterprise customers who need programmatic notifications for credits, health events, and rate limits. Blocks enterprise adoption.

  3. Input Content Moderation (1.9) -- Prevents platform misuse and is often required for enterprise compliance. Integration with an external moderation API (e.g., OpenAI moderation endpoint) would be relatively straightforward.

Medium Priority (Competitive Differentiation)

  1. Batch / Async Inference (7.2) -- Common requirement for data processing workloads. Competitors (OpenAI, Anthropic) offer batch APIs at 50% discount.

  2. Dynamic Cost-Optimal Selection (2.8) -- Completing this would differentiate from competitors by automatically finding the cheapest provider per model per request.

  3. Dynamic Latency-Optimal Selection (2.7) -- Same differentiation for latency-sensitive workloads.

  4. Prompt Management (7.1) -- High-value developer experience feature. Template versioning and A/B testing are table-stakes for production AI deployments.

Lower Priority (Future Growth)

  1. Semantic Cache (4.1) -- High complexity (requires embedding infrastructure) but significant cost reduction potential.

  2. Guardrails Suite (1.6-1.8, 1.10-1.12) -- Important for enterprise but can be initially addressed by recommending customers use provider-side safety features.

  3. Multi-Region / Data Residency (10.1, 10.2) -- Required for EU enterprise customers but can be deferred until geographic expansion demands it.


Source: Conceptual Model Features | Features (Current Implementation) | API Mappings

Clone this wiki locally