-
Notifications
You must be signed in to change notification settings - Fork 1
Conceptual Model Features
Reading path: Conceptual Model | Stability Definition | Conceptual Model Features (you are here) | Features | Delta Report | Features-Acceptance-Criteria
Read after: Stability Definition (so you know what "stable" means) Next: Features (to see what's actually built today)
TL;DR — This is the full spec. 56 features across 10 layers that the Conceptual Model says the system must have. Each feature includes what it should do AND what it explicitly should NOT do (boundaries). The layers: Ingress (12 features including 7 guardrails), Core Routing (9), Intelligence (6), Caching (4), Model Catalog (5), Business (5), Developer Platform (4), Observability (6), API Compatibility (2), Infrastructure (3). Use this as a checklist — compare it against Features and Delta Report to see what's done vs missing.
Last Updated: 2026-03-04
-
Ingress Layer
- 1.1 API Key Authentication
- 1.2 Role-Based Access Control (RBAC)
- 1.3 Per-Key IP Allowlists
- 1.4 Domain Restrictions
- 1.5 Three-Layer Rate Limiting
- 1.6 Input Guardrails — PII Detection
- 1.7 Input Guardrails — Prompt Injection Defense
- 1.8 Input Guardrails — Topic Restrictions
- 1.9 Input Guardrails — Content Moderation
- 1.10 Output Guardrails — Content Filtering
- 1.11 Output Guardrails — Structured Output Validation
- 1.12 Output Guardrails — Hallucination Flags
- Core Routing Engine
- Intelligence Layer
- Caching System
- Model Catalog
- Business Layer
- Developer Platform
- Observability
- API Compatibility
- Infrastructure & Deployment
The ingress layer is the security and quality boundary. Every request passes through it before reaching any business logic. Its job is to authenticate, authorize, rate-limit, and validate requests -- and optionally apply safety guardrails on both inputs and outputs.
What it does: Authenticates every API request using API keys that are encrypted at rest with AES-128 Fernet encryption. Keys are looked up via HMAC-SHA256 hashing for fast retrieval without needing to decrypt every key in the database. The system validates that the key is active, not expired, and not rate-limited before allowing the request to proceed.
What it does NOT do:
- Does not manage user identity or sessions (API keys are the sole auth mechanism for API requests; user identity management is a separate concern handled by the auth system)
- Does not support OAuth or JWT-based authentication for API requests (only API key Bearer tokens)
- Does not rotate keys automatically (key rotation is a manual user action)
- Does not support multi-key authentication (one key per request, not key combinations)
What it does: Assigns roles (admin, team, dev, free) to users, each with distinct permissions controlling what endpoints and operations they can access. Permissions are checked at the dependency-injection level before any route handler executes. Role changes are logged in an audit trail with reasons.
What it does NOT do:
- Does not provide granular resource-level permissions (e.g., "can only access model X" -- permissions are role-wide, not per-model or per-provider)
- Does not support custom roles (only the predefined set)
- Does not support team-level RBAC (roles are per-user, not per-team or per-organization)
- Does not enforce permissions at the provider level (all authenticated users with sufficient role can access all providers)
What it does: Allows users to restrict an API key so it can only be used from specific IP addresses or CIDR ranges. Requests from non-allowlisted IPs are rejected before any processing occurs. Supports both IPv4 single addresses and CIDR notation.
What it does NOT do:
- Does not provide geo-based restrictions (only IP-based; no country or region blocking)
- Does not automatically detect or suggest IPs to allowlist
- Does not support IPv6 range matching
- Does not apply allowlists retroactively to existing sessions or connections
What it does: Limits which HTTP referrer domains can use a specific API key. This prevents API keys embedded in frontend applications from being stolen and used on unauthorized domains.
What it does NOT do:
- Does not validate domain ownership (trusts the Referer/Origin header)
- Does not provide subdomain wildcard matching
- Does not block server-side usage (domain restrictions only apply when a Referer header is present)
What it does: Enforces rate limits at three distinct levels to protect the system from abuse:
- Layer 1 — IP-level: Network edge protection with behavioral analysis and velocity detection. Detects anomalous patterns (sudden traffic spikes) and activates velocity mode (temporarily halving rate limits system-wide when error rates exceed 25%).
- Layer 2 — API key-level: Redis-backed per-key limits (requests per minute, tokens per day/month) tied to the user's plan tier.
- Layer 3 — Anonymous: Separate, stricter limits for unauthenticated requests using IP-hash-based counters.
If Redis is unavailable, an in-memory fallback rate limiter activates. Requests are never blocked due to infrastructure failure.
What it does NOT do:
- Does not provide per-model or per-provider rate limits (limits are per-key and per-IP, not per-model)
- Does not support burst allowances or token bucket algorithms (uses sliding window counters)
- Does not coordinate rate limit state across multiple gateway instances in real-time (each instance maintains its own IP-level state; only API key limits are shared via Redis)
- Does not bill for rate-limited requests (rejected requests consume zero credits)
What it does: Scans prompts for personally identifiable information (phone numbers, SSNs, emails, credit card numbers) before sending them to external providers. Can be configured to redact the PII automatically or block the request entirely.
What it does NOT do:
- Does not detect all forms of PII across all languages and formats (pattern-based detection, not ML-based)
- Does not store or log detected PII (detection is ephemeral, in-request only)
- Does not apply PII detection to model responses (that's output guardrails)
- Does not provide HIPAA or SOC2 certified PII handling (detection is best-effort)
What it does: Detects and blocks known prompt injection patterns that attempt to override system prompts, extract hidden instructions, or manipulate model behavior. Applies pattern matching against a library of known injection techniques.
What it does NOT do:
- Does not guarantee protection against novel or sophisticated injection attacks (pattern-based, not adversarially trained)
- Does not modify or sanitize prompts (either blocks the request or allows it through)
- Does not apply to tool/function calling arguments (only scans the message content)
- Does not learn from new injection attempts automatically
What it does: Allows per-API-key configuration to restrict models to specific domains (e.g., "only answer customer support questions"). Requests outside the allowed topic domain are rejected before reaching any provider.
What it does NOT do:
- Does not understand nuanced topic boundaries (uses classifier-based detection, not deep semantic understanding)
- Does not apply topic restrictions to system prompts (only user messages)
- Does not provide topic restriction templates (configuration is custom per key)
- Does not modify the request to steer it back on-topic (rejects or allows, no rewriting)
What it does: Integrates with moderation classifiers to block harmful, illegal, or policy-violating inputs before they reach any AI provider. Prevents the platform from being used to generate harmful content.
What it does NOT do:
- Does not train or host its own moderation models (integrates with external classifiers)
- Does not apply different moderation policies per user or key (system-wide policy)
- Does not provide moderation explanations to the user (blocks with a generic rejection)
- Does not moderate in real-time during streaming (checks input before dispatch, not tokens as they arrive)
What it does: Scans model responses for policy violations, harmful content, or off-topic answers before returning them to the customer. Acts as a safety net for cases where the model produces inappropriate output despite clean input.
What it does NOT do:
- Does not rewrite or sanitize problematic responses (blocks the response and returns an error)
- Does not apply during streaming (full response must be available for analysis, which conflicts with SSE streaming)
- Does not provide configurable sensitivity levels per customer
- Does not cache filtered responses (each response is checked independently)
What it does:
When a customer requests JSON schema output (via response_format parameter), validates that the model's response conforms to the specified schema before returning it. Prevents malformed JSON from reaching the customer's application.
What it does NOT do:
- Does not fix or repair malformed JSON (validates and rejects only)
- Does not support non-JSON structured formats (XML, YAML, CSV)
- Does not retry with a corrective prompt if validation fails (returns error to customer)
- Does not validate semantic correctness (only structural/schema compliance)
What it does: Surfaces provider-side safety metadata (refusals, safety filter triggers, content flags) in a standardized format regardless of which provider generated the response. Normalizes the different safety signal formats from OpenAI, Anthropic, Google, etc. into one consistent schema.
What it does NOT do:
- Does not detect hallucinations independently (relies on provider-reported metadata)
- Does not verify factual accuracy of responses
- Does not block responses based on hallucination flags (surfaces the metadata, letting the customer decide)
- Does not provide confidence scores or uncertainty estimates
The core routing engine is the central nervous system of Gatewayz. Every inference request must be resolved to a specific provider and model ID, with fallback chains, load balancing, and intelligent model selection.
What it does: Translates any model identifier a user sends into a specific provider and that provider's native model ID format, through a three-stage pipeline:
-
Alias Normalization — Maps shorthand names to canonical model IDs (e.g.,
"r1"→"deepseek/deepseek-r1","gpt-4o"→"openai/gpt-4o"). Supports 120+ aliases. - Provider Detection — Determines which provider serves the model, following a strict priority: explicit overrides → format-based rules → mapping tables → org-prefix fallbacks.
-
Model ID Transformation — Translates the canonical ID to the provider's native naming format (e.g., Fireworks uses
accounts/fireworks/models/...).
What it does NOT do:
- Does not validate that the model is actually available at the resolved provider at request time (availability is checked separately by the failover system)
- Does not support user-defined aliases or custom model mappings
- Does not resolve model versions or snapshots (uses the provider's default/latest version)
- Does not handle multi-modal routing differently from text routing (same pipeline for all modalities)
What it does:
ML-powered model selection for general tasks using NotDiamond integration. When a user sends a request with router:general:<mode>, the system analyzes the prompt content and picks the optimal model. Four optimization modes: quality (best output), cost (cheapest capable model), latency (fastest response), balanced (tradeoff of all three).
Falls back to mode-specific default models when NotDiamond is unavailable (e.g., quality → openai/gpt-4o, cost → openai/gpt-4o-mini, latency → groq/llama-3.3-70b-versatile).
What it does NOT do:
- Does not learn from user feedback or usage patterns (relies on NotDiamond's analysis per request)
- Does not support custom model pools or user-defined routing rules
- Does not guarantee the selected model is the objectively best choice (heuristic-based)
- Does not support routing constraints (e.g., "only open-source models" or "only models with >128K context")
What it does:
Benchmark-driven model selection specifically for coding tasks. Classifies task complexity and matches requests to tiered models scored by SWE-bench and HumanEval benchmarks. Four modes: auto (complexity-based tier selection), price (cheapest capable model), quality (highest benchmark score), agentic (optimized for multi-step tool-using agents).
Models are organized into 4 tiers ranked by benchmark scores and pricing.
What it does NOT do:
- Does not execute or benchmark code itself (uses pre-computed static benchmark data)
- Does not learn or adapt from user feedback (static benchmark data, loaded once at startup)
- Does not support custom model pools or user-defined tiers
- Does not detect programming language to optimize model selection (analyzes task complexity, not language)
What it does: When a provider fails during an inference request, the system automatically retries with the next provider in a prioritized 14-provider failover chain. The user never sees the failure — the response comes back as if nothing went wrong.
Failover triggers on: 401, 402 (provider out of credits), 403, 404, 502, 503, 504 from the provider. Failover does NOT trigger on: 400 (user error — the same bad request would fail at every provider) or 429 (rate limit — retries with exponential backoff instead).
Model-aware routing rules apply: OpenAI models only failover to OpenAI → OpenRouter. Anthropic models only to Anthropic → OpenRouter. Open-source models can failover across all providers.
What it does NOT do:
- Does not retry on user-caused errors (4xx except 401-404)
- Does not failover mid-stream (if streaming has started and the provider fails partway, the stream is terminated)
- Does not guarantee the failover provider has the same pricing (cost may differ across providers)
- Does not track which provider ultimately served the request in the billing (charges based on the provider that succeeded)
- Does not support user-configured failover chains (the chain is system-defined)
What it does: Implements the circuit breaker pattern per provider to prevent cascading failures. Each provider's circuit breaker tracks consecutive success/failure counts and transitions between three states:
- CLOSED (normal): Requests flow through. Failures increment the counter.
- OPEN (blocking): After 5 consecutive failures, all requests to this provider are immediately rejected without attempting a call. Lasts for 5 minutes.
- HALF_OPEN (testing): After the cool-down period, one test request is allowed through. If it succeeds (3 consecutive successes needed), the breaker closes. If it fails, it reopens.
What it does NOT do:
- Does not auto-configure thresholds per provider (all providers use the same 5-failure/5-minute/3-success defaults)
- Does not consider error type (a 502 and a timeout count the same)
- Does not alert operators when a circuit opens (only emits Prometheus metrics)
- Does not persist state to the database (Redis + in-memory only; state is lost if both Redis and the process restart simultaneously)
What it does: Before attempting a request, checks the primary provider's health score. If the provider's uptime is below a threshold, a healthier provider is promoted to the front of the failover chain. This prevents routing to a provider that is technically up but performing poorly.
What it does NOT do:
- Does not split traffic proportionally by health score (binary decision: promote or don't)
- Does not consider per-model health (health is tracked at the provider level)
- Does not predict future health based on trends (uses current point-in-time health only)
What it does: For models available on multiple providers simultaneously, routes to the provider with the lowest current P50 latency. This ensures users get the fastest response for their chosen model without needing to know which provider is fastest.
What it does NOT do:
- Does not consider user's geographic location in latency calculations
- Does not account for queue depth or provider load (only historical latency measurements)
- Does not provide latency guarantees or SLAs per provider
- Does not optimize for time-to-first-token vs total response time separately
What it does: When the user requests cost optimization, selects the cheapest provider that serves the requested model and meets minimum quality and latency thresholds. Prevents cost optimization from degrading response quality below acceptable levels.
What it does NOT do:
- Does not negotiate prices with providers (uses published pricing)
- Does not consider volume discounts or committed-use contracts
- Does not factor in the user's remaining credit balance to optimize cost
- Does not support per-request cost caps ("don't spend more than $X on this request")
What it does: Distributes inference load across multiple providers for the same model (e.g., 70/30 split) to prevent over-reliance on any single provider and to continuously gather performance data from all available providers. This ensures the system always has fresh health and latency data for every provider.
What it does NOT do:
- Does not support user-configured split ratios
- Does not guarantee deterministic routing (the same request may go to different providers on retry)
- Does not split individual requests across providers (one request goes to one provider)
- Does not consider cost differences when splitting (split ratios are reliability-focused, not cost-focused)
The intelligence layer continuously monitors the health, quality, and cost of every model across every provider. It feeds data into the routing engine to make informed decisions.
What it does: Runs a continuous monitoring system that checks model health at different intervals based on usage importance:
| Tier | Coverage | Check Interval |
|---|---|---|
| Critical | Top 5% by usage | Every 5 minutes |
| Popular | Next 20% | Every 30 minutes |
| Standard | Remaining 75% | Every 2-4 hours |
| On-Demand | New/rare models | Only when requested |
Health checks verify that a model is responding, within latency bounds, and returning valid outputs.
What it does NOT do:
- Does not perform load testing or synthetic inference (health checks are lightweight probes, not full inference)
- Does not check response quality (only checks availability and latency)
- Does not support custom check intervals per customer
- Does not run health checks from multiple geographic regions (checks from the gateway's region only)
What it does: Every real inference request contributes health data as a background task — success/failure, latency, token throughput, provider response codes. This happens after the response is returned to the user, adding zero overhead to the request path. Over time, this creates a rich, real-world health picture from actual production traffic.
What it does NOT do:
- Does not capture prompt or response content (only metadata: latency, tokens, status)
- Does not attribute health data to specific customers (aggregated per model/provider)
- Does not trigger alerts on individual request failures (only on patterns/thresholds)
What it does: Automatically creates incidents when provider health degrades below thresholds. Incidents have severity levels (Critical, High, Medium, Low), timestamps, affected providers, captured logs, and resolution tracking. Supports manual resolution with notes and automatic MTTR (Mean Time To Recovery) calculation.
What it does NOT do:
- Does not automatically remediate incidents (detection and tracking only; resolution is manual)
- Does not send external notifications to customers about incidents (internal system only)
- Does not integrate with PagerDuty, OpsGenie, or other incident management platforms natively
- Does not perform root cause analysis (provides log capture and pattern analysis, but diagnosis is human-driven)
What it does: Maintains quality scores for every model in the catalog, drawn from standardized benchmarks (MMLU, HumanEval, MATH, MT-Bench, LMSYS Arena ELO, LiveBench, SWE-bench) and blended with real-time signals (success rate, retry rate, format compliance rate, average response time). Provides task-specific scores (code generation, reasoning, creative writing, summarization, translation, data extraction, Q&A) to help users and the routing engine make informed model selections.
What it does NOT do:
- Does not run its own benchmarks (consumes external benchmark data)
- Does not guarantee benchmark scores are current (scores may lag behind model updates)
- Does not factor in prompt-specific quality (scores are general, not per-prompt)
- Does not provide quality comparisons across modalities (text-to-text scores don't compare with text-to-image)
What it does: Tracks whether a model performs well for a specific customer's use case over time. By analyzing success rates, retry patterns, and feedback signals per customer per model, the system can provide personalized routing recommendations — suggesting models that work best for each customer's specific workload.
What it does NOT do:
- Does not access or analyze prompt/response content (only tracks outcome signals)
- Does not train custom models per customer
- Does not override explicit model selection by the customer (recommendations only, never forced routing)
- Does not share one customer's quality data with another
What it does: Continuously tracks upstream provider credit balances (e.g., OpenRouter, DeepInfra account balances). When a provider's credits are running low, preemptively deprioritizes it in the failover chain before it starts returning 402 errors — preventing customer-visible failures due to Gatewayz's own provider billing issues.
What it does NOT do:
- Does not auto-refill provider credits (alerts operators to take action)
- Does not expose provider credit data to customers
- Does not monitor all providers equally (currently limited to providers with balance-check APIs)
- Does not predict when credits will run out based on usage trends
A multi-layer caching architecture that minimizes latency, reduces costs, and never blocks a request if a cache layer fails. Every layer degrades gracefully.
What it does: Caches inference responses and matches them against semantically similar future prompts using vector similarity (cosine threshold > 0.95). This means "What's the capital of France?" and "Tell me France's capital city" would return the same cached response without hitting a provider.
What it does NOT do:
- Does not cache responses for prompts with high variability or creativity requirements
- Does not consider conversation context (only the current message, not the full history)
- Does not support cache invalidation by topic or content change
- Does not work for streaming responses (cache hit returns the full response immediately, bypassing SSE)
- Does not guarantee semantic equivalence (similarity threshold is a heuristic)
What it does: Caches inference responses keyed by a SHA-256 hash of the exact request (messages + model + parameters). Stores up to 20,000 entries with 60-minute TTL and LRU eviction. Identical requests get instant responses without provider calls.
What it does NOT do:
- Does not match semantically similar prompts (exact byte-level match only)
- Does not cache partial or streaming responses
- Does not share cache across gateway instances (in-process memory only)
- Does not respect per-customer cache isolation (same prompt from different customers hits the same cache entry)
What it does: Integrates with Butter.dev, a third-party LLM response caching proxy. Identical prompts across all Gatewayz customers hit a shared external cache, achieving sub-100ms response times on cache hits (vs 1-5 seconds from providers). This is an opt-in feature configurable per user.
What it does NOT do:
- Does not guarantee cache hits (depends on other customers having made the same request)
- Does not provide cache isolation between customers (shared cache by design)
- Does not cache sensitive or PII-containing prompts (no content filtering before caching)
- Does not work when the Butter.dev service is unavailable (request falls through to provider)
What it does: Maintains several operational caches that reduce database and service load:
| Cache | Purpose | TTL |
|---|---|---|
| Auth cache | API key → user data lookup | 5-10 min |
| Catalog cache L1 | Full serialized catalog HTTP response | 5 min |
| Catalog cache L2 | Per-provider model lists in Redis | 15-30 min |
| DB query cache | User, plan, pricing, rate limit lookups | 1-30 min |
| Health cache | Model health data for routing | 6 min |
| Local memory cache | Redis fallback (LRU, 500 entries) | 15 min |
All caches degrade gracefully: Redis down → local memory. All caches miss → database or provider directly. No cache failure ever blocks a user request.
What it does NOT do:
- Does not provide cross-instance cache consistency (each process has its own L1 cache)
- Does not support manual cache invalidation per customer
- Does not guarantee cache freshness (stale data served until TTL expires)
- Does not encrypt cached data at rest (in-memory and Redis caches store plaintext)
The model catalog is the system's inventory — it knows what models exist, where they're hosted, what they cost, and what they can do.
What it does: A scheduled background process calls each provider's API to refresh the model catalog, storing results in the database. User-facing requests only read from cache → database, never hitting provider APIs on the hot path. If a provider's API is down, the system serves the last successfully synced catalog. Supports full sync (delete + reimport), incremental sync (delta), per-provider sync, and provider-metadata-only sync.
What it does NOT do:
- Does not sync in real-time (scheduled intervals with manual trigger option)
- Does not detect new models between sync cycles
- Does not verify model functionality during sync (only catalogs metadata)
- Does not automatically remove models that providers have deprecated (requires explicit flush or full resync)
What it does: Ensures every model in the catalog carries a standardized set of metadata: canonical ID, display name, provider slug, context length, modality (text→text, text→image, etc.), pricing (prompt + completion per token), streaming support, function calling support, vision support, health status, benchmark scores, and HuggingFace community metrics.
What it does NOT do:
- Does not guarantee all metadata fields are populated for every model (some providers don't expose all fields)
- Does not standardize model versioning (uses whatever version the provider publishes)
- Does not track model deprecation dates or migration paths
- Does not include training data information or model licenses
What it does: Enforces quality gates for catalog inclusion: models must have resolvable pricing (to prevent under-billing), an active provider, a valid modality, and must not be a duplicate. Models without pricing from any source (database, manual file, cross-reference) are excluded to prevent users from running expensive models at default rates.
What it does NOT do:
- Does not verify model quality or capability before inclusion (only metadata completeness)
- Does not require human approval for new models (automated based on criteria)
- Does not support provisional or beta model listings
- Does not apply different inclusion criteria per provider
What it does: Models with a HuggingFace ID receive additional community data: download count, likes, parameter count, pipeline tag, author information, avatar, and available inference providers. This data helps users evaluate model popularity and community trust.
What it does NOT do:
- Does not pull model weights or files from HuggingFace (metadata only)
- Does not update HuggingFace data in real-time (cached with TTL)
- Does not verify HuggingFace data accuracy
- Does not use HuggingFace metrics in routing decisions (informational only)
What it does: Provides multiple ways to find models: full-text search, filtering by provider/gateway/modality, trending models (ranked by requests, tokens, users, cost, speed), low-latency optimized models, model comparison across providers, batch comparison, and developer/organization views. Supports both a full catalog view (all providers) and a deduplicated unique model view.
What it does NOT do:
- Does not provide natural language model search ("find me a good coding model" — use the routers for that)
- Does not support saved searches or alerts for new models matching criteria
- Does not recommend models based on user history (use per-customer quality tracking for that)
- Does not show pricing history or price trend data
The business layer handles everything related to money, plans, and commercial operations.
What it does:
The atomic unit of billing. Every API request consumes credits calculated as: (prompt_tokens x prompt_price) + (completion_tokens x completion_price). Credits are deducted in order: subscription allowance first, then purchased credits. The system enforces safety rails:
- Pre-flight credit check: Estimates max cost before calling any provider. Insufficient credits → 402 immediately.
- Idempotent deduction: Every deduction carries a unique request ID. Retries never double-charge.
- Auto-refund: Provider errors (5xx, timeouts, empty streams) are automatically refunded. User errors (4xx) are not.
- High-value model protection: Premium models are blocked if pricing falls through to defaults (prevents under-billing).
- Daily usage cap: Safety limit to prevent runaway costs.
What it does NOT do:
- Does not support real-time credit streaming (credits are deducted after the full response, not token-by-token during streaming)
- Does not support credit expiration (purchased credits never expire)
- Does not roll over unused subscription allowance (resets monthly)
- Does not support credit transfers between users
- Does not support multiple currencies (USD only)
- Does not provide spending alerts during a request ("you've spent $X so far in this session")
What it does: Defines subscription tiers with different capabilities:
| Tier | Billing | Allowance | Target |
|---|---|---|---|
| Trial | Free, 3 days | $5 cap, 1M tokens, 10K requests | New users evaluating |
| Dev | Pay-as-you-go | Optional monthly allowance | Individual developers |
| Team | Subscription | Monthly credit allowance | Teams and startups |
| Enterprise | Custom | Negotiated | Large organizations |
Trial users can still access :free suffix models after expiration. Purchased credits survive plan changes.
What it does NOT do:
- Does not provide annual billing discounts
- Does not support plan previewing ("what would my bill have been on Team tier last month")
- Does not automatically upgrade users based on usage patterns
- Does not support mid-cycle plan changes with prorated refunds for all scenarios
- Does not support team billing (one bill for multiple users under one organization)
What it does: Gives customers full visibility into their usage: spend by model, by API key, by day. Token counts, request counts, error rates. Cost attribution (which key, which team member consumed what). Latency percentiles (P50, P95, P99) per model. Time-series data for dashboard rendering (hourly and daily).
What it does NOT do:
- Does not provide CSV/JSON export (the conceptual model calls for this, but it needs implementation)
- Does not support custom date ranges beyond 365 days
- Does not provide cost forecasting or budget projection
- Does not compare usage across team members (no team/org analytics layer)
- Does not track per-request prompt/response content (only metadata: tokens, cost, latency)
What it does: Provides programmatic event notifications so customers can build automations:
| Event | Trigger |
|---|---|
credits.low |
Balance drops below configurable threshold |
credits.depleted |
Balance reaches zero |
credits.added |
Credits purchased or granted |
model.degraded |
A model the customer uses becomes unhealthy |
rate_limit.approaching |
Usage approaching rate limit threshold |
batch.completed |
Async batch job finished |
Webhooks are delivered with retry logic and exponential backoff, signed with HMAC-SHA256 for verification, and include a delivery log for debugging.
What it does NOT do:
- Does not support custom webhook event types
- Does not guarantee exactly-once delivery (at-least-once with idempotency keys)
- Does not support webhook filtering (all subscribed events are delivered; no per-model or per-key filters)
- Does not provide a webhook testing/debugging UI
- Does not support alternative delivery mechanisms (email, SMS, Slack)
What it does: Tracks uptime per provider, per model, and per customer plan tier. Maintains a historical incident log visible to customers. Monitors P99 latency and error rates against plan-specific SLA thresholds. Provides automatic credit-back compensation when SLA thresholds are violated.
What it does NOT do:
- Does not guarantee SLAs for upstream providers (Gatewayz SLA covers the gateway layer, not provider availability)
- Does not negotiate custom SLAs dynamically
- Does not provide legally binding SLA documentation (operational tracking, not contractual)
- Does not account for planned maintenance windows in uptime calculations
Tools beyond basic inference that help developers build, test, and optimize their AI applications.
What it does: A centralized system for managing, versioning, and testing prompts:
- Template library — Store and version system prompts. Retrieve by ID or name.
-
Template variables —
{{customer_name}},{{context}},{{language}}— filled at request time. - A/B testing — Run two prompt variants side by side, measure which produces better outcomes.
- Per-key defaults — Attach a default system prompt to an API key so it's injected on every request.
What it does NOT do:
- Does not provide prompt optimization or rewriting suggestions
- Does not support prompt chaining or multi-step workflows
- Does not integrate with version control systems (Git)
- Does not provide a visual prompt builder (API and config-based only)
- Does not track prompt performance metrics automatically (A/B testing requires manual metric definition)
What it does: Allows submission of large lists of prompts for asynchronous processing. Jobs run off-peak at reduced cost (typically 50% cheaper). Users can poll for status or receive a webhook on completion, then download results. Essential for document processing, data extraction, bulk evaluation, and dataset generation.
What it does NOT do:
- Does not guarantee completion time (best-effort scheduling)
- Does not support real-time streaming of batch results
- Does not provide partial results (all-or-nothing per batch)
- Does not support priority queuing or rush processing
- Does not retry individual failed items within a batch automatically
What it does: Provides tools for model evaluation:
- Model comparison — Send the same prompt to multiple models, compare outputs side-by-side.
- Regression testing — Define test cases, run them against model updates, flag quality regressions.
What it does NOT do:
- Does not provide automated quality scoring of outputs (comparison is visual/manual)
- Does not support scheduled regression test runs (manual trigger only)
- Does not integrate with CI/CD pipelines
- Does not benchmark latency or throughput (functional quality testing only)
What it does: An interactive web UI for testing prompts against any model in the catalog. Developers can experiment with different models, parameters, and system prompts without writing code. Supports both streaming and non-streaming modes.
What it does NOT do:
- Does not save playground sessions (ephemeral testing only)
- Does not support collaborative playground sessions (single-user)
- Does not provide code generation from playground interactions ("export to code")
- Does not support file uploads or image inputs in the playground
Full visibility into system behavior for both the Gatewayz team and customers.
What it does: Exposes comprehensive Prometheus metrics scraped by Grafana: request rates, latencies (P50/P95/P99), error rates, cache hit rates, credit usage, provider health scores, token throughput, circuit breaker states, concurrency utilization, and cost-per-request. Supports both standard Prometheus text format and OpenMetrics format with exemplar support for trace-to-metric linking.
What it does NOT do:
- Does not store metric history in the gateway (Prometheus server handles retention)
- Does not provide alerting rules (configured in Grafana/Prometheus, not in the gateway)
- Does not aggregate metrics across multiple gateway instances (each instance exposes its own)
- Does not provide pre-built dashboard templates (Grafana dashboards are configured separately)
What it does: Full request lifecycle tracing via OpenTelemetry, exported to Tempo. Every request gets a trace ID that links across middleware, auth, routing, provider calls, credit deduction, and cache operations. Supports exemplar linking from Prometheus metrics to traces for drill-down debugging.
What it does NOT do:
- Does not trace into provider APIs (trace ends at the HTTP call boundary)
- Does not provide trace-based alerting
- Does not support customer-provided trace context propagation (W3C trace-context)
- Does not retain traces in the gateway (exported to Tempo for storage and querying)
What it does: Captures exceptions with full stack traces, breadcrumbs, and context via Sentry. Supports automatic alerting on new or regression errors. Integrates with Loki for log correlation. Provides autonomous error pattern detection and AI-generated fix suggestions using Claude.
What it does NOT do:
- Does not automatically apply generated fixes (suggestions require human review)
- Does not persist error patterns to a database (in-memory only, lost on restart)
- Does not correlate errors across multiple gateway instances
- Does not provide customer-facing error detail (customers see sanitized error messages)
What it does: LLM-specific observability via Arize Phoenix and Braintrust: prompt/response pairs, token usage breakdowns, quality scoring, model performance comparison, and cost attribution. Provides insights specific to AI workloads that generic APM tools miss.
What it does NOT do:
- Does not store prompt/response content long-term in the gateway (exported to external tools)
- Does not provide fine-tuning recommendations based on traces
- Does not support custom evaluation metrics per customer
- Does not integrate with customer-provided evaluation frameworks
What it does: Continuous CPU and memory profiling of hot paths via Pyroscope. Tags profiling data with operation context (cache operations, auth, routing, provider calls) for targeted performance analysis. Helps identify bottlenecks and memory leaks in production.
What it does NOT do:
- Does not profile provider API calls (only gateway-side code)
- Does not provide automated performance regression alerts
- Does not support on-demand profiling activation (always on, sampling-based)
- Does not expose profiling data to customers
What it does: Provides customers with:
- Usage dashboard — Real-time and historical view of spend, tokens, requests, errors.
- Model health status — Which models are healthy, degraded, or down.
- Status page — Historical uptime, incident timeline.
- Request logs — Per-request detail: model, provider, tokens, cost, latency, status.
What it does NOT do:
- Does not provide raw provider responses or full prompt/response logs (only metadata)
- Does not support custom dashboard creation
- Does not provide API access to observability data (dashboard only)
- Does not offer SLA compliance reporting (see SLA Tracking for that)
Drop-in replacement compatibility with the two most popular AI APIs.
What it does:
Exposes POST /v1/chat/completions that accepts the exact same request format as the OpenAI Chat Completions API. Any application built for OpenAI works with Gatewayz by changing the base URL — no code changes required. Supports streaming (SSE), tool/function calling, JSON mode, logprobs, and all standard parameters. Responses are normalized to the OpenAI response format regardless of which provider actually served the request.
What it does NOT do:
- Does not support the OpenAI Assistants API or Threads API
- Does not support the OpenAI Files API or fine-tuning endpoints
- Does not support the OpenAI Embeddings API (inference only)
- Does not guarantee identical token counting as OpenAI (different tokenizers per provider)
- Does not support OpenAI-specific features like structured outputs with
strict: trueenforcement across all providers
What it does:
Exposes POST /v1/messages that accepts the exact same request format as the Anthropic Messages API. Applications built for Anthropic work with Gatewayz by changing the base URL. Supports streaming and all standard parameters. Responses are normalized to the Anthropic response format.
What it does NOT do:
- Does not support Anthropic-specific features like computer use or extended thinking across all providers
- Does not support the Anthropic Batch API format
- Does not guarantee identical token counting as Anthropic
- Does not support Anthropic-specific headers (x-api-key style auth; uses Bearer token instead)
How the system is deployed and operated across environments.
What it does: Routes requests to the nearest provider region for lowest latency. Implements geo-aware provider selection so that a user in Europe is routed to European provider endpoints when available, reducing round-trip time.
What it does NOT do:
- Does not deploy gateway instances in multiple regions (the gateway is single-region; geo-routing is at the provider selection level)
- Does not support user-specified region preferences per request
- Does not guarantee all models are available in all regions
- Does not provide region-specific pricing
What it does: Routes EU customers' requests to EU-based providers for GDPR compliance. Ensures that prompt and response data for data-residency-sensitive customers never leaves the specified geographic region.
What it does NOT do:
- Does not provide data residency for non-EU regions (EU-only initially)
- Does not guarantee that all models are available in the EU region
- Does not handle data deletion requests (GDPR right to erasure) through the API
- Does not provide data residency certification or compliance documentation
What it does: Supports deployment to multiple targets:
| Target | Use Case |
|---|---|
| Vercel (serverless) | Quick deployment, auto-scaling |
| Railway / Docker (container) | Full control, persistent connections |
| Self-hosted | Enterprise on-prem deployment |
What it does NOT do:
- Does not provide a managed/hosted SaaS offering with zero deployment
- Does not support Kubernetes-native deployment manifests (Docker-based only)
- Does not provide deployment automation or infrastructure-as-code templates
- Does not support hot code reload in production (requires restart for configuration changes)
| Category | Features | Purpose |
|---|---|---|
| Ingress Layer | 12 | Security, authentication, rate limiting, input/output guardrails |
| Core Routing Engine | 9 | Model resolution, intelligent routing, failover, load balancing |
| Intelligence Layer | 6 | Health monitoring, quality scoring, incident management |
| Caching System | 4 | Multi-layer caching for speed and cost reduction |
| Model Catalog | 5 | Model discovery, metadata, sync, enrichment |
| Business Layer | 5 | Credits, plans, analytics, webhooks, SLA tracking |
| Developer Platform | 4 | Prompt management, batch inference, eval, playground |
| Observability | 6 | Metrics, tracing, error tracking, profiling, customer dashboards |
| API Compatibility | 2 | OpenAI and Anthropic drop-in replacement |
| Infrastructure | 3 | Multi-region, data residency, multi-target deployment |
| Total | 56 |
Source: Conceptual Model | Features (Current Implementation) | API Mappings
Reading Path (start here, in order)
- Conceptual Model
- Stability Definition
- Conceptual Model Features
- Features
- Delta Report
- Features-Acceptance-Criteria
Testing
Security & Access
Billing
Monitoring
Features
Providers
Operations
Data References