-
Notifications
You must be signed in to change notification settings - Fork 1
Conceptual Model
Reading path: Conceptual Model (you are here) | Stability Definition | Conceptual Model Features | Features | Delta Report | Features-Acceptance-Criteria
Read this first. This is the foundation — everything else builds on it. Next: Stability Definition (what "done" looks like)
TL;DR — Gatewayz is a universal AI gateway. One API key gives you access to 10,000+ models from 30+ providers. Automatic failover, intelligent routing, one credit-based bill, enterprise security. Think "Stripe for AI inference." The architecture has 10 layers: Ingress (auth/rate limiting/guardrails), Core Routing (model resolution/failover/load balancing), Intelligence (health monitoring/quality scoring), Caching (7 layers), Model Catalog (10K+ models), Business (credits/plans/billing), Developer Platform (prompt management/batch/playground), Observability (Prometheus/OpenTelemetry/Sentry), API Compatibility (OpenAI + Anthropic drop-in), and Infrastructure (multi-region/deployment).
Gatewayz is a universal AI gateway. It sits between applications and every major AI model provider in the world, giving developers access to thousands of AI models through a single API.
Think of it like this:
-
Without Gatewayz: A company that wants to use AI models from OpenAI, Google, Anthropic, Meta, Mistral, and others needs to build and maintain a separate integration for each provider. Each has its own API format, its own billing account, its own authentication, and its own quirks. If one goes down, the application goes down with it.
-
With Gatewayz: The company integrates once. One API call, one API key, one bill. Gatewayz handles the rest — routing the request to the right provider, translating between formats, switching to a backup if something fails, and tracking every token and dollar.
Gatewayz is to AI providers what Stripe is to payment processors.
Stripe lets businesses accept payments from Visa, Mastercard, Amex, and dozens of other networks through one integration. Businesses don't think about which card network to use — Stripe handles routing, retries, and reconciliation.
Gatewayz does the same for AI inference. Developers don't think about which provider serves which model, or what happens if that provider has an outage. They send a request, get a response, and see the cost on one bill.
Your Application ──► Gatewayz ──► OpenAI
──► Anthropic
──► Google
──► Meta (via providers)
──► Mistral (via providers)
──► DeepSeek (via providers)
──► ... 30+ more providers
-
One API, every model — Send a standard API request. Gatewayz figures out which provider serves that model and routes accordingly.
-
Automatic failover — If a provider goes down mid-request, Gatewayz silently retries with another provider that serves the same model. The developer never sees the failure.
-
Intelligent routing — Don't know which model to use? Ask Gatewayz to pick the best one for your task — optimized for quality, cost, speed, or a balance of all three.
-
One bill — Every model from every provider is billed through one credit balance. Pay-as-you-go, subscription, or trial.
-
Full visibility — Every request is tracked: which model, which provider, how many tokens, how much it cost, how fast it responded, whether it succeeded.
| Audience | What they get |
|---|---|
| Developers | One SDK integration instead of 30. Drop-in compatible with OpenAI and Anthropic formats — existing code works unchanged. |
| Engineering teams | Automatic failover, health monitoring, and rate limiting without building it themselves. |
| Product teams | Access to every model for experimentation. Switch models by changing a string, not rewriting code. |
| Finance / Ops | One vendor, one invoice, clear per-request cost attribution. |
| Enterprise | Security (encrypted keys, IP allowlists, audit logs), compliance, and SLA-backed reliability. |
One API key, every AI model, automatic reliability, one bill.
This section describes what Gatewayz aims to be — the complete, optimal system. It covers both what exists today and what the system should evolve into. This is the target architecture.
┌──────────────────────────────────────────────────────────────────────────┐
│ GATEWAYZ GATEWAY │
│ │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌───────────────────┐ │
│ │ Ingress │ │ Core │ │ Intel- │ │ Business │ │
│ │ Layer │ │ Routing │ │ ligence │ │ Layer │ │
│ │ │ │ Engine │ │ Layer │ │ │ │
│ │ Auth │ │ │ │ │ │ Credits & Billing │ │
│ │ Rate Limit │ │ Provider │ │ Health │ │ Plans & Trials │ │
│ │ Guardrails │ │ Resolution │ │ Monitoring │ │ Usage Analytics │ │
│ │ Validation │ │ Failover │ │ Benchmarks │ │ Webhooks │ │
│ │ │ │ Load Bal. │ │ Quality │ │ SLA Tracking │ │
│ │ │ │ Smart Rtr │ │ Scoring │ │ │ │
│ └────────────┘ └────────────┘ └────────────┘ └───────────────────┘ │
│ │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌───────────────────┐ │
│ │ Caching │ │ Model │ │ Observa- │ │ Developer │ │
│ │ System │ │ Catalog │ │ bility │ │ Platform │ │
│ │ │ │ │ │ │ │ │ │
│ │ Semantic │ │ Discovery │ │ Metrics │ │ Prompt Mgmt │ │
│ │ Response │ │ Metadata │ │ Tracing │ │ Batch Inference │ │
│ │ Catalog │ │ Pricing │ │ Alerts │ │ Eval & Testing │ │
│ │ Auth │ │ Enrichment │ │ Dashboards │ │ Playgrounds │ │
│ └────────────┘ └────────────┘ └────────────┘ └───────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────────┘
│
▼
30+ AI Model Provider Gateways
Every request passes through the ingress layer before anything else. This is the security and quality boundary.
- API key authentication with keys encrypted at rest (AES-128 Fernet)
- HMAC-SHA256 key hashing for fast lookup without decryption
- Role-based access control (RBAC) — admin, team, dev, free tiers with distinct permissions
- Per-key IP allowlists — restrict an API key to specific IP addresses or ranges
- Domain restrictions — limit which domains can use a key
- Layer 1 — IP-level: Protects against abuse at the network edge. Behavioral analysis and velocity detection for anomalous patterns.
- Layer 2 — API key-level: Redis-backed per-key limits (requests per minute, tokens per day/month). Tied to the user's plan tier.
- Layer 3 — Anonymous: Separate, stricter limits for unauthenticated requests.
- Graceful degradation: If Redis is unavailable, an in-memory fallback rate limiter activates. Requests are never blocked due to infrastructure failure.
- PII detection — Scan prompts for personally identifiable information (phone numbers, SSNs, emails, credit cards) before sending to external providers. Optionally redact or block.
- Prompt injection defense — Detect and block known injection patterns that attempt to override system prompts.
- Topic restrictions — Per-API-key configuration to restrict models to specific domains (e.g., "only answer customer support questions").
- Content moderation — Integration with moderation classifiers to block harmful or policy-violating inputs before they reach any provider.
- Content filtering — Scan model responses for policy violations, harmful content, or off-topic answers before returning to the customer.
- Structured output validation — When the customer requests JSON schema output, validate the response conforms before returning it.
- Hallucination flags — Surface provider-side safety metadata (refusals, safety filter triggers) in a standardized format regardless of which provider generated the response.
This is the central nervous system of Gatewayz. Every request must be resolved to a specific provider and model ID.
User sends: model = "deepseek-r1"
│
▼
┌─ Alias Normalization ─┐
│ "deepseek-r1" │
│ → "deepseek/deepseek-r1" │
└───────────┬───────────┘
▼
┌─ Provider Detection ──┐
│ Check overrides │
│ Check format rules │
│ Check registry │
│ → Provider: "fireworks" │
└───────────┬───────────┘
▼
┌─ Model ID Transform ──┐
│ Translate to native │
│ provider format │
│ → "accounts/fireworks/│
│ models/deepseek-r1"│
└───────────┬───────────┘
▼
Provider API call
-
120+ aliases map shorthand names to canonical model IDs (
"r1"→"deepseek/deepseek-r1","gpt-4o"→"openai/gpt-4o") - Provider detection follows a strict priority: explicit overrides → format-based rules → mapping tables → org-prefix fallbacks
- Model ID transformation translates canonical IDs to each provider's native format (every provider has different naming conventions)
When the user doesn't specify a model, Gatewayz picks the optimal one:
| Router | Syntax | What it does |
|---|---|---|
| General Router | router:general:quality |
ML-powered model selection (via NotDiamond). Analyzes the prompt content and picks the best model for: quality, cost, latency, or balanced. |
| Code Router | router:code:agentic |
Benchmark-driven code model selection. Classifies task complexity, matches to tiered models scored by SWE-bench and code benchmarks. Modes: auto, price, quality, agentic. |
When a provider fails, the request automatically retries with the next provider in a prioritized chain:
Primary (Fireworks) ──FAIL──► OpenRouter ──FAIL──► Together ──SUCCESS──► Response
- 14-provider failover chain ordered by reliability
- Triggers on: 401, 402 (provider out of credits), 403, 404, 502, 503, 504
- Does not trigger on: 400 (user error), 429 (rate limit — retry with backoff instead)
- Circuit breakers per provider: after 5 consecutive failures, the provider is temporarily removed from the chain. Auto-recovers after 60 seconds of cool-down.
- Model-aware rules: OpenAI models only failover to OpenAI → OpenRouter. Anthropic models only to Anthropic → OpenRouter. Open-source models can failover across all providers.
For models available on multiple providers simultaneously:
- Health-weighted routing — Before attempting a request, check the primary provider's health. If uptime < threshold, promote a healthier provider to the front of the chain.
- Latency-optimal selection — For the same model on multiple providers, route to the provider with the lowest current P50 latency.
- Cost-optimal selection — When the user requests cost optimization, select the cheapest provider that serves the model and meets minimum quality/latency thresholds.
- Traffic splitting — Distribute load across providers to prevent over-reliance on any single one (e.g., 70/30 split) and to continuously gather performance data from all providers.
A continuous, tiered monitoring system that watches every model across every provider:
| Tier | Coverage | Check interval | Examples |
|---|---|---|---|
| Critical | Top 5% by usage | Every 5 minutes | GPT-4o, Claude Sonnet, Gemini Pro |
| Popular | Next 20% | Every 30 minutes | Llama-3.3-70B, Mistral Large |
| Standard | Remaining 75% | Every 2-4 hours | Long-tail models |
| On-Demand | New/rare models | Only when requested | Niche or newly added models |
- Passive health capture: Every real inference request contributes health data as a background task — zero overhead on the request path.
- Circuit breaker states: CLOSED (healthy) → OPEN (failing, blocked) → HALF_OPEN (testing recovery).
- Incident management: Severity levels (Critical/High/Medium/Low) with automatic incident creation.
Every model in the catalog should carry quality scores that help users and the routing engine make informed decisions:
- Benchmark integration — Pull scores from standardized benchmarks: MMLU, HumanEval, MATH, MT-Bench, LMSYS Arena ELO, LiveBench, SWE-bench.
- Task-specific quality priors — Per-model scores for: code generation, reasoning, creative writing, summarization, translation, data extraction, simple Q&A.
- Real-time quality signals — Blend static benchmarks with live data: success rate, retry rate, format compliance rate, average response time.
- Per-customer quality tracking — Track whether a model performs well for a specific customer's use case over time, enabling personalized routing recommendations.
- Track upstream provider credit balances continuously.
- When a provider's credits are low, preemptively deprioritize it in the failover chain before it starts returning 402 errors.
A multi-layer caching architecture that minimizes latency, reduces costs, and never blocks a request if a cache layer fails.
Request
│
▼
┌─ Semantic Cache ──────────────────────────────────────────┐
│ "What's the capital of France?" ≈ "Tell me France's │
│ capital city" → same cached response │
│ (Vector similarity, cosine threshold > 0.95) │
└──────────────┬────────────────────────────────────────────┘
│ miss
▼
┌─ Exact-Match Response Cache ──────────────────────────────┐
│ SHA-256 hash of {messages + model + params} │
│ 20K entries, 60-min TTL, LRU eviction │
└──────────────┬────────────────────────────────────────────┘
│ miss
▼
┌─ External Cache (Butter.dev) ─────────────────────────────┐
│ Third-party LLM response caching proxy │
│ Identical prompts across all customers → shared cache │
│ Sub-100ms response on hit vs 1-5s from provider │
└──────────────┬────────────────────────────────────────────┘
│ miss
▼
Provider API call
Supporting caches:
| Cache | What it stores | TTL | Purpose |
|---|---|---|---|
| Auth cache | API key → user data | 5-10 min | Reduces auth latency from 50-150ms to 1-5ms |
| Catalog cache (L1) | Full serialized catalog HTTP response | 5 min | Sub-10ms catalog responses with stampede protection |
| Catalog cache (L2) | Per-provider model lists in Redis | 15-30 min | Avoids rebuilding catalog on every request |
| DB query cache | User, plan, pricing, rate limit lookups | 1-30 min | 60-80% database load reduction |
| Health cache | Model health data | 6 min | Feeds health-based routing decisions |
| Local memory cache | Redis fallback (LRU, 500 entries) | 15 min | Ensures system works when Redis is down |
Design principle: Every cache layer degrades gracefully. If Redis goes down, local memory takes over. If all caches miss, the request goes to the database or provider directly. No cache failure ever blocks a user request.
The model catalog is the system's inventory — it knows what models exist, where they're hosted, what they cost, and what they can do.
Models are not fetched from providers on each user request. Instead:
Background sync (scheduled) ──► Provider APIs ──► models_catalog DB table
│
User request ──► Cache L1 ──► Cache L2 ──► Database ─────┘
- A scheduled background process calls each provider's API to refresh the catalog.
- Results are stored in the database.
- User-facing requests only read from cache → database, never hitting provider APIs on the hot path.
- If a provider's API is down, the system serves the last successfully synced catalog.
Every model in the catalog has:
| Field | Description | Example |
|---|---|---|
id |
Canonical identifier | meta-llama/Llama-3.3-70B-Instruct |
name |
Display name | Llama 3.3 70B Instruct |
provider_slug |
Which gateway serves it | fireworks |
context_length |
Maximum token window | 131072 |
modality |
Input → output type |
text→text, text→image, image→text
|
pricing |
Cost per token (prompt + completion) | $0.00000055 / token |
supports_streaming |
SSE streaming support | true |
supports_function_calling |
Tool/function use | true |
supports_vision |
Image input support | false |
health_status |
Current health |
healthy, degraded, down
|
benchmark_scores |
Quality scores by task | {code: 92, reasoning: 88, ...} |
huggingface_metrics |
Downloads, likes, parameters | Community engagement data |
A model must meet these requirements to appear in the catalog:
- Resolvable pricing — Models without pricing data from any source (database, manual file, cross-reference) are excluded. This prevents users from running expensive models at default rates.
- Active provider — The model's provider must be registered and reachable.
- Valid modality — The model must have a known input/output modality.
- Not duplicate — When the same model is available from multiple providers, the catalog supports both a unique (deduplicated) view and a full (all providers) view.
Models with a HuggingFace ID receive additional community data:
- Download count, likes, parameter count
- Pipeline tag (text-generation, text-to-image, etc.)
- Author information and avatar
- Available inference providers
The atomic unit of billing. Every API request consumes credits based on token usage.
Cost = (prompt_tokens × prompt_price) + (completion_tokens × completion_price)
Deduction order:
- Subscription allowance (monthly credits included in plan) — used first
- Purchased credits (top-ups) — used after allowance is exhausted
Safety rails:
- Pre-flight credit check: Before calling any provider, estimate max cost. If insufficient credits → 402 immediately (no wasted provider call).
- Idempotent deduction: Every deduction carries a unique request ID. Retries never double-charge.
- Atomic transactions: Balance update and transaction record happen in a single database transaction.
- Auto-refund: Provider errors (5xx, timeouts, empty streams) are automatically refunded. User errors (4xx) are not.
- High-value model protection: Premium models (GPT-4, Claude, Gemini, o1/o3/o4) are blocked from serving if pricing falls through to default — prevents massive under-billing.
- Daily usage cap: Safety limit to prevent runaway costs.
| Tier | Billing | Allowance | Limits | Target |
|---|---|---|---|---|
| Trial | Free, 14 days | $5 credit cap, 1M tokens, 10K requests | Strict | New users evaluating the platform |
| Dev | Pay-as-you-go | Optional monthly allowance | Standard | Individual developers |
| Team | Subscription | Monthly credit allowance | Higher concurrency, higher rate limits | Teams and startups |
| Enterprise | Custom | Negotiated | Custom SLAs, dedicated support | Large organizations |
- Trial users can still access
:freesuffix models after trial expiration. - Unused subscription allowance does not roll over — it resets monthly.
- Purchased credits never expire and survive plan changes.
Customers should have full visibility into their usage:
- Usage breakdown — Spend by model, by API key, by day. Token counts, request counts, error rates.
- Cost attribution — Which API key, which team member, which application consumed what.
- Latency percentiles — P50, P95, P99 response times per model.
- Time-series data — Hourly and daily usage trends for dashboard rendering.
- Exportable — CSV/JSON export for finance teams and internal reporting.
Programmatic event notifications so customers can build automations:
| Event | Trigger |
|---|---|
credits.low |
Balance drops below configurable threshold |
credits.depleted |
Balance reaches zero |
credits.added |
Credits purchased or granted |
model.degraded |
A model the customer uses becomes unhealthy |
rate_limit.approaching |
Usage approaching rate limit threshold |
batch.completed |
Async batch job finished |
- Delivery with retry logic and exponential backoff.
- HMAC-SHA256 signed payloads for verification.
- Delivery log for debugging.
- Uptime calculation per provider, per model, per customer plan tier.
- Historical incident log — customer-visible timeline of outages and degradations.
- SLA breach alerting — notify customer when P99 latency or error rate exceeds their plan's SLA.
- Credit-back — automatic compensation when SLA thresholds are violated.
A centralized system for managing, versioning, and testing prompts:
- Template library — Store and version system prompts. Retrieve by ID or name.
-
Template variables —
{{customer_name}},{{context}},{{language}}— filled at request time. - A/B testing — Run two prompt variants side by side, measure which produces better outcomes.
- Per-key defaults — Attach a default system prompt to an API key so it's injected on every request.
For workloads that don't need real-time responses:
POST /v1/batch/jobs
→ Submit list of prompts
→ Job runs off-peak (cheaper)
→ Poll status or receive webhook on completion
→ Download results
- Typically 50% cheaper than synchronous inference.
- Essential for: document processing, data extraction, bulk evaluation, dataset generation.
- Model comparison — Send the same prompt to multiple models, compare outputs side-by-side.
- Regression testing — Define test cases, run them against model updates, flag quality regressions.
- Playground — Interactive web UI for testing prompts against any model in the catalog.
| Layer | Tool | What it tracks |
|---|---|---|
| Metrics | Prometheus + Grafana | Request rates, latencies, error rates, cache hit rates, credit usage, provider health, token throughput |
| Tracing | OpenTelemetry | Full request lifecycle traces across all services |
| Error tracking | Sentry | Exceptions, stack traces, breadcrumbs with automatic alerting |
| AI-specific tracing | Arize Phoenix + Braintrust | LLM-specific observability: prompt/response pairs, token usage, quality scoring |
| Profiling | Pyroscope | CPU and memory profiling of hot paths (cache operations, auth, routing) |
- Usage dashboard — Real-time and historical view of spend, tokens, requests, errors.
- Model health status — Which models are healthy, degraded, or down right now.
- Status page — Historical uptime, incident timeline, SLA compliance.
- Request logs — Per-request detail: model used, provider, tokens, cost, latency, status.
Gatewayz exposes two API-compatible interfaces:
| Format | Endpoint | What it means |
|---|---|---|
| OpenAI-compatible | POST /v1/chat/completions |
Any application built for the OpenAI API works with Gatewayz by changing the base URL. No code changes. |
| Anthropic-compatible | POST /v1/messages |
Any application built for the Anthropic API works with Gatewayz by changing the base URL. No code changes. |
Both formats support streaming (SSE) and non-streaming responses. Responses are normalized to the expected format regardless of which provider actually served the request.
- Geo-aware routing — Route requests to the nearest provider region for lowest latency.
- Data residency — EU customers' requests routed to EU-based providers for GDPR compliance.
- Multi-region Redis — Cache replication across regions for consistent performance.
- Edge deployment — HTTP termination at the edge, application logic in regional clusters.
| Target | Use case |
|---|---|
| Vercel (serverless) | Quick deployment, auto-scaling |
| Railway / Docker (container) | Full control, persistent connections |
| Self-hosted | Enterprise on-prem deployment |
┌─────────────────────────────────────────────────────────────────┐
│ THE CUSTOMER │
│ │
│ "I want to use any AI model, reliably, at the best price, │
│ with full visibility, through one integration." │
└──────────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ GATEWAYZ │
│ │
│ PROTECT ROUTE OPTIMIZE BILL │
│ ─────── ───── ──────── ──── │
│ Auth Model resolve Health monitor Credits │
│ Rate limit Provider detect Smart routing Plans │
│ Guardrails Failover chain Caching (7+ Usage analytics │
│ Validation Load balancing layers) Webhooks │
│ Smart routing Benchmarks SLA tracking │
│ Cost optimize │
│ │
│ CATALOG PLATFORM OBSERVE │
│ ─────── ──────── ─────── │
│ 10,000+ models Prompt mgmt Metrics Status page │
│ Auto-sync Batch inference Tracing Customer logs │
│ Pricing Eval & testing Alerts Dashboards │
│ Enrichment Playgrounds Profiling │
│ │
└──────────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ 30+ AI PROVIDER GATEWAYS │
│ │
│ OpenAI Anthropic Google Groq Fireworks Together Meta │
│ DeepInfra Cerebras HuggingFace Featherless Cloudflare │
│ xAI Alibaba NEAR Fal Helicone AiHubMix Morpheus ... │
└─────────────────────────────────────────────────────────────────┘
Any developer or company can use any AI model from any provider through one API key and one bill — with automatic reliability, cost optimization, quality-aware routing, full visibility, and enterprise-grade security.
Gatewayz becomes the default infrastructure layer through which the world consumes AI — not by locking anyone into a single provider, but by making every provider accessible, reliable, and observable through one unified gateway.
Reading Path (start here, in order)
- Conceptual Model
- Stability Definition
- Conceptual Model Features
- Features
- Delta Report
- Features-Acceptance-Criteria
Testing
Security & Access
Billing
Monitoring
Features
Providers
Operations
Data References