A production-ready FastAPI application that provides a unified, multi-provider LLM inference proxy with automatic API key fallback, rate limiting, structured logging, and health monitoring. Supports OpenAI and Anthropic APIs with seamless format conversion and cross-provider routing.
Warning: This repo is functional but incomplete and may undergo further restructuring; model schemas, API handling, and execution may vary as devlopment progresses.
- Multi-Provider Support: Route requests to OpenAI or Anthropic based on model configuration
- API Key Fallback: Automatic failover to backup API keys when rate limits or errors occur
- Circuit Breaker Pattern: Failed keys enter cooldown period before retry
- Format Conversion: Seamless conversion between OpenAI and Anthropic API formats (not perfect yet)
- Streaming Support: Full Server-Sent Events (SSE) streaming for both providers
- Structured Logging: Comprehensive request/response logging to SQLite database
- Rate Limiting: Configurable per-client rate limits (requests and tokens per minute)
- Health Checks: Basic and detailed health monitoring endpoints
- CORS Support: Configurable Cross-Origin Resource Sharing
- Error Handling: Provider-standardized error responses matching OpenAI/Anthropic formats
- Clone the repository:
git clone https://github.com/BenItBuhner/model-proxy.git
cd centralized-inference-endpoint- Install dependencies:
# Install uv if you don't have it
pip install uv
# Install project dependencies
uv sync- Set up environment variables (see Configuration below)
Create a .env file or set the following environment variables:
CLIENT_API_KEY: API key for client authentication (required for all requests)
OPENAI_API_KEY: Primary OpenAI API key (orOPENAI_API_KEY_1)OPENAI_API_KEY_1,OPENAI_API_KEY_2, ...: Additional OpenAI API keys for fallbackANTHROPIC_API_KEY: Primary Anthropic API key (orANTHROPIC_API_KEY_1)ANTHROPIC_API_KEY_1,ANTHROPIC_API_KEY_2, ...: Additional Anthropic API keys for fallback
KEY_COOLDOWN_SECONDS: Cooldown period for failed API keys (default: 300 seconds / 5 minutes)REQUIRE_CLIENT_API_KEY: Set to "true" to fail startup if CLIENT_API_KEY is missing (default: "false")FAIL_ON_STARTUP_VALIDATION: Set to "true" to fail startup on validation errors (default: "false")CORS_ORIGINS: Comma-separated list of allowed CORS origins (default: "*")RATE_LIMIT_REQUESTS_PER_MINUTE: Maximum requests per minute per client (default: 60)RATE_LIMIT_TOKENS_PER_MINUTE: Maximum tokens per minute per client (default: 100000)
Provider settings are configured in JSON files under config/providers/:
config/providers/openai.json: OpenAI provider configurationconfig/providers/anthropic.json: Anthropic provider configuration
Each provider config includes:
endpoints: Base URL and endpoint pathsauthentication: Header format and authentication methodapi_key_env_patterns: Environment variable patterns for API keysrequest_config: Timeouts, retries, and default parametersproxy_support: Optional proxy URL override for OpenAI-compatible endpoints
To add a new model, create a JSON file in config/models/ named <logical_model>.json with the routing configuration. Models are defined as individual routing configuration files under config/models/. Each logical model has its own JSON file named <logical_model>.json that describes routing (primary provider, fallbacks, timeouts, and optional overrides for API keys or wire protocol).
Example config/models/gpt-5-2.json (simplified):
{
"logical_name": "gpt-5.2",
"timeout_seconds": 60,
"model_routings": [
{
"id": "primary",
"provider": "openai",
"model": "gpt-5.2"
},
{
"id": "secondary",
"provider": "azure",
"model": "gpt-5.2"
}
],
"fallback_model_routings": ["gpt-5.1"]
}Notes:
- The new routing system reads per-model JSON files in
config/models/usingapp.routing.config_loader.ModelConfigLoader. - Use
config_loader.get_available_models()to list logical models programmatically. wire_protocolandapi_key_envare optional per-route overrides. If omitted, the provider config determines the wire protocol and API key env var patterns.- If you previously used a single
config/models.json(the legacy flat mapping), you should migrate to per-model files by creating one JSON file per logical model inconfig/models/. A migration script can be added to automate splitting the flat mapping into per-model files; otherwise create files by hand using the example above.
uv run uvicorn app.main:app --port 9876The API will be available at http://localhost:9876
Build the Docker image:
docker build -t centralized-inference-endpoint .Run the container:
docker run -d -p 9876:9876 \
-e CLIENT_API_KEY=your_client_key \
-e OPENAI_API_KEY_1=your_openai_key \
-e ANTHROPIC_API_KEY_1=your_anthropic_key \
centralized-inference-endpointFor local development:
docker-compose upFor production:
docker-compose -f docker-compose.prod.yml up -dOpenAI-compatible chat completions endpoint (non-streaming).
Request:
{
"model": "gpt-5.2",
"messages": [
{"role": "user", "content": "Hello!"}
],
"temperature": 0.7,
"max_tokens": 100
}Response: Standard OpenAI chat completion response format.
OpenAI-compatible streaming chat completions endpoint.
Request: Same as /v1/chat/completions but returns Server-Sent Events stream.
Response: SSE stream with OpenAI-formatted chunks.
Anthropic-compatible messages endpoint (non-streaming).
Request:
{
"model": "claude-4.5-opus",
"messages": [
{"role": "user", "content": "Hello!"}
],
"max_tokens": 100,
"temperature": 0.7
}Response: Standard Anthropic message response format.
Anthropic-compatible streaming messages endpoint.
Request: Same as /v1/messages but returns Server-Sent Events stream.
Response: SSE stream with Anthropic-formatted chunks.
All endpoints require authentication via the Authorization header:
Authorization: Bearer <CLIENT_API_KEY>
Or simply:
Authorization: <CLIENT_API_KEY>
The Bearer prefix is optional and case-insensitive.
