Skip to content

InkByteStudio/llm-budget-proxy

Repository files navigation

llm-budget-proxy

MIT License

Lightweight reverse proxy for OpenAI API rate limiting, per-key token budgets, and cost dashboards. Deploy in 5 minutes with Docker.

Companion resources: Blog post · Tutorial

Architecture

┌──────────┐     ┌──────────────────────────────────────────────┐     ┌──────────┐
│  Client   │────▶│  llm-budget-proxy                            │────▶│  OpenAI  │
│           │◀────│                                              │◀────│  API     │
└──────────┘     │  ┌──────┐ ┌───────────┐ ┌────────┐ ┌───────┐│     └──────────┘
                 │  │ Auth │▶│Rate Limit │▶│Budget  │▶│Cache  ││
                 │  └──────┘ └───────────┘ └────────┘ └───────┘│
                 │                                              │
                 │  ┌───────────┐  ┌───────────┐  ┌──────────┐ │
                 │  │ SQLite DB │  │ Dashboard │  │ Webhooks │ │
                 │  └───────────┘  └───────────┘  └──────────┘ │
                 └──────────────────────────────────────────────┘

Quick start

git clone https://github.com/yourusername/llm-budget-proxy.git
cd llm-budget-proxy
cp .env.example .env
# Edit .env — add your OPENAI_API_KEY

# Option A: Docker (recommended)
docker compose up --build

# Option B: Local development
npm install
npm run seed -- alice team-a 10.00   # Create an API key
npm run dev

API key management

API keys use the lbp_ prefix and are stored as SHA-256 hashes. The plaintext key is shown once at creation time.

Create a key via seed script:

npm run seed -- <name> [team] [daily-budget]
npm run seed -- alice team-a 10.00

Create a key via API:

curl -X POST http://localhost:3000/api/keys \
  -H "Authorization: Bearer $ADMIN_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"name": "alice", "team": "team-a", "budgetPeriod": "daily", "budgetLimit": 10.00}'

Use a key:

curl http://localhost:3000/v1/chat/completions \
  -H "Authorization: Bearer lbp_your_key_here" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Configuration

All configuration is in config/config.yml. Environment variables are substituted using ${VAR_NAME} syntax.

Section Key Default Description
server.port Port 3000 Server port
server.adminKey required Admin API key for dashboard and management
provider.apiKey required OpenAI API key
rateLimits.default.rpm RPM 60 Requests per minute per key
rateLimits.default.tpm TPM 100000 Tokens per minute per key
budgets.defaultDaily Budget 10.00 Default daily budget in USD
budgets.defaultMonthly Budget 100.00 Default monthly budget in USD
modelDowngrade.enabled Flag false Enable model downgrade on budget pressure
cache.enabled Flag true Enable exact-match response caching
cache.defaultTtlSeconds TTL 3600 Cache entry lifetime
alerts.webhookUrl URL Webhook URL for budget alerts

Rate limit overrides

Match API keys by name pattern:

rateLimits:
  overrides:
    - keyPattern: "premium-*"
      rpm: 120
      tpm: 500000

Budget alert thresholds

budgets:
  alertThresholds:
    - percent: 80
      action: warn        # X-Budget-Warning header
    - percent: 95
      action: downgrade   # Switch to cheaper model (if enabled)
    - percent: 100
      action: block       # Reject request (402)

Model downgrade rules

Disabled by default. Opt-in via config:

modelDowngrade:
  enabled: true
  rules:
    - from: "gpt-4o"
      to: "gpt-4o-mini"

When triggered, the response includes X-Model-Downgraded: true and X-Original-Model headers.

Pricing manifest

Model pricing is in config/pricing.yml. Update this file when OpenAI changes pricing.

version: "2026-03-14"
provider: openai
models:
  gpt-4o:
    inputPer1k: 0.0025
    outputPer1k: 0.01
    cachedInputPer1k: 0.00125
    maxOutputTokens: 16384

Response headers

Every proxied response includes:

Header Description
X-Request-Cost Actual cost of this request in USD
X-Estimated-Cost Pre-flight estimated worst-case cost
X-Input-Tokens Input token count
X-Tokens-Used Total tokens (input + output)
X-Budget-Remaining Remaining budget in USD
X-Budget-Period Budget period (daily/monthly)
X-Budget-Warning Set when approaching budget limit
X-Cache HIT or MISS
X-Model-Downgraded true if model was downgraded
X-RateLimit-Limit-RPM RPM limit for this key
X-RateLimit-Remaining-RPM Remaining RPM

Dashboard

Open http://localhost:3000/dashboard and enter your admin API key. Shows:

  • Cost by API key (bar chart)
  • Cost over time (line chart)
  • Budget status (doughnut chart)
  • Recent requests table

Alerting

Configure a webhook URL to receive budget notifications:

alerts:
  webhookUrl: "https://hooks.slack.com/services/xxx/yyy/zzz"
  events:
    - budgetWarning
    - budgetExceeded

Alerts are debounced (same event + key fires at most once per hour).

How it compares to LiteLLM

LiteLLM (~39k stars) is a mature, full-featured LLMOps platform with 100+ provider integrations, virtual keys, per-key budgets, load balancing, guardrails, and a Postgres-backed dashboard.

llm-budget-proxy is deliberately simpler:

LiteLLM llm-budget-proxy
Providers 100+ OpenAI only (MVP)
Database Postgres/Redis SQLite
Deployment Multi-service Single container
Setup time ~30 min ~5 min
Dashboard Full admin UI Single-page Chart.js
Use case Enterprise, multi-provider Dev/staging, single-provider, learning

Use LiteLLM when you need enterprise scale, multi-provider support, or a full observability platform.

Use llm-budget-proxy when you want a lightweight, self-contained proxy you can understand, modify, and deploy in minutes.

Limitations

  • Single-instance only — SQLite does not support multi-node deployment. For horizontal scaling, migrate to Postgres or Redis.
  • OpenAI only — This MVP proxies OpenAI's /v1/chat/completions endpoint. Anthropic support is a documented future extension.
  • Estimated cost — Pre-flight cost checks use estimated input tokens + worst-case output ceiling. Actual cost is recorded after the response completes.
  • No semantic caching — Cache uses exact request-body matching only. Semantic similarity caching requires embeddings and vector search, which is out of scope.

Development

npm install
npm run dev         # Start with hot reload
npm test            # Run tests
npm run test:watch  # Watch mode
npm run build       # Compile TypeScript

License

MIT — AGR Group