FreeLLMAPI

One OpenAI-compatible endpoint. Eleven free LLM providers. ~1B+ tokens per month.

Aggregate the free tiers from Google, Groq, Cerebras, SambaNova, NVIDIA, Mistral, OpenRouter, GitHub Models, Cohere, Cloudflare, and Z.ai (Zhipu) behind a single /v1/chat/completions endpoint. Keys are stored encrypted. A Thompson-sampling bandit router picks the best available model for each request — learning from success rate and speed across providers — falls over to the next when one is rate-limited, and tracks per-key usage so you stay under every free-tier cap.

Why this exists

Every serious AI lab now offers a free tier — a few million tokens a month, a few thousand requests a day. On its own each tier is a toy. Stacked together, they add up to roughly 1.3 billion tokens per month of working inference capacity, across dozens of models from small-and-fast to reasonably capable.

The problem is that stacking them by hand is painful: fourteen different SDKs, fourteen different rate limits, fourteen places a request can fail. FreeLLMAPI collapses that into one OpenAI-compatible endpoint. Point any OpenAI client library at your local server, and it routes transparently across whichever providers you've added keys for.

Supported providers

Google Gemini 2.5 Flash · 3.x previews	Groq Llama 3.3, Llama 4, GPT-OSS, Qwen3	Cerebras Qwen3 235B	SambaNova DeepSeek V3.x · Llama 4 · Gemma 3
Mistral Large 3 · Medium 3.5 · Codestral · Devstral	OpenRouter 19 free-tier models	GitHub Models GPT-4.1 · GPT-4o	Cloudflare Kimi K2 · GLM-4.7 · GPT-OSS · Granite 4
Cohere Command R+ · Command-A (trial)	Z.ai (Zhipu) GLM-4.5 · GLM-4.7 Flash	NVIDIA NIM (disabled by default)	Adding another? See Contributing.

Features

OpenAI-compatible — POST /v1/chat/completions and GET /v1/models work with the official OpenAI SDKs and any OpenAI-compatible client (LangChain, LlamaIndex, Continue, Hermes, etc.). Just change base_url.
Streaming and non-streaming — Server-Sent Events for stream: true, JSON response otherwise. Every provider adapter implements both.
Tool calling — OpenAI-style tools / tool_choice requests are passed through, and assistant tool_calls + tool role follow-up messages round-trip across providers.
Thompson-sampling router — Each request draws a score from each model's Beta posterior (Beta(successes + 2, failures + 2)), adds a normalized tok/s speed term, and subtracts any active rate-limit penalty. The stochastic draw means better models win more often without locking out unproven ones — exploration is automatic and proportional to uncertainty. The dashboard shows the deterministic Bayesian mean of the same posterior for human readability. Rate-limit penalties are model-scoped but only applied once all keys for that model are exhausted — a single key hitting a 429 does not down-rank the model if other keys remain available.
Two routing modes — freellmapi/auto (default) balances speed, reliability, and intelligence. freellmapi/auto-smart prioritizes model capability (60% intelligence weight) over raw speed — better for complex reasoning tasks where you want the smartest available model even if it streams more slowly.
Automatic fallover — If the chosen provider returns a 429, 5xx, or times out, the router skips it, puts the key on a short cooldown, and retries on the next model in your fallback chain (up to 20 attempts).
Per-key rate tracking — RPM, RPD, TPM, and TPD counters per (platform, model, key) so the router always picks a key that's under its caps.
Sticky sessions — Multi-turn conversations keep talking to the same model for 30 minutes to avoid the hallucination spike that comes from mid-conversation model switches.
Encrypted key storage — API keys are encrypted with AES-256-GCM before hitting SQLite; decryption happens in-memory just before a request.
Unified API key — Clients authenticate to your proxy with a single freellmapi-… bearer token. You never expose upstream provider keys to your apps.
Two-key auth — Dashboard routes (/api/*) require a separate ADMIN_DASHBOARD_KEY bearer token; proxy routes (/v1/*) require the unified key. The two keys cannot cross routes. /api/ping is the only public endpoint.
Hardened production mode — Helmet CSP/HSTS headers enabled, CORS locked to configured origins, generic 500 messages (no stack traces), sensitive request/response logging opt-in only (LOG_SENSITIVE_DATA=true).
Health checks — Periodic probes mark keys as healthy, rate_limited, invalid, or error so the router skips dead ones automatically.
Admin dashboard — React + Vite UI to manage keys, inspect live routing stats, browse analytics, and run prompts in a playground. Dark mode included.
Analytics — Per-request logging with latency, token counts, success rate, and per-provider breakdowns.
Deploys to a Raspberry Pi — Runs happily on a Pi 4 under PM2 behind nginx. ~40 MB RSS at idle.

Not yet supported

The scope is deliberately narrow. If a feature isn't on this list and isn't below, assume it isn't there yet.

Embeddings (/v1/embeddings)
Image generation (/v1/images/*)
Audio / speech (/v1/audio/*)
Vision / multimodal inputs — message content is text-only
Legacy completions (/v1/completions) — only the chat endpoint is implemented
Moderation (/v1/moderations)
n > 1 (multiple completions per request)
Per-user billing / multi-tenant auth — single-user by design

PRs that add any of these are very welcome. See Contributing.

Quick start

Prerequisites: Node.js 22, pnpm (or use Volta — versions are pinned in package.json).

git clone https://github.com/tashfeenahmed/freellmapi.git
cd freellmapi
pnpm install

# Generate an encryption key for at-rest key storage
cp .env.example .env
echo "ENCRYPTION_KEY=$(node -e "console.log(require('crypto').randomBytes(32).toString('hex'))")" >> .env
echo "ADMIN_DASHBOARD_KEY=$(node -e "console.log('freellmapi-admin-' + require('crypto').randomBytes(32).toString('hex'))")" >> .env

# Start server + dashboard together
pnpm dev

Open http://localhost:5173 (the Vite dev UI), add your provider keys on the Keys page, and grab your unified API key from the Keys page header. That unified key is what you point your OpenAI SDK at.

For a production build:

pnpm build
node server/dist/index.js     # server + dashboard both served on :3001

For production, set ADMIN_DASHBOARD_KEY in .env and keep it private. The dashboard prompts for this key on first load and stores it in browser local storage to authenticate /api/* calls. /v1/* clients use the separate unified freellmapi-… key shown on the Keys page — the two keys cannot cross routes.

All .env variables:

Variable	Required	Description
`ENCRYPTION_KEY`	Yes	64-char hex key for AES-256-GCM at-rest key encryption.
`ADMIN_DASHBOARD_KEY`	Yes (prod)	Bearer token for all `/api/*` dashboard routes. Min 24 chars. Omitting it only works in `NODE_ENV=development`.
`ADMIN_CORS_ORIGINS`	No	Comma-separated browser origins allowed to call `/api/*` cross-origin (e.g. `http://localhost:5173`). Same-origin deployments don't need this.
`DISABLE_HSTS`	No	Set `true` to skip HSTS headers — useful when terminating TLS at a reverse proxy.
`LOG_SENSITIVE_DATA`	No	Set `true` to log full request/response bodies. Off by default; never enable in production.
`PORT`	No	Server port (default `3001`).

Using the API

Any OpenAI-compatible client works. Examples:

Python

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:3001/v1",
    api_key="freellmapi-your-unified-key",
)

resp = client.chat.completions.create(
    model="freellmapi/auto",  # let the router pick; or specify e.g. "gemini-2.5-flash"
    messages=[{"role": "user", "content": "Summarise the fall of Rome in one sentence."}],
)
print(resp.choices[0].message.content)
print("Routed via:", resp.headers.get("x-routed-via"))

Choosing a routing mode

# Balanced (default): Optimizes for speed, reliability, and basic capability
client.chat.completions.create(model="freellmapi/auto", ...)

# Smart: Prioritizes intelligence (60% weight) — better for reasoning, coding, analysis
# May trade speed for capability, preferring models like Gemini 2.5 Pro or GPT-4o
client.chat.completions.create(model="freellmapi/auto-smart", ...)

curl

curl http://localhost:3001/v1/chat/completions \
  -H "Authorization: Bearer freellmapi-your-unified-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "freellmapi/auto",
    "messages": [{"role": "user", "content": "hi"}]
  }'

Streaming

stream = client.chat.completions.create(
    model="freellmapi/auto",
    messages=[{"role": "user", "content": "Stream me a haiku about SQLite."}],
    stream=True,
)
for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

Tool calling

Pass OpenAI-style tools and tool_choice; the assistant response round-trips back through the proxy exactly like the OpenAI API. Multi-step flows (assistant tool_calls → tool role follow-up → final answer) work across every provider the router can reach.

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a city.",
        "parameters": {
            "type": "object",
            "properties": {"city": {"type": "string"}},
            "required": ["city"],
        },
    },
}]

# 1. Model asks for a tool call
first = client.chat.completions.create(
    model="freellmapi/auto",
    messages=[{"role": "user", "content": "What's the weather in Karachi?"}],
    tools=tools,
    tool_choice="required",
)
call = first.choices[0].message.tool_calls[0]

# 2. You execute the tool, feed the result back
final = client.chat.completions.create(
    model="freellmapi/auto",
    messages=[
        {"role": "user", "content": "What's the weather in Karachi?"},
        first.choices[0].message,
        {"role": "tool", "tool_call_id": call.id, "content": '{"temp_c": 32, "cond": "sunny"}'},
    ],
    tools=tools,
)
print(final.choices[0].message.content)

Works with stream=True as well — you'll get delta.tool_calls chunks followed by a finish_reason: "tool_calls" close. Under the hood, OpenAI-compatible providers (Groq, Cerebras, SambaNova, Mistral, OpenRouter, GitHub Models, HuggingFace, Cloudflare, Cohere compat) get the request passed through; Gemini requests get translated into Google's functionDeclarations / functionResponse shape and the response is translated back.

Every response carries an X-Routed-Via: <platform>/<model> header so you can see which provider actually served each call. If a request fell over between providers, you'll also see X-Fallback-Attempts: N.

Screenshots

Keys

Manage provider credentials and grab the unified API key your apps connect with. Each key shows a status dot and when it was last health-checked.

Playground

Send a chat completion through the router and see which provider served it, with the model ID and latency printed right on the message.

Analytics

Request volume, success rate, tokens in and out, average latency, and per-provider breakdowns over 24h / 7d / 30d windows.

How it works

┌──────────────────┐   Bearer freellmapi-…   ┌─────────────────────────┐
│  OpenAI SDK /    │ ──────────────────────▶ │  Express proxy (:3001)  │
│  curl / any      │ ◀────────────────────── │  /v1/chat/completions   │
│  OpenAI client   │      streamed tokens    └────────────┬────────────┘
└──────────────────┘                                      │
                                                          ▼
                             ┌──────────────────────────────────────────────────────┐
                             │  Router (Thompson-sampling bandit)                   │
                             │   1. For each enabled model, sample a score:         │
                             │        score = Beta(wins+2, losses+2) sample         │
                             │              + INTELLIGENCE_WEIGHT × normalized rank │
                             │              + SPEED_WEIGHT × (tok/s / max tok/s)    │
                             │              + TTFB_WEIGHT × ttfb_score              │
                             │              - slow-model penalty (if < 10 tok/s)    │
                             │              - rate-limit penalty × 0.05             │
                             │      (balanced: intelligence 10%, smart mode: 60%)   │
                             │   2. Sort descending; sticky session pins preferred. │
                             │   3. First model with a healthy, under-limit key     │
                             │      wins; decrypt key, call provider SDK.           │
                             │   4. On 429/5xx → key cooldown + retry next key.    │
                             │      Model penalty only fires when all keys for      │
                             │      that model are exhausted by 429s.               │
                             └──────────────────────────────────────────────────────┘
                                          │
   ┌──────────────┬────────────┬──────────┴─────────┬─────────────┬──────────┐
   ▼              ▼            ▼                    ▼             ▼          ▼
 Google         Groq        Cerebras           OpenRouter        HF       …10 more

Router (server/src/services/router.ts) — Thompson-sampling multi-armed bandit. Samples from each model's Beta posterior over success rate, adds a normalized tok/s speed reward (models below 10 tok/s receive an active penalty), and subtracts a time-decaying rate-limit penalty for recent 429s. The bandit penalty is model-scoped and fires only when all keys for a model are exhausted by 429s in the current retry loop — a single key rate-limiting does not demote the model if other keys remain. Stochastic selection means the router naturally explores new models while converging on faster, more reliable ones as data accumulates.
Rate-limit ledger (server/src/services/ratelimit.ts) — in-memory RPM/RPD/TPM/TPD counters backed by SQLite, with cooldowns on 429s.
Provider adapters (server/src/providers/*.ts) — one file per provider, implementing the Provider base class: chatCompletion() and streamChatCompletion().
Health service (server/src/services/health.ts) — periodic probe keeps key status fresh.
Dashboard (client/) — React + Vite + shadcn/ui admin surface.
Storage — SQLite (better-sqlite3) with AES-256-GCM envelope encryption for keys.

Limitations

Stacking free tiers has real trade-offs. Be honest with yourself about them:

No frontier models. The free-tier catalog tops out around Llama 3.3 70B, GLM-4.5, Qwen 3 Coder, and Gemini 2.5 Pro. You will not get GPT-5 or Claude Opus class reasoning through this. For hard problems, pay for a real API.
Intelligence degrades as the day progresses. Your top-ranked models (usually Gemini 2.5 Pro, GPT-4o via GitHub Models) have the lowest daily caps. Once they hit their limits, the router falls down your priority chain to smaller/weaker models. Expect the effective intelligence of the endpoint to drop in the late hours of each day — then reset at UTC midnight.
Latency is highly variable. Cerebras and Groq are extremely fast; others are not. You get whichever one is available.
Free tiers can change without notice. Providers regularly tighten, loosen, or remove free tiers. When that happens you'll see 429s or auth errors until you update the catalog. Re-seed scripts live in server/src/scripts/.
No SLA, by definition. If you need reliability, use a paid provider with a contract.
Local-first. There's no multi-tenant auth. Run this for yourself; don't expose it to the internet.

Contributing

Contributors very welcome! Good first PRs:

Add a provider — copy server/src/providers/openai-compat.ts as a template, wire it into server/src/providers/index.ts, seed its models in server/src/db/index.ts, add a test in server/src/__tests__/providers/.
Add an endpoint — embeddings, images, moderations. The provider base class can grow new methods; adapters declare which they support.
Improve the router — cost-aware routing (cheapest-healthy-fastest tradeoffs), better latency-weighted priority, regional pinning.
Dashboard polish — charts on the Analytics page, key rotation UX, batch import of keys from .env.
Docs — more examples, client library snippets for Go/Rust/etc., a deployment recipe for Docker or Fly.

Development loop:

pnpm install
pnpm dev         # server on :3001, dashboard on :5173, both with HMR
pnpm test        # vitest — 75 tests across providers, routes, router, ratelimit

PRs should include a test, keep the existing test suite green, and match the .editorconfig / tsconfig defaults already in the repo. Issues and discussions are open.

Contributors

Thanks to everyone who's helped improve FreeLLMAPI:

@moaaz12-web — tool-calling support across providers (#3)
@lukasulc — better-sqlite3 bump to fix npm install on Node 24+ (#12)
@VinhPhamAI — root .env PORT now propagates to server + Vite dev proxy + UI base URL (#27)
@deadc — preserve Gemini thoughtSignature so multi-turn function calling stops 400-ing (#32); router model-first key-exhaustion tests + per-model limits hoist (#42)
@zhangyu1324 — requested Ollama Cloud integration, now V10 catalog (#14 / #41)
@jtbrennan-git — security review (#35) and Phase 1 hardening: parameterized analytics queries, sort-preset whitelist, timing-safe API key compare, mid-stream error sanitization
@praveenkumarpranjal — guard Gemini SSE JSON.parse so a malformed frame no longer aborts the whole stream, plus first streaming tests for the Google provider (#47)

Terms of Service review

A self-hosted, single-user, personal-use setup was re-reviewed against each provider's ToS (May 2026). Summary:

Provider	Verdict	Notes
Google Gemini	⚠️ Caution	March 2026 ToS narrows scope to "professional or business purposes, not for consumer use" — a self-hosted developer proxy is still defensible, but the clause is new.
Groq	✅ Likely OK	GroqCloud Services Agreement permits Customer Application integration.
Cerebras	✅ Likely OK	Permitted; explicitly forbids selling/transferring API keys.
Mistral	✅ Likely OK	APIs allowed for personal/internal business use.
OpenRouter	✅ Likely OK	April 2026 ToS sharpens the no-resale / no-competing-service clause; private single-user proxy still fine.
SambaNova	⚠️ Ambiguous	EULA §1.5(c) blocks resale and "service bureau" use; single-user with no third-party access is fine.
Cloudflare Workers AI	⚠️ Ambiguous	No anti-proxy clause; covered by general Self-Serve Subscription Agreement.
NVIDIA NIM	⚠️ Caution	Trial ToS §1.2 / §1.4: "evaluation only, not production." Disabled in default catalog.
GitHub Models	⚠️ Caution	Free tier explicitly scoped to "experimentation" and "prototyping."
Cohere	❌ Avoid	Terms §14 still forbids "personal, family or household purposes."
Zhipu (open.bigmodel.cn)	✅ Likely OK	Personal/non-commercial research carve-out still in the platform docs.
Z.ai (api.z.ai)	⚠️ Caution	New row — Singapore entity (distinct from Zhipu CN). §III.3(l) anti-traffic-redirect clause could plausibly be read against a proxy; no explicit personal-use carve-out.
Ollama Cloud	✅ Likely OK	New row — Free plan permits cloud-model access (1 concurrent, 5-hour session caps). No anti-proxy / anti-resale clauses found. (Integration tracked in #14.)

Rules of thumb that keep most providers happy: one account per provider, no reselling, no sharing your endpoint with other humans, don't hammer a free tier as a paid production backend. This is informational, not legal advice — read each provider's ToS and make your own call.

Removed since the April 2026 review: Hugging Face, Moonshot, and MiniMax direct integrations were dropped from the catalog (HF — tool-call format issues; Moonshot — moved to paid only; MiniMax — superseded by the OpenRouter minimax/minimax-m2.5:free route).

Disclaimer

This project is for personal experimentation and learning, not production. Free tiers exist so developers can prototype against them; they aren't a stable, supported inference substrate and shouldn't be treated as one. If you build something real on top of FreeLLMAPI, swap in a paid API before you ship. Your relationship with each upstream provider is governed by the terms you accepted when you created your account — those terms still apply when the traffic is proxied through this project, and you're responsible for complying with them.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 101 Commits
.github/workflows		.github/workflows
client		client
docs		docs
repo-assets		repo-assets
server		server
shared		shared
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FreeLLMAPI

Contents

Why this exists

Supported providers

Features

Not yet supported

Quick start

Using the API

Screenshots

Keys

Playground

Analytics

How it works

Limitations

Contributing

Contributors

Terms of Service review

Disclaimer

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FreeLLMAPI

Contents

Why this exists

Supported providers

Features

Not yet supported

Quick start

Using the API

Screenshots

Keys

Playground

Analytics

How it works

Limitations

Contributing

Contributors

Terms of Service review

Disclaimer

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages