Three michaelfeil/infinity inference services —
two embedders (BAAI/bge-m3, intfloat/multilingual-e5-large) and one reranker
(BAAI/bge-reranker-v2-m3) — sitting behind an authenticating nginx gateway. Clients hit a
single host:port, present a bearer token, and get routed to the right backend by URL prefix.
The Infinity image lives at michaelf34/infinity
on Docker Hub (the GitHub user is michaelfeil, the Docker Hub user is michaelf34).
flowchart LR
client([client])
nginx["nginx :8080"]
authz["authz :8000<br/>verifies bearer token"]
bge["bge-m3 :7997"]
rerank["bge-reranker :7997"]
e5["e5-large :7997<br/>(default)"]
client -- "Authorization: Bearer <client-token>" --> nginx
nginx -. "auth_request /_authz" .-> authz
authz -. "204 ok / 401 deny" .-> nginx
nginx -- "/bge/*" --> bge
nginx -- "/reranker/*" --> rerank
nginx -- "/" --> e5
| Service | Image | Internal port | External surface |
|---|---|---|---|
nginx |
nginxinc/nginx-unprivileged:alpine |
8080 |
localhost:<random> (local) / Traefik (server) |
authz |
python:3.12-alpine |
8000 |
none — only nginx talks to it |
bge-m3 |
michaelf34/infinity:* |
7997 |
none — proxied via /bge/* |
bge-reranker |
michaelf34/infinity:* |
7997 |
none — proxied via /reranker/* |
e5-large |
michaelf34/infinity:* |
7997 |
none — proxied via / |
The model containers share a host bind mount at ./.docker/data/hf_cache (mounted at
/hf-cache, HF_HOME=/hf-cache) for the Hugging Face cache so they don't double-download
tokenizers and configs. The directory is committed empty via .docker/data/.gitignore.
| File | Purpose | Device | Dtype |
|---|---|---|---|
docker-compose.yml |
Local dev on a laptop | CPU | float32 |
docker-compose.server.yml |
GPU override layered onto the base | CUDA | float16 |
The server file is an override, not a standalone stack — it patches command:, adds the
deploy.resources GPU reservation, drops published ports, and switches the gateway to Traefik
HTTPS routing. Always pass both files when running on the server.
Single-key model, enforced at the gateway:
- Client tokens live in
.docker/authz/api_keys.txt, one per line —<token> # <label>. Theauthzsidecar re-reads the file on every request, so adding or revoking tokens is live; no container restart needed. - The Infinity upstreams are not configured with their own API key. They're only reachable
on the internal
appDocker network, so nginx is the sole authenticator and the client'sAuthorizationheader is forwarded upstream as-is.
Missing or unknown client token → authz returns 401 → nginx blocks the request.
- Docker Engine ≥ 24 with the Compose v2 plugin
- Server only: NVIDIA driver +
nvidia-container-toolkit(verify withdocker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi) - The external
frontendDocker network must exist:docker network create frontend
- An
api_keys.txtwith at least one client token (see First-time setup below) — the file must exist beforeup -dor Docker bind-mounts a directory and every request 500s. - Disk: ~7 GB for combined model weights under
./.docker/data/hf_cache(bge-m3 ~2.3 GB, e5-large ~2.2 GB, bge-reranker ~2.3 GB) - Server VRAM: target deployment is a single 16 GB GPU shared by all three services.
Steady-state fp16 footprint is ~9–12 GB; peaks under long bge-m3 inputs (8 k-token context)
or large rerank pair batches can edge close to the 16 GB ceiling at the default batch sizes
(
bge-m3=16,e5-large=32,bge-reranker=8). Raise batch sizes cautiously and watchnvidia-smifor the first heavy run; the first lever if it OOMs is droppingBGE_M3_BATCH_SIZEto8.
cp .env.example .env
$EDITOR .env # set COMPOSE_DOMAIN, COMPOSE_SERVER_DOMAIN, INFINITY_IMAGE_VERSION, etc.
cp .docker/authz/api_keys.txt.example .docker/authz/api_keys.txt
echo "tok_$(openssl rand -hex 16) # alice <alice@example.com>" >> .docker/authz/api_keys.txtdocker compose pull
docker compose up -d
docker compose logs -fCPU inference is slow on these models (seconds-per-request, not ms). It's a smoke-test path — fine for wiring up clients, not for load.
The nginx port is published as "8080" without a host binding, so Docker assigns a random host
port:
docker compose port nginx 8080 # e.g. 0.0.0.0:49154Use that, or point a local Traefik at ${COMPOSE_DOMAIN}.
docker compose -f docker-compose.yml -f docker-compose.server.yml pull
docker compose -f docker-compose.yml -f docker-compose.server.yml up -d
docker compose -f docker-compose.yml -f docker-compose.server.yml logs -fTo make the GPU stack the default for every docker compose invocation in this directory, add to
.env:
COMPOSE_FILE=docker-compose.yml:docker-compose.server.yml
…then docker compose up -d / down / logs work without the -f flags.
On the server every published port is dropped (!reset []) and traffic enters via Traefik on
${COMPOSE_SERVER_DOMAIN} with HTTPS-redirect from web → websecure.
The path prefix selects the model; model: in the request body is informational (Infinity
matches by served-model-name).
TOKEN="tok_abc123" # from .docker/authz/api_keys.txt
HOST="http://localhost:$(docker compose port nginx 8080 | cut -d: -f2)"
# On the server: HOST="https://${COMPOSE_SERVER_DOMAIN}"
# e5-large (default at /)
curl -sS "$HOST/v1/embeddings" \
-H "Authorization: Bearer $TOKEN" \
-H 'Content-Type: application/json' \
-d '{"model":"multilingual-e5-large","input":["query: hvad er en vektor?"]}'
# bge-m3 (under /bge prefix)
curl -sS "$HOST/bge/v1/embeddings" \
-H "Authorization: Bearer $TOKEN" \
-H 'Content-Type: application/json' \
-d '{"model":"bge-m3","input":["hej fra Aarhus","hello from Aarhus"]}'bge-m3 natively produces three representations. /v1/embeddings returns dense only — for
the other two, hit Infinity's native routes under the same /bge/ prefix:
# Sparse / lexical weights (token_id -> weight)
curl -sS "$HOST/bge/embeddings_sparse" \
-H "Authorization: Bearer $TOKEN" \
-H 'Content-Type: application/json' \
-d '{"model":"bge-m3","input":["hej fra Aarhus"]}'
# ColBERT / multi-vector (one vector per token)
curl -sS "$HOST/bge/embeddings/colbert" \
-H "Authorization: Bearer $TOKEN" \
-H 'Content-Type: application/json' \
-d '{"model":"bge-m3","input":["hej fra Aarhus"]}'The reranker is a cross-encoder: given a query and a set of candidate documents it returns
a relevance score per document. Different shape from the embedders — no input array, no
vectors back. Standard pipeline is to recall with bge-m3 / e5-large, then rerank the top-K with
this service.
curl -sS "$HOST/reranker/rerank" \
-H "Authorization: Bearer $TOKEN" \
-H 'Content-Type: application/json' \
-d '{
"model": "bge-reranker-v2-m3",
"query": "hvor ligger Aarhus?",
"documents": [
"Aarhus er Danmarks næststørste by.",
"Paris er hovedstaden i Frankrig.",
"Aarhus ligger i Jylland."
],
"top_n": 3
}'Response is a list of {index, relevance_score} entries sorted by score, descending.
Endpoint paths follow the running Infinity version — docker compose exec bge-m3 wget -qO- http://localhost:7997/openapi.json shows the live spec.
# Status — all five services should be healthy (nginx, authz, bge-m3, e5-large, bge-reranker)
docker compose ps
# Inspect token verdicts (204 ok / 401 deny, with client IP and label)
docker compose logs -f authz
# Internal probes
docker compose exec authz python -c \
"import urllib.request; print(urllib.request.urlopen('http://localhost:8000/healthz').status)"
docker compose exec nginx wget -qO- http://localhost:8080/v1/models # proxied to e5-large
# Restart one piece without restarting the rest
docker compose restart bge-m3
docker compose restart nginx # picks up template edits
docker compose down # stop services
rm -rf .docker/data/hf_cache/* # wipe HF cache (now a host bind mount, so `down -v` won't touch it)- Multiple GPUs (not the current server): set
BGE_M3_GPU,E5_LARGE_GPU, andBGE_RERANKER_GPUindependently in.envto spread services across cards. On the production server (single 16 GB card) leave all three at0. - Batch sizes: defaults are CPU 8/8/8, GPU 16/32/8 (bge-m3 conservative because of its
8 k-token context, reranker conservative because (query, doc) pair scoring inflates the
effective batch). Pin
BGE_M3_BATCH_SIZE/E5_LARGE_BATCH_SIZE/BGE_RERANKER_BATCH_SIZEin.envto override. - Quantisation: change
--dtype=float16toint8indocker-compose.server.ymlto roughly halve VRAM at a small accuracy cost. - Pin the image: replace
michaelf34/infinity:latestwith a tagged release for reproducible builds. - Token rotation: edit
.docker/authz/api_keys.txtand the next request will see the new set — no restart, no reload signal.