Skip to content

AarhusAI/embedding_infinity_docker

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

embedding_infinity_docker

Three michaelfeil/infinity inference services — two embedders (BAAI/bge-m3, intfloat/multilingual-e5-large) and one reranker (BAAI/bge-reranker-v2-m3) — sitting behind an authenticating nginx gateway. Clients hit a single host:port, present a bearer token, and get routed to the right backend by URL prefix.

The Infinity image lives at michaelf34/infinity on Docker Hub (the GitHub user is michaelfeil, the Docker Hub user is michaelf34).

Topology

flowchart LR
    client([client])
    nginx["nginx :8080"]
    authz["authz :8000<br/>verifies bearer token"]
    bge["bge-m3 :7997"]
    rerank["bge-reranker :7997"]
    e5["e5-large :7997<br/>(default)"]

    client -- "Authorization: Bearer &lt;client-token&gt;" --> nginx
    nginx -. "auth_request /_authz" .-> authz
    authz -. "204 ok / 401 deny" .-> nginx
    nginx -- "/bge/*" --> bge
    nginx -- "/reranker/*" --> rerank
    nginx -- "/" --> e5
Loading
Service Image Internal port External surface
nginx nginxinc/nginx-unprivileged:alpine 8080 localhost:<random> (local) / Traefik (server)
authz python:3.12-alpine 8000 none — only nginx talks to it
bge-m3 michaelf34/infinity:* 7997 none — proxied via /bge/*
bge-reranker michaelf34/infinity:* 7997 none — proxied via /reranker/*
e5-large michaelf34/infinity:* 7997 none — proxied via /

The model containers share a host bind mount at ./.docker/data/hf_cache (mounted at /hf-cache, HF_HOME=/hf-cache) for the Hugging Face cache so they don't double-download tokenizers and configs. The directory is committed empty via .docker/data/.gitignore.

Two profiles, one repo

File Purpose Device Dtype
docker-compose.yml Local dev on a laptop CPU float32
docker-compose.server.yml GPU override layered onto the base CUDA float16

The server file is an override, not a standalone stack — it patches command:, adds the deploy.resources GPU reservation, drops published ports, and switches the gateway to Traefik HTTPS routing. Always pass both files when running on the server.

Authentication

Single-key model, enforced at the gateway:

  • Client tokens live in .docker/authz/api_keys.txt, one per line — <token> # <label>. The authz sidecar re-reads the file on every request, so adding or revoking tokens is live; no container restart needed.
  • The Infinity upstreams are not configured with their own API key. They're only reachable on the internal app Docker network, so nginx is the sole authenticator and the client's Authorization header is forwarded upstream as-is.

Missing or unknown client token → authz returns 401 → nginx blocks the request.

Prerequisites

  • Docker Engine ≥ 24 with the Compose v2 plugin
  • Server only: NVIDIA driver + nvidia-container-toolkit (verify with docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi)
  • The external frontend Docker network must exist:
    docker network create frontend
  • An api_keys.txt with at least one client token (see First-time setup below) — the file must exist before up -d or Docker bind-mounts a directory and every request 500s.
  • Disk: ~7 GB for combined model weights under ./.docker/data/hf_cache (bge-m3 ~2.3 GB, e5-large ~2.2 GB, bge-reranker ~2.3 GB)
  • Server VRAM: target deployment is a single 16 GB GPU shared by all three services. Steady-state fp16 footprint is ~9–12 GB; peaks under long bge-m3 inputs (8 k-token context) or large rerank pair batches can edge close to the 16 GB ceiling at the default batch sizes (bge-m3=16, e5-large=32, bge-reranker=8). Raise batch sizes cautiously and watch nvidia-smi for the first heavy run; the first lever if it OOMs is dropping BGE_M3_BATCH_SIZE to 8.

First-time setup

cp .env.example .env
$EDITOR .env                                                      # set COMPOSE_DOMAIN, COMPOSE_SERVER_DOMAIN, INFINITY_IMAGE_VERSION, etc.

cp .docker/authz/api_keys.txt.example .docker/authz/api_keys.txt
echo "tok_$(openssl rand -hex 16) # alice <alice@example.com>" >> .docker/authz/api_keys.txt

Run — local (CPU)

docker compose pull
docker compose up -d
docker compose logs -f

CPU inference is slow on these models (seconds-per-request, not ms). It's a smoke-test path — fine for wiring up clients, not for load.

The nginx port is published as "8080" without a host binding, so Docker assigns a random host port:

docker compose port nginx 8080      # e.g. 0.0.0.0:49154

Use that, or point a local Traefik at ${COMPOSE_DOMAIN}.

Run — server (GPU)

docker compose -f docker-compose.yml -f docker-compose.server.yml pull
docker compose -f docker-compose.yml -f docker-compose.server.yml up -d
docker compose -f docker-compose.yml -f docker-compose.server.yml logs -f

To make the GPU stack the default for every docker compose invocation in this directory, add to .env:

COMPOSE_FILE=docker-compose.yml:docker-compose.server.yml

…then docker compose up -d / down / logs work without the -f flags.

On the server every published port is dropped (!reset []) and traffic enters via Traefik on ${COMPOSE_SERVER_DOMAIN} with HTTPS-redirect from webwebsecure.

Calling the API

The path prefix selects the model; model: in the request body is informational (Infinity matches by served-model-name).

TOKEN="tok_abc123"                          # from .docker/authz/api_keys.txt
HOST="http://localhost:$(docker compose port nginx 8080 | cut -d: -f2)"
# On the server: HOST="https://${COMPOSE_SERVER_DOMAIN}"

# e5-large (default at /)
curl -sS "$HOST/v1/embeddings" \
  -H "Authorization: Bearer $TOKEN" \
  -H 'Content-Type: application/json' \
  -d '{"model":"multilingual-e5-large","input":["query: hvad er en vektor?"]}'

# bge-m3 (under /bge prefix)
curl -sS "$HOST/bge/v1/embeddings" \
  -H "Authorization: Bearer $TOKEN" \
  -H 'Content-Type: application/json' \
  -d '{"model":"bge-m3","input":["hej fra Aarhus","hello from Aarhus"]}'

BGE-M3 sparse + ColBERT vectors

bge-m3 natively produces three representations. /v1/embeddings returns dense only — for the other two, hit Infinity's native routes under the same /bge/ prefix:

# Sparse / lexical weights (token_id -> weight)
curl -sS "$HOST/bge/embeddings_sparse" \
  -H "Authorization: Bearer $TOKEN" \
  -H 'Content-Type: application/json' \
  -d '{"model":"bge-m3","input":["hej fra Aarhus"]}'

# ColBERT / multi-vector (one vector per token)
curl -sS "$HOST/bge/embeddings/colbert" \
  -H "Authorization: Bearer $TOKEN" \
  -H 'Content-Type: application/json' \
  -d '{"model":"bge-m3","input":["hej fra Aarhus"]}'

Reranking with bge-reranker-v2-m3

The reranker is a cross-encoder: given a query and a set of candidate documents it returns a relevance score per document. Different shape from the embedders — no input array, no vectors back. Standard pipeline is to recall with bge-m3 / e5-large, then rerank the top-K with this service.

curl -sS "$HOST/reranker/rerank" \
  -H "Authorization: Bearer $TOKEN" \
  -H 'Content-Type: application/json' \
  -d '{
        "model": "bge-reranker-v2-m3",
        "query": "hvor ligger Aarhus?",
        "documents": [
          "Aarhus er Danmarks næststørste by.",
          "Paris er hovedstaden i Frankrig.",
          "Aarhus ligger i Jylland."
        ],
        "top_n": 3
      }'

Response is a list of {index, relevance_score} entries sorted by score, descending.

Endpoint paths follow the running Infinity version — docker compose exec bge-m3 wget -qO- http://localhost:7997/openapi.json shows the live spec.

Operations

# Status — all five services should be healthy (nginx, authz, bge-m3, e5-large, bge-reranker)
docker compose ps

# Inspect token verdicts (204 ok / 401 deny, with client IP and label)
docker compose logs -f authz

# Internal probes
docker compose exec authz python -c \
  "import urllib.request; print(urllib.request.urlopen('http://localhost:8000/healthz').status)"
docker compose exec nginx wget -qO- http://localhost:8080/v1/models   # proxied to e5-large

# Restart one piece without restarting the rest
docker compose restart bge-m3
docker compose restart nginx                                          # picks up template edits

docker compose down                       # stop services
rm -rf .docker/data/hf_cache/*            # wipe HF cache (now a host bind mount, so `down -v` won't touch it)

Tuning

  • Multiple GPUs (not the current server): set BGE_M3_GPU, E5_LARGE_GPU, and BGE_RERANKER_GPU independently in .env to spread services across cards. On the production server (single 16 GB card) leave all three at 0.
  • Batch sizes: defaults are CPU 8/8/8, GPU 16/32/8 (bge-m3 conservative because of its 8 k-token context, reranker conservative because (query, doc) pair scoring inflates the effective batch). Pin BGE_M3_BATCH_SIZE / E5_LARGE_BATCH_SIZE / BGE_RERANKER_BATCH_SIZE in .env to override.
  • Quantisation: change --dtype=float16 to int8 in docker-compose.server.yml to roughly halve VRAM at a small accuracy cost.
  • Pin the image: replace michaelf34/infinity:latest with a tagged release for reproducible builds.
  • Token rotation: edit .docker/authz/api_keys.txt and the next request will see the new set — no restart, no reload signal.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages