Deploy GGUF (llama-cpp-python) models to RunPod or Replicate with one command.
pip install infera-deploy
infera init my-project
cd my-project
cp ~/Downloads/llama.gguf models/
infera deploy runpod # or: replicateThat's it. No Dockerfile, no Cog config, no GraphQL — infera writes the runtime, builds the image, uploads the model, and registers the serverless endpoint.
Package name on PyPI is
infera-deploy; the Python module and CLI are bothinfera.
- Python 3.10+
- A
.ggufmodel file (e.g. from TheBloke on Hugging Face) - For RunPod: Docker daemon, RunPod API key, Docker Hub login (
docker login) - For Replicate: cog (Linux/macOS or WSL),
cog login
- Bundles a runtime tailored to the provider (
Dockerfile+ handler for RunPod,predict.py+cog.yamlfor Replicate) - Builds and pushes the container image
- (RunPod) Creates a network volume and uploads
.gguffiles to it — idempotent, skips unchanged models via MD5 - Registers / upserts the serverless endpoint
- Smoke-tests it and prints the URL
Re-runs are idempotent: same template, same volume, only changed bits get re-shipped.
The job input is OpenAI-ish:
{
"input": {
"messages": [{"role": "user", "content": "Hello"}],
"model": "llama",
"temperature": 0.7,
"max_tokens": 512
}
}model is optional — it's the filename stem (e.g. llama-3.2-1b for llama-3.2-1b.gguf). If omitted, the first model alphabetically is used.
For embeddings: "endpoint": "embeddings" and "input": "text" (or a list).
For function calling / structured output: pass tools, response_format, or grammar (GBNF) the same way you would to OpenAI.
RunPod: POST https://api.runpod.ai/v2/<endpoint>/runsync with Authorization: Bearer <RUNPOD_KEY>.
Replicate: standard Replicate API. messages and tools are JSON-encoded strings (Cog limitation).
cp another.gguf models/
infera deploy runpodIdempotent — only the new .gguf gets uploaded. Multiple models live side-by-side on the volume; pick one per request via the model field.
First infera deploy <provider> drops <provider>.yaml into the project root. Edit and re-deploy.
# runpod.yaml
gpu: AMPERE_16,AMPERE_24
gpu_vram_min: 8
workers_min: 0
workers_max: 1
idle_timeout: 5
datacenter: EU-RO-1from infera import Engine
engine = Engine("./models")
print(engine.chat([{"role": "user", "content": "Hello"}]))If infera saved you an afternoon of Dockerfile yak-shaving, consider buying me a coffee:
MIT
