infera

Deploy GGUF (llama-cpp-python) models to RunPod or Replicate with one command.

pip install infera-deploy

infera init my-project
cd my-project
cp ~/Downloads/llama.gguf models/
infera deploy runpod        # or: replicate

That's it. No Dockerfile, no Cog config, no GraphQL — infera writes the runtime, builds the image, uploads the model, and registers the serverless endpoint.

Package name on PyPI is infera-deploy; the Python module and CLI are both infera.

What you'll need

Python 3.10+
A .gguf model file (e.g. from TheBloke on Hugging Face)
For RunPod: Docker daemon, RunPod API key, Docker Hub login (docker login)
For Replicate: cog (Linux/macOS or WSL), cog login

What `infera deploy` actually does

Bundles a runtime tailored to the provider (Dockerfile + handler for RunPod, predict.py + cog.yaml for Replicate)
Builds and pushes the container image
(RunPod) Creates a network volume and uploads .gguf files to it — idempotent, skips unchanged models via MD5
Registers / upserts the serverless endpoint
Smoke-tests it and prints the URL

Re-runs are idempotent: same template, same volume, only changed bits get re-shipped.

Calling a deployed endpoint

The job input is OpenAI-ish:

{
  "input": {
    "messages":    [{"role": "user", "content": "Hello"}],
    "model":       "llama",
    "temperature": 0.7,
    "max_tokens":  512
  }
}

model is optional — it's the filename stem (e.g. llama-3.2-1b for llama-3.2-1b.gguf). If omitted, the first model alphabetically is used.

For embeddings: "endpoint": "embeddings" and "input": "text" (or a list).

For function calling / structured output: pass tools, response_format, or grammar (GBNF) the same way you would to OpenAI.

RunPod: POST https://api.runpod.ai/v2/<endpoint>/runsync with Authorization: Bearer <RUNPOD_KEY>. Replicate: standard Replicate API. messages and tools are JSON-encoded strings (Cog limitation).

Adding a model to a deployed project

cp another.gguf models/
infera deploy runpod

Idempotent — only the new .gguf gets uploaded. Multiple models live side-by-side on the volume; pick one per request via the model field.

Provider configs

First infera deploy <provider> drops <provider>.yaml into the project root. Edit and re-deploy.

# runpod.yaml
gpu:           AMPERE_16,AMPERE_24
gpu_vram_min:  8
workers_min:   0
workers_max:   1
idle_timeout:  5
datacenter:    EU-RO-1

Using the engine locally (advanced)

from infera import Engine

engine = Engine("./models")
print(engine.chat([{"role": "user", "content": "Hello"}]))

Support

If infera saved you an afternoon of Dockerfile yak-shaving, consider buying me a coffee:

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.claude		.claude
assets		assets
infera		infera
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

infera

What you'll need

What `infera deploy` actually does

Calling a deployed endpoint

Adding a model to a deployed project

Provider configs

Using the engine locally (advanced)

Support

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

infera

What you'll need

What infera deploy actually does

Calling a deployed endpoint

Adding a model to a deployed project

Provider configs

Using the engine locally (advanced)

Support

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

What `infera deploy` actually does

Packages