Skip to content

Ga0512/infera

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

infera — deploy & chill

infera

Deploy GGUF (llama-cpp-python) models to RunPod or Replicate with one command.

pip install infera-deploy

infera init my-project
cd my-project
cp ~/Downloads/llama.gguf models/
infera deploy runpod        # or: replicate

That's it. No Dockerfile, no Cog config, no GraphQL — infera writes the runtime, builds the image, uploads the model, and registers the serverless endpoint.

Package name on PyPI is infera-deploy; the Python module and CLI are both infera.

What you'll need

  • Python 3.10+
  • A .gguf model file (e.g. from TheBloke on Hugging Face)
  • For RunPod: Docker daemon, RunPod API key, Docker Hub login (docker login)
  • For Replicate: cog (Linux/macOS or WSL), cog login

What infera deploy actually does

  1. Bundles a runtime tailored to the provider (Dockerfile + handler for RunPod, predict.py + cog.yaml for Replicate)
  2. Builds and pushes the container image
  3. (RunPod) Creates a network volume and uploads .gguf files to it — idempotent, skips unchanged models via MD5
  4. Registers / upserts the serverless endpoint
  5. Smoke-tests it and prints the URL

Re-runs are idempotent: same template, same volume, only changed bits get re-shipped.

Calling a deployed endpoint

The job input is OpenAI-ish:

{
  "input": {
    "messages":    [{"role": "user", "content": "Hello"}],
    "model":       "llama",
    "temperature": 0.7,
    "max_tokens":  512
  }
}

model is optional — it's the filename stem (e.g. llama-3.2-1b for llama-3.2-1b.gguf). If omitted, the first model alphabetically is used.

For embeddings: "endpoint": "embeddings" and "input": "text" (or a list).

For function calling / structured output: pass tools, response_format, or grammar (GBNF) the same way you would to OpenAI.

RunPod: POST https://api.runpod.ai/v2/<endpoint>/runsync with Authorization: Bearer <RUNPOD_KEY>. Replicate: standard Replicate API. messages and tools are JSON-encoded strings (Cog limitation).

Adding a model to a deployed project

cp another.gguf models/
infera deploy runpod

Idempotent — only the new .gguf gets uploaded. Multiple models live side-by-side on the volume; pick one per request via the model field.

Provider configs

First infera deploy <provider> drops <provider>.yaml into the project root. Edit and re-deploy.

# runpod.yaml
gpu:           AMPERE_16,AMPERE_24
gpu_vram_min:  8
workers_min:   0
workers_max:   1
idle_timeout:  5
datacenter:    EU-RO-1

Using the engine locally (advanced)

from infera import Engine

engine = Engine("./models")
print(engine.chat([{"role": "user", "content": "Hello"}]))

Support

If infera saved you an afternoon of Dockerfile yak-shaving, consider buying me a coffee:

Buy Me A Coffee

License

MIT

About

Deploy your GGUF models with one command

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors