Skip to content

v0.7.0 — vLLM provider

Choose a tag to compare

@dratner dratner released this 31 May 20:03
· 1 commit to main since this release
e15081a

Headline: self-hosted GPU inference via vLLM. The toolkit now wraps five chat providers behind one stable interface: Anthropic, OpenAI (Responses), Google, Ollama, vLLM.

llms/providers/vllm (ADR-0015)

Leaf package alongside openai, using the OpenAI Chat Completions surface (not Responses) via openai-go with a configurable base URL. Zero new dependencies — openai-go is already imported for the existing OpenAI Responses adapter.

import "github.com/SnapdragonPartners/maestro-llms/llms/providers/vllm"

client, _ := vllm.New(
    vllm.WithBaseURL("http://my-vllm:8000"),
    vllm.WithModel("mistralai/Ministral-3-14B-Instruct-2512"),
    // WithAPIKey is OPTIONAL — vLLM's default deployment has no auth.
)

Key behaviors:

  • No-auth by default is the distinguishing feature vs hosted providers. An empty WithAPIKey is a valid configuration, not a config error.
  • ModelLister implemented; LatestInFamily is not — HuggingFace-style names have no canonical family convention. Same shape as Ollama. Regression-guard test ensures this stays binding.
  • ModelInfo.Created is the load time on the vLLM instance, NOT the upstream HuggingFace release date. Don't surface it as freshness.
  • Tool calling works through the standard tools / tool_choice request fields, but actual emission depends on the vLLM server's per-model --tool-call-parser config.
  • Streaming deferred per ADR-0003.

Live integration test gated on MAESTRO_VLLM (full base URL). Optional MAESTRO_VLLM_MODEL overrides the model ID, defaulting to the first one /v1/models reports. Picked up automatically by make test-integration.

ADR-0013 amendment (PR #44)

Append-only clarification that the future opt-in TextEstimator design space includes API-backed variants (Anthropic count_tokens, Gemini CountTokens) — both zero-new-dep wrappers since those SDKs are already imported — not just embedded tokenizers like tiktoken-go. Decision unchanged; framing widened so the future ADR doesn't get retconned.

MAESTRO_DIVERGENCES.md

Informational V1-V5 rows for vLLM (greenfield, no Maestro behavior to diverge from; rows document vLLM-specific behaviors a cut-over consumer should know about).

Not in this tag (separate module)

examples/chat (PR #46) is a runs-locally web demo wiring every provider through one UI (RecommendedChat middleware, per-provider ListModels, EstimateTextTokens). Lives under its own go.mod so its dependency closure doesn't leak into the toolkit. Build/run from examples/chat with go run ..

Compatibility

Additive throughout. No core llms type changes. Existing v0.6.x consumers are unaffected. govulncheck clean.

Pre-1.0; v0.x minor versions may break.

🤖 Generated with Claude Code