v0.7.0 — vLLM provider
Headline: self-hosted GPU inference via vLLM. The toolkit now wraps five chat providers behind one stable interface: Anthropic, OpenAI (Responses), Google, Ollama, vLLM.
llms/providers/vllm (ADR-0015)
Leaf package alongside openai, using the OpenAI Chat Completions surface (not Responses) via openai-go with a configurable base URL. Zero new dependencies — openai-go is already imported for the existing OpenAI Responses adapter.
import "github.com/SnapdragonPartners/maestro-llms/llms/providers/vllm"
client, _ := vllm.New(
vllm.WithBaseURL("http://my-vllm:8000"),
vllm.WithModel("mistralai/Ministral-3-14B-Instruct-2512"),
// WithAPIKey is OPTIONAL — vLLM's default deployment has no auth.
)Key behaviors:
- No-auth by default is the distinguishing feature vs hosted providers. An empty
WithAPIKeyis a valid configuration, not a config error. ModelListerimplemented;LatestInFamilyis not — HuggingFace-style names have no canonical family convention. Same shape as Ollama. Regression-guard test ensures this stays binding.ModelInfo.Createdis the load time on the vLLM instance, NOT the upstream HuggingFace release date. Don't surface it as freshness.- Tool calling works through the standard
tools/tool_choicerequest fields, but actual emission depends on the vLLM server's per-model--tool-call-parserconfig. - Streaming deferred per ADR-0003.
Live integration test gated on MAESTRO_VLLM (full base URL). Optional MAESTRO_VLLM_MODEL overrides the model ID, defaulting to the first one /v1/models reports. Picked up automatically by make test-integration.
ADR-0013 amendment (PR #44)
Append-only clarification that the future opt-in TextEstimator design space includes API-backed variants (Anthropic count_tokens, Gemini CountTokens) — both zero-new-dep wrappers since those SDKs are already imported — not just embedded tokenizers like tiktoken-go. Decision unchanged; framing widened so the future ADR doesn't get retconned.
MAESTRO_DIVERGENCES.md
Informational V1-V5 rows for vLLM (greenfield, no Maestro behavior to diverge from; rows document vLLM-specific behaviors a cut-over consumer should know about).
Not in this tag (separate module)
examples/chat (PR #46) is a runs-locally web demo wiring every provider through one UI (RecommendedChat middleware, per-provider ListModels, EstimateTextTokens). Lives under its own go.mod so its dependency closure doesn't leak into the toolkit. Build/run from examples/chat with go run ..
Compatibility
Additive throughout. No core llms type changes. Existing v0.6.x consumers are unaffected. govulncheck clean.
Pre-1.0; v0.x minor versions may break.
🤖 Generated with Claude Code