Releases: SnapdragonPartners/maestro-llms
v0.7.1 — reasoning + billable output normalization (ADR-0016)
Patch release. Investigated as a Gemini bug, shipped as the cross-provider normalization the toolkit was missing.
The bug
A chat to gemini-3.1-pro-preview-customtools with MaxTokens=1024 returned OutputTokens=36, StopReason=MAX_TOKENS. Confusing because 36 is nowhere near 1024 — the model had silently burned ~988 tokens on thinking and the toolkit dropped them. Worse, the bug wasn't Gemini-specific: every provider's wire output_tokens had a different meaning, so Usage.OutputTokens was silently inconsistent across providers.
Fix (ADR-0016)
Two new fields on llms.Usage:
ReasoningTokens int // separately-metered thinking; ADDITIVE to OutputTokens
BillableOutputTokens int // what the provider charges as \"output\"OutputTokens is redefined as visible-only across every adapter. Budget math holds: a length truncation fires when Input + Output + Reasoning approaches the cap. Cost math: read BillableOutputTokens (one field, one semantic, no per-provider branching).
| Provider | OutputTokens | ReasoningTokens | BillableOutputTokens |
|---|---|---|---|
| Gemini | CandidatesTokenCount |
ThoughtsTokenCount |
Candidates + Thoughts |
| OpenAI Responses | wire − OutputTokensDetails.ReasoningTokens |
OutputTokensDetails.ReasoningTokens |
wire.OutputTokens |
| OpenAI Chat Completions (vLLM) | same subtraction via CompletionTokensDetails |
same | wire.CompletionTokens |
| Anthropic | wire.OutputTokens (includes thinking when on; not separable) |
0 (SDK has no field) | mirrors OutputTokens |
| Ollama | eval_count |
0 | mirrors OutputTokens |
Breaking semantic for OpenAI o-series
Pre-v0.7.1, Usage.OutputTokens was the OpenAI wire output_tokens, which for o-series silently included reasoning. Now it's visible-only. Migration: code reading OutputTokens as the billing-relevant number should read BillableOutputTokens instead (carries the exact same value the old field did). Non-reasoning OpenAI calls are unaffected — reasoning_tokens=0 makes the subtraction a no-op.
Pre-1.0 is the right time to make this consistent; carrying the inconsistency to v1.0 would have been worse.
Demo bonus (examples/chat)
The bubble footer surfaces the breakdown so a max_tokens stop on a small visible output is self-explanatory:
gemini-3.1-pro-preview-customtools · max tokens · 12 in / 102 out / 918 reasoning · 1020 billable · 10875 ms
Non-reasoning models keep the terse <in> in / <out> out form (billable would just duplicate out).
Compatibility
Additive new fields. Existing fields keep their names. The only semantic shift is OpenAI o-series OutputTokens. No MAESTRO_DIVERGENCES.md row — Maestro never surfaced either field; nothing to diverge from.
Pre-1.0; v0.x minor versions may break.
🤖 Generated with Claude Code
v0.7.0 — vLLM provider
Headline: self-hosted GPU inference via vLLM. The toolkit now wraps five chat providers behind one stable interface: Anthropic, OpenAI (Responses), Google, Ollama, vLLM.
llms/providers/vllm (ADR-0015)
Leaf package alongside openai, using the OpenAI Chat Completions surface (not Responses) via openai-go with a configurable base URL. Zero new dependencies — openai-go is already imported for the existing OpenAI Responses adapter.
import "github.com/SnapdragonPartners/maestro-llms/llms/providers/vllm"
client, _ := vllm.New(
vllm.WithBaseURL("http://my-vllm:8000"),
vllm.WithModel("mistralai/Ministral-3-14B-Instruct-2512"),
// WithAPIKey is OPTIONAL — vLLM's default deployment has no auth.
)Key behaviors:
- No-auth by default is the distinguishing feature vs hosted providers. An empty
WithAPIKeyis a valid configuration, not a config error. ModelListerimplemented;LatestInFamilyis not — HuggingFace-style names have no canonical family convention. Same shape as Ollama. Regression-guard test ensures this stays binding.ModelInfo.Createdis the load time on the vLLM instance, NOT the upstream HuggingFace release date. Don't surface it as freshness.- Tool calling works through the standard
tools/tool_choicerequest fields, but actual emission depends on the vLLM server's per-model--tool-call-parserconfig. - Streaming deferred per ADR-0003.
Live integration test gated on MAESTRO_VLLM (full base URL). Optional MAESTRO_VLLM_MODEL overrides the model ID, defaulting to the first one /v1/models reports. Picked up automatically by make test-integration.
ADR-0013 amendment (PR #44)
Append-only clarification that the future opt-in TextEstimator design space includes API-backed variants (Anthropic count_tokens, Gemini CountTokens) — both zero-new-dep wrappers since those SDKs are already imported — not just embedded tokenizers like tiktoken-go. Decision unchanged; framing widened so the future ADR doesn't get retconned.
MAESTRO_DIVERGENCES.md
Informational V1-V5 rows for vLLM (greenfield, no Maestro behavior to diverge from; rows document vLLM-specific behaviors a cut-over consumer should know about).
Not in this tag (separate module)
examples/chat (PR #46) is a runs-locally web demo wiring every provider through one UI (RecommendedChat middleware, per-provider ListModels, EstimateTextTokens). Lives under its own go.mod so its dependency closure doesn't leak into the toolkit. Build/run from examples/chat with go run ..
Compatibility
Additive throughout. No core llms type changes. Existing v0.6.x consumers are unaffected. govulncheck clean.
Pre-1.0; v0.x minor versions may break.
🤖 Generated with Claude Code
v0.6.0 — model listing & text token estimator
Two additive features, no core type breakage, no MAESTRO_DIVERGENCES.md rows. Adds maestro-cms as a third consumer of the toolkit alongside Maestro and Morris.
llms.ModelLister + per-provider LatestInFamily (ADR-0012)
Optional capability for listing a provider's catalog and detecting newer models in the same family as a pinned ID. Surfaces "newer model available, upgrade?" — never auto-updates.
lister, ok := client.(llms.ModelLister)
if !ok { return } // provider doesn't expose a list (e.g. future vLLM)
models, _ := lister.ListModels(ctx)
newer, found := anthropic.LatestInFamily(currentID, models)
if found {
fmt.Printf("Newer model: %s (released %s)\n", newer.ID, newer.Created.Format("2006-01-02"))
}
// Or one-shot:
newer, found, err = client.LatestInFamily(ctx, currentID)- Anthropic — family
claude-{opus|sonnet|haiku}, crosses generations (claude-3-5-sonnet-…andclaude-sonnet-4-5-…are bothclaude-sonnet). Ordered byCreatedAt. - OpenAI — family = ID with
-YYYY-MM-DDstripped. Self-filtering by family-prefix means embedding/image models in the catalog don't collide withgpt-*queries. Ordered byCreated(Unix). - Google — family
gemini-{pro|flash|nano|ultra}. genai exposes no created date, so ordered by parsed numeric version in the ID. - Ollama —
ListModelsonly (local pulls have no canonical family).Createdis local pull time, not provider release time.
Permissive family parsing by design. Callers wanting major-version pinning filter the list themselves.
llms.EstimateTextTokens (ADR-0013)
Exported free function for budget-aware text chunking. Zero new dependencies.
budget := llms.EstimateTextTokens(s) // approx token count
// Directly assignable as func(string) int — e.g. for maestro-cms chunk injection.- Neutral bias (~4 chars/token, rune-counted) — intentionally distinct from the middleware
TokenEstimator's high bias. - Two estimators, two purposes: over-reserving at the limiter is safe; over-estimating at chunk time wastes API calls. ADR-0013 makes the split binding.
- Tokenizer-backed / model-aware variant deferred to a future ADR when a consumer needs the fidelity.
Compatibility
Additive new surface throughout. Existing TokenEstimator, ChatClient, rate-limiter behavior all unchanged. govulncheck clean (bumped golang.org/x/net to v0.55.0 to resolve GO-2026-5026 along the way).
Pre-1.0; v0.x minor versions may break.
🤖 Generated with Claude Code
v0.5.0 — Tool loop helper (ADR-0011)
First non-port addition past v0.4.x: a small, app-neutral helper that wraps the provider-protocol tool round-trip over an llms.ChatClient.
What's new
llms/toolloop — synchronous tool-loop helper. Run(ctx, Config) Outcome over any ChatClient. App-supplied Tool{Definition, Execute}; the loop never executes side effects itself.
out := toolloop.Run(ctx, toolloop.Config{
Client: client,
Request: req,
Tools: []toolloop.Tool{weather},
})
switch out.Kind {
case toolloop.OutcomeFinalAnswer:
fmt.Println(out.Response.Text)
case toolloop.OutcomeMaxIterations, toolloop.OutcomeLLMError,
toolloop.OutcomeToolError, toolloop.OutcomeCanceled:
// inspect out.Err, out.Messages, out.TotalUsage
}Full design: docs/toolloop-proposal.md. Binding scope and non-goals: ADR-0011.
Behaviors pinned
- Verbatim assistant-message round-trip preserves
ToolCall.ProviderSignature(ADR-0010), so Gemini 3 multi-turn tool loops work end-to-end. - Fail-closed config (
*toolloop.ConfigError): nilClient, emptyMessages,Request.Toolsset, duplicate tool names, nilExecute, bothToolChoices non-zero, unknownToolChoice.Type,RequiresTools()with no tools,ToolChoiceToolnaming an unknown tool — all rejected before any provider call. - Pre-execute
MaxIterationsstop: the limit-hitting assistant turn is appended toOutcome.Messagesas diagnostic state; its tool calls are not executed and the transcript is not directly re-feedable intoCompletewithout repairing the unresolved pairing. - Defensive
ToolCallcopy before dispatch and event emission so a misbehaving executor or observer cannot corrupt the round-trip through shared slice backing (clonedParametersandProviderSignature). - Failure split:
ToolResult{IsError: true}is model-visible (loop continues); a non-nilExecuteerror is loop-visible →OutcomeToolError. An executor returningcontext.Canceled→OutcomeCanceled. - Cancellation detected via
errors.Is(err, context.Canceled)(andctx.Err() == context.Canceled), matching the toolkit's X5 contract that caller cancel is not converted into a*llms.ProviderError. - Unknown-tool recovery is the default: an unregistered tool name appends an error tool result and continues, so the model can self-correct when the provider can't enforce
tool_choice.
Non-goals (binding per ADR-0011)
This is a tool loop, not an agent loop. No agent state, terminal/state-transition tools, ProcessEffect, story IDs, request IDs, audit taxonomy, persistence, authorization, tenant isolation, schema-generation, tool registries, moderation hooks, automatic streaming (deferred per ADR-0003), or prompt management. PRs that try to grow IterationEvent/ToolCallEvent with app-specific fields require a superseding ADR.
Compatibility
Additive new surface. No core llms type changes; no MAESTRO_DIVERGENCES.md row required. Existing v0.4.x consumers are unaffected.
Pre-1.0; v0.x minor versions may break.
🤖 Generated with Claude Code
v0.4.2 — Gemini thought_signature round-trip (G1 resolved)
Patch release. A release-blocking Gemini fix the Maestro team surfaced during cut-over (their PR #220 / migration §5 G1).
Fixed
- Gemini tool-call round-trip — the opaque functionCall
thought_signatureis now carried through the message model via the newToolCall.ProviderSignature, captured on response parse and replayed on the next request. Without it, Gemini 3 Pro (gemini-3-pro-preview) hard-400 INVALID_ARGUMENTs on every 2nd+ tool-loop turn, making it unusable for any multi-turn agentic tool use. The signature travels in the conversation history the app already round-trips — stateless, no per-client cache (restores pre-extraction Maestro parity without the contract violation that motivated dropping the originalresponseCache). Divergence G1 → RESOLVED; ADR-0010.
Added
llms.ToolCall.ProviderSignature []byte— opaque, provider-owned, never interpreted by core; round-trips provider-required per-tool-call state. Additive and zero-value-safe; Anthropic/OpenAI/Ollama leave it nil and ignore it (same neutral pattern asContentPart.CacheBreakpoint).
Maestro consumers: bumping to v0.4.2 unblocks Gemini for the migration — multi-turn Gemini tool loops now work; migration §5 G1 clears (Anthropic + OpenAI were unaffected).
No breaking changes. Pre-1.0: v0.x minor versions may break.
v0.4.1 — OpenAI incomplete StopReason fix
Patch release. A cut-over fix the Maestro team surfaced while switching onto the toolkit.
Fixed
-
OpenAI (Responses)
StopReason— onstatus:"incomplete"the adapter now surfacesincomplete_details.reason("max_output_tokens"/"content_filter") instead of the envelope status. This makes OpenAI consistent with the raw-finish-reason passthrough of the other adapters (cf. divergences G3/OL3), so consumers can detect length-truncation / content-filter without reaching intoRaw(outside the stability contract). Tool calls on a truncated response are preserved. Divergence OC4; no ADR (bugfix aligning to an existing convention). Cross-ref Maestro PR #220 / spec §9.Maestro consumers: on bumping to v0.4.1 you can delete the
rawStopReasonRaw.(*responses.Response)workaround + its two guard tests —normalizeStopReasonalready maps these reasons; no other change needed.
Docs
- ADR-0009 — recorded the 2026-05-18 end-to-end live validation of the v0.4 Vertex path (Anthropic via
anthropicvertex+ Gemini embeddings) against real Google Vertex AI (append-only; design validated as built).
No API changes. Pre-1.0: v0.x minor versions may break.
v0.4.0 — Vertex AI backend & task-typed embeddings
Google Vertex AI backend + task-typed embeddings (ADR-0009). The package now covers Morris's managed-AI posture — Claude and Gemini embeddings over Vertex/PSC, app-supplied auth, no static provider keys.
Added since v0.3.0
llms/providers/anthropic/anthropicvertex— Claude via Vertex AI as a separate leaf package so the baseanthropicpackage stays Google-dependency-free; returns the same*anthropic.Client(all request/response/tool/cache translation + middleware reused).anthropic.WithRequestOptions— low-level SDK escape hatch; API key optional when request options supply auth.google.NewEmbeddings— Gemini/Vertex embeddings (gemini-embedding-001), order/ID-preserving, per-request dimension override.llms.EmbeddingTask+EmbeddingInput.Title— provider-neutral, advisory; honored on Gemini, ignored on OpenAI (same pattern asCacheBreakpoint).- App-supplied credentials + PSC endpoint/transport injection; no ADC discovery. The auth-vs-PSC-transport precedence is documented and covered by option-order tests.
- Fail-closed truncation: genai cannot send
autoTruncate:false, soAutoTruncate=trueis Vertex-only andAutoTruncate=falserequires a client-sideMaxInputBytesguard —NewEmbeddingsfails rather than look safe while Vertex silently truncates. gemini-embedding-001is single-input: a multi-input request returns a typedbad_request— no fan-out, no hidden chunking exception (the app owns chunking).
Decisions of record
ADR-0009 (Vertex/PSC/Gemini-embeddings): the leaf-package shape, app-supplied-auth-only, the precedence rule, and the auto-truncate fail-closed refinement.
Out of scope (Morris / OpenTofu infrastructure)
Vertex API enablement, aiplatform IAM, PSC/restricted-API DNS, VPC-SC perimeter, egress lockdown, model/region config. Streaming and vLLM remain deliberately deferred.
Pre-1.0: v0.x minor versions may break.
v0.3.0 — middleware complete
The provider-neutral middleware line is complete. Every chat/embedding client can now be composed with the full resilience + observability stack behind the same stable interfaces.
Added since v0.2.0
llms/middleware— the full set:ValidationChat(structural/app-neutral: text-onlySystem, role/part legality, tool-call↔result pairing)RetryChat/RetryEmbeddings(classify viallms.Retryable, honorRetryAfter, backoff + optional jitter)TimeoutChat/TimeoutEmbeddings(per-attempt deadline)CircuitChat/CircuitEmbeddings(3-state breaker, single-flight half-open, non-retryable*CircuitOpenError)MetricsChat/MetricsEmbeddings(narrow app-neutralObserver, oneEventper attempt)RecommendedChat/RecommendedEmbeddings— wires the spec's recommended composition order;ChainChatstays the primitive
llms.RetryAfteraccessor, symmetric withllms.Retryableapierr:context.Canceledreturned as-is (non-retryable,errors.Is-matchable);context.DeadlineExceededstays a retryable timeout- Docs: ADR log adopted (
docs/adr/, 0001–0006);MAESTRO_DIVERGENCESextended; README rewritten as a project + usage guide
Decisions of record
ADR-0003 (middleware is Complete/Embed-only; streaming deferred), ADR-0004 (single error classifier), ADR-0005 (non-retryable CircuitOpenError), ADR-0006 (structural ValidationError).
Not in this release — Maestro cut-over readiness gate
v0.3.0 is a complete middleware milestone, not a "cut-over ready" claim. Two tracked, separate readiness gates remain before the Maestro cut-over:
ToolChoiceRequired— "the model must call one of the offered tools" (general capability; divergences OC2/G2)- Provider-neutral prompt-cache hint — restores Maestro's
CacheControlbehavior (divergence A5)
Streaming and vLLM remain deliberately deferred. Pre-1.0: v0.x minor versions may break.
v0.2.0 — provider chat ports
Provider chat ports complete. All four providers now offer live, tool-use-capable chat behind the app-neutral interfaces, with the full live integration suite (simple-chat + tool-use, every provider) green as the Maestro cut-over acceptance gate.
Added since v0.1.0
llms/providers/openai— OpenAI chat via the Responses API (structured message / function_call / function_call_output items; honors callerToolChoice)llms/providers/google— Gemini chat via genai (stateless / concurrency-safe; real finish reason + usage)llms/providers/ollama— hand-rolled/api/chatclient, no SDK dependency (avoids the ollama module's unfixed CVEs); rawdone_reason+ token usage- Shared error classifier (
apierr) — typed HTTP-status classification across anthropic + openai + google + ollama - Live integration suite — build-tagged tests for all four providers, each with simple-chat and a full tool-use round trip; OS-aware
make test-integration(macOS ad-hoc codesign path, Linux/CI plaingo test); manualworkflow_dispatchCI docs/MAESTRO_DIVERGENCES.md— living cut-over acceptance checklist
Deferred to v0.3 (not in this release)
Middleware: retry, timeout, circuit, metrics, validation. Shipped middleware remains ChainChat/ChainEmbeddings, token estimator, and ratelimit (from v0.1).
Pre-1.0: v0.x minor versions may break.
v0.1.0 — first extraction milestone
First extraction milestone: an app-neutral Go LLM/embedding toolkit shared by Maestro and Morris.
Included
llms— app-neutral chat + embedding interfaces;Message/ContentPart(tool calls/results as content parts),Usage(incl. cache tokens), classifiedProviderError/LimitError; optionalStreamingChatClientcapability interface.llms/testllm— deterministic, concurrency-safe chat + embedding fakes (recordings deep-copied; stable hash embeddings).llms/ratelimit—Limiter/Reservationreservation protocol + in-memory token-bucket + concurrency limiter (lazy clock-based refill, no background goroutine; reconciliation contract).llms/middleware—ChainChat/ChainEmbeddings(first arg outermost),DefaultEstimator, rate-limit middleware (reserve → release on cancellation-surviving ctx → commit actual usage).llms/providers/anthropic— Anthropic chat: typed SDK-error classification, cache-token usage, raw tool-call params, strict-alternation translation.llms/providers/openai— OpenAI embeddings: input-order/ID preserving, per-request dimension override, batch-limit validation.
Requirements
Go 1.26.3+ (patched net/net/http stdlib advisories GO-2026-4971 / GO-2026-4918).
Not included (roadmap)
- v0.2: retry / timeout / circuit-breaker middleware, metrics hook interfaces, richer error classification.
- v0.3: Google + Ollama providers, optional streaming, shared provider error-classifier.
Distributed (e.g. PostgreSQL) limiters are implemented by applications against the ratelimit.Limiter interface; none ship here yet by design.