Skip to content

Releases: SnapdragonPartners/maestro-llms

v0.7.1 — reasoning + billable output normalization (ADR-0016)

31 May 21:25
6d9a7aa

Choose a tag to compare

Patch release. Investigated as a Gemini bug, shipped as the cross-provider normalization the toolkit was missing.

The bug

A chat to gemini-3.1-pro-preview-customtools with MaxTokens=1024 returned OutputTokens=36, StopReason=MAX_TOKENS. Confusing because 36 is nowhere near 1024 — the model had silently burned ~988 tokens on thinking and the toolkit dropped them. Worse, the bug wasn't Gemini-specific: every provider's wire output_tokens had a different meaning, so Usage.OutputTokens was silently inconsistent across providers.

Fix (ADR-0016)

Two new fields on llms.Usage:

ReasoningTokens      int  // separately-metered thinking; ADDITIVE to OutputTokens
BillableOutputTokens int  // what the provider charges as \"output\"

OutputTokens is redefined as visible-only across every adapter. Budget math holds: a length truncation fires when Input + Output + Reasoning approaches the cap. Cost math: read BillableOutputTokens (one field, one semantic, no per-provider branching).

Provider OutputTokens ReasoningTokens BillableOutputTokens
Gemini CandidatesTokenCount ThoughtsTokenCount Candidates + Thoughts
OpenAI Responses wire − OutputTokensDetails.ReasoningTokens OutputTokensDetails.ReasoningTokens wire.OutputTokens
OpenAI Chat Completions (vLLM) same subtraction via CompletionTokensDetails same wire.CompletionTokens
Anthropic wire.OutputTokens (includes thinking when on; not separable) 0 (SDK has no field) mirrors OutputTokens
Ollama eval_count 0 mirrors OutputTokens

Breaking semantic for OpenAI o-series

Pre-v0.7.1, Usage.OutputTokens was the OpenAI wire output_tokens, which for o-series silently included reasoning. Now it's visible-only. Migration: code reading OutputTokens as the billing-relevant number should read BillableOutputTokens instead (carries the exact same value the old field did). Non-reasoning OpenAI calls are unaffected — reasoning_tokens=0 makes the subtraction a no-op.

Pre-1.0 is the right time to make this consistent; carrying the inconsistency to v1.0 would have been worse.

Demo bonus (examples/chat)

The bubble footer surfaces the breakdown so a max_tokens stop on a small visible output is self-explanatory:

gemini-3.1-pro-preview-customtools · max tokens · 12 in / 102 out / 918 reasoning · 1020 billable · 10875 ms

Non-reasoning models keep the terse <in> in / <out> out form (billable would just duplicate out).

Compatibility

Additive new fields. Existing fields keep their names. The only semantic shift is OpenAI o-series OutputTokens. No MAESTRO_DIVERGENCES.md row — Maestro never surfaced either field; nothing to diverge from.

Pre-1.0; v0.x minor versions may break.

🤖 Generated with Claude Code

v0.7.0 — vLLM provider

31 May 20:03
e15081a

Choose a tag to compare

Headline: self-hosted GPU inference via vLLM. The toolkit now wraps five chat providers behind one stable interface: Anthropic, OpenAI (Responses), Google, Ollama, vLLM.

llms/providers/vllm (ADR-0015)

Leaf package alongside openai, using the OpenAI Chat Completions surface (not Responses) via openai-go with a configurable base URL. Zero new dependencies — openai-go is already imported for the existing OpenAI Responses adapter.

import "github.com/SnapdragonPartners/maestro-llms/llms/providers/vllm"

client, _ := vllm.New(
    vllm.WithBaseURL("http://my-vllm:8000"),
    vllm.WithModel("mistralai/Ministral-3-14B-Instruct-2512"),
    // WithAPIKey is OPTIONAL — vLLM's default deployment has no auth.
)

Key behaviors:

  • No-auth by default is the distinguishing feature vs hosted providers. An empty WithAPIKey is a valid configuration, not a config error.
  • ModelLister implemented; LatestInFamily is not — HuggingFace-style names have no canonical family convention. Same shape as Ollama. Regression-guard test ensures this stays binding.
  • ModelInfo.Created is the load time on the vLLM instance, NOT the upstream HuggingFace release date. Don't surface it as freshness.
  • Tool calling works through the standard tools / tool_choice request fields, but actual emission depends on the vLLM server's per-model --tool-call-parser config.
  • Streaming deferred per ADR-0003.

Live integration test gated on MAESTRO_VLLM (full base URL). Optional MAESTRO_VLLM_MODEL overrides the model ID, defaulting to the first one /v1/models reports. Picked up automatically by make test-integration.

ADR-0013 amendment (PR #44)

Append-only clarification that the future opt-in TextEstimator design space includes API-backed variants (Anthropic count_tokens, Gemini CountTokens) — both zero-new-dep wrappers since those SDKs are already imported — not just embedded tokenizers like tiktoken-go. Decision unchanged; framing widened so the future ADR doesn't get retconned.

MAESTRO_DIVERGENCES.md

Informational V1-V5 rows for vLLM (greenfield, no Maestro behavior to diverge from; rows document vLLM-specific behaviors a cut-over consumer should know about).

Not in this tag (separate module)

examples/chat (PR #46) is a runs-locally web demo wiring every provider through one UI (RecommendedChat middleware, per-provider ListModels, EstimateTextTokens). Lives under its own go.mod so its dependency closure doesn't leak into the toolkit. Build/run from examples/chat with go run ..

Compatibility

Additive throughout. No core llms type changes. Existing v0.6.x consumers are unaffected. govulncheck clean.

Pre-1.0; v0.x minor versions may break.

🤖 Generated with Claude Code

v0.6.0 — model listing & text token estimator

30 May 21:41
70aff1b

Choose a tag to compare

Two additive features, no core type breakage, no MAESTRO_DIVERGENCES.md rows. Adds maestro-cms as a third consumer of the toolkit alongside Maestro and Morris.

llms.ModelLister + per-provider LatestInFamily (ADR-0012)

Optional capability for listing a provider's catalog and detecting newer models in the same family as a pinned ID. Surfaces "newer model available, upgrade?" — never auto-updates.

lister, ok := client.(llms.ModelLister)
if !ok { return } // provider doesn't expose a list (e.g. future vLLM)

models, _ := lister.ListModels(ctx)
newer, found := anthropic.LatestInFamily(currentID, models)
if found {
    fmt.Printf("Newer model: %s (released %s)\n", newer.ID, newer.Created.Format("2006-01-02"))
}

// Or one-shot:
newer, found, err = client.LatestInFamily(ctx, currentID)
  • Anthropic — family claude-{opus|sonnet|haiku}, crosses generations (claude-3-5-sonnet-… and claude-sonnet-4-5-… are both claude-sonnet). Ordered by CreatedAt.
  • OpenAI — family = ID with -YYYY-MM-DD stripped. Self-filtering by family-prefix means embedding/image models in the catalog don't collide with gpt-* queries. Ordered by Created (Unix).
  • Google — family gemini-{pro|flash|nano|ultra}. genai exposes no created date, so ordered by parsed numeric version in the ID.
  • OllamaListModels only (local pulls have no canonical family). Created is local pull time, not provider release time.

Permissive family parsing by design. Callers wanting major-version pinning filter the list themselves.

llms.EstimateTextTokens (ADR-0013)

Exported free function for budget-aware text chunking. Zero new dependencies.

budget := llms.EstimateTextTokens(s) // approx token count
// Directly assignable as func(string) int — e.g. for maestro-cms chunk injection.
  • Neutral bias (~4 chars/token, rune-counted) — intentionally distinct from the middleware TokenEstimator's high bias.
  • Two estimators, two purposes: over-reserving at the limiter is safe; over-estimating at chunk time wastes API calls. ADR-0013 makes the split binding.
  • Tokenizer-backed / model-aware variant deferred to a future ADR when a consumer needs the fidelity.

Compatibility

Additive new surface throughout. Existing TokenEstimator, ChatClient, rate-limiter behavior all unchanged. govulncheck clean (bumped golang.org/x/net to v0.55.0 to resolve GO-2026-5026 along the way).

Pre-1.0; v0.x minor versions may break.

🤖 Generated with Claude Code

v0.5.0 — Tool loop helper (ADR-0011)

19 May 23:41
5b5bee9

Choose a tag to compare

First non-port addition past v0.4.x: a small, app-neutral helper that wraps the provider-protocol tool round-trip over an llms.ChatClient.

What's new

llms/toolloop — synchronous tool-loop helper. Run(ctx, Config) Outcome over any ChatClient. App-supplied Tool{Definition, Execute}; the loop never executes side effects itself.

out := toolloop.Run(ctx, toolloop.Config{
    Client:  client,
    Request: req,
    Tools:   []toolloop.Tool{weather},
})

switch out.Kind {
case toolloop.OutcomeFinalAnswer:
    fmt.Println(out.Response.Text)
case toolloop.OutcomeMaxIterations, toolloop.OutcomeLLMError,
    toolloop.OutcomeToolError, toolloop.OutcomeCanceled:
    // inspect out.Err, out.Messages, out.TotalUsage
}

Full design: docs/toolloop-proposal.md. Binding scope and non-goals: ADR-0011.

Behaviors pinned

  • Verbatim assistant-message round-trip preserves ToolCall.ProviderSignature (ADR-0010), so Gemini 3 multi-turn tool loops work end-to-end.
  • Fail-closed config (*toolloop.ConfigError): nil Client, empty Messages, Request.Tools set, duplicate tool names, nil Execute, both ToolChoices non-zero, unknown ToolChoice.Type, RequiresTools() with no tools, ToolChoiceTool naming an unknown tool — all rejected before any provider call.
  • Pre-execute MaxIterations stop: the limit-hitting assistant turn is appended to Outcome.Messages as diagnostic state; its tool calls are not executed and the transcript is not directly re-feedable into Complete without repairing the unresolved pairing.
  • Defensive ToolCall copy before dispatch and event emission so a misbehaving executor or observer cannot corrupt the round-trip through shared slice backing (cloned Parameters and ProviderSignature).
  • Failure split: ToolResult{IsError: true} is model-visible (loop continues); a non-nil Execute error is loop-visible → OutcomeToolError. An executor returning context.CanceledOutcomeCanceled.
  • Cancellation detected via errors.Is(err, context.Canceled) (and ctx.Err() == context.Canceled), matching the toolkit's X5 contract that caller cancel is not converted into a *llms.ProviderError.
  • Unknown-tool recovery is the default: an unregistered tool name appends an error tool result and continues, so the model can self-correct when the provider can't enforce tool_choice.

Non-goals (binding per ADR-0011)

This is a tool loop, not an agent loop. No agent state, terminal/state-transition tools, ProcessEffect, story IDs, request IDs, audit taxonomy, persistence, authorization, tenant isolation, schema-generation, tool registries, moderation hooks, automatic streaming (deferred per ADR-0003), or prompt management. PRs that try to grow IterationEvent/ToolCallEvent with app-specific fields require a superseding ADR.

Compatibility

Additive new surface. No core llms type changes; no MAESTRO_DIVERGENCES.md row required. Existing v0.4.x consumers are unaffected.

Pre-1.0; v0.x minor versions may break.

🤖 Generated with Claude Code

v0.4.2 — Gemini thought_signature round-trip (G1 resolved)

19 May 12:53
570d023

Choose a tag to compare

Patch release. A release-blocking Gemini fix the Maestro team surfaced during cut-over (their PR #220 / migration §5 G1).

Fixed

  • Gemini tool-call round-trip — the opaque functionCall thought_signature is now carried through the message model via the new ToolCall.ProviderSignature, captured on response parse and replayed on the next request. Without it, Gemini 3 Pro (gemini-3-pro-preview) hard-400 INVALID_ARGUMENTs on every 2nd+ tool-loop turn, making it unusable for any multi-turn agentic tool use. The signature travels in the conversation history the app already round-trips — stateless, no per-client cache (restores pre-extraction Maestro parity without the contract violation that motivated dropping the original responseCache). Divergence G1 → RESOLVED; ADR-0010.

Added

  • llms.ToolCall.ProviderSignature []byte — opaque, provider-owned, never interpreted by core; round-trips provider-required per-tool-call state. Additive and zero-value-safe; Anthropic/OpenAI/Ollama leave it nil and ignore it (same neutral pattern as ContentPart.CacheBreakpoint).

Maestro consumers: bumping to v0.4.2 unblocks Gemini for the migration — multi-turn Gemini tool loops now work; migration §5 G1 clears (Anthropic + OpenAI were unaffected).

No breaking changes. Pre-1.0: v0.x minor versions may break.

v0.4.1 — OpenAI incomplete StopReason fix

19 May 05:24
92eceb8

Choose a tag to compare

Patch release. A cut-over fix the Maestro team surfaced while switching onto the toolkit.

Fixed

  • OpenAI (Responses) StopReason — on status:"incomplete" the adapter now surfaces incomplete_details.reason ("max_output_tokens" / "content_filter") instead of the envelope status. This makes OpenAI consistent with the raw-finish-reason passthrough of the other adapters (cf. divergences G3/OL3), so consumers can detect length-truncation / content-filter without reaching into Raw (outside the stability contract). Tool calls on a truncated response are preserved. Divergence OC4; no ADR (bugfix aligning to an existing convention). Cross-ref Maestro PR #220 / spec §9.

    Maestro consumers: on bumping to v0.4.1 you can delete the rawStopReason Raw.(*responses.Response) workaround + its two guard tests — normalizeStopReason already maps these reasons; no other change needed.

Docs

  • ADR-0009 — recorded the 2026-05-18 end-to-end live validation of the v0.4 Vertex path (Anthropic via anthropicvertex + Gemini embeddings) against real Google Vertex AI (append-only; design validated as built).

No API changes. Pre-1.0: v0.x minor versions may break.

v0.4.0 — Vertex AI backend & task-typed embeddings

18 May 02:14
cab919c

Choose a tag to compare

Google Vertex AI backend + task-typed embeddings (ADR-0009). The package now covers Morris's managed-AI posture — Claude and Gemini embeddings over Vertex/PSC, app-supplied auth, no static provider keys.

Added since v0.3.0

  • llms/providers/anthropic/anthropicvertex — Claude via Vertex AI as a separate leaf package so the base anthropic package stays Google-dependency-free; returns the same *anthropic.Client (all request/response/tool/cache translation + middleware reused).
  • anthropic.WithRequestOptions — low-level SDK escape hatch; API key optional when request options supply auth.
  • google.NewEmbeddings — Gemini/Vertex embeddings (gemini-embedding-001), order/ID-preserving, per-request dimension override.
  • llms.EmbeddingTask + EmbeddingInput.Title — provider-neutral, advisory; honored on Gemini, ignored on OpenAI (same pattern as CacheBreakpoint).
  • App-supplied credentials + PSC endpoint/transport injection; no ADC discovery. The auth-vs-PSC-transport precedence is documented and covered by option-order tests.
  • Fail-closed truncation: genai cannot send autoTruncate:false, so AutoTruncate=true is Vertex-only and AutoTruncate=false requires a client-side MaxInputBytes guard — NewEmbeddings fails rather than look safe while Vertex silently truncates.
  • gemini-embedding-001 is single-input: a multi-input request returns a typed bad_request — no fan-out, no hidden chunking exception (the app owns chunking).

Decisions of record

ADR-0009 (Vertex/PSC/Gemini-embeddings): the leaf-package shape, app-supplied-auth-only, the precedence rule, and the auto-truncate fail-closed refinement.

Out of scope (Morris / OpenTofu infrastructure)

Vertex API enablement, aiplatform IAM, PSC/restricted-API DNS, VPC-SC perimeter, egress lockdown, model/region config. Streaming and vLLM remain deliberately deferred.

Pre-1.0: v0.x minor versions may break.

v0.3.0 — middleware complete

17 May 20:19
6023418

Choose a tag to compare

The provider-neutral middleware line is complete. Every chat/embedding client can now be composed with the full resilience + observability stack behind the same stable interfaces.

Added since v0.2.0

  • llms/middleware — the full set:
    • ValidationChat (structural/app-neutral: text-only System, role/part legality, tool-call↔result pairing)
    • RetryChat/RetryEmbeddings (classify via llms.Retryable, honor RetryAfter, backoff + optional jitter)
    • TimeoutChat/TimeoutEmbeddings (per-attempt deadline)
    • CircuitChat/CircuitEmbeddings (3-state breaker, single-flight half-open, non-retryable *CircuitOpenError)
    • MetricsChat/MetricsEmbeddings (narrow app-neutral Observer, one Event per attempt)
    • RecommendedChat/RecommendedEmbeddings — wires the spec's recommended composition order; ChainChat stays the primitive
  • llms.RetryAfter accessor, symmetric with llms.Retryable
  • apierr: context.Canceled returned as-is (non-retryable, errors.Is-matchable); context.DeadlineExceeded stays a retryable timeout
  • Docs: ADR log adopted (docs/adr/, 0001–0006); MAESTRO_DIVERGENCES extended; README rewritten as a project + usage guide

Decisions of record

ADR-0003 (middleware is Complete/Embed-only; streaming deferred), ADR-0004 (single error classifier), ADR-0005 (non-retryable CircuitOpenError), ADR-0006 (structural ValidationError).

Not in this release — Maestro cut-over readiness gate

v0.3.0 is a complete middleware milestone, not a "cut-over ready" claim. Two tracked, separate readiness gates remain before the Maestro cut-over:

  • ToolChoiceRequired — "the model must call one of the offered tools" (general capability; divergences OC2/G2)
  • Provider-neutral prompt-cache hint — restores Maestro's CacheControl behavior (divergence A5)

Streaming and vLLM remain deliberately deferred. Pre-1.0: v0.x minor versions may break.

v0.2.0 — provider chat ports

16 May 21:31
746fb71

Choose a tag to compare

Provider chat ports complete. All four providers now offer live, tool-use-capable chat behind the app-neutral interfaces, with the full live integration suite (simple-chat + tool-use, every provider) green as the Maestro cut-over acceptance gate.

Added since v0.1.0

  • llms/providers/openai — OpenAI chat via the Responses API (structured message / function_call / function_call_output items; honors caller ToolChoice)
  • llms/providers/google — Gemini chat via genai (stateless / concurrency-safe; real finish reason + usage)
  • llms/providers/ollama — hand-rolled /api/chat client, no SDK dependency (avoids the ollama module's unfixed CVEs); raw done_reason + token usage
  • Shared error classifier (apierr) — typed HTTP-status classification across anthropic + openai + google + ollama
  • Live integration suite — build-tagged tests for all four providers, each with simple-chat and a full tool-use round trip; OS-aware make test-integration (macOS ad-hoc codesign path, Linux/CI plain go test); manual workflow_dispatch CI
  • docs/MAESTRO_DIVERGENCES.md — living cut-over acceptance checklist

Deferred to v0.3 (not in this release)

Middleware: retry, timeout, circuit, metrics, validation. Shipped middleware remains ChainChat/ChainEmbeddings, token estimator, and ratelimit (from v0.1).

Pre-1.0: v0.x minor versions may break.

v0.1.0 — first extraction milestone

16 May 01:26
342528a

Choose a tag to compare

First extraction milestone: an app-neutral Go LLM/embedding toolkit shared by Maestro and Morris.

Included

  • llms — app-neutral chat + embedding interfaces; Message/ContentPart (tool calls/results as content parts), Usage (incl. cache tokens), classified ProviderError/LimitError; optional StreamingChatClient capability interface.
  • llms/testllm — deterministic, concurrency-safe chat + embedding fakes (recordings deep-copied; stable hash embeddings).
  • llms/ratelimitLimiter/Reservation reservation protocol + in-memory token-bucket + concurrency limiter (lazy clock-based refill, no background goroutine; reconciliation contract).
  • llms/middlewareChainChat/ChainEmbeddings (first arg outermost), DefaultEstimator, rate-limit middleware (reserve → release on cancellation-surviving ctx → commit actual usage).
  • llms/providers/anthropic — Anthropic chat: typed SDK-error classification, cache-token usage, raw tool-call params, strict-alternation translation.
  • llms/providers/openai — OpenAI embeddings: input-order/ID preserving, per-request dimension override, batch-limit validation.

Requirements

Go 1.26.3+ (patched net/net/http stdlib advisories GO-2026-4971 / GO-2026-4918).

Not included (roadmap)

  • v0.2: retry / timeout / circuit-breaker middleware, metrics hook interfaces, richer error classification.
  • v0.3: Google + Ollama providers, optional streaming, shared provider error-classifier.

Distributed (e.g. PostgreSQL) limiters are implemented by applications against the ratelimit.Limiter interface; none ship here yet by design.