Releases · SnapdragonPartners/maestro-llms

31 May 21:25

dratner

v0.7.1

6d9a7aa

v0.7.1 — reasoning + billable output normalization (ADR-0016) Latest

Latest

Patch release. Investigated as a Gemini bug, shipped as the cross-provider normalization the toolkit was missing.

The bug

A chat to gemini-3.1-pro-preview-customtools with MaxTokens=1024 returned OutputTokens=36, StopReason=MAX_TOKENS. Confusing because 36 is nowhere near 1024 — the model had silently burned ~988 tokens on thinking and the toolkit dropped them. Worse, the bug wasn't Gemini-specific: every provider's wire output_tokens had a different meaning, so Usage.OutputTokens was silently inconsistent across providers.

Fix (ADR-0016)

Two new fields on llms.Usage:

ReasoningTokens      int  // separately-metered thinking; ADDITIVE to OutputTokens
BillableOutputTokens int  // what the provider charges as \"output\"

OutputTokens is redefined as visible-only across every adapter. Budget math holds: a length truncation fires when Input + Output + Reasoning approaches the cap. Cost math: read BillableOutputTokens (one field, one semantic, no per-provider branching).

Provider	OutputTokens	ReasoningTokens	BillableOutputTokens
Gemini	`CandidatesTokenCount`	`ThoughtsTokenCount`	`Candidates + Thoughts`
OpenAI Responses	`wire − OutputTokensDetails.ReasoningTokens`	`OutputTokensDetails.ReasoningTokens`	`wire.OutputTokens`
OpenAI Chat Completions (vLLM)	same subtraction via `CompletionTokensDetails`	same	`wire.CompletionTokens`
Anthropic	`wire.OutputTokens` (includes thinking when on; not separable)	0 (SDK has no field)	mirrors OutputTokens
Ollama	`eval_count`	0	mirrors OutputTokens

Breaking semantic for OpenAI o-series

Pre-v0.7.1, Usage.OutputTokens was the OpenAI wire output_tokens, which for o-series silently included reasoning. Now it's visible-only. Migration: code reading OutputTokens as the billing-relevant number should read BillableOutputTokens instead (carries the exact same value the old field did). Non-reasoning OpenAI calls are unaffected — reasoning_tokens=0 makes the subtraction a no-op.

Pre-1.0 is the right time to make this consistent; carrying the inconsistency to v1.0 would have been worse.

Demo bonus (`examples/chat`)

The bubble footer surfaces the breakdown so a max_tokens stop on a small visible output is self-explanatory:

gemini-3.1-pro-preview-customtools · max tokens · 12 in / 102 out / 918 reasoning · 1020 billable · 10875 ms

Non-reasoning models keep the terse <in> in / <out> out form (billable would just duplicate out).

Compatibility

Additive new fields. Existing fields keep their names. The only semantic shift is OpenAI o-series OutputTokens. No MAESTRO_DIVERGENCES.md row — Maestro never surfaced either field; nothing to diverge from.

Pre-1.0; v0.x minor versions may break.

🤖 Generated with Claude Code

Assets 2

31 May 20:03

dratner

v0.7.0

e15081a

v0.7.0 — vLLM provider

Headline: self-hosted GPU inference via vLLM. The toolkit now wraps five chat providers behind one stable interface: Anthropic, OpenAI (Responses), Google, Ollama, vLLM.

`llms/providers/vllm` (ADR-0015)

Leaf package alongside openai, using the OpenAI Chat Completions surface (not Responses) via openai-go with a configurable base URL. Zero new dependencies — openai-go is already imported for the existing OpenAI Responses adapter.

import "github.com/SnapdragonPartners/maestro-llms/llms/providers/vllm"

client, _ := vllm.New(
    vllm.WithBaseURL("http://my-vllm:8000"),
    vllm.WithModel("mistralai/Ministral-3-14B-Instruct-2512"),
    // WithAPIKey is OPTIONAL — vLLM's default deployment has no auth.
)

Key behaviors:

No-auth by default is the distinguishing feature vs hosted providers. An empty WithAPIKey is a valid configuration, not a config error.
ModelLister implemented; LatestInFamily is not — HuggingFace-style names have no canonical family convention. Same shape as Ollama. Regression-guard test ensures this stays binding.
ModelInfo.Created is the load time on the vLLM instance, NOT the upstream HuggingFace release date. Don't surface it as freshness.
Tool calling works through the standard tools / tool_choice request fields, but actual emission depends on the vLLM server's per-model --tool-call-parser config.
Streaming deferred per ADR-0003.

Live integration test gated on MAESTRO_VLLM (full base URL). Optional MAESTRO_VLLM_MODEL overrides the model ID, defaulting to the first one /v1/models reports. Picked up automatically by make test-integration.

ADR-0013 amendment (PR #44)

Append-only clarification that the future opt-in TextEstimator design space includes API-backed variants (Anthropic count_tokens, Gemini CountTokens) — both zero-new-dep wrappers since those SDKs are already imported — not just embedded tokenizers like tiktoken-go. Decision unchanged; framing widened so the future ADR doesn't get retconned.

`MAESTRO_DIVERGENCES.md`

Informational V1-V5 rows for vLLM (greenfield, no Maestro behavior to diverge from; rows document vLLM-specific behaviors a cut-over consumer should know about).

Not in this tag (separate module)

examples/chat (PR #46) is a runs-locally web demo wiring every provider through one UI (RecommendedChat middleware, per-provider ListModels, EstimateTextTokens). Lives under its own go.mod so its dependency closure doesn't leak into the toolkit. Build/run from examples/chat with go run ..

Compatibility

Additive throughout. No core llms type changes. Existing v0.6.x consumers are unaffected. govulncheck clean.

Pre-1.0; v0.x minor versions may break.

🤖 Generated with Claude Code

Assets 2

30 May 21:41

dratner

v0.6.0

70aff1b

v0.6.0 — model listing & text token estimator

Two additive features, no core type breakage, no MAESTRO_DIVERGENCES.md rows. Adds maestro-cms as a third consumer of the toolkit alongside Maestro and Morris.

`llms.ModelLister` + per-provider `LatestInFamily` (ADR-0012)

Optional capability for listing a provider's catalog and detecting newer models in the same family as a pinned ID. Surfaces "newer model available, upgrade?" — never auto-updates.

lister, ok := client.(llms.ModelLister)
if !ok { return } // provider doesn't expose a list (e.g. future vLLM)

models, _ := lister.ListModels(ctx)
newer, found := anthropic.LatestInFamily(currentID, models)
if found {
    fmt.Printf("Newer model: %s (released %s)\n", newer.ID, newer.Created.Format("2006-01-02"))
}

// Or one-shot:
newer, found, err = client.LatestInFamily(ctx, currentID)

Anthropic — family claude-{opus|sonnet|haiku}, crosses generations (claude-3-5-sonnet-… and claude-sonnet-4-5-… are both claude-sonnet). Ordered by CreatedAt.
OpenAI — family = ID with -YYYY-MM-DD stripped. Self-filtering by family-prefix means embedding/image models in the catalog don't collide with gpt-* queries. Ordered by Created (Unix).
Google — family gemini-{pro|flash|nano|ultra}. genai exposes no created date, so ordered by parsed numeric version in the ID.
Ollama — ListModels only (local pulls have no canonical family). Created is local pull time, not provider release time.

Permissive family parsing by design. Callers wanting major-version pinning filter the list themselves.

`llms.EstimateTextTokens` (ADR-0013)

Exported free function for budget-aware text chunking. Zero new dependencies.

budget := llms.EstimateTextTokens(s) // approx token count
// Directly assignable as func(string) int — e.g. for maestro-cms chunk injection.

Neutral bias (~4 chars/token, rune-counted) — intentionally distinct from the middleware TokenEstimator's high bias.
Two estimators, two purposes: over-reserving at the limiter is safe; over-estimating at chunk time wastes API calls. ADR-0013 makes the split binding.
Tokenizer-backed / model-aware variant deferred to a future ADR when a consumer needs the fidelity.

Compatibility

Additive new surface throughout. Existing TokenEstimator, ChatClient, rate-limiter behavior all unchanged. govulncheck clean (bumped golang.org/x/net to v0.55.0 to resolve GO-2026-5026 along the way).

Pre-1.0; v0.x minor versions may break.

🤖 Generated with Claude Code

Assets 2

19 May 23:41

dratner

v0.5.0

5b5bee9

v0.5.0 — Tool loop helper (ADR-0011)

First non-port addition past v0.4.x: a small, app-neutral helper that wraps the provider-protocol tool round-trip over an llms.ChatClient.

What's new

llms/toolloop — synchronous tool-loop helper. Run(ctx, Config) Outcome over any ChatClient. App-supplied Tool{Definition, Execute}; the loop never executes side effects itself.

out := toolloop.Run(ctx, toolloop.Config{
    Client:  client,
    Request: req,
    Tools:   []toolloop.Tool{weather},
})

switch out.Kind {
case toolloop.OutcomeFinalAnswer:
    fmt.Println(out.Response.Text)
case toolloop.OutcomeMaxIterations, toolloop.OutcomeLLMError,
    toolloop.OutcomeToolError, toolloop.OutcomeCanceled:
    // inspect out.Err, out.Messages, out.TotalUsage
}

Full design: docs/toolloop-proposal.md. Binding scope and non-goals: ADR-0011.

Behaviors pinned

Verbatim assistant-message round-trip preserves ToolCall.ProviderSignature (ADR-0010), so Gemini 3 multi-turn tool loops work end-to-end.
Fail-closed config (*toolloop.ConfigError): nil Client, empty Messages, Request.Tools set, duplicate tool names, nil Execute, both ToolChoices non-zero, unknown ToolChoice.Type, RequiresTools() with no tools, ToolChoiceTool naming an unknown tool — all rejected before any provider call.
Pre-execute MaxIterations stop: the limit-hitting assistant turn is appended to Outcome.Messages as diagnostic state; its tool calls are not executed and the transcript is not directly re-feedable into Complete without repairing the unresolved pairing.
Defensive ToolCall copy before dispatch and event emission so a misbehaving executor or observer cannot corrupt the round-trip through shared slice backing (cloned Parameters and ProviderSignature).
Failure split: ToolResult{IsError: true} is model-visible (loop continues); a non-nil Execute error is loop-visible → OutcomeToolError. An executor returning context.Canceled → OutcomeCanceled.
Cancellation detected via errors.Is(err, context.Canceled) (and ctx.Err() == context.Canceled), matching the toolkit's X5 contract that caller cancel is not converted into a *llms.ProviderError.
Unknown-tool recovery is the default: an unregistered tool name appends an error tool result and continues, so the model can self-correct when the provider can't enforce tool_choice.

Non-goals (binding per ADR-0011)

This is a tool loop, not an agent loop. No agent state, terminal/state-transition tools, ProcessEffect, story IDs, request IDs, audit taxonomy, persistence, authorization, tenant isolation, schema-generation, tool registries, moderation hooks, automatic streaming (deferred per ADR-0003), or prompt management. PRs that try to grow IterationEvent/ToolCallEvent with app-specific fields require a superseding ADR.

Compatibility

Additive new surface. No core llms type changes; no MAESTRO_DIVERGENCES.md row required. Existing v0.4.x consumers are unaffected.

Pre-1.0; v0.x minor versions may break.

🤖 Generated with Claude Code

Assets 2

19 May 12:53

dratner

v0.4.2

570d023

v0.4.2 — Gemini thought_signature round-trip (G1 resolved)

Patch release. A release-blocking Gemini fix the Maestro team surfaced during cut-over (their PR #220 / migration §5 G1).

Fixed

Gemini tool-call round-trip — the opaque functionCall thought_signature is now carried through the message model via the new ToolCall.ProviderSignature, captured on response parse and replayed on the next request. Without it, Gemini 3 Pro (gemini-3-pro-preview) hard-400 INVALID_ARGUMENTs on every 2nd+ tool-loop turn, making it unusable for any multi-turn agentic tool use. The signature travels in the conversation history the app already round-trips — stateless, no per-client cache (restores pre-extraction Maestro parity without the contract violation that motivated dropping the original responseCache). Divergence G1 → RESOLVED; ADR-0010.

Added

llms.ToolCall.ProviderSignature []byte — opaque, provider-owned, never interpreted by core; round-trips provider-required per-tool-call state. Additive and zero-value-safe; Anthropic/OpenAI/Ollama leave it nil and ignore it (same neutral pattern as ContentPart.CacheBreakpoint).

Maestro consumers: bumping to v0.4.2 unblocks Gemini for the migration — multi-turn Gemini tool loops now work; migration §5 G1 clears (Anthropic + OpenAI were unaffected).

No breaking changes. Pre-1.0: v0.x minor versions may break.

Assets 2

19 May 05:24

dratner

v0.4.1

92eceb8

v0.4.1 — OpenAI incomplete StopReason fix

Patch release. A cut-over fix the Maestro team surfaced while switching onto the toolkit.

Fixed

OpenAI (Responses) StopReason — on status:"incomplete" the adapter now surfaces incomplete_details.reason ("max_output_tokens" / "content_filter") instead of the envelope status. This makes OpenAI consistent with the raw-finish-reason passthrough of the other adapters (cf. divergences G3/OL3), so consumers can detect length-truncation / content-filter without reaching into Raw (outside the stability contract). Tool calls on a truncated response are preserved. Divergence OC4; no ADR (bugfix aligning to an existing convention). Cross-ref Maestro PR #220 / spec §9.

Maestro consumers: on bumping to v0.4.1 you can delete the rawStopReason Raw.(*responses.Response) workaround + its two guard tests — normalizeStopReason already maps these reasons; no other change needed.

Docs

ADR-0009 — recorded the 2026-05-18 end-to-end live validation of the v0.4 Vertex path (Anthropic via anthropicvertex + Gemini embeddings) against real Google Vertex AI (append-only; design validated as built).

No API changes. Pre-1.0: v0.x minor versions may break.

Assets 2

18 May 02:14

dratner

v0.4.0

cab919c

v0.4.0 — Vertex AI backend & task-typed embeddings

Google Vertex AI backend + task-typed embeddings (ADR-0009). The package now covers Morris's managed-AI posture — Claude and Gemini embeddings over Vertex/PSC, app-supplied auth, no static provider keys.

Added since v0.3.0

llms/providers/anthropic/anthropicvertex — Claude via Vertex AI as a separate leaf package so the base anthropic package stays Google-dependency-free; returns the same *anthropic.Client (all request/response/tool/cache translation + middleware reused).
anthropic.WithRequestOptions — low-level SDK escape hatch; API key optional when request options supply auth.
google.NewEmbeddings — Gemini/Vertex embeddings (gemini-embedding-001), order/ID-preserving, per-request dimension override.
llms.EmbeddingTask + EmbeddingInput.Title — provider-neutral, advisory; honored on Gemini, ignored on OpenAI (same pattern as CacheBreakpoint).
App-supplied credentials + PSC endpoint/transport injection; no ADC discovery. The auth-vs-PSC-transport precedence is documented and covered by option-order tests.
Fail-closed truncation: genai cannot send autoTruncate:false, so AutoTruncate=true is Vertex-only and AutoTruncate=false requires a client-side MaxInputBytes guard — NewEmbeddings fails rather than look safe while Vertex silently truncates.
gemini-embedding-001 is single-input: a multi-input request returns a typed bad_request — no fan-out, no hidden chunking exception (the app owns chunking).

Decisions of record

ADR-0009 (Vertex/PSC/Gemini-embeddings): the leaf-package shape, app-supplied-auth-only, the precedence rule, and the auto-truncate fail-closed refinement.

Out of scope (Morris / OpenTofu infrastructure)

Vertex API enablement, aiplatform IAM, PSC/restricted-API DNS, VPC-SC perimeter, egress lockdown, model/region config. Streaming and vLLM remain deliberately deferred.

Pre-1.0: v0.x minor versions may break.

Assets 2

17 May 20:19

dratner

v0.3.0

6023418

v0.3.0 — middleware complete

The provider-neutral middleware line is complete. Every chat/embedding client can now be composed with the full resilience + observability stack behind the same stable interfaces.

Added since v0.2.0

llms/middleware — the full set:
- ValidationChat (structural/app-neutral: text-only System, role/part legality, tool-call↔result pairing)
- RetryChat/RetryEmbeddings (classify via llms.Retryable, honor RetryAfter, backoff + optional jitter)
- TimeoutChat/TimeoutEmbeddings (per-attempt deadline)
- CircuitChat/CircuitEmbeddings (3-state breaker, single-flight half-open, non-retryable *CircuitOpenError)
- MetricsChat/MetricsEmbeddings (narrow app-neutral Observer, one Event per attempt)
- RecommendedChat/RecommendedEmbeddings — wires the spec's recommended composition order; ChainChat stays the primitive
llms.RetryAfter accessor, symmetric with llms.Retryable
apierr: context.Canceled returned as-is (non-retryable, errors.Is-matchable); context.DeadlineExceeded stays a retryable timeout
Docs: ADR log adopted (docs/adr/, 0001–0006); MAESTRO_DIVERGENCES extended; README rewritten as a project + usage guide

Decisions of record

ADR-0003 (middleware is Complete/Embed-only; streaming deferred), ADR-0004 (single error classifier), ADR-0005 (non-retryable CircuitOpenError), ADR-0006 (structural ValidationError).

Not in this release — Maestro cut-over readiness gate

v0.3.0 is a complete middleware milestone, not a "cut-over ready" claim. Two tracked, separate readiness gates remain before the Maestro cut-over:

ToolChoiceRequired — "the model must call one of the offered tools" (general capability; divergences OC2/G2)
Provider-neutral prompt-cache hint — restores Maestro's CacheControl behavior (divergence A5)

Streaming and vLLM remain deliberately deferred. Pre-1.0: v0.x minor versions may break.

Assets 2

16 May 21:31

dratner

v0.2.0

746fb71

v0.2.0 — provider chat ports

Provider chat ports complete. All four providers now offer live, tool-use-capable chat behind the app-neutral interfaces, with the full live integration suite (simple-chat + tool-use, every provider) green as the Maestro cut-over acceptance gate.

Added since v0.1.0

llms/providers/openai — OpenAI chat via the Responses API (structured message / function_call / function_call_output items; honors caller ToolChoice)
llms/providers/google — Gemini chat via genai (stateless / concurrency-safe; real finish reason + usage)
llms/providers/ollama — hand-rolled /api/chat client, no SDK dependency (avoids the ollama module's unfixed CVEs); raw done_reason + token usage
Shared error classifier (apierr) — typed HTTP-status classification across anthropic + openai + google + ollama
Live integration suite — build-tagged tests for all four providers, each with simple-chat and a full tool-use round trip; OS-aware make test-integration (macOS ad-hoc codesign path, Linux/CI plain go test); manual workflow_dispatch CI
docs/MAESTRO_DIVERGENCES.md — living cut-over acceptance checklist

Deferred to v0.3 (not in this release)

Middleware: retry, timeout, circuit, metrics, validation. Shipped middleware remains ChainChat/ChainEmbeddings, token estimator, and ratelimit (from v0.1).

Pre-1.0: v0.x minor versions may break.

Assets 2

16 May 01:26

dratner

v0.1.0

342528a

v0.1.0 — first extraction milestone

First extraction milestone: an app-neutral Go LLM/embedding toolkit shared by Maestro and Morris.

Included

llms — app-neutral chat + embedding interfaces; Message/ContentPart (tool calls/results as content parts), Usage (incl. cache tokens), classified ProviderError/LimitError; optional StreamingChatClient capability interface.
llms/testllm — deterministic, concurrency-safe chat + embedding fakes (recordings deep-copied; stable hash embeddings).
llms/ratelimit — Limiter/Reservation reservation protocol + in-memory token-bucket + concurrency limiter (lazy clock-based refill, no background goroutine; reconciliation contract).
llms/middleware — ChainChat/ChainEmbeddings (first arg outermost), DefaultEstimator, rate-limit middleware (reserve → release on cancellation-surviving ctx → commit actual usage).
llms/providers/anthropic — Anthropic chat: typed SDK-error classification, cache-token usage, raw tool-call params, strict-alternation translation.
llms/providers/openai — OpenAI embeddings: input-order/ID preserving, per-request dimension override, batch-limit validation.

Requirements

Go 1.26.3+ (patched net/net/http stdlib advisories GO-2026-4971 / GO-2026-4918).

Not included (roadmap)

v0.2: retry / timeout / circuit-breaker middleware, metrics hook interfaces, richer error classification.
v0.3: Google + Ollama providers, optional streaming, shared provider error-classifier.

Distributed (e.g. PostgreSQL) limiters are implemented by applications against the ratelimit.Limiter interface; none ship here yet by design.

Assets 2

Releases: SnapdragonPartners/maestro-llms

v0.7.1 — reasoning + billable output normalization (ADR-0016)

The bug

Fix (ADR-0016)

Breaking semantic for OpenAI o-series

Demo bonus (examples/chat)

Compatibility

Uh oh!

v0.7.0 — vLLM provider

llms/providers/vllm (ADR-0015)

ADR-0013 amendment (PR #44)

MAESTRO_DIVERGENCES.md

Not in this tag (separate module)

Compatibility

Uh oh!

v0.6.0 — model listing & text token estimator

llms.ModelLister + per-provider LatestInFamily (ADR-0012)

llms.EstimateTextTokens (ADR-0013)

Compatibility

Uh oh!

v0.5.0 — Tool loop helper (ADR-0011)

What's new

Behaviors pinned

Non-goals (binding per ADR-0011)

Compatibility

Uh oh!

v0.4.2 — Gemini thought_signature round-trip (G1 resolved)

Fixed

Added

Uh oh!

v0.4.1 — OpenAI incomplete StopReason fix

Fixed

Docs

Uh oh!

v0.4.0 — Vertex AI backend & task-typed embeddings

Added since v0.3.0

Decisions of record

Out of scope (Morris / OpenTofu infrastructure)

Uh oh!

v0.3.0 — middleware complete

Added since v0.2.0

Decisions of record

Not in this release — Maestro cut-over readiness gate

Uh oh!

v0.2.0 — provider chat ports

Added since v0.1.0

Deferred to v0.3 (not in this release)

Uh oh!

v0.1.0 — first extraction milestone

Included

Requirements

Not included (roadmap)

Uh oh!

Demo bonus (`examples/chat`)

`llms/providers/vllm` (ADR-0015)

`MAESTRO_DIVERGENCES.md`

`llms.ModelLister` + per-provider `LatestInFamily` (ADR-0012)

`llms.EstimateTextTokens` (ADR-0013)