Skip to content

refactor(core): migrate LLM providers from Vercel AI SDK to @mariozechner/pi-ai #1205

@christso

Description

@christso

Objective

Replace the Vercel AI SDK (@ai-sdk/*, ai, @openrouter/ai-sdk-provider) with @mariozechner/pi-ai as the LLM provider layer for grader/rubric/agentv-provider call sites in packages/core. AgentV already depends on @mariozechner/pi-coding-agent (which sits on top of pi-ai), so this consolidates onto a single LLM stack and removes ~6 SDK packages from the dependency graph.

Background

packages/core/src/evaluation/providers/ai-sdk.ts (559 lines) wraps Vercel AI SDK to expose 5 providers: OpenAIProvider, AzureProvider, OpenRouterProvider, AnthropicProvider, GeminiProvider. All converge on a single generateText() call per invoke() (stateless RPC shape).

pi-ai covers the same provider surface natively:

  • OpenAI, Azure OpenAI (Responses), Anthropic, Google, OpenRouter, plus Vertex/Bedrock/Mistral/Groq/xAI/Cerebras/etc.
  • Dedicated provider files in pi-ai/dist/providers/ including azure-openai-responses.js.
  • Unified complete(model, context) and stream(model, context) APIs.
  • Built-in token usage + cost tracking.
  • Unified reasoning: 'minimal'|'low'|'medium'|'high'|'xhigh' thinking control.

OpenRouter is a first-class supported provider with its own routing config (openRouterRouting).

Call sites in scope

  1. packages/core/src/evaluation/providers/ai-sdk.ts — 5 provider classes
  2. packages/core/src/evaluation/providers/agentv-provider.ts — built-in grader provider
  3. packages/core/src/evaluation/graders/llm-grader.tsgenerateText() + filesystem tool() definitions
  4. packages/core/src/evaluation/graders/composite.tsgenerateText()
  5. packages/core/src/evaluation/generators/rubric-generator.tsgenerateText()
  6. packages/core/src/evaluation/providers/index.ts — registry wiring

Design latitude

  • Keep the existing Provider.invoke(request) -> response contract. Implement provider classes as thin adapters over pi-ai's complete(). Don't refactor call sites to a session-based shape — that's a much larger change and pi-coding-agent's session model is heavier than graders need.
  • Tool definitions move from Zod (ai-sdk's tool()) to TypeBox (pi-ai's Type.Object()). Mechanical port for the small set of filesystem tools in llm-grader.
  • Anthropic thinking budget: today the config takes a numeric budgetTokens; pi-ai exposes a 5-bucket reasoning enum. Map numeric budgets to the closest bucket and document the change.
  • Retry/backoff: ai-sdk.ts lines 520–559 have a custom exponential-backoff loop. Either preserve as a wrapper around complete() or accept pi-ai's defaults.

Spike scope (first PR)

Single-provider PoC to de-risk the migration before scoping the full port:

  • Port OpenAIProvider only to a pi-ai adapter behind the existing Provider interface.
  • Leave the other 4 providers and all consumers unchanged.
  • Run the existing grader-score baselines (scripts/check-grader-scores.ts) against an OpenAI-targeted eval and confirm scores stay within range.
  • Capture findings on: token-usage shape mapping, retry-loop placement, tool-definition port complexity, any Azure/Anthropic-specific gotchas observed while reading pi-ai source.

The spike PR is not intended to remove @ai-sdk/openai from package.json — both libraries co-exist for the duration of the spike.

Acceptance signals (full migration)

  • All 5 provider classes in ai-sdk.ts reimplemented over pi-ai.
  • llm-grader, composite, rubric-generator, agentv-provider updated.
  • @ai-sdk/anthropic, @ai-sdk/azure, @ai-sdk/google, @ai-sdk/openai, ai, @openrouter/ai-sdk-provider removed from all package.json files.
  • All grader-score baselines under examples/**/*.grader-scores.yaml pass.
  • At least one live eval per provider (OpenAI, Azure, Anthropic, Google, OpenRouter) produces correct scores[].type, scores in expected range, and non-zero token usage.
  • Anthropic thinking budget config: numeric → bucket mapping documented in skill files and code header.
  • No regressions in bun run test or bun run validate:examples.

Risks / unknowns

  • Token-usage object shape. pi-ai returns {input, output, cost}; ai-sdk surfaces {inputTokens, outputTokens, cachedInputTokens, reasoningTokens}. JSONL output and Studio aggregation may need adjustment if any consumer relies on cached/reasoning fields.
  • Azure Responses API parity. useDeploymentBasedUrls + apiFormat: 'responses' switching needs verification with real deployment.
  • Anthropic thinking. Going from numeric budget to 5-bucket enum is a lossy API change for anyone setting fine-grained budgets — call out as a behavior change in the PR.
  • Retry semantics. ai-sdk.ts has bespoke backoff; pi-ai's behavior differs. Decide: wrap or replace.

Non-goals

  • No streaming. Current call sites are non-streaming; don't add streaming as part of this migration.
  • No move to pi-coding-agent's session model — keep grader calls stateless.
  • Not changing the public Provider interface or ProviderRequest/ProviderResponse shapes consumed elsewhere in core.
  • Not adding new providers exposed by pi-ai (Bedrock, Vertex, Mistral, etc.) in this issue — separate work.

Metadata

Metadata

Assignees

No one assigned

    Labels

    coreAnything pertaining to core functionality of AgentVrefactor

    Type

    No type

    Projects

    Status

    In progress

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions