Skip to content

[EPIC] core azd evals #7661

@kristenwomack

Description

@kristenwomack

Problem statement

azd is shipping AI-powered experiences (agentic init flows, Copilot-assisted commands, hosted agent lifecycle management) and as these capabilities grow, we want a systematic way to measure their quality: are AI experiences working well, improving over time, and meeting developer expectations? Today, quality signals for agentic flows come primarily from user feedback rather than automated metrics. By integrating eval tooling into our CI pipeline, we can proactively track quality trends and make data-driven product decisions as AI scenarios expand across azd.

We plan to use eval tooling that already exists, wiring it into our CI, and using the results to make better product decisions. (We aren't building eval infrastructure.)

Vision

Every agentic flow in azd has a defined eval scenario. CI runs those evals automatically and reports success metrics. The team knows, quantitatively and not just anecdotally, whether AI-assisted experiences are improving with each release.

Who this helps

  • Copilot CLI users: Benefit directly with eval-backed quality means the AI-assisted experiences they depend on are tested, measured, and continuously improving.
  • New Azure developers: Agentic init flows are often their first touch with azd. Evals help ensure that first experience stays high-quality over time.
  • azd contributors: Get a concrete quality signal for AI features, enabling the team to ship with confidence and catch regressions early.

Goals (in scope)

  • Define eval scenarios for azd's agentic flows: what does "success" look like for each AI-assisted command?
  • Integrate the eval CLI (NPX tool) into azd's CI pipeline so eval scenarios run automatically
  • Report scenario success metrics: pass/fail rates, quality scores, trend over time
  • Establish the pattern for adding new eval scenarios as new AI features ship

Non-goals (out of scope)

  • Owning or building the eval platform; azd is a consumer, not a platform owner
  • Building custom eval infrastructure; we use what's available (the NPX eval CLI tool)
  • Evaluating non-azd AI experiences (Foundry portal, VS Code extensions, etc.)
  • Blocking releases on eval results initially; start with reporting, gate later once confidence is established

Success criteria

  • Eval scenarios defined for at least the top 2-3 agentic flows in azd
  • Eval CLI integrated into azd CI pipeline (runs on nightly or per-PR, depending on cost)
  • Scenario success metrics reported and visible to the team
  • First eval results available by June 2026
  • Pattern documented: how to add an eval scenario for a new AI feature

Dependencies

  • Eval tooling maturity: This epic's timeline is contingent on the eval CLI/NPX tool being ready for integration. If tooling isn't mature by May, the timeline shifts to June or later.
  • Agentic flow stability: Eval scenarios need stable-enough AI commands to evaluate. If agentic flows are still rapidly changing, evals will be noisy.
  • Cross-team coordination: Eval scope and definition need input from stakeholders across the azd team; this is a follow-up from initial planning discussions.
  • Boundary with Test Pipelines: If this epic and Test Pipelines are merged, evals become a sub-workstream of the broader quality initiative. If separate (recommended), the boundary is: Test Pipelines = code reliability, Evals = AI scenario quality.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions