You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add a first-party team-eval example pack that shows how to evaluate coordinated multi-agent workflows in AgentV without requiring users to reverse-engineer the pattern from research notes.
Why this is needed
Current AgentV examples are strong for single-agent and single-test flows, but the frontier benchmark/story around agent teams keeps recurring:
coordinated specialist agents
judgeable intermediate artifacts
role adherence and division of labor
end-to-end scoring of the team result
Even before dependency-aware DAG execution lands, AgentV can already demonstrate useful team-eval patterns with existing primitives: multi-turn transcripts, composite evaluators, code graders, tool trajectory checks, and imported session data.
Suggested example coverage
Two-role handoff example — planner -> implementer, scored on final output + role adherence
Team transcript import example — evaluate an existing multi-agent / multi-role transcript offline
Composite team score example — combine outcome quality, tool-use constraints, and collaboration-specific rubric signals
Objective
Add a first-party team-eval example pack that shows how to evaluate coordinated multi-agent workflows in AgentV without requiring users to reverse-engineer the pattern from research notes.
Why this is needed
Current AgentV examples are strong for single-agent and single-test flows, but the frontier benchmark/story around agent teams keeps recurring:
Even before dependency-aware DAG execution lands, AgentV can already demonstrate useful team-eval patterns with existing primitives: multi-turn transcripts, composite evaluators, code graders, tool trajectory checks, and imported session data.
Suggested example coverage
Acceptance signals
Non-goals
Related