Skip to content

Add P3 ecosystem: multi-judge, versioning, Maven plugin, GitHub Actions#7

Merged
pratyush618 merged 1 commit intomainfrom
feat/p3-ecosystem
Mar 12, 2026
Merged

Add P3 ecosystem: multi-judge, versioning, Maven plugin, GitHub Actions#7
pratyush618 merged 1 commit intomainfrom
feat/p3-ecosystem

Conversation

@pratyush618
Copy link
Copy Markdown
Collaborator

Summary

  • Multi-model judge consensus — fan-out to multiple LLM judges with configurable consensus strategies (MAJORITY, AVERAGE, WEIGHTED_AVERAGE, UNANIMOUS), virtual thread parallelism, ThreadLocal response access
  • Golden set versioning — git metadata resolution, dataset version tagging/loading/listing with JSON persistence and Jackson JSR-310 support
  • Maven pluginagenteval:evaluate goal (default phase VERIFY), metric/report resolution, YAML config integration
  • GitHub Actions integration — composite action with GFM markdown reporter, PR comment posting/updating via marker-based detection, fat JAR via shade plugin

Stats: 15 modules, 490 tests, 2735 lines added across 33 files. 0 checkstyle violations, 0 SpotBugs issues.

Test plan

  • mvn clean install — all 15 modules build, 490 tests pass
  • Multi-judge: verify all 4 consensus strategies, partial/total failure, token summing, ThreadLocal storage
  • Versioning: tag/load/list versions, git metadata resolution, non-git directory handling
  • Maven plugin: metric name resolution (case-insensitive), report format resolution
  • GitHub Actions: markdown table rendering, PR comment create/update/auth, mock HTTP tests
  • SpotBugs + Checkstyle: 0 violations across all modules

… GitHub Actions

Add 4 P3 features across 2 new modules and 2 existing modules (15 modules, 490 tests):

- Multi-model judge consensus (MAJORITY/AVERAGE/WEIGHTED_AVERAGE/UNANIMOUS strategies,
  virtual thread fan-out, ThreadLocal MultiJudgeResponse, JudgeModels.multi() factory)
- Golden set versioning (GitResolver for git metadata, DatasetVersioner with
  tag/load/listVersions/latest, VersionedDataset record, EvalDataset.tagVersion())
- Maven plugin (agenteval-maven-plugin: EvaluateMojo @mojo(evaluate, VERIFY),
  MetricResolver, ReportFormatResolver, SpotBugs skipped)
- GitHub Actions integration (agenteval-github-actions: MarkdownReporter with GFM tables,
  GitHubPrCommenter with marker-based update, GitHubActionRunner, action.yml, fat JAR)
@pratyush618 pratyush618 merged commit 546f243 into main Mar 12, 2026
@pratyush618 pratyush618 deleted the feat/p3-ecosystem branch March 31, 2026 17:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant