Skip to content

P4: snapshot testing, benchmark mode, Gradle plugin, IntelliJ plugin#8

Merged
pratyush618 merged 4 commits intomainfrom
p4-deferred-features
Mar 13, 2026
Merged

P4: snapshot testing, benchmark mode, Gradle plugin, IntelliJ plugin#8
pratyush618 merged 4 commits intomainfrom
p4-deferred-features

Conversation

@pratyush618
Copy link
Copy Markdown
Collaborator

Summary

  • Snapshot testing: persist EvalResult as named JSON snapshots in agenteval-reporting, compare against baselines, fail on regressions (SnapshotStore, SnapshotReporter, SnapshotData records)
  • Benchmark mode: run datasets against multiple config variants with isolated deep-copy, parallel virtual thread execution, console reporter with [BEST]/[WORST] labels (Benchmark, BenchmarkVariant, BenchmarkResult, BenchmarkReporter, BenchmarkComparison, AgentTestCase.toBuilder())
  • Gradle plugin: new agenteval-gradle-plugin mirroring Maven plugin with managed Property fields, plugin ID com.agenteval.evaluate (AgentEvalPlugin, EvaluateTask, extension)
  • IntelliJ plugin: new agenteval-intellij with JSON report parser (no core dependency), tool window, gutter icon provider for @Metric annotations, SVG icons, plugin.xml descriptor

Stats: 17 modules, 590 tests, 0 checkstyle/SpotBugs violations.

Test plan

  • mvn clean install — all 17 modules build, all tests pass
  • Checkstyle: 0 violations across all modules
  • SpotBugs: 0 bugs (skipped for gradle-plugin and intellij modules)
  • Snapshot round-trip: save → load → compare with RegressionComparison
  • Benchmark isolation: mutations in one variant do not leak to others
  • Gradle plugin: plugin applies, extension defaults wired, task registered
  • IntelliJ ReportParser: single/multi-case, malformed JSON, metric pass/fail extraction

Persist EvalResult as named JSON snapshots and compare against prior
baselines to detect regressions. SnapshotStore handles save/load with
path traversal validation, SnapshotReporter implements EvalReporter
with baseline/compare/update modes, SnapshotRegressionException
carries the RegressionReport on failure.
Run the same dataset against multiple config variants and compare
results. AgentTestCase.toBuilder() enables isolated deep-copies per
variant. Benchmark supports parallel execution via virtual threads
with Semaphore-based concurrency control. BenchmarkReporter outputs
console tables with [BEST]/[WORST] labels and per-metric breakdown.
BenchmarkComparison bridges to RegressionComparison for variant diffs.
New agenteval-gradle-plugin with plugin ID com.agenteval.evaluate.
AgentEvalPlugin registers an agenteval extension and agentEvaluate
task in the verification group. Uses dev.gradleplugins:gradle-api:8.5
from Maven Central. MetricResolver and ReportFormatResolver are
duplicated from the Maven plugin (small, not worth a shared module).
New agenteval-intellij with lightweight JSON report parser (no
agenteval-core dependency). ReportModel/ReportParser parse
agenteval-report.json using Jackson. Includes tool window, gutter
icon provider for @Metric annotations, VFS file watcher, SVG icons,
and plugin.xml descriptor. IntelliJ Platform-dependent UI sources
are excluded from Maven compilation (require IDE SDK).
@pratyush618 pratyush618 merged commit 715694b into main Mar 13, 2026
@pratyush618 pratyush618 self-assigned this Mar 13, 2026
@pratyush618 pratyush618 deleted the p4-deferred-features branch March 31, 2026 17:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant