P4: snapshot testing, benchmark mode, Gradle plugin, IntelliJ plugin by pratyush618 · Pull Request #8 · ByteVeda/agenteval

pratyush618 · 2026-03-13T06:21:15Z

Summary

Snapshot testing: persist EvalResult as named JSON snapshots in agenteval-reporting, compare against baselines, fail on regressions (SnapshotStore, SnapshotReporter, SnapshotData records)
Benchmark mode: run datasets against multiple config variants with isolated deep-copy, parallel virtual thread execution, console reporter with [BEST]/[WORST] labels (Benchmark, BenchmarkVariant, BenchmarkResult, BenchmarkReporter, BenchmarkComparison, AgentTestCase.toBuilder())
Gradle plugin: new agenteval-gradle-plugin mirroring Maven plugin with managed Property fields, plugin ID com.agenteval.evaluate (AgentEvalPlugin, EvaluateTask, extension)
IntelliJ plugin: new agenteval-intellij with JSON report parser (no core dependency), tool window, gutter icon provider for @Metric annotations, SVG icons, plugin.xml descriptor

Stats: 17 modules, 590 tests, 0 checkstyle/SpotBugs violations.

Test plan

mvn clean install — all 17 modules build, all tests pass
Checkstyle: 0 violations across all modules
SpotBugs: 0 bugs (skipped for gradle-plugin and intellij modules)
Snapshot round-trip: save → load → compare with RegressionComparison
Benchmark isolation: mutations in one variant do not leak to others
Gradle plugin: plugin applies, extension defaults wired, task registered
IntelliJ ReportParser: single/multi-case, malformed JSON, metric pass/fail extraction

Persist EvalResult as named JSON snapshots and compare against prior baselines to detect regressions. SnapshotStore handles save/load with path traversal validation, SnapshotReporter implements EvalReporter with baseline/compare/update modes, SnapshotRegressionException carries the RegressionReport on failure.

Run the same dataset against multiple config variants and compare results. AgentTestCase.toBuilder() enables isolated deep-copies per variant. Benchmark supports parallel execution via virtual threads with Semaphore-based concurrency control. BenchmarkReporter outputs console tables with [BEST]/[WORST] labels and per-metric breakdown. BenchmarkComparison bridges to RegressionComparison for variant diffs.

New agenteval-gradle-plugin with plugin ID com.agenteval.evaluate. AgentEvalPlugin registers an agenteval extension and agentEvaluate task in the verification group. Uses dev.gradleplugins:gradle-api:8.5 from Maven Central. MetricResolver and ReportFormatResolver are duplicated from the Maven plugin (small, not worth a shared module).

@Metric

New agenteval-intellij with lightweight JSON report parser (no agenteval-core dependency). ReportModel/ReportParser parse agenteval-report.json using Jackson. Includes tool window, gutter icon provider for @Metric annotations, VFS file watcher, SVG icons, and plugin.xml descriptor. IntelliJ Platform-dependent UI sources are excluded from Maven compilation (require IDE SDK).

pratyush618 added 4 commits March 13, 2026 11:32

pratyush618 merged commit 715694b into main Mar 13, 2026

pratyush618 self-assigned this Mar 13, 2026

pratyush618 deleted the p4-deferred-features branch March 31, 2026 17:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

P4: snapshot testing, benchmark mode, Gradle plugin, IntelliJ plugin#8

P4: snapshot testing, benchmark mode, Gradle plugin, IntelliJ plugin#8
pratyush618 merged 4 commits intomainfrom
p4-deferred-features

pratyush618 commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pratyush618 commented Mar 13, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant