P4: snapshot testing, benchmark mode, Gradle plugin, IntelliJ plugin#8
Merged
pratyush618 merged 4 commits intomainfrom Mar 13, 2026
Merged
P4: snapshot testing, benchmark mode, Gradle plugin, IntelliJ plugin#8pratyush618 merged 4 commits intomainfrom
pratyush618 merged 4 commits intomainfrom
Conversation
Persist EvalResult as named JSON snapshots and compare against prior baselines to detect regressions. SnapshotStore handles save/load with path traversal validation, SnapshotReporter implements EvalReporter with baseline/compare/update modes, SnapshotRegressionException carries the RegressionReport on failure.
Run the same dataset against multiple config variants and compare results. AgentTestCase.toBuilder() enables isolated deep-copies per variant. Benchmark supports parallel execution via virtual threads with Semaphore-based concurrency control. BenchmarkReporter outputs console tables with [BEST]/[WORST] labels and per-metric breakdown. BenchmarkComparison bridges to RegressionComparison for variant diffs.
New agenteval-gradle-plugin with plugin ID com.agenteval.evaluate. AgentEvalPlugin registers an agenteval extension and agentEvaluate task in the verification group. Uses dev.gradleplugins:gradle-api:8.5 from Maven Central. MetricResolver and ReportFormatResolver are duplicated from the Maven plugin (small, not worth a shared module).
New agenteval-intellij with lightweight JSON report parser (no agenteval-core dependency). ReportModel/ReportParser parse agenteval-report.json using Jackson. Includes tool window, gutter icon provider for @Metric annotations, VFS file watcher, SVG icons, and plugin.xml descriptor. IntelliJ Platform-dependent UI sources are excluded from Maven compilation (require IDE SDK).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
com.agenteval.evaluate(AgentEvalPlugin, EvaluateTask, extension)Stats: 17 modules, 590 tests, 0 checkstyle/SpotBugs violations.
Test plan
mvn clean install— all 17 modules build, all tests pass