Add Tier 2 differentiators: cost metrics, root cause, replay, mutation, fingerprint by pratyush618 · Pull Request #66 · ByteVeda/agenteval

pratyush618 · 2026-04-07T07:56:35Z

Summary

Five Tier 2 differentiating features — no competing eval tool offers any of these:

Cost-Normalized Metrics (agenteval-metrics/cost): CostNormalizedMetric, LatencyNormalizedMetric, CostEfficiencyAnalyzer with Pareto frontier computation — 29 tests
Regression Root Cause Analysis (agenteval-reporting/regression/rootcause): RootCauseAnalyzer clusters regressed cases by failure pattern, detects output/tool/cost/latency changes, ranks by impact — 11 tests
Deterministic Replay (agenteval-replay): Record agent+judge interactions, replay without API calls ($0 regression tests). RecordingJudgeModel/ReplayJudgeModel decorators, RecordingStore persistence, ReplaySuite orchestrator — 32 tests
Mutation Testing (agenteval-mutation): Sealed Mutator interface with 5 built-in mutators (weaken constraints, remove safety, inject contradiction, etc.), MutationSuite orchestrator measures eval detection rate — 22 tests
Capability Fingerprinting (agenteval-fingerprint): CapabilityProfiler evaluates agents across 8 dimensions, CapabilityComparison for side-by-side profiles, CapabilityReporter for console output — 17 tests

58 new files, ~4,850 lines, 111 new tests — all passing.

Test plan

mvn test -pl agenteval-metrics — cost metrics pass (29 new)
mvn test -pl agenteval-reporting — root cause analysis pass (11 new)
mvn test -pl agenteval-replay — replay module pass (32 tests)
mvn test -pl agenteval-mutation — mutation module pass (22 tests)
mvn test -pl agenteval-fingerprint — fingerprint module pass (17 tests)
All pre-commit hooks pass (checkstyle, editorconfig, spotbugs)
Verify full reactor build: mvn clean install -Denforcer.skip=true

CostNormalizedMetric, LatencyNormalizedMetric, CostEfficiencyAnalyzer, ParetoFrontier in agenteval-metrics/cost package, 29 tests.

RootCauseAnalyzer clusters regressed cases by failure pattern, detects output/tool/cost/latency changes, ranks by impact, 11 tests.

RecordingJudgeModel/AgentWrapper decorators, ReplayJudgeModel/AgentWrapper for $0 regression tests, RecordingStore persistence, ReplaySuite orchestrator, 32 tests.

Sealed Mutator interface with 5 built-in mutators, PluggableMutator, MutationSuite orchestrator, AgentFactory, 22 tests.

CapabilityDimension enum (8 dimensions), CapabilityProfiler orchestrator, CapabilityComparison, CapabilityReporter, 17 tests.

Update README module structure, add 6 doc pages under docs/advanced for contract testing, chaos engineering, statistical analysis, deterministic replay, mutation testing, and capability fingerprinting.

pratyush618 added 7 commits April 7, 2026 13:22

Add cost-normalized metrics for cost/latency-aware evaluation

4e7ee6d

CostNormalizedMetric, LatencyNormalizedMetric, CostEfficiencyAnalyzer, ParetoFrontier in agenteval-metrics/cost package, 29 tests.

Add regression root cause analysis

ce6930c

RootCauseAnalyzer clusters regressed cases by failure pattern, detects output/tool/cost/latency changes, ranks by impact, 11 tests.

Add agenteval-replay module for deterministic evaluation replay

a07e968

RecordingJudgeModel/AgentWrapper decorators, ReplayJudgeModel/AgentWrapper for $0 regression tests, RecordingStore persistence, ReplaySuite orchestrator, 32 tests.

Add agenteval-mutation module for prompt mutation testing

b214608

Sealed Mutator interface with 5 built-in mutators, PluggableMutator, MutationSuite orchestrator, AgentFactory, 22 tests.

Add agenteval-fingerprint module for agent capability profiling

af7acf8

CapabilityDimension enum (8 dimensions), CapabilityProfiler orchestrator, CapabilityComparison, CapabilityReporter, 17 tests.

Register replay, mutation, fingerprint modules in parent POM and BOM

cd67744

Add documentation for new modules

ba66066

Update README module structure, add 6 doc pages under docs/advanced for contract testing, chaos engineering, statistical analysis, deterministic replay, mutation testing, and capability fingerprinting.

pratyush618 merged commit 0c55ea5 into main Apr 7, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Tier 2 differentiators: cost metrics, root cause, replay, mutation, fingerprint#66

Add Tier 2 differentiators: cost metrics, root cause, replay, mutation, fingerprint#66
pratyush618 merged 7 commits intomainfrom
feature/tier2-differentiators

pratyush618 commented Apr 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pratyush618 commented Apr 7, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant