Add P2 growth phase: parallel eval, 4 metrics, 3 new modules, reporting by pratyush618 · Pull Request #6 · ByteVeda/agenteval

pratyush618 · 2026-03-12T15:08:51Z

Summary

Virtual thread parallel evaluation with bounded concurrency (Semaphore + VirtualThreadPerTaskExecutor), progress callback with ETA, console progress bar
4 new metrics: ToolResultUtilization (embedding-based), StepLevelErrorLocalization (LLM judge per reasoning step), TopicDriftDetection (embedding-based conversation), ConversationResolution (LLM judge conversation)
Custom HTTP embedding model support for any embedding API
HTML reporter (self-contained single-file) and regression comparison (per-metric deltas, new failure/pass detection)
Synthetic dataset generation from documents, variations, and adversarial inputs via LLM
PromptTemplate moved from metrics to core module (deprecated delegate preserved)
agenteval-langgraph4j module for graph execution capture with configurable node mapping
agenteval-mcp module for framework-agnostic MCP tool call capture and JSON Schema validation
agenteval-redteam module with 20 attack templates across 4 categories, LLM-scored resistance evaluation

13 modules, 439 tests, 0 checkstyle violations, 0 SpotBugs bugs.

Test plan

mvn clean install passes all 13 modules
439 tests pass (up from 359 in P1)
Checkstyle: 0 violations
SpotBugs: 0 bugs
Each commit passes pre-commit hook (full mvn verify)

Adds bounded-concurrency parallel evaluation using virtual threads (Executors.newVirtualThreadPerTaskExecutor + Semaphore), progress callback interface with ETA estimation, and a console progress bar implementation. Configurable via AgentEvalConfig and YAML.

… TopicDriftDetection, ConversationResolution Embedding-based ToolResultUtilization measures how well tool results are reflected in agent output. LLM-judge StepLevelErrorLocalization pinpoints the first reasoning trace failure. TopicDriftDetection tracks conversation coherence via embedding similarity to the initial topic. ConversationResolution judges whether the conversation achieved its goal. Also extracts VectorMath utility from SemanticSimilarityMetric.

Adds CustomHttpEmbeddingModel with configurable request template, JSON path extraction for embeddings, and optional auth header. Allows users to integrate any embedding API without a dedicated provider implementation.

Self-contained single-file HTML report with embedded CSS/JS and JSON data injection. Regression comparison matches test cases by input, computes per-metric deltas, and identifies new failures/passes. Includes RegressionReporter for human-readable output.

SyntheticDatasetGenerator creates test cases from documents, generates variations of existing cases, and produces adversarial inputs using LLM judge. PromptTemplate relocated from agenteval-metrics to agenteval-core with a deprecated delegate in the original location. Existing metric classes updated to use core PromptTemplate.

agenteval-langgraph4j captures graph execution into ReasoningSteps with configurable node-to-step-type mapping. agenteval-mcp provides framework-agnostic MCP tool call capture, test case building, and lightweight JSON Schema validation. Both modules use provided-scope dependencies for their respective frameworks.

RedTeamSuite runs configurable attack categories (prompt injection, data leakage, boundary testing, robustness) against an agent function. Includes 20 built-in attack templates loaded from classpath JSON, LLM-based attack variation generation, and judge-scored resistance evaluation. Also updates SpotBugs exclusions for P2 modules.

@deprecated

…tyMetric The method already has a proper @deprecated annotation, so the dep-ann suppression is redundant and causes an IDE diagnostic.

pratyush618 added 8 commits March 12, 2026 20:25

Add custom HTTP embedding model support

8a8a454

Adds CustomHttpEmbeddingModel with configurable request template, JSON path extraction for embeddings, and optional auth header. Allows users to integrate any embedding API without a dedicated provider implementation.

Remove unnecessary @SuppressWarnings("dep-ann") from SemanticSimilari…

9bada4f

…tyMetric The method already has a proper @deprecated annotation, so the dep-ann suppression is redundant and causes an IDE diagnostic.

pratyush618 merged commit 3796e3c into main Mar 12, 2026

pratyush618 deleted the p2-growth-phase branch March 31, 2026 17:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add P2 growth phase: parallel eval, 4 metrics, 3 new modules, reporting#6

Add P2 growth phase: parallel eval, 4 metrics, 3 new modules, reporting#6
pratyush618 merged 8 commits intomainfrom
p2-growth-phase

pratyush618 commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant