Skip to content

Add P2 growth phase: parallel eval, 4 metrics, 3 new modules, reporting#6

Merged
pratyush618 merged 8 commits intomainfrom
p2-growth-phase
Mar 12, 2026
Merged

Add P2 growth phase: parallel eval, 4 metrics, 3 new modules, reporting#6
pratyush618 merged 8 commits intomainfrom
p2-growth-phase

Conversation

@pratyush618
Copy link
Copy Markdown
Collaborator

Summary

  • Virtual thread parallel evaluation with bounded concurrency (Semaphore + VirtualThreadPerTaskExecutor), progress callback with ETA, console progress bar
  • 4 new metrics: ToolResultUtilization (embedding-based), StepLevelErrorLocalization (LLM judge per reasoning step), TopicDriftDetection (embedding-based conversation), ConversationResolution (LLM judge conversation)
  • Custom HTTP embedding model support for any embedding API
  • HTML reporter (self-contained single-file) and regression comparison (per-metric deltas, new failure/pass detection)
  • Synthetic dataset generation from documents, variations, and adversarial inputs via LLM
  • PromptTemplate moved from metrics to core module (deprecated delegate preserved)
  • agenteval-langgraph4j module for graph execution capture with configurable node mapping
  • agenteval-mcp module for framework-agnostic MCP tool call capture and JSON Schema validation
  • agenteval-redteam module with 20 attack templates across 4 categories, LLM-scored resistance evaluation

13 modules, 439 tests, 0 checkstyle violations, 0 SpotBugs bugs.

Test plan

  • mvn clean install passes all 13 modules
  • 439 tests pass (up from 359 in P1)
  • Checkstyle: 0 violations
  • SpotBugs: 0 bugs
  • Each commit passes pre-commit hook (full mvn verify)

Adds bounded-concurrency parallel evaluation using virtual threads
(Executors.newVirtualThreadPerTaskExecutor + Semaphore), progress
callback interface with ETA estimation, and a console progress bar
implementation. Configurable via AgentEvalConfig and YAML.
… TopicDriftDetection, ConversationResolution

Embedding-based ToolResultUtilization measures how well tool results
are reflected in agent output. LLM-judge StepLevelErrorLocalization
pinpoints the first reasoning trace failure. TopicDriftDetection
tracks conversation coherence via embedding similarity to the initial
topic. ConversationResolution judges whether the conversation achieved
its goal. Also extracts VectorMath utility from SemanticSimilarityMetric.
Adds CustomHttpEmbeddingModel with configurable request template,
JSON path extraction for embeddings, and optional auth header.
Allows users to integrate any embedding API without a dedicated
provider implementation.
Self-contained single-file HTML report with embedded CSS/JS and JSON
data injection. Regression comparison matches test cases by input,
computes per-metric deltas, and identifies new failures/passes.
Includes RegressionReporter for human-readable output.
SyntheticDatasetGenerator creates test cases from documents, generates
variations of existing cases, and produces adversarial inputs using
LLM judge. PromptTemplate relocated from agenteval-metrics to
agenteval-core with a deprecated delegate in the original location.
Existing metric classes updated to use core PromptTemplate.
agenteval-langgraph4j captures graph execution into ReasoningSteps
with configurable node-to-step-type mapping. agenteval-mcp provides
framework-agnostic MCP tool call capture, test case building, and
lightweight JSON Schema validation. Both modules use provided-scope
dependencies for their respective frameworks.
RedTeamSuite runs configurable attack categories (prompt injection,
data leakage, boundary testing, robustness) against an agent function.
Includes 20 built-in attack templates loaded from classpath JSON,
LLM-based attack variation generation, and judge-scored resistance
evaluation. Also updates SpotBugs exclusions for P2 modules.
…tyMetric

The method already has a proper @deprecated annotation, so the
dep-ann suppression is redundant and causes an IDE diagnostic.
@pratyush618 pratyush618 merged commit 3796e3c into main Mar 12, 2026
@pratyush618 pratyush618 deleted the p2-growth-phase branch March 31, 2026 17:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant