Add P2 growth phase: parallel eval, 4 metrics, 3 new modules, reporting#6
Merged
pratyush618 merged 8 commits intomainfrom Mar 12, 2026
Merged
Add P2 growth phase: parallel eval, 4 metrics, 3 new modules, reporting#6pratyush618 merged 8 commits intomainfrom
pratyush618 merged 8 commits intomainfrom
Conversation
Adds bounded-concurrency parallel evaluation using virtual threads (Executors.newVirtualThreadPerTaskExecutor + Semaphore), progress callback interface with ETA estimation, and a console progress bar implementation. Configurable via AgentEvalConfig and YAML.
… TopicDriftDetection, ConversationResolution Embedding-based ToolResultUtilization measures how well tool results are reflected in agent output. LLM-judge StepLevelErrorLocalization pinpoints the first reasoning trace failure. TopicDriftDetection tracks conversation coherence via embedding similarity to the initial topic. ConversationResolution judges whether the conversation achieved its goal. Also extracts VectorMath utility from SemanticSimilarityMetric.
Adds CustomHttpEmbeddingModel with configurable request template, JSON path extraction for embeddings, and optional auth header. Allows users to integrate any embedding API without a dedicated provider implementation.
Self-contained single-file HTML report with embedded CSS/JS and JSON data injection. Regression comparison matches test cases by input, computes per-metric deltas, and identifies new failures/passes. Includes RegressionReporter for human-readable output.
SyntheticDatasetGenerator creates test cases from documents, generates variations of existing cases, and produces adversarial inputs using LLM judge. PromptTemplate relocated from agenteval-metrics to agenteval-core with a deprecated delegate in the original location. Existing metric classes updated to use core PromptTemplate.
agenteval-langgraph4j captures graph execution into ReasoningSteps with configurable node-to-step-type mapping. agenteval-mcp provides framework-agnostic MCP tool call capture, test case building, and lightweight JSON Schema validation. Both modules use provided-scope dependencies for their respective frameworks.
RedTeamSuite runs configurable attack categories (prompt injection, data leakage, boundary testing, robustness) against an agent function. Includes 20 built-in attack templates loaded from classpath JSON, LLM-based attack variation generation, and judge-scored resistance evaluation. Also updates SpotBugs exclusions for P2 modules.
…tyMetric The method already has a proper @deprecated annotation, so the dep-ann suppression is redundant and causes an IDE diagnostic.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
13 modules, 439 tests, 0 checkstyle violations, 0 SpotBugs bugs.
Test plan
mvn clean installpasses all 13 modulesmvn verify)