Add P1 features: metrics, providers, datasets, integrations#5
Merged
pratyush618 merged 9 commits intomainfrom Mar 12, 2026
Merged
Add P1 features: metrics, providers, datasets, integrations#5pratyush618 merged 9 commits intomainfrom
pratyush618 merged 9 commits intomainfrom
Conversation
Three new LLM-judge metrics extending LLMJudgeMetric: - BiasMetric: evaluates output for bias across configurable dimensions (gender, race, religion, political, socioeconomic), threshold=0.5 - ConcisenessMetric: evaluates response brevity, threshold=0.5 - CoherenceMetric: evaluates logical flow and consistency, threshold=0.7 Includes prompt templates and unit tests for all three metrics.
…evancy Three new LLM-judge metrics for evaluating retrieval-augmented generation: - ContextualPrecisionMetric: measures relevance of retrieved context to the expected output, validates retrievalContext + expectedOutput - ContextualRecallMetric: measures coverage of expected output by retrieved context - ContextualRelevancyMetric: measures relevance of retrieved context to the input query All use numbered context formatting ([1] doc1, [2] doc2) and threshold=0.7.
…e, RetrievalCompleteness Four new metrics for evaluating agent behavior: - ToolArgumentCorrectnessMetric: deterministic metric comparing actual vs expected tool call arguments with optional strict mode - PlanQualityMetric: LLM-judge metric evaluating reasoning trace quality - PlanAdherenceMetric: LLM-judge metric checking execution against plan - RetrievalCompletenessMetric: supports EXACT (set-intersection) and SEMANTIC (LLM-judge) match modes for context completeness
New ConversationMetric interface in core for multi-turn evaluation, with LLMConversationMetric abstract base class following the same template method pattern as LLMJudgeMetric. - ConversationCoherenceMetric: evaluates logical flow across turns - ContextRetentionMetric: evaluates whether the agent retains context from earlier turns Formats conversation turns as "Turn N [USER/AGENT]: ..." for prompts. Updates SpotBugs exclusions for constructor-throw pattern.
New agenteval-embeddings module with OpenAI and Ollama embedding providers using java.net.http client. Includes EmbeddingModels factory, config builder, and HTTP transport layer. SemanticSimilarityMetric in agenteval-metrics uses cosine similarity between embedded actual and expected outputs (deterministic, no LLM judge). Updates root POM with new module and dependency management entries. Adds optional agenteval-embeddings dependency to agenteval-metrics.
Dataset formats: - CsvDatasetLoader/Writer: RFC 4180 CSV with pipe-separated lists - JsonlDatasetLoader/Writer: one JSON object per line - DatasetFormat enum with auto-detection by file extension - DatasetLoaders factory using DatasetFormat.detect() - EvalDataset.save(Path, DatasetFormat) overload Reporting: - JsonReporter: serializes EvalResult to JSON via Jackson Updates DatasetArgumentsProvider to auto-detect .json/.jsonl/.csv. Adds jackson-databind dependency to agenteval-reporting.
Ollama:
- OllamaJudgeModel: POST /api/chat with JSON format, no API key required
- JudgeModels.ollama() factory methods
- JudgeConfig.apiKey now nullable (Ollama doesn't need one);
null checks moved to OpenAI/Anthropic provider constructors
Cost tracking:
- PricingModel, CostSummary, CostTracker (thread-safe with atomics)
- CostTrackingJudgeModel decorator wrapping any JudgeModel
- BudgetExceededException for budget enforcement
- AgentEvalConfig gains costBudget and pricingModel fields
- EvalResult gains costSummary accessor
YAML config:
- AgentEvalConfigLoader: loads agenteval.yaml with ${ENV_VAR} resolution
- YamlConfigModel POJO for Jackson YAML deserialization
- Optional jackson-dataformat-yaml dependency in core
agenteval-spring-ai: - SpringAiCapture: wraps ChatModel calls as AgentTestCase - SpringAiTestCaseBuilder: converts ChatResponse to AgentTestCase - SpringAiAdvisorInterceptor: CallAdvisor capturing RAG retrieval context - AgentEvalAutoConfiguration for Spring Boot auto-config - Uses Spring AI 1.0 GA artifacts (spring-ai-model, spring-ai-client-chat) agenteval-langchain4j: - LangChain4jCapture: wraps ChatLanguageModel calls as AgentTestCase - LangChain4jTestCaseBuilder: converts AiMessage response to AgentTestCase - LangChain4jContentRetrieverCapture: wraps ContentRetriever for context Both modules use provided-scope dependencies so users bring their own framework version.
Match Spring AI's CallAdvisor and Advisor interface null contracts.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
agenteval.yaml)agenteval-embeddings) with OpenAI and Ollama providersagenteval-spring-ai) with advisor interceptor and auto-configurationagenteval-langchain4j) with chat model and content retriever captureTest plan
mvn clean install)