Add P1 features: metrics, providers, datasets, integrations by pratyush618 · Pull Request #5 · ByteVeda/agenteval

pratyush618 · 2026-03-12T13:37:35Z

Summary

Response quality metrics: Bias, Conciseness, Coherence, SemanticSimilarity (with embeddings module)
RAG metrics: ContextualPrecision, ContextualRecall, ContextualRelevancy
Agent metrics: ToolArgumentCorrectness, PlanQuality, PlanAdherence, RetrievalCompleteness
Conversation metrics: ConversationCoherence, ContextRetention
Ollama judge provider for local LLM evaluation
Cost tracking with budget limits and per-metric usage
YAML configuration loading (agenteval.yaml)
CSV/JSONL dataset formats with auto-detection and writers
JSON reporter for machine-readable evaluation output
Embeddings module (agenteval-embeddings) with OpenAI and Ollama providers
Spring AI integration (agenteval-spring-ai) with advisor interceptor and auto-configuration
LangChain4j integration (agenteval-langchain4j) with chat model and content retriever capture

Test plan

All 10 modules build successfully (mvn clean install)
Unit tests pass for all new metrics, providers, datasets, and integrations
SpotBugs and Checkstyle pass with no violations

Three new LLM-judge metrics extending LLMJudgeMetric: - BiasMetric: evaluates output for bias across configurable dimensions (gender, race, religion, political, socioeconomic), threshold=0.5 - ConcisenessMetric: evaluates response brevity, threshold=0.5 - CoherenceMetric: evaluates logical flow and consistency, threshold=0.7 Includes prompt templates and unit tests for all three metrics.

…evancy Three new LLM-judge metrics for evaluating retrieval-augmented generation: - ContextualPrecisionMetric: measures relevance of retrieved context to the expected output, validates retrievalContext + expectedOutput - ContextualRecallMetric: measures coverage of expected output by retrieved context - ContextualRelevancyMetric: measures relevance of retrieved context to the input query All use numbered context formatting ([1] doc1, [2] doc2) and threshold=0.7.

…e, RetrievalCompleteness Four new metrics for evaluating agent behavior: - ToolArgumentCorrectnessMetric: deterministic metric comparing actual vs expected tool call arguments with optional strict mode - PlanQualityMetric: LLM-judge metric evaluating reasoning trace quality - PlanAdherenceMetric: LLM-judge metric checking execution against plan - RetrievalCompletenessMetric: supports EXACT (set-intersection) and SEMANTIC (LLM-judge) match modes for context completeness

New ConversationMetric interface in core for multi-turn evaluation, with LLMConversationMetric abstract base class following the same template method pattern as LLMJudgeMetric. - ConversationCoherenceMetric: evaluates logical flow across turns - ContextRetentionMetric: evaluates whether the agent retains context from earlier turns Formats conversation turns as "Turn N [USER/AGENT]: ..." for prompts. Updates SpotBugs exclusions for constructor-throw pattern.

New agenteval-embeddings module with OpenAI and Ollama embedding providers using java.net.http client. Includes EmbeddingModels factory, config builder, and HTTP transport layer. SemanticSimilarityMetric in agenteval-metrics uses cosine similarity between embedded actual and expected outputs (deterministic, no LLM judge). Updates root POM with new module and dependency management entries. Adds optional agenteval-embeddings dependency to agenteval-metrics.

Dataset formats: - CsvDatasetLoader/Writer: RFC 4180 CSV with pipe-separated lists - JsonlDatasetLoader/Writer: one JSON object per line - DatasetFormat enum with auto-detection by file extension - DatasetLoaders factory using DatasetFormat.detect() - EvalDataset.save(Path, DatasetFormat) overload Reporting: - JsonReporter: serializes EvalResult to JSON via Jackson Updates DatasetArgumentsProvider to auto-detect .json/.jsonl/.csv. Adds jackson-databind dependency to agenteval-reporting.

Ollama: - OllamaJudgeModel: POST /api/chat with JSON format, no API key required - JudgeModels.ollama() factory methods - JudgeConfig.apiKey now nullable (Ollama doesn't need one); null checks moved to OpenAI/Anthropic provider constructors Cost tracking: - PricingModel, CostSummary, CostTracker (thread-safe with atomics) - CostTrackingJudgeModel decorator wrapping any JudgeModel - BudgetExceededException for budget enforcement - AgentEvalConfig gains costBudget and pricingModel fields - EvalResult gains costSummary accessor YAML config: - AgentEvalConfigLoader: loads agenteval.yaml with ${ENV_VAR} resolution - YamlConfigModel POJO for Jackson YAML deserialization - Optional jackson-dataformat-yaml dependency in core

agenteval-spring-ai: - SpringAiCapture: wraps ChatModel calls as AgentTestCase - SpringAiTestCaseBuilder: converts ChatResponse to AgentTestCase - SpringAiAdvisorInterceptor: CallAdvisor capturing RAG retrieval context - AgentEvalAutoConfiguration for Spring Boot auto-config - Uses Spring AI 1.0 GA artifacts (spring-ai-model, spring-ai-client-chat) agenteval-langchain4j: - LangChain4jCapture: wraps ChatLanguageModel calls as AgentTestCase - LangChain4jTestCaseBuilder: converts AiMessage response to AgentTestCase - LangChain4jContentRetrieverCapture: wraps ContentRetriever for context Both modules use provided-scope dependencies so users bring their own framework version.

Match Spring AI's CallAdvisor and Advisor interface null contracts.

pratyush618 added 9 commits March 12, 2026 18:43

Fix @nonnull annotation warnings in SpringAiAdvisorInterceptor

6c45806

Match Spring AI's CallAdvisor and Advisor interface null contracts.

pratyush618 merged commit 5b62535 into main Mar 12, 2026

pratyush618 deleted the feat/p1-features branch March 31, 2026 17:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add P1 features: metrics, providers, datasets, integrations#5

Add P1 features: metrics, providers, datasets, integrations#5
pratyush618 merged 9 commits intomainfrom
feat/p1-features

pratyush618 commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant