Skip to content

Add P1 features: metrics, providers, datasets, integrations#5

Merged
pratyush618 merged 9 commits intomainfrom
feat/p1-features
Mar 12, 2026
Merged

Add P1 features: metrics, providers, datasets, integrations#5
pratyush618 merged 9 commits intomainfrom
feat/p1-features

Conversation

@pratyush618
Copy link
Copy Markdown
Collaborator

Summary

  • Response quality metrics: Bias, Conciseness, Coherence, SemanticSimilarity (with embeddings module)
  • RAG metrics: ContextualPrecision, ContextualRecall, ContextualRelevancy
  • Agent metrics: ToolArgumentCorrectness, PlanQuality, PlanAdherence, RetrievalCompleteness
  • Conversation metrics: ConversationCoherence, ContextRetention
  • Ollama judge provider for local LLM evaluation
  • Cost tracking with budget limits and per-metric usage
  • YAML configuration loading (agenteval.yaml)
  • CSV/JSONL dataset formats with auto-detection and writers
  • JSON reporter for machine-readable evaluation output
  • Embeddings module (agenteval-embeddings) with OpenAI and Ollama providers
  • Spring AI integration (agenteval-spring-ai) with advisor interceptor and auto-configuration
  • LangChain4j integration (agenteval-langchain4j) with chat model and content retriever capture

Test plan

  • All 10 modules build successfully (mvn clean install)
  • Unit tests pass for all new metrics, providers, datasets, and integrations
  • SpotBugs and Checkstyle pass with no violations

Three new LLM-judge metrics extending LLMJudgeMetric:
- BiasMetric: evaluates output for bias across configurable dimensions
  (gender, race, religion, political, socioeconomic), threshold=0.5
- ConcisenessMetric: evaluates response brevity, threshold=0.5
- CoherenceMetric: evaluates logical flow and consistency, threshold=0.7

Includes prompt templates and unit tests for all three metrics.
…evancy

Three new LLM-judge metrics for evaluating retrieval-augmented generation:
- ContextualPrecisionMetric: measures relevance of retrieved context to
  the expected output, validates retrievalContext + expectedOutput
- ContextualRecallMetric: measures coverage of expected output by
  retrieved context
- ContextualRelevancyMetric: measures relevance of retrieved context to
  the input query

All use numbered context formatting ([1] doc1, [2] doc2) and threshold=0.7.
…e, RetrievalCompleteness

Four new metrics for evaluating agent behavior:
- ToolArgumentCorrectnessMetric: deterministic metric comparing actual vs
  expected tool call arguments with optional strict mode
- PlanQualityMetric: LLM-judge metric evaluating reasoning trace quality
- PlanAdherenceMetric: LLM-judge metric checking execution against plan
- RetrievalCompletenessMetric: supports EXACT (set-intersection) and
  SEMANTIC (LLM-judge) match modes for context completeness
New ConversationMetric interface in core for multi-turn evaluation,
with LLMConversationMetric abstract base class following the same
template method pattern as LLMJudgeMetric.

- ConversationCoherenceMetric: evaluates logical flow across turns
- ContextRetentionMetric: evaluates whether the agent retains context
  from earlier turns

Formats conversation turns as "Turn N [USER/AGENT]: ..." for prompts.
Updates SpotBugs exclusions for constructor-throw pattern.
New agenteval-embeddings module with OpenAI and Ollama embedding providers
using java.net.http client. Includes EmbeddingModels factory, config
builder, and HTTP transport layer.

SemanticSimilarityMetric in agenteval-metrics uses cosine similarity
between embedded actual and expected outputs (deterministic, no LLM judge).

Updates root POM with new module and dependency management entries.
Adds optional agenteval-embeddings dependency to agenteval-metrics.
Dataset formats:
- CsvDatasetLoader/Writer: RFC 4180 CSV with pipe-separated lists
- JsonlDatasetLoader/Writer: one JSON object per line
- DatasetFormat enum with auto-detection by file extension
- DatasetLoaders factory using DatasetFormat.detect()
- EvalDataset.save(Path, DatasetFormat) overload

Reporting:
- JsonReporter: serializes EvalResult to JSON via Jackson

Updates DatasetArgumentsProvider to auto-detect .json/.jsonl/.csv.
Adds jackson-databind dependency to agenteval-reporting.
Ollama:
- OllamaJudgeModel: POST /api/chat with JSON format, no API key required
- JudgeModels.ollama() factory methods
- JudgeConfig.apiKey now nullable (Ollama doesn't need one);
  null checks moved to OpenAI/Anthropic provider constructors

Cost tracking:
- PricingModel, CostSummary, CostTracker (thread-safe with atomics)
- CostTrackingJudgeModel decorator wrapping any JudgeModel
- BudgetExceededException for budget enforcement
- AgentEvalConfig gains costBudget and pricingModel fields
- EvalResult gains costSummary accessor

YAML config:
- AgentEvalConfigLoader: loads agenteval.yaml with ${ENV_VAR} resolution
- YamlConfigModel POJO for Jackson YAML deserialization
- Optional jackson-dataformat-yaml dependency in core
agenteval-spring-ai:
- SpringAiCapture: wraps ChatModel calls as AgentTestCase
- SpringAiTestCaseBuilder: converts ChatResponse to AgentTestCase
- SpringAiAdvisorInterceptor: CallAdvisor capturing RAG retrieval context
- AgentEvalAutoConfiguration for Spring Boot auto-config
- Uses Spring AI 1.0 GA artifacts (spring-ai-model, spring-ai-client-chat)

agenteval-langchain4j:
- LangChain4jCapture: wraps ChatLanguageModel calls as AgentTestCase
- LangChain4jTestCaseBuilder: converts AiMessage response to AgentTestCase
- LangChain4jContentRetrieverCapture: wraps ContentRetriever for context

Both modules use provided-scope dependencies so users bring their own
framework version.
Match Spring AI's CallAdvisor and Advisor interface null contracts.
@pratyush618 pratyush618 merged commit 5b62535 into main Mar 12, 2026
@pratyush618 pratyush618 deleted the feat/p1-features branch March 31, 2026 17:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant