Add judge and metrics modules (Phase 2) by pratyush618 · Pull Request #3 · ByteVeda/agenteval

pratyush618 · 2026-03-12T11:09:15Z

Summary

Add agenteval-judge module with OpenAI and Anthropic LLM-as-judge providers, HTTP client with exponential backoff retry, and JSON-first response parsing
Add agenteval-metrics module with all 7 P0 evaluation metrics: AnswerRelevancy, Faithfulness, Correctness (G-Eval), Hallucination, Toxicity, ToolSelectionAccuracy, TaskCompletion
Add shared infrastructure: LLMJudgeMetric abstract base class, PromptTemplate classpath loader, JudgeModels static factory

Details

agenteval-judge (12 source files, 6 test files)

JudgeConfig builder with apiKey, model, baseUrl, timeout, maxRetries, temperature
HttpJudgeClient with exponential backoff + jitter, Retry-After header support
JudgeResponseParser: 3-tier score extraction (clean JSON → regex JSON block → bare score)
OpenAiJudgeModel (POST /v1/chat/completions, json_object response format)
AnthropicJudgeModel (POST /v1/messages, x-api-key + anthropic-version headers)
JudgeModels.openai() / .anthropic() factory with env var API key resolution
Unchecked exception hierarchy: JudgeException, JudgeRateLimitException, JudgeTimeoutException

agenteval-metrics (10 source files, 6 prompt templates, 9 test files)

LLMJudgeMetric abstract base with final evaluate() template method lifecycle
PromptTemplate with classpath resource loading, {{variable}} substitution, caching
5 response quality metrics: AnswerRelevancy, Faithfulness, Correctness (G-Eval), Hallucination, Toxicity
2 agent metrics: ToolSelectionAccuracy (deterministic F1/LCS), TaskCompletion (LLM-as-judge)
6 prompt templates as classpath .txt resources

Other changes

Root POM: added modules, mockito 5.14.2, inter-module dependency management
spotbugs-exclude.xml: added exclusion for abstract constructor throw pattern

Test plan

mvn clean install passes — 180 tests (71 core + 39 judge + 70 metrics), 0 failures
All unit tests use mocked JudgeModel — no API keys needed for CI
Checkstyle, SpotBugs, and -Werror all pass
Verify deterministic ToolSelectionAccuracyMetric scoring with edge cases
Verify prompt templates render correctly with all variable combinations

Implement the judge module with OpenAI and Anthropic LLM providers, and 7 P0 evaluation metrics for end-to-end agent evaluation. Judge module: - JudgeConfig builder, HttpJudgeClient with exponential backoff retry - JudgeResponseParser (JSON-first, regex-fallback score extraction) - OpenAiJudgeModel (/v1/chat/completions, json_object mode) - AnthropicJudgeModel (/v1/messages, x-api-key auth) - JudgeModels static factory with env var API key resolution - Unchecked exception hierarchy (JudgeException, RateLimit, Timeout) Metrics module: - LLMJudgeMetric abstract base with template method lifecycle - PromptTemplate classpath loader with {{variable}} substitution - 5 response metrics: AnswerRelevancy, Faithfulness, Correctness (G-Eval), Hallucination, Toxicity - 2 agent metrics: ToolSelectionAccuracy (deterministic F1/LCS), TaskCompletion (LLM-as-judge) - 6 prompt templates as classpath .txt resources 180 tests pass (71 core + 39 judge + 70 metrics), no API keys needed.

pratyush618 merged commit 6fe0bc0 into main Mar 12, 2026

pratyush618 deleted the feat/judge-and-metrics branch March 31, 2026 17:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add judge and metrics modules (Phase 2)#3

Add judge and metrics modules (Phase 2)#3
pratyush618 merged 1 commit intomainfrom
feat/judge-and-metrics

pratyush618 commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pratyush618 commented Mar 12, 2026

Summary

Details

agenteval-judge (12 source files, 6 test files)

agenteval-metrics (10 source files, 6 prompt templates, 9 test files)

Other changes

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant