Skip to content

Add judge and metrics modules (Phase 2)#3

Merged
pratyush618 merged 1 commit intomainfrom
feat/judge-and-metrics
Mar 12, 2026
Merged

Add judge and metrics modules (Phase 2)#3
pratyush618 merged 1 commit intomainfrom
feat/judge-and-metrics

Conversation

@pratyush618
Copy link
Copy Markdown
Collaborator

Summary

  • Add agenteval-judge module with OpenAI and Anthropic LLM-as-judge providers, HTTP client with exponential backoff retry, and JSON-first response parsing
  • Add agenteval-metrics module with all 7 P0 evaluation metrics: AnswerRelevancy, Faithfulness, Correctness (G-Eval), Hallucination, Toxicity, ToolSelectionAccuracy, TaskCompletion
  • Add shared infrastructure: LLMJudgeMetric abstract base class, PromptTemplate classpath loader, JudgeModels static factory

Details

agenteval-judge (12 source files, 6 test files)

  • JudgeConfig builder with apiKey, model, baseUrl, timeout, maxRetries, temperature
  • HttpJudgeClient with exponential backoff + jitter, Retry-After header support
  • JudgeResponseParser: 3-tier score extraction (clean JSON → regex JSON block → bare score)
  • OpenAiJudgeModel (POST /v1/chat/completions, json_object response format)
  • AnthropicJudgeModel (POST /v1/messages, x-api-key + anthropic-version headers)
  • JudgeModels.openai() / .anthropic() factory with env var API key resolution
  • Unchecked exception hierarchy: JudgeException, JudgeRateLimitException, JudgeTimeoutException

agenteval-metrics (10 source files, 6 prompt templates, 9 test files)

  • LLMJudgeMetric abstract base with final evaluate() template method lifecycle
  • PromptTemplate with classpath resource loading, {{variable}} substitution, caching
  • 5 response quality metrics: AnswerRelevancy, Faithfulness, Correctness (G-Eval), Hallucination, Toxicity
  • 2 agent metrics: ToolSelectionAccuracy (deterministic F1/LCS), TaskCompletion (LLM-as-judge)
  • 6 prompt templates as classpath .txt resources

Other changes

  • Root POM: added modules, mockito 5.14.2, inter-module dependency management
  • spotbugs-exclude.xml: added exclusion for abstract constructor throw pattern

Test plan

  • mvn clean install passes — 180 tests (71 core + 39 judge + 70 metrics), 0 failures
  • All unit tests use mocked JudgeModel — no API keys needed for CI
  • Checkstyle, SpotBugs, and -Werror all pass
  • Verify deterministic ToolSelectionAccuracyMetric scoring with edge cases
  • Verify prompt templates render correctly with all variable combinations

Implement the judge module with OpenAI and Anthropic LLM providers,
and 7 P0 evaluation metrics for end-to-end agent evaluation.

Judge module:
- JudgeConfig builder, HttpJudgeClient with exponential backoff retry
- JudgeResponseParser (JSON-first, regex-fallback score extraction)
- OpenAiJudgeModel (/v1/chat/completions, json_object mode)
- AnthropicJudgeModel (/v1/messages, x-api-key auth)
- JudgeModels static factory with env var API key resolution
- Unchecked exception hierarchy (JudgeException, RateLimit, Timeout)

Metrics module:
- LLMJudgeMetric abstract base with template method lifecycle
- PromptTemplate classpath loader with {{variable}} substitution
- 5 response metrics: AnswerRelevancy, Faithfulness, Correctness
  (G-Eval), Hallucination, Toxicity
- 2 agent metrics: ToolSelectionAccuracy (deterministic F1/LCS),
  TaskCompletion (LLM-as-judge)
- 6 prompt templates as classpath .txt resources

180 tests pass (71 core + 39 judge + 70 metrics), no API keys needed.
@pratyush618 pratyush618 merged commit 6fe0bc0 into main Mar 12, 2026
@pratyush618 pratyush618 deleted the feat/judge-and-metrics branch March 31, 2026 17:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant