Skip to content

Add agenteval-judge and agenteval-metrics modules (Phase 2)#2

Closed
pratyush618 wants to merge 1 commit intomainfrom
feat/judge-and-metrics
Closed

Add agenteval-judge and agenteval-metrics modules (Phase 2)#2
pratyush618 wants to merge 1 commit intomainfrom
feat/judge-and-metrics

Conversation

@pratyush618
Copy link
Copy Markdown
Collaborator

Implement the judge module with OpenAI and Anthropic LLM providers, and 7 P0 evaluation metrics for end-to-end agent evaluation.

Judge module:

  • JudgeConfig builder, HttpJudgeClient with exponential backoff retry
  • JudgeResponseParser (JSON-first, regex-fallback score extraction)
  • OpenAiJudgeModel (/v1/chat/completions, json_object mode)
  • AnthropicJudgeModel (/v1/messages, x-api-key auth)
  • JudgeModels static factory with env var API key resolution
  • Unchecked exception hierarchy (JudgeException, RateLimit, Timeout)

Metrics module:

  • LLMJudgeMetric abstract base with template method lifecycle
  • PromptTemplate classpath loader with {{variable}} substitution
  • 5 response metrics: AnswerRelevancy, Faithfulness, Correctness (G-Eval), Hallucination, Toxicity
  • 2 agent metrics: ToolSelectionAccuracy (deterministic F1/LCS), TaskCompletion (LLM-as-judge)
  • 6 prompt templates as classpath .txt resources

@pratyush618 pratyush618 self-assigned this Mar 12, 2026
Implement the judge module with OpenAI and Anthropic LLM providers,
and 7 P0 evaluation metrics for end-to-end agent evaluation.

Judge module:
- JudgeConfig builder, HttpJudgeClient with exponential backoff retry
- JudgeResponseParser (JSON-first, regex-fallback score extraction)
- OpenAiJudgeModel (/v1/chat/completions, json_object mode)
- AnthropicJudgeModel (/v1/messages, x-api-key auth)
- JudgeModels static factory with env var API key resolution
- Unchecked exception hierarchy (JudgeException, RateLimit, Timeout)

Metrics module:
- LLMJudgeMetric abstract base with template method lifecycle
- PromptTemplate classpath loader with {{variable}} substitution
- 5 response metrics: AnswerRelevancy, Faithfulness, Correctness
  (G-Eval), Hallucination, Toxicity
- 2 agent metrics: ToolSelectionAccuracy (deterministic F1/LCS),
  TaskCompletion (LLM-as-judge)
- 6 prompt templates as classpath .txt resources

180 tests pass (71 core + 39 judge + 70 metrics), no API keys needed.
@pratyush618 pratyush618 force-pushed the feat/judge-and-metrics branch from e8a814e to 47e6c18 Compare March 12, 2026 11:05
@pratyush618
Copy link
Copy Markdown
Collaborator Author

Recreating PR with clean history

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant