Add agenteval-judge and agenteval-metrics modules (Phase 2) by pratyush618 · Pull Request #2 · ByteVeda/agenteval

pratyush618 · 2026-03-12T11:04:17Z

Implement the judge module with OpenAI and Anthropic LLM providers, and 7 P0 evaluation metrics for end-to-end agent evaluation.

Judge module:

JudgeConfig builder, HttpJudgeClient with exponential backoff retry
JudgeResponseParser (JSON-first, regex-fallback score extraction)
OpenAiJudgeModel (/v1/chat/completions, json_object mode)
AnthropicJudgeModel (/v1/messages, x-api-key auth)
JudgeModels static factory with env var API key resolution
Unchecked exception hierarchy (JudgeException, RateLimit, Timeout)

Metrics module:

LLMJudgeMetric abstract base with template method lifecycle
PromptTemplate classpath loader with {{variable}} substitution
5 response metrics: AnswerRelevancy, Faithfulness, Correctness (G-Eval), Hallucination, Toxicity
2 agent metrics: ToolSelectionAccuracy (deterministic F1/LCS), TaskCompletion (LLM-as-judge)
6 prompt templates as classpath .txt resources

Implement the judge module with OpenAI and Anthropic LLM providers, and 7 P0 evaluation metrics for end-to-end agent evaluation. Judge module: - JudgeConfig builder, HttpJudgeClient with exponential backoff retry - JudgeResponseParser (JSON-first, regex-fallback score extraction) - OpenAiJudgeModel (/v1/chat/completions, json_object mode) - AnthropicJudgeModel (/v1/messages, x-api-key auth) - JudgeModels static factory with env var API key resolution - Unchecked exception hierarchy (JudgeException, RateLimit, Timeout) Metrics module: - LLMJudgeMetric abstract base with template method lifecycle - PromptTemplate classpath loader with {{variable}} substitution - 5 response metrics: AnswerRelevancy, Faithfulness, Correctness (G-Eval), Hallucination, Toxicity - 2 agent metrics: ToolSelectionAccuracy (deterministic F1/LCS), TaskCompletion (LLM-as-judge) - 6 prompt templates as classpath .txt resources 180 tests pass (71 core + 39 judge + 70 metrics), no API keys needed.

pratyush618 · 2026-03-12T11:07:34Z

Recreating PR with clean history

pratyush618 self-assigned this Mar 12, 2026

pratyush618 force-pushed the feat/judge-and-metrics branch from e8a814e to 47e6c18 Compare March 12, 2026 11:05

pratyush618 closed this Mar 12, 2026

pratyush618 deleted the feat/judge-and-metrics branch March 12, 2026 11:07

This was referenced Apr 23, 2026

Audit fixes: build reconciliation, integration tests, XML/YAML hardening, housekeeping #78

Closed

Audit fixes: build reconciliation, integration tests, XML/YAML hardening, housekeeping #79

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add agenteval-judge and agenteval-metrics modules (Phase 2)#2

Add agenteval-judge and agenteval-metrics modules (Phase 2)#2
pratyush618 wants to merge 1 commit intomainfrom
feat/judge-and-metrics

pratyush618 commented Mar 12, 2026

Uh oh!

pratyush618 commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pratyush618 commented Mar 12, 2026

Uh oh!

pratyush618 commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant