Skip to content

Add 4 judge providers and TrajectoryOptimality metric#9

Merged
pratyush618 merged 3 commits intomainfrom
feat/p5-providers-and-metrics
Mar 13, 2026
Merged

Add 4 judge providers and TrajectoryOptimality metric#9
pratyush618 merged 3 commits intomainfrom
feat/p5-providers-and-metrics

Conversation

@pratyush618
Copy link
Copy Markdown
Collaborator

Summary

  • Google Gemini judge provider — generateContent API with x-goog-api-key header auth and responseMimeType: application/json
  • Azure OpenAI judge provider — Azure deployments endpoint with api-key header, configurable apiVersion
  • Amazon Bedrock judge provider — AWS SigV4 request signing, Anthropic Messages API format for Claude models
  • Custom HTTP judge provider — OpenAI-compatible endpoint for vLLM, LiteLLM, LocalAI, and other self-hosted servers
  • TrajectoryOptimality metric — LLM-as-judge metric evaluating agent path efficiency, penalizing redundant tool calls, circular reasoning, and unnecessary steps; configurable maxSteps
  • JudgeModels factory updated with google(), azure(), bedrock(), custom() methods
  • 38 new tests, 632 total (all passing)

Test plan

  • mvn clean install — all 17 modules build, 632 tests pass
  • Checkstyle + SpotBugs: 0 violations
  • Google: request URL, auth header, content/token extraction, empty candidates
  • Azure: deployment URL with api-version, api-key header, default version fallback
  • Bedrock: SigV4 auth headers, region extraction, session token, content parsing
  • Custom: optional Bearer auth, OpenAI-compatible format, no-key mode
  • TrajectoryOptimality: optimal/suboptimal scoring, maxSteps, validation, defaults

- GoogleJudgeModel: Gemini generateContent API with x-goog-api-key auth
- AzureOpenAiJudgeModel: Azure deployments endpoint with api-key header
- BedrockJudgeModel: AWS SigV4 signing with Anthropic Messages format
- CustomHttpJudgeModel: OpenAI-compatible endpoint for vLLM/LiteLLM/LocalAI
- JudgeModels factory updated with google(), azure(), bedrock(), custom()
- 30 new tests across all providers
- LLM-as-judge metric that penalizes redundant tool calls, circular
  reasoning, and unnecessary steps
- Configurable maxSteps parameter for step count bounds
- Validates on reasoningTrace or toolCalls (actualOutput optional)
- 8 tests covering optimal/suboptimal trajectories, validation, defaults
- Gradle plugin tests: replace findByName/findByType with getByName/getByType
  to eliminate nullable dereference warnings (AssertJ isNotNull doesn't
  satisfy IDE null analysis)
- Remove unused AZURE_OPENAI_API_KEY_ENV constant from JudgeModels
@pratyush618 pratyush618 merged commit 08754f4 into main Mar 13, 2026
@pratyush618 pratyush618 deleted the feat/p5-providers-and-metrics branch March 31, 2026 17:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant