feat: opt-in function-calling for the Gemini SDK judge by IsmailMehdi · Pull Request #409 · GoogleCloudPlatform/evalbench

IsmailMehdi · 2026-06-03T02:56:06Z

Summary

LLM-judged scorers (BinaryRubricScorer, LlmRater, GoalCompletionRate, etc.) currently call self.model.generate(prompt) and get back text. The SDK-backed Gemini judge (gemini.py) wraps client.models.generate_content(...) with no tool-use loop, so it cannot ground rubric questions in external state — e.g. "did the agent name the most-recent Apache Beam version listed on beam.apache.org?" This PR adds an opt-in function-calling loop to the Gemini judge with one built-in tool (fetch_url).

New generators/models/tools/ package: Tool dataclass + name registry + fetch_url implementation.
GeminiGenerator reads a tools: list from YAML, builds the native types.Tool config once, and runs a loop bounded at MAX_TOOL_ITERATIONS=5. Configs without tools: take the original single-shot codepath unchanged.
Retry/backoff on rate-limit errors is extracted into a single _call_generate_content helper shared by both paths.
fetch_url: HTTPS only, SSRF guard via ipaddress against private/loopback/link-local/multicast/reserved IPs, 10s timeout, 50KB body cap, HTML→text via stdlib html.parser. Tool errors are returned as \"Error: ...\" strings so the model can react rather than crashing the judge.

Configuration

generator: gcp_vertex_gemini
vertex_model: gemini-2.5-pro
base_prompt: \"\"
tools:
  - fetch_url

Existing configs without tools: keep their current behavior bit-for-bit.

Out of scope

Any tools beyond fetch_url.
Plumbing tools through the CLI generators (they have MCP).

Known follow-ups

Redirect SSRF: urlopen follows 3xx without re-validating each hop's IP. A malicious site can 302 to a private host and bypass the upfront check. Mitigation requires a custom HTTPRedirectHandler or disabling redirects. Out of scope for this PR but flagged.
DNS TOCTOU: _is_blocked_host resolves once; urlopen resolves again. A rebinding attack can flip between the two. Out of scope.

Test plan

`uv run pytest evalbench/test/tools_fetch_url_test.py evalbench/test/gemini_tools_test.py evalbench/test/binaryrubricscorer_test.py` — 21/21 pass
- 12 unit tests for fetch_url: scheme enforcement, SSRF reject (private/loopback/unresolvable), size cap, HTML strip, HTTP error, timeout, URL error.
- 6 unit tests for the Gemini loop: no-tools single-shot preserved, unknown tool fails fast, no-call-emit returns text, tool invoked + FunctionResponse threaded back, iteration cap, tool exception → error string.
- 3 existing rubric tests still pass.
`uv run pycodestyle --max-line-length 100 evalbench/generators/models/gemini.py evalbench/generators/models/tools/ evalbench/test/gemini_tools_test.py evalbench/test/tools_fetch_url_test.py` — clean.
Manual smoke against live Vertex with a rubric like "did the agent's final answer mention Apache Beam version 2.x" — not yet run.

Adds a tools/ package (Tool dataclass, registry) and a fetch_url tool that the GeminiGenerator can invoke via the google.genai function-calling loop. Tools are opt-in per judge via a `tools:` YAML list; configs without the key take the existing single-shot codepath unchanged. The fetch_url tool restricts to public HTTPS hosts (SSRF guard via ipaddress on each resolved IP), times out at 10s, caps responses at 50KB, and strips HTML to text. Tool errors are returned as `Error: ...` strings so the model can react rather than the judge crashing. The retry/backoff loop on rate-limit errors is extracted into a single _call_generate_content helper shared by both the single-shot and tool-loop paths. The tool loop is bounded at MAX_TOOL_ITERATIONS=5. Tested: 12 unit tests for fetch_url (scheme, SSRF, size cap, HTML strip, HTTP error, timeout, URL error); 6 unit tests for the Gemini loop (single-shot preserved, unknown tool fails fast, no-call-emit returns text, tool invoked + FunctionResponse threaded back, iteration cap, tool-exception-as-error-string); existing binaryrubricscorer tests still pass. Claude judge tool support deferred to a follow-up.

IsmailMehdi · 2026-06-03T02:57:16Z

/gcbrun

… to add tools Documents the feature added in the previous commit: activation via the tools: YAML key, the fetch_url tool behavior and constraints, an end-to-end example wiring it to BinaryRubricScorer for a Beam-version rubric, security notes (HTTPS-only, SSRF guard, known redirect/TOCTOU gaps), and a recipe for adding new tools.

…test import, assert retry-loop unreachable github-code-quality bot flagged three items on #409: 1. fetch_url.py: empty 'except: pass' on HTML extraction now logs at debug with exc_info, with a comment that fallback to raw text is intentional. 2. tools_fetch_url_test.py: unused 'import pytest' removed. 3. gemini.py: _call_generate_content fell off the retry loop with an implicit None return. The loop always returns or raises, but the invariant is now explicit via a raise RuntimeError after the loop so static analyzers do not flag the fallthrough.

IsmailMehdi · 2026-06-03T03:07:48Z

/gcbrun

* feat: opt-in function-calling for the Gemini SDK judge Adds a tools/ package (Tool dataclass, registry) and a fetch_url tool that the GeminiGenerator can invoke via the google.genai function-calling loop. Tools are opt-in per judge via a `tools:` YAML list; configs without the key take the existing single-shot codepath unchanged. The fetch_url tool restricts to public HTTPS hosts (SSRF guard via ipaddress on each resolved IP), times out at 10s, caps responses at 50KB, and strips HTML to text. Tool errors are returned as `Error: ...` strings so the model can react rather than the judge crashing. The retry/backoff loop on rate-limit errors is extracted into a single _call_generate_content helper shared by both the single-shot and tool-loop paths. The tool loop is bounded at MAX_TOOL_ITERATIONS=5. Tested: 12 unit tests for fetch_url (scheme, SSRF, size cap, HTML strip, HTTP error, timeout, URL error); 6 unit tests for the Gemini loop (single-shot preserved, unknown tool fails fast, no-call-emit returns text, tool invoked + FunctionResponse threaded back, iteration cap, tool-exception-as-error-string); existing binaryrubricscorer tests still pass. Claude judge tool support deferred to a follow-up. * docs: add judge_tools.md covering opt-in tool use, fetch_url, and how to add tools Documents the feature added in the previous commit: activation via the tools: YAML key, the fetch_url tool behavior and constraints, an end-to-end example wiring it to BinaryRubricScorer for a Beam-version rubric, security notes (HTTPS-only, SSRF guard, known redirect/TOCTOU gaps), and a recipe for adding new tools. * address review bot nits: log HTML extraction failures, drop unused pytest import, assert retry-loop unreachable github-code-quality bot flagged three items on #409: 1. fetch_url.py: empty 'except: pass' on HTML extraction now logs at debug with exc_info, with a comment that fallback to raw text is intentional. 2. tools_fetch_url_test.py: unused 'import pytest' removed. 3. gemini.py: _call_generate_content fell off the retry loop with an implicit None return. The loop always returns or raises, but the invariant is now explicit via a raise RuntimeError after the loop so static analyzers do not flag the fallthrough.

* feat: opt-in function-calling for the Gemini SDK judge Adds a tools/ package (Tool dataclass, registry) and a fetch_url tool that the GeminiGenerator can invoke via the google.genai function-calling loop. Tools are opt-in per judge via a `tools:` YAML list; configs without the key take the existing single-shot codepath unchanged. The fetch_url tool restricts to public HTTPS hosts (SSRF guard via ipaddress on each resolved IP), times out at 10s, caps responses at 50KB, and strips HTML to text. Tool errors are returned as `Error: ...` strings so the model can react rather than the judge crashing. The retry/backoff loop on rate-limit errors is extracted into a single _call_generate_content helper shared by both the single-shot and tool-loop paths. The tool loop is bounded at MAX_TOOL_ITERATIONS=5. Tested: 12 unit tests for fetch_url (scheme, SSRF, size cap, HTML strip, HTTP error, timeout, URL error); 6 unit tests for the Gemini loop (single-shot preserved, unknown tool fails fast, no-call-emit returns text, tool invoked + FunctionResponse threaded back, iteration cap, tool-exception-as-error-string); existing binaryrubricscorer tests still pass. Claude judge tool support deferred to a follow-up. * docs: add judge_tools.md covering opt-in tool use, fetch_url, and how to add tools Documents the feature added in the previous commit: activation via the tools: YAML key, the fetch_url tool behavior and constraints, an end-to-end example wiring it to BinaryRubricScorer for a Beam-version rubric, security notes (HTTPS-only, SSRF guard, known redirect/TOCTOU gaps), and a recipe for adding new tools. * address review bot nits: log HTML extraction failures, drop unused pytest import, assert retry-loop unreachable github-code-quality bot flagged three items on #409: 1. fetch_url.py: empty 'except: pass' on HTML extraction now logs at debug with exc_info, with a comment that fallback to raw text is intentional. 2. tools_fetch_url_test.py: unused 'import pytest' removed. 3. gemini.py: _call_generate_content fell off the retry loop with an implicit None return. The loop always returns or raises, but the invariant is now explicit via a raise RuntimeError after the loop so static analyzers do not flag the fallthrough. * docs: add architecture documentation including external dependency graph mapping

IsmailMehdi requested a review from prernakakkar-google June 3, 2026 02:57

github-code-quality Bot found potential problems Jun 3, 2026

View reviewed changes

Comment thread evalbench/generators/models/tools/fetch_url.py Fixed

Comment thread evalbench/test/tools_fetch_url_test.py Fixed

Comment thread evalbench/generators/models/gemini.py Fixed

IsmailMehdi added 2 commits June 2, 2026 20:01

prernakakkar-google approved these changes Jun 3, 2026

View reviewed changes

IsmailMehdi merged commit d97f511 into main Jun 3, 2026
10 checks passed

release-please Bot mentioned this pull request Jun 3, 2026

chore(main): release 1.8.0 #380

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: opt-in function-calling for the Gemini SDK judge#409

feat: opt-in function-calling for the Gemini SDK judge#409
IsmailMehdi merged 3 commits into
mainfrom
tools-support

IsmailMehdi commented Jun 3, 2026 •

edited

Loading

Uh oh!

IsmailMehdi commented Jun 3, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

IsmailMehdi commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

IsmailMehdi commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Configuration

Out of scope

Known follow-ups

Test plan

Uh oh!

IsmailMehdi commented Jun 3, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

IsmailMehdi commented Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

IsmailMehdi commented Jun 3, 2026 •

edited

Loading