Skip to content

feat: opt-in function-calling for the Gemini SDK judge#409

Merged
IsmailMehdi merged 3 commits into
mainfrom
tools-support
Jun 3, 2026
Merged

feat: opt-in function-calling for the Gemini SDK judge#409
IsmailMehdi merged 3 commits into
mainfrom
tools-support

Conversation

@IsmailMehdi
Copy link
Copy Markdown
Collaborator

@IsmailMehdi IsmailMehdi commented Jun 3, 2026

Summary

LLM-judged scorers (BinaryRubricScorer, LlmRater, GoalCompletionRate, etc.) currently call self.model.generate(prompt) and get back text. The SDK-backed Gemini judge (gemini.py) wraps client.models.generate_content(...) with no tool-use loop, so it cannot ground rubric questions in external state — e.g. "did the agent name the most-recent Apache Beam version listed on beam.apache.org?" This PR adds an opt-in function-calling loop to the Gemini judge with one built-in tool (fetch_url).

  • New generators/models/tools/ package: Tool dataclass + name registry + fetch_url implementation.
  • GeminiGenerator reads a tools: list from YAML, builds the native types.Tool config once, and runs a loop bounded at MAX_TOOL_ITERATIONS=5. Configs without tools: take the original single-shot codepath unchanged.
  • Retry/backoff on rate-limit errors is extracted into a single _call_generate_content helper shared by both paths.
  • fetch_url: HTTPS only, SSRF guard via ipaddress against private/loopback/link-local/multicast/reserved IPs, 10s timeout, 50KB body cap, HTML→text via stdlib html.parser. Tool errors are returned as \"Error: ...\" strings so the model can react rather than crashing the judge.

Configuration

generator: gcp_vertex_gemini
vertex_model: gemini-2.5-pro
base_prompt: \"\"
tools:
  - fetch_url

Existing configs without tools: keep their current behavior bit-for-bit.

Out of scope

  • Any tools beyond fetch_url.
  • Plumbing tools through the CLI generators (they have MCP).

Known follow-ups

  • Redirect SSRF: urlopen follows 3xx without re-validating each hop's IP. A malicious site can 302 to a private host and bypass the upfront check. Mitigation requires a custom HTTPRedirectHandler or disabling redirects. Out of scope for this PR but flagged.
  • DNS TOCTOU: _is_blocked_host resolves once; urlopen resolves again. A rebinding attack can flip between the two. Out of scope.

Test plan

  • `uv run pytest evalbench/test/tools_fetch_url_test.py evalbench/test/gemini_tools_test.py evalbench/test/binaryrubricscorer_test.py` — 21/21 pass
    • 12 unit tests for fetch_url: scheme enforcement, SSRF reject (private/loopback/unresolvable), size cap, HTML strip, HTTP error, timeout, URL error.
    • 6 unit tests for the Gemini loop: no-tools single-shot preserved, unknown tool fails fast, no-call-emit returns text, tool invoked + FunctionResponse threaded back, iteration cap, tool exception → error string.
    • 3 existing rubric tests still pass.
  • `uv run pycodestyle --max-line-length 100 evalbench/generators/models/gemini.py evalbench/generators/models/tools/ evalbench/test/gemini_tools_test.py evalbench/test/tools_fetch_url_test.py` — clean.
  • Manual smoke against live Vertex with a rubric like "did the agent's final answer mention Apache Beam version 2.x" — not yet run.

Adds a tools/ package (Tool dataclass, registry) and a fetch_url tool that
the GeminiGenerator can invoke via the google.genai function-calling loop.
Tools are opt-in per judge via a `tools:` YAML list; configs without the
key take the existing single-shot codepath unchanged.

The fetch_url tool restricts to public HTTPS hosts (SSRF guard via
ipaddress on each resolved IP), times out at 10s, caps responses at 50KB,
and strips HTML to text. Tool errors are returned as `Error: ...` strings
so the model can react rather than the judge crashing.

The retry/backoff loop on rate-limit errors is extracted into a single
_call_generate_content helper shared by both the single-shot and tool-loop
paths. The tool loop is bounded at MAX_TOOL_ITERATIONS=5.

Tested: 12 unit tests for fetch_url (scheme, SSRF, size cap, HTML strip,
HTTP error, timeout, URL error); 6 unit tests for the Gemini loop
(single-shot preserved, unknown tool fails fast, no-call-emit returns
text, tool invoked + FunctionResponse threaded back, iteration cap,
tool-exception-as-error-string); existing binaryrubricscorer tests still
pass.

Claude judge tool support deferred to a follow-up.
@IsmailMehdi
Copy link
Copy Markdown
Collaborator Author

/gcbrun

Comment thread evalbench/generators/models/tools/fetch_url.py Fixed
Comment thread evalbench/test/tools_fetch_url_test.py Fixed
Comment thread evalbench/generators/models/gemini.py Fixed
… to add tools

Documents the feature added in the previous commit: activation via the tools: YAML key, the fetch_url tool behavior and constraints, an end-to-end example wiring it to BinaryRubricScorer for a Beam-version rubric, security notes (HTTPS-only, SSRF guard, known redirect/TOCTOU gaps), and a recipe for adding new tools.
…test import, assert retry-loop unreachable

github-code-quality bot flagged three items on #409:

1. fetch_url.py: empty 'except: pass' on HTML extraction now logs at debug with exc_info, with a comment that fallback to raw text is intentional.

2. tools_fetch_url_test.py: unused 'import pytest' removed.

3. gemini.py: _call_generate_content fell off the retry loop with an implicit None return. The loop always returns or raises, but the invariant is now explicit via a raise RuntimeError after the loop so static analyzers do not flag the fallthrough.
@IsmailMehdi
Copy link
Copy Markdown
Collaborator Author

/gcbrun

@IsmailMehdi IsmailMehdi merged commit d97f511 into main Jun 3, 2026
10 checks passed
erinlimbogoogle pushed a commit that referenced this pull request Jun 5, 2026
* feat: opt-in function-calling for the Gemini SDK judge

Adds a tools/ package (Tool dataclass, registry) and a fetch_url tool that
the GeminiGenerator can invoke via the google.genai function-calling loop.
Tools are opt-in per judge via a `tools:` YAML list; configs without the
key take the existing single-shot codepath unchanged.

The fetch_url tool restricts to public HTTPS hosts (SSRF guard via
ipaddress on each resolved IP), times out at 10s, caps responses at 50KB,
and strips HTML to text. Tool errors are returned as `Error: ...` strings
so the model can react rather than the judge crashing.

The retry/backoff loop on rate-limit errors is extracted into a single
_call_generate_content helper shared by both the single-shot and tool-loop
paths. The tool loop is bounded at MAX_TOOL_ITERATIONS=5.

Tested: 12 unit tests for fetch_url (scheme, SSRF, size cap, HTML strip,
HTTP error, timeout, URL error); 6 unit tests for the Gemini loop
(single-shot preserved, unknown tool fails fast, no-call-emit returns
text, tool invoked + FunctionResponse threaded back, iteration cap,
tool-exception-as-error-string); existing binaryrubricscorer tests still
pass.

Claude judge tool support deferred to a follow-up.

* docs: add judge_tools.md covering opt-in tool use, fetch_url, and how to add tools

Documents the feature added in the previous commit: activation via the tools: YAML key, the fetch_url tool behavior and constraints, an end-to-end example wiring it to BinaryRubricScorer for a Beam-version rubric, security notes (HTTPS-only, SSRF guard, known redirect/TOCTOU gaps), and a recipe for adding new tools.

* address review bot nits: log HTML extraction failures, drop unused pytest import, assert retry-loop unreachable

github-code-quality bot flagged three items on #409:

1. fetch_url.py: empty 'except: pass' on HTML extraction now logs at debug with exc_info, with a comment that fallback to raw text is intentional.

2. tools_fetch_url_test.py: unused 'import pytest' removed.

3. gemini.py: _call_generate_content fell off the retry loop with an implicit None return. The loop always returns or raises, but the invariant is now explicit via a raise RuntimeError after the loop so static analyzers do not flag the fallthrough.
IsmailMehdi added a commit that referenced this pull request Jun 6, 2026
* feat: opt-in function-calling for the Gemini SDK judge

Adds a tools/ package (Tool dataclass, registry) and a fetch_url tool that
the GeminiGenerator can invoke via the google.genai function-calling loop.
Tools are opt-in per judge via a `tools:` YAML list; configs without the
key take the existing single-shot codepath unchanged.

The fetch_url tool restricts to public HTTPS hosts (SSRF guard via
ipaddress on each resolved IP), times out at 10s, caps responses at 50KB,
and strips HTML to text. Tool errors are returned as `Error: ...` strings
so the model can react rather than the judge crashing.

The retry/backoff loop on rate-limit errors is extracted into a single
_call_generate_content helper shared by both the single-shot and tool-loop
paths. The tool loop is bounded at MAX_TOOL_ITERATIONS=5.

Tested: 12 unit tests for fetch_url (scheme, SSRF, size cap, HTML strip,
HTTP error, timeout, URL error); 6 unit tests for the Gemini loop
(single-shot preserved, unknown tool fails fast, no-call-emit returns
text, tool invoked + FunctionResponse threaded back, iteration cap,
tool-exception-as-error-string); existing binaryrubricscorer tests still
pass.

Claude judge tool support deferred to a follow-up.

* docs: add judge_tools.md covering opt-in tool use, fetch_url, and how to add tools

Documents the feature added in the previous commit: activation via the tools: YAML key, the fetch_url tool behavior and constraints, an end-to-end example wiring it to BinaryRubricScorer for a Beam-version rubric, security notes (HTTPS-only, SSRF guard, known redirect/TOCTOU gaps), and a recipe for adding new tools.

* address review bot nits: log HTML extraction failures, drop unused pytest import, assert retry-loop unreachable

github-code-quality bot flagged three items on #409:

1. fetch_url.py: empty 'except: pass' on HTML extraction now logs at debug with exc_info, with a comment that fallback to raw text is intentional.

2. tools_fetch_url_test.py: unused 'import pytest' removed.

3. gemini.py: _call_generate_content fell off the retry loop with an implicit None return. The loop always returns or raises, but the invariant is now explicit via a raise RuntimeError after the loop so static analyzers do not flag the fallthrough.

* docs: add architecture documentation including external dependency graph mapping
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants