feat: opt-in function-calling for the Gemini SDK judge#409
Merged
Conversation
Adds a tools/ package (Tool dataclass, registry) and a fetch_url tool that the GeminiGenerator can invoke via the google.genai function-calling loop. Tools are opt-in per judge via a `tools:` YAML list; configs without the key take the existing single-shot codepath unchanged. The fetch_url tool restricts to public HTTPS hosts (SSRF guard via ipaddress on each resolved IP), times out at 10s, caps responses at 50KB, and strips HTML to text. Tool errors are returned as `Error: ...` strings so the model can react rather than the judge crashing. The retry/backoff loop on rate-limit errors is extracted into a single _call_generate_content helper shared by both the single-shot and tool-loop paths. The tool loop is bounded at MAX_TOOL_ITERATIONS=5. Tested: 12 unit tests for fetch_url (scheme, SSRF, size cap, HTML strip, HTTP error, timeout, URL error); 6 unit tests for the Gemini loop (single-shot preserved, unknown tool fails fast, no-call-emit returns text, tool invoked + FunctionResponse threaded back, iteration cap, tool-exception-as-error-string); existing binaryrubricscorer tests still pass. Claude judge tool support deferred to a follow-up.
Collaborator
Author
|
/gcbrun |
… to add tools Documents the feature added in the previous commit: activation via the tools: YAML key, the fetch_url tool behavior and constraints, an end-to-end example wiring it to BinaryRubricScorer for a Beam-version rubric, security notes (HTTPS-only, SSRF guard, known redirect/TOCTOU gaps), and a recipe for adding new tools.
…test import, assert retry-loop unreachable github-code-quality bot flagged three items on #409: 1. fetch_url.py: empty 'except: pass' on HTML extraction now logs at debug with exc_info, with a comment that fallback to raw text is intentional. 2. tools_fetch_url_test.py: unused 'import pytest' removed. 3. gemini.py: _call_generate_content fell off the retry loop with an implicit None return. The loop always returns or raises, but the invariant is now explicit via a raise RuntimeError after the loop so static analyzers do not flag the fallthrough.
Collaborator
Author
|
/gcbrun |
prernakakkar-google
approved these changes
Jun 3, 2026
erinlimbogoogle
pushed a commit
that referenced
this pull request
Jun 5, 2026
* feat: opt-in function-calling for the Gemini SDK judge Adds a tools/ package (Tool dataclass, registry) and a fetch_url tool that the GeminiGenerator can invoke via the google.genai function-calling loop. Tools are opt-in per judge via a `tools:` YAML list; configs without the key take the existing single-shot codepath unchanged. The fetch_url tool restricts to public HTTPS hosts (SSRF guard via ipaddress on each resolved IP), times out at 10s, caps responses at 50KB, and strips HTML to text. Tool errors are returned as `Error: ...` strings so the model can react rather than the judge crashing. The retry/backoff loop on rate-limit errors is extracted into a single _call_generate_content helper shared by both the single-shot and tool-loop paths. The tool loop is bounded at MAX_TOOL_ITERATIONS=5. Tested: 12 unit tests for fetch_url (scheme, SSRF, size cap, HTML strip, HTTP error, timeout, URL error); 6 unit tests for the Gemini loop (single-shot preserved, unknown tool fails fast, no-call-emit returns text, tool invoked + FunctionResponse threaded back, iteration cap, tool-exception-as-error-string); existing binaryrubricscorer tests still pass. Claude judge tool support deferred to a follow-up. * docs: add judge_tools.md covering opt-in tool use, fetch_url, and how to add tools Documents the feature added in the previous commit: activation via the tools: YAML key, the fetch_url tool behavior and constraints, an end-to-end example wiring it to BinaryRubricScorer for a Beam-version rubric, security notes (HTTPS-only, SSRF guard, known redirect/TOCTOU gaps), and a recipe for adding new tools. * address review bot nits: log HTML extraction failures, drop unused pytest import, assert retry-loop unreachable github-code-quality bot flagged three items on #409: 1. fetch_url.py: empty 'except: pass' on HTML extraction now logs at debug with exc_info, with a comment that fallback to raw text is intentional. 2. tools_fetch_url_test.py: unused 'import pytest' removed. 3. gemini.py: _call_generate_content fell off the retry loop with an implicit None return. The loop always returns or raises, but the invariant is now explicit via a raise RuntimeError after the loop so static analyzers do not flag the fallthrough.
IsmailMehdi
added a commit
that referenced
this pull request
Jun 6, 2026
* feat: opt-in function-calling for the Gemini SDK judge Adds a tools/ package (Tool dataclass, registry) and a fetch_url tool that the GeminiGenerator can invoke via the google.genai function-calling loop. Tools are opt-in per judge via a `tools:` YAML list; configs without the key take the existing single-shot codepath unchanged. The fetch_url tool restricts to public HTTPS hosts (SSRF guard via ipaddress on each resolved IP), times out at 10s, caps responses at 50KB, and strips HTML to text. Tool errors are returned as `Error: ...` strings so the model can react rather than the judge crashing. The retry/backoff loop on rate-limit errors is extracted into a single _call_generate_content helper shared by both the single-shot and tool-loop paths. The tool loop is bounded at MAX_TOOL_ITERATIONS=5. Tested: 12 unit tests for fetch_url (scheme, SSRF, size cap, HTML strip, HTTP error, timeout, URL error); 6 unit tests for the Gemini loop (single-shot preserved, unknown tool fails fast, no-call-emit returns text, tool invoked + FunctionResponse threaded back, iteration cap, tool-exception-as-error-string); existing binaryrubricscorer tests still pass. Claude judge tool support deferred to a follow-up. * docs: add judge_tools.md covering opt-in tool use, fetch_url, and how to add tools Documents the feature added in the previous commit: activation via the tools: YAML key, the fetch_url tool behavior and constraints, an end-to-end example wiring it to BinaryRubricScorer for a Beam-version rubric, security notes (HTTPS-only, SSRF guard, known redirect/TOCTOU gaps), and a recipe for adding new tools. * address review bot nits: log HTML extraction failures, drop unused pytest import, assert retry-loop unreachable github-code-quality bot flagged three items on #409: 1. fetch_url.py: empty 'except: pass' on HTML extraction now logs at debug with exc_info, with a comment that fallback to raw text is intentional. 2. tools_fetch_url_test.py: unused 'import pytest' removed. 3. gemini.py: _call_generate_content fell off the retry loop with an implicit None return. The loop always returns or raises, but the invariant is now explicit via a raise RuntimeError after the loop so static analyzers do not flag the fallthrough. * docs: add architecture documentation including external dependency graph mapping
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
LLM-judged scorers (
BinaryRubricScorer,LlmRater,GoalCompletionRate, etc.) currently callself.model.generate(prompt)and get back text. The SDK-backed Gemini judge (gemini.py) wrapsclient.models.generate_content(...)with no tool-use loop, so it cannot ground rubric questions in external state — e.g. "did the agent name the most-recent Apache Beam version listed on beam.apache.org?" This PR adds an opt-in function-calling loop to the Gemini judge with one built-in tool (fetch_url).generators/models/tools/package:Tooldataclass + name registry +fetch_urlimplementation.GeminiGeneratorreads atools:list from YAML, builds the nativetypes.Toolconfig once, and runs a loop bounded atMAX_TOOL_ITERATIONS=5. Configs withouttools:take the original single-shot codepath unchanged._call_generate_contenthelper shared by both paths.fetch_url: HTTPS only, SSRF guard viaipaddressagainst private/loopback/link-local/multicast/reserved IPs, 10s timeout, 50KB body cap, HTML→text via stdlibhtml.parser. Tool errors are returned as\"Error: ...\"strings so the model can react rather than crashing the judge.Configuration
Existing configs without
tools:keep their current behavior bit-for-bit.Out of scope
fetch_url.Known follow-ups
urlopenfollows 3xx without re-validating each hop's IP. A malicious site can 302 to a private host and bypass the upfront check. Mitigation requires a customHTTPRedirectHandleror disabling redirects. Out of scope for this PR but flagged._is_blocked_hostresolves once;urlopenresolves again. A rebinding attack can flip between the two. Out of scope.Test plan
fetch_url: scheme enforcement, SSRF reject (private/loopback/unresolvable), size cap, HTML strip, HTTP error, timeout, URL error.