Testbot: fix bazelisk, add observability, broaden allowlist - 80% cost cut by jiaenren · Pull Request #904 · NVIDIA/OSMO

jiaenren · 2026-04-27T22:27:29Z

Description

Fixes a hard outage on testbot, brings generate-side observability up to par
with respond, and broadens the tool allowlist for ~80% lower spend.

1. Bazelisk install workaround. setup-bazel@0.15.0 resolves bazelisk
via GitHub's listReleases API on bazelbuild/bazelisk, which returns []
as of 2026-04-27 (the tags/v1.27.0 endpoint and asset URLs still work).
Every fresh ubuntu-latest run failed with Unable to find Bazelisk version 1.27.0 (run 25012620541).
Workaround: download from the asset URL into $RUNNER_TEMP, verify SHA256
against a pinned hash, sudo install to /usr/local/bin. setup-bazel still
runs for caching.

2. Generate-side observability. Switch to --output-format stream-json --verbose so each turn streams live; wrap in a foldable ::group:: so it
collapses by default. After the run, parse the final result event and
print subtype/is_error/num_turns/duration_ms/duration_api_ms/
total_cost_usd/result text in a Claude Code diagnostics group. Reverse-
iterate + break for O(1) parsing. set +e + ${PIPESTATUS[0]} so the
diagnostics print on Claude failure and step exit code matches Claude's.

3. Broaden tool allowlist. Replace narrow Bash(pnpm --dir src/ui ...)
with Bash(pnpm *); add Bash(bazel build *), Bash(npx vitest *),
Bash(npx tsc *), direct ./node_modules/.bin/{vitest,tsc} *,
Bash(cd *), Bash(mv *). cd and mv are zero-exec by themselves;
following segments still need their own pattern. Pipes, output redirects,
rm, tee, curl, git, gh write remain blocked. mv was added after
a runaway run (Claude wrote .test.ts, needed .test.tsx, looped 180
turns trying mv/rm/git rm).

4. Security hardening (CodeRabbit). ${PIPESTATUS[0]} instead of $?
so tee doesn't mask Claude failures. inputs.max_turns routed through
env: (literal) and validated ^[0-9]+$ (defense vs. shell injection from
malicious workflow_dispatch). Untrusted Claude result text wrapped in
::stop-commands::<random-token>:: so it can't inject ::warning:: /
::add-mask:: etc.

5. max_turns calibrated to 100. Was 50 (too low), bumped to 200
(turned a runaway into ~$88), now 100 — covers legitimate work (best run
was 21 turns) and caps blast radius. mv allowance means filename mistakes
recover in 1 turn instead of 200.

Verification — A/B on this branch

Both runs: max_targets=1, max_uncovered=300, dry run, opus-4-5.

Metric	Before allowlist (run 25026758807)	After (run 25028953472)
`num_turns`	57	21
`total_cost_usd`	$29.73	$5.91
`duration_api_ms`	461s	119s
Permission denials	67 of 161 (42%)	0 of 9 (0%)
Step wall time	7m 53s	6m 39s

Subsequent run 25069219888
verified SHA256 install + foldable stream group + diagnostics summary
end-to-end and surfaced the mv denial loop that drove fix #5.

Issue #760

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

🤖 Generated with Claude Code

- Install bazelisk directly from the release asset URL. The setup-bazel action's `listReleases` call against bazelbuild/bazelisk now returns an empty array (the `/releases/tags/v<X>` endpoint still works), so every fresh ubuntu-latest run was failing at setup with "Unable to find Bazelisk version 1.27.0". - Bump Claude Code `--max-turns` to 200 in both workflows and update the README config tables. Recent runs were hitting the previous caps (50 / 75) on non-trivial requests. - Surface generate-side diagnostics like respond does: switch to `--output-format stream-json --verbose` so each turn and tool use streams into the action log live, and add a `Claude Code diagnostics` group that parses the final `result` message for `subtype`, `is_error`, `num_turns`, `duration_ms`, `total_cost_usd`, and the result text. `set +e` ensures the summary still prints when Claude exits non-zero (timeout, max-turns, etc.).

coderabbitai · 2026-04-27T22:27:42Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

CI workflows now install Bazelisk directly and increase Claude Code agent turn limits to 200; Claude Code runs in streamed JSONL mode with tee'd logs, exit-status capture, and post-parse of the final result. Testbot docs, rules, and allowed-tools for Claude were updated.

Changes

Cohort / File(s)	Summary
Workflows: Bazelisk install & turn limits `.github/workflows/testbot.yaml`, `.github/workflows/testbot-respond.yaml`	Adds explicit Bazelisk download/install (v1.27.0) and symlinks `bazel`; removes `bazelisk-version` usage. Increases `max_turns`/`--max-turns` from 50/75 to 200.
Workflows: Claude streaming, logging & diagnostics `.github/workflows/testbot.yaml`	Runs Claude Code in streamed JSONL + verbose mode, stores the prompt in a variable, tees the JSONL stream to a log file, continues diagnostics when Claude exits non-zero, captures and re-raises Claude’s exit status, and parses the JSONL to print the final `result` and diagnostics.
Scripts: allowed tools & testbot runtime `src/scripts/testbot/respond.py`	Expands Claude Code `--allowedTools` allowlist: adds `cd `, `bazel build `, broader `pnpm *`, and allows `vitest`/`tsc` via `npx` and `node_modules/.bin`, while keeping existing Bazel `test` entries.
Docs & Rules `src/scripts/testbot/README.md`, `src/scripts/testbot/TESTBOT_RULES.md`	Updates README examples/config to use 200-agent-turn default. Adds TESTBOT_RULES restricting verification shell commands to a single allowed command (or `cd <dir> && <single-cmd>`), disallowing piping/chaining/redirection.

Sequence Diagram(s)

sequenceDiagram
  participant WF as Workflow
  participant Claude as Claude Code (agent)
  participant Log as JSONL Log File
  participant Parser as Post-processor

  WF->>Claude: invoke Claude Code (streamed JSONL, verbose, allowedTools)
  Claude-->>WF: stream JSONL chunks (stdout)
  WF->>Log: tee stream to log file
  alt Claude exits (any status)
    WF->>Parser: parse JSONL log for final `result` & diagnostics
    Parser-->>WF: parsed result + diagnostics
    WF->>WF: propagate Claude exit status
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 I nibble at scripts with a curious hum,
Two hundred turns now — hoppity, here we come!
Bazelisk fetched, JSONL streaming bright,
Logs tucked away, diagnostics take flight.
Hop on, testbot — let the checks ignite!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title covers multiple changes (bazelisk fix, observability, allowlist) but only vaguely references the main change. The '80% cost cut' claim is not substantiated by the PR objectives, which discuss bazelisk fixes and diagnostics instead.	Consider revising the title to highlight the primary change (bazelisk fix + workflow unblocking) and verify or remove the cost-cut claim, or provide clearer context for what 'cost cut' refers to.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch jiaenr/testbot-bazelisk-fix

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/testbot-respond.yaml:
- Around line 75-80: The "Install Bazelisk" step is writing directly to
/usr/local/bin and can fail due to missing write permissions; change the install
to download the binary to a temporary path (e.g., /tmp/bazelisk), then use sudo
to move it into /usr/local/bin and set execute permissions and the symlink
(update commands that reference /usr/local/bin/bazelisk and /usr/local/bin/bazel
accordingly), or pipe the curl through sudo tee into /usr/local/bin/bazelisk
before running sudo chmod and sudo ln -sf so the step reliably succeeds on
ubuntu-latest.

In @.github/workflows/testbot.yaml:
- Around line 90-95: The "Install Bazelisk" GitHub Actions step writes directly
to /usr/local/bin without sudo, which can fail; change the download to write
with elevated permissions (e.g., stream the curl output into sudo tee to
/usr/local/bin/bazelisk or download to a temp file and sudo mv it), keep sudo
chmod +x /usr/local/bin/bazelisk and sudo ln -sf /usr/local/bin/bazelisk
/usr/local/bin/bazel so the binary is owned/installed correctly; update the
"Install Bazelisk" step to use that approach.
- Around line 142-150: The pipeline captures the wrong exit code after running
the npx `@anthropic-ai/claude-code` ... | tee "$STREAM_LOG" pipeline because it
reads `$?` (which reflects tee) while `set +e` is active; update the code that
sets `status` (currently `status=$?`) to instead capture the first pipeline
command's status via `PIPESTATUS[0]` so failures of the npx command are
preserved, and ensure the later `exit $status` uses that corrected variable;
look for the npx `@anthropic-ai/claude-code` invocation, the `tee "$STREAM_LOG"`
usage, the `status=$?` assignment, and the `exit $status` call to make this
change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 9b0dd86e-5052-481b-81df-e67de5ed2fb8

📥 Commits

Reviewing files that changed from the base of the PR and between 72c36d0 and dc90590.

📒 Files selected for processing (3)

.github/workflows/testbot-respond.yaml
.github/workflows/testbot.yaml
src/scripts/testbot/README.md

curl can't write directly to /usr/local/bin without sudo on the runner (curl exit 23 / "Failure writing output to destination"). Download to a writable scratch path first and use `sudo install` to place the binary with correct mode.

codecov · 2026-04-28T00:18:05Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 42.54%. Comparing base (b9dd4ee) to head (91854a0).
⚠️ Report is 5 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #904      +/-   ##
==========================================
+ Coverage   42.45%   42.54%   +0.08%     
==========================================
  Files         211      211              
  Lines       26831    26833       +2     
  Branches     4195     4196       +1     
==========================================
+ Hits        11392    11416      +24     
+ Misses      14834    14807      -27     
- Partials      605      610       +5

Flag	Coverage Δ
backend	`43.32% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.
see 1 file with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

.github/workflows/testbot.yaml (1)

143-150: ⚠️ Potential issue | 🔴 Critical

Capture Claude’s exit code from PIPESTATUS[0] (not $?).

Line 149 uses $? after a pipeline, which can report tee’s status and mask Claude failures. Capture immediately with PIPESTATUS[0] to preserve the npx exit code.

✅ Minimal fix

           npx `@anthropic-ai/claude-code`@2.1.91 --print \
             --model "$ANTHROPIC_MODEL" \
             --output-format stream-json --verbose \
             --allowedTools "Read,Edit,Write,Bash(bazel test *),Bash(pnpm --dir src/ui test *),Bash(pnpm --dir src/ui validate),Bash(pnpm --dir src/ui format),Glob,Grep" \
             --max-turns ${{ inputs.max_turns || '200' }} \
             "$PROMPT" | tee "$STREAM_LOG"
-          status=$?
+          status=${PIPESTATUS[0]}

#!/bin/bash
set -euo pipefail

echo "=== workflow snippet ==="
sed -n '138,151p' .github/workflows/testbot.yaml

echo "=== reproduce pipeline status under non-pipefail shell ==="
bash -e -c 'set +e; false | tee /tmp/t.out >/dev/null; echo "status_from_dollar_q=$?"'

echo "=== compare with PIPESTATUS capture ==="
bash -e -c 'set +e; false | tee /tmp/t.out >/dev/null; s=${PIPESTATUS[0]}; echo "status_from_PIPESTATUS_0=$s"'

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.github/workflows/testbot.yaml around lines 143 - 150, The pipeline is
capturing the wrong exit code with "$?" after the piped Claude command; replace
that with capturing PIPESTATUS[0] immediately after the pipeline so you record
the npx command's exit status rather than tee's. Specifically, after the npx
`@anthropic-ai/claude-code`... | tee "$STREAM_LOG" pipeline, set
status=${PIPESTATUS[0]} (or a similarly named variable) and use that for
subsequent checks instead of status=$?; ensure this assignment happens on the
next line without intervening commands so PIPESTATUS[0] reflects the npx
process.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/testbot.yaml:
- Around line 92-95: The workflow downloads and installs bazelisk directly (the
curl -> "$RUNNER_TEMP/bazelisk" and sudo install -> /usr/local/bin/bazelisk)
without verifying integrity; update the steps in both testbot.yaml and
testbot-respond.yaml to verify the binary before installing by either (1)
upgrading to a bazelisk release that publishes official SHA256 checksums and
validating the downloaded file with sha256sum against a pinned hash, or (2)
computing and pinning the sha256 of the downloaded "$RUNNER_TEMP/bazelisk" and
failing the job if it does not match, or (3) replace the direct download with a
verified package manager installation if available; ensure the step logs which
method was used and aborts instead of running sudo install when checksum
verification fails.

---

Duplicate comments:
In @.github/workflows/testbot.yaml:
- Around line 143-150: The pipeline is capturing the wrong exit code with "$?"
after the piped Claude command; replace that with capturing PIPESTATUS[0]
immediately after the pipeline so you record the npx command's exit status
rather than tee's. Specifically, after the npx `@anthropic-ai/claude-code`... |
tee "$STREAM_LOG" pipeline, set status=${PIPESTATUS[0]} (or a similarly named
variable) and use that for subsequent checks instead of status=$?; ensure this
assignment happens on the next line without intervening commands so
PIPESTATUS[0] reflects the npx process.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 36cb026d-db65-4856-a635-4adba2c14766

📥 Commits

Reviewing files that changed from the base of the PR and between dc90590 and 13372f3.

📒 Files selected for processing (2)

.github/workflows/testbot-respond.yaml
.github/workflows/testbot.yaml

The previous run on this branch (run 25026758807, 57 turns / \$29.73 for one UI test file) burned ~50 turns on tool-call permission denials. Two patterns dominated: - 15 calls of `pnpm --dir <abs-path> ...` blocked because the allowlist was `pnpm --dir src/ui test *` — literal `src/ui` and only the `test` subcommand. Claude reached for `type-check`, `lint`, `format:check`, and used the runner's absolute path. - 23 calls blocked by the multi-op / pipe rule (`cd X && Y`, `cmd 2>&1`, `cmd | head`). The matcher rejects every variant. Replace the narrow `pnpm --dir src/ui ...` entries with `Bash(pnpm *)` to cover all flag/subcommand variants regardless of working directory, and add `Bash(bazel build *)`, `Bash(npx vitest *)`, `Bash(npx tsc *)`, plus direct `./node_modules/.bin/{vitest,tsc} *` paths for the cases where Claude bypasses pnpm. Risk surface is unchanged: pnpm scripts are bounded by the repo's package.json (Claude can't add new ones via the post-run guardrails filter), git/gh/network access stays excluded. Add a one-line guidance block at the top of TESTBOT_RULES.md "Verification" so both generate and respond stop reaching for `cd && ...` and `2>&1`.

coderabbitai

🧹 Nitpick comments (1)

src/scripts/testbot/respond.py (1)

35-37: Consider scoping pnpm allowlist to src/ui for least-privilege.

Bash(pnpm *) materially broadens command access beyond the documented UI pattern (pnpm --dir src/ui ...). Tightening this pattern would keep the workflow flexible while reducing accidental/risky command surface.

Suggested adjustment

 ALLOWED_TOOLS = (
     "Read,Edit,Write,Glob,Grep,"
     "Bash(bazel test *),Bash(bazel build *),"
-    "Bash(pnpm *),Bash(npx vitest *),Bash(npx tsc *),"
+    "Bash(pnpm --dir src/ui *),Bash(npx vitest *),Bash(npx tsc *),"
     "Bash(./node_modules/.bin/vitest *),Bash(./node_modules/.bin/tsc *),"
     "Bash(gh pr view *),Bash(gh pr diff *),Bash(gh pr checks *)"
 )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/scripts/testbot/respond.py` around lines 35 - 37, The allowlist entry
"Bash(pnpm *)" is too broad; restrict it to the UI workspace by replacing that
pattern with a scoped form such as "Bash(pnpm --dir src/ui *)" (or the
equivalent flag your repo uses, e.g. --workspace/--prefix) so commands can only
run within src/ui; update the allowlist string in respond.py where "Bash(pnpm
*)" appears to the scoped pattern and run tests to confirm no other patterns
depend on the global pnpm form.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/scripts/testbot/respond.py`:
- Around line 35-37: The allowlist entry "Bash(pnpm *)" is too broad; restrict
it to the UI workspace by replacing that pattern with a scoped form such as
"Bash(pnpm --dir src/ui *)" (or the equivalent flag your repo uses, e.g.
--workspace/--prefix) so commands can only run within src/ui; update the
allowlist string in respond.py where "Bash(pnpm *)" appears to the scoped
pattern and run tests to confirm no other patterns depend on the global pnpm
form.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4202b456-584b-4120-b286-458f65cd0d90

📥 Commits

Reviewing files that changed from the base of the PR and between 13372f3 and 08849cb.

📒 Files selected for processing (3)

.github/workflows/testbot.yaml
src/scripts/testbot/TESTBOT_RULES.md
src/scripts/testbot/respond.py

✅ Files skipped from review due to trivial changes (1)

src/scripts/testbot/TESTBOT_RULES.md

🚧 Files skipped from review as they are similar to previous changes (1)

.github/workflows/testbot.yaml

Add `Bash(cd *)` to the allowlist for both generate and respond. `cd` itself executes nothing — it just sets the working directory for the following segment in a compound command. The matcher splits on `&&`, so `cd src/ui && pnpm test` only works if both segments are individually allowed; `pnpm` is already covered by `Bash(pnpm *)`. This eliminates the largest remaining denial pattern (~13 of 67 in the prior run) at zero added attack surface — `cd` cannot exec, read, or write anything by itself, and every command that runs after it still has to clear its own pattern. Update the TESTBOT_RULES.md verification note accordingly: `cd && <cmd>` and `2>&1` are now explicitly allowed; pipes and output redirects (`>`, `>>`) remain blocked since the action log already captures full output.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/testbot.yaml:
- Around line 151-179: Generate a random token in the shell before invoking the
Python block and pass it into the script (e.g., via an env var or extra arg),
then in the script emit a GitHub Actions stop-commands wrapper around the
untrusted model text: before printing the model output (the code that prints
"result:" and print(result[:2000]) where result is derived from final), print
the stop marker ::stop-commands::<TOKEN> and after printing the model text print
::<TOKEN>:: to re-enable command parsing; use the generated token variable so
the markers are unique and surround only the model output.
- Around line 143-147: Validate inputs.max_turns with a short run step before
invoking the npx command (e.g., a shell step that checks the value matches
^[0-9]+$ and fails the workflow if not), store the safe value in an env/output
like VALIDATED_MAX_TURNS, and then pass it quoted to the CLI (replace
--max-turns ${{ inputs.max_turns || '200' }} with --max-turns '${{
env.VALIDATED_MAX_TURNS }}' or the output variable); this ensures only digits
reach the --max-turns flag and prevents command injection when running npx
`@anthropic-ai/claude-code`.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e4ff2ebd-1657-4cf3-ad58-0eae3068e94f

📥 Commits

Reviewing files that changed from the base of the PR and between 926fbf3 and 0de31e9.

📒 Files selected for processing (3)

.github/workflows/testbot.yaml
src/scripts/testbot/TESTBOT_RULES.md
src/scripts/testbot/respond.py

✅ Files skipped from review due to trivial changes (1)

src/scripts/testbot/TESTBOT_RULES.md

🚧 Files skipped from review as they are similar to previous changes (1)

src/scripts/testbot/respond.py

…jection Three findings on PR #904: 1. **Pipe exit code masked Claude failures.** `status=$?` after `npx ... | tee "$STREAM_LOG"` reads tee's exit, not npx's. With `set +e` active, a Claude failure surfaced as exit 0. Fix: use `${PIPESTATUS[0]}` to capture the first segment's status. 2. **Command injection via `inputs.max_turns`.** The value was interpolated into the run script before the shell saw it, so a malicious dispatcher (anyone with workflow_dispatch perms) could pass `200; <cmd>`. workflow_dispatch is already gated by repo write access so the practical risk is low, but the fix is cheap: pass the input via `env: MAX_TURNS: ...` (literal string), then reject anything not matching `^[0-9]+$` before passing to npx. 3. **`::workflow-command::` injection from Claude output.** The diagnostics block prints up to 2KB of `result` text. If Claude's text contains `::warning::`, `::add-mask::`, `::endgroup::`, etc., the runner interprets them as workflow commands. Fix: wrap the untrusted output in `::stop-commands::<random-token>::` / `::<random-token>::` so command parsing is disabled while the result is being printed. Skipping two other findings: - "No SHA256 verification of bazelisk" — bazelisk doesn't publish checksums for v1.27.0 and the existing setup-bazel action also trusts the same TLS-protected GH release URL without verification. - The duplicate "/usr/local/bin write permission" findings were already resolved in 13372f3.

Two cleanups from /simplify review: - Stream-json `result` is always the final event. Iterate in reverse and break on first match instead of parsing every assistant/user/ tool message. With `max_turns=200` the prior approach json-decoded ~1000 lines per run; now it touches one. - Drop the redundant comment on the `MAX_TURNS:` env entry — the validation block in the run script already explains the indirection.

Earlier review flagged that the workflow downloads and sudo-installs bazelisk into /usr/local/bin without integrity verification. I had skipped this on the basis that bazelisk doesn't publish official checksums for v1.27.0 and the existing setup-bazel action trusts the same TLS-protected URL — but the threat is real: a tampered binary runs in the runner with access to SVC_OSMO_CI_TOKEN and could push backdoored code. TLS doesn't protect against artifact replacement. Pin the SHA256 we observed and verify with `sha256sum -c -` before the sudo install. Hash is stable across re-fetch (e1508323f347ad1465a887bc5d2bfb91cffc232d11e8e997b623227c6b32fb76). On version bump, recompute and update both env entries.

The verbose stream-json output was useful for debugging but drowned out the diagnostics summary in the action log UI. Wrap the stream in ::group::/::endgroup:: so it collapses by default; reviewers see the short diagnostics summary first and click to expand the turn-by-turn log when needed.

github-actions · 2026-04-28T17:59:48Z

📖 Docs preview: https://d3in15bfzp49i0.cloudfront.net/904/index.html

Root-cause analysis of run 25069219888 (201 turns / $88.80 / max_turns hit). Claude wrote `datasets-hooks.test.ts`, then realized the test needed `.test.tsx` to use JSX, and tried to rename it. With `mv`, `rm`, and `git rm` all blocked, Claude burned ~180 turns retrying variants before max_turns finally cut it off. Fix: add `Bash(mv *)` to the allowlist (generate workflow + respond.py shared list). Risk surface stays bounded — runner is ephemeral and guardrails.py discards non-test file changes after the run. Also lower the default `max_turns` from 200 to 100 to cap the blast radius if Claude does loop. The earlier 200 was set to absorb worst cases; with `mv` available the recovery cost is one turn, so 100 is plenty for legitimate work and ~10x cheaper on a runaway.

jiaenren requested a review from a team as a code owner April 27, 2026 22:27

jiaenren temporarily deployed to internal-ci April 27, 2026 22:27 — with GitHub Actions Inactive

coderabbitai Bot reviewed Apr 27, 2026

View reviewed changes

Comment thread .github/workflows/testbot-respond.yaml

Comment thread .github/workflows/testbot.yaml

Comment thread .github/workflows/testbot.yaml

jiaenren had a problem deploying to testbot-generate April 27, 2026 22:36 — with GitHub Actions Failure

jiaenren temporarily deployed to internal-ci April 28, 2026 00:14 — with GitHub Actions Inactive

coderabbitai Bot reviewed Apr 28, 2026

View reviewed changes

Comment thread .github/workflows/testbot.yaml

jiaenren temporarily deployed to testbot-generate April 28, 2026 00:19 — with GitHub Actions Inactive

jiaenren temporarily deployed to internal-ci April 28, 2026 00:41 — with GitHub Actions Inactive

coderabbitai Bot reviewed Apr 28, 2026

View reviewed changes

jiaenren temporarily deployed to internal-ci April 28, 2026 01:22 — with GitHub Actions Inactive

jiaenren force-pushed the jiaenr/testbot-bazelisk-fix branch from 926fbf3 to 0de31e9 Compare April 28, 2026 01:29

jiaenren temporarily deployed to internal-ci April 28, 2026 01:29 — with GitHub Actions Inactive

jiaenren temporarily deployed to testbot-generate April 28, 2026 01:31 — with GitHub Actions Inactive

coderabbitai Bot reviewed Apr 28, 2026

View reviewed changes

Comment thread .github/workflows/testbot.yaml Outdated

Comment thread .github/workflows/testbot.yaml

jiaenren changed the title ~~testbot: unblock workflows + add generate-side diagnostics, raise max-turns~~ Testbot: unblock CI, add diagnostics, cut wasted spend 80% Apr 28, 2026

jiaenren changed the title ~~Testbot: unblock CI, add diagnostics, cut wasted spend 80%~~ Testbot: fix bazelisk outage, add diagnostics, broaden allowlist Apr 28, 2026

jiaenren changed the title ~~Testbot: fix bazelisk outage, add diagnostics, broaden allowlist~~ Testbot: fix bazelisk, add diagnostics, broaden allowlist Apr 28, 2026

jiaenren changed the title ~~Testbot: fix bazelisk, add diagnostics, broaden allowlist~~ Testbot: fix bazelisk, add observability, broaden allowlist - 80% cost cut Apr 28, 2026

jiaenren temporarily deployed to internal-ci April 28, 2026 05:14 — with GitHub Actions Inactive

jiaenren temporarily deployed to internal-ci April 28, 2026 06:34 — with GitHub Actions Inactive

jiaenren temporarily deployed to internal-ci April 28, 2026 07:33 — with GitHub Actions Inactive

jiaenren temporarily deployed to internal-ci April 28, 2026 17:56 — with GitHub Actions Inactive

jiaenren had a problem deploying to testbot-generate April 28, 2026 17:57 — with GitHub Actions Failure

jiaenren temporarily deployed to pr-preview April 28, 2026 17:59 — with GitHub Actions Inactive

jiaenren temporarily deployed to internal-ci April 28, 2026 18:38 — with GitHub Actions Inactive

jiaenren deployed to pr-preview April 28, 2026 18:41 — with GitHub Actions View deployment

RyaliNvidia approved these changes Apr 28, 2026

View reviewed changes

elookpotts-nvidia approved these changes Apr 28, 2026

View reviewed changes

jiaenren merged commit 636a866 into main Apr 28, 2026
14 of 15 checks passed

jiaenren deleted the jiaenr/testbot-bazelisk-fix branch April 28, 2026 21:02

Conversation

jiaenren commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Verification — A/B on this branch

Checklist

Uh oh!

coderabbitai Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jiaenren commented Apr 27, 2026 •

edited

Loading

coderabbitai Bot commented Apr 27, 2026 •

edited

Loading

codecov Bot commented Apr 28, 2026 •

edited

Loading

github-actions Bot commented Apr 28, 2026 •

edited

Loading