Skip to content

Testbot: fix bazelisk, add observability, broaden allowlist - 80% cost cut#904

Merged
jiaenren merged 9 commits intomainfrom
jiaenr/testbot-bazelisk-fix
Apr 28, 2026
Merged

Testbot: fix bazelisk, add observability, broaden allowlist - 80% cost cut#904
jiaenren merged 9 commits intomainfrom
jiaenr/testbot-bazelisk-fix

Conversation

@jiaenren
Copy link
Copy Markdown
Collaborator

@jiaenren jiaenren commented Apr 27, 2026

Description

Fixes a hard outage on testbot, brings generate-side observability up to par
with respond, and broadens the tool allowlist for ~80% lower spend.

1. Bazelisk install workaround. setup-bazel@0.15.0 resolves bazelisk
via GitHub's listReleases API on bazelbuild/bazelisk, which returns []
as of 2026-04-27 (the tags/v1.27.0 endpoint and asset URLs still work).
Every fresh ubuntu-latest run failed with Unable to find Bazelisk version 1.27.0 (run 25012620541).
Workaround: download from the asset URL into $RUNNER_TEMP, verify SHA256
against a pinned hash, sudo install to /usr/local/bin. setup-bazel still
runs for caching.

2. Generate-side observability. Switch to --output-format stream-json --verbose so each turn streams live; wrap in a foldable ::group:: so it
collapses by default. After the run, parse the final result event and
print subtype/is_error/num_turns/duration_ms/duration_api_ms/
total_cost_usd/result text in a Claude Code diagnostics group. Reverse-
iterate + break for O(1) parsing. set +e + ${PIPESTATUS[0]} so the
diagnostics print on Claude failure and step exit code matches Claude's.

3. Broaden tool allowlist. Replace narrow Bash(pnpm --dir src/ui ...)
with Bash(pnpm *); add Bash(bazel build *), Bash(npx vitest *),
Bash(npx tsc *), direct ./node_modules/.bin/{vitest,tsc} *,
Bash(cd *), Bash(mv *). cd and mv are zero-exec by themselves;
following segments still need their own pattern. Pipes, output redirects,
rm, tee, curl, git, gh write remain blocked. mv was added after
a runaway run (Claude wrote .test.ts, needed .test.tsx, looped 180
turns trying mv/rm/git rm).

4. Security hardening (CodeRabbit). ${PIPESTATUS[0]} instead of $?
so tee doesn't mask Claude failures. inputs.max_turns routed through
env: (literal) and validated ^[0-9]+$ (defense vs. shell injection from
malicious workflow_dispatch). Untrusted Claude result text wrapped in
::stop-commands::<random-token>:: so it can't inject ::warning:: /
::add-mask:: etc.

5. max_turns calibrated to 100. Was 50 (too low), bumped to 200
(turned a runaway into ~$88), now 100 — covers legitimate work (best run
was 21 turns) and caps blast radius. mv allowance means filename mistakes
recover in 1 turn instead of 200.

Verification — A/B on this branch

Both runs: max_targets=1, max_uncovered=300, dry run, opus-4-5.

Metric Before allowlist (run 25026758807) After (run 25028953472)
num_turns 57 21
total_cost_usd $29.73 $5.91
duration_api_ms 461s 119s
Permission denials 67 of 161 (42%) 0 of 9 (0%)
Step wall time 7m 53s 6m 39s

Subsequent run 25069219888
verified SHA256 install + foldable stream group + diagnostics summary
end-to-end and surfaced the mv denial loop that drove fix #5.

Issue #760

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

🤖 Generated with Claude Code

- Install bazelisk directly from the release asset URL. The setup-bazel
  action's `listReleases` call against bazelbuild/bazelisk now returns
  an empty array (the `/releases/tags/v<X>` endpoint still works), so
  every fresh ubuntu-latest run was failing at setup with
  "Unable to find Bazelisk version 1.27.0".
- Bump Claude Code `--max-turns` to 200 in both workflows and update
  the README config tables. Recent runs were hitting the previous
  caps (50 / 75) on non-trivial requests.
- Surface generate-side diagnostics like respond does: switch to
  `--output-format stream-json --verbose` so each turn and tool use
  streams into the action log live, and add a `Claude Code diagnostics`
  group that parses the final `result` message for `subtype`,
  `is_error`, `num_turns`, `duration_ms`, `total_cost_usd`, and the
  result text. `set +e` ensures the summary still prints when Claude
  exits non-zero (timeout, max-turns, etc.).
@jiaenren jiaenren requested a review from a team as a code owner April 27, 2026 22:27
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 27, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

CI workflows now install Bazelisk directly and increase Claude Code agent turn limits to 200; Claude Code runs in streamed JSONL mode with tee'd logs, exit-status capture, and post-parse of the final result. Testbot docs, rules, and allowed-tools for Claude were updated.

Changes

Cohort / File(s) Summary
Workflows: Bazelisk install & turn limits
​.github/workflows/testbot.yaml, ​.github/workflows/testbot-respond.yaml
Adds explicit Bazelisk download/install (v1.27.0) and symlinks bazel; removes bazelisk-version usage. Increases max_turns/--max-turns from 50/75 to 200.
Workflows: Claude streaming, logging & diagnostics
​.github/workflows/testbot.yaml
Runs Claude Code in streamed JSONL + verbose mode, stores the prompt in a variable, tees the JSONL stream to a log file, continues diagnostics when Claude exits non-zero, captures and re-raises Claude’s exit status, and parses the JSONL to print the final result and diagnostics.
Scripts: allowed tools & testbot runtime
src/scripts/testbot/respond.py
Expands Claude Code --allowedTools allowlist: adds cd *, bazel build *, broader pnpm *, and allows vitest/tsc via npx and node_modules/.bin, while keeping existing Bazel test entries.
Docs & Rules
src/scripts/testbot/README.md, src/scripts/testbot/TESTBOT_RULES.md
Updates README examples/config to use 200-agent-turn default. Adds TESTBOT_RULES restricting verification shell commands to a single allowed command (or cd <dir> && <single-cmd>), disallowing piping/chaining/redirection.

Sequence Diagram(s)

sequenceDiagram
  participant WF as Workflow
  participant Claude as Claude Code (agent)
  participant Log as JSONL Log File
  participant Parser as Post-processor

  WF->>Claude: invoke Claude Code (streamed JSONL, verbose, allowedTools)
  Claude-->>WF: stream JSONL chunks (stdout)
  WF->>Log: tee stream to log file
  alt Claude exits (any status)
    WF->>Parser: parse JSONL log for final `result` & diagnostics
    Parser-->>WF: parsed result + diagnostics
    WF->>WF: propagate Claude exit status
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 I nibble at scripts with a curious hum,
Two hundred turns now — hoppity, here we come!
Bazelisk fetched, JSONL streaming bright,
Logs tucked away, diagnostics take flight.
Hop on, testbot — let the checks ignite!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title covers multiple changes (bazelisk fix, observability, allowlist) but only vaguely references the main change. The '80% cost cut' claim is not substantiated by the PR objectives, which discuss bazelisk fixes and diagnostics instead. Consider revising the title to highlight the primary change (bazelisk fix + workflow unblocking) and verify or remove the cost-cut claim, or provide clearer context for what 'cost cut' refers to.
✅ Passed checks (4 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch jiaenr/testbot-bazelisk-fix

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/testbot-respond.yaml:
- Around line 75-80: The "Install Bazelisk" step is writing directly to
/usr/local/bin and can fail due to missing write permissions; change the install
to download the binary to a temporary path (e.g., /tmp/bazelisk), then use sudo
to move it into /usr/local/bin and set execute permissions and the symlink
(update commands that reference /usr/local/bin/bazelisk and /usr/local/bin/bazel
accordingly), or pipe the curl through sudo tee into /usr/local/bin/bazelisk
before running sudo chmod and sudo ln -sf so the step reliably succeeds on
ubuntu-latest.

In @.github/workflows/testbot.yaml:
- Around line 90-95: The "Install Bazelisk" GitHub Actions step writes directly
to /usr/local/bin without sudo, which can fail; change the download to write
with elevated permissions (e.g., stream the curl output into sudo tee to
/usr/local/bin/bazelisk or download to a temp file and sudo mv it), keep sudo
chmod +x /usr/local/bin/bazelisk and sudo ln -sf /usr/local/bin/bazelisk
/usr/local/bin/bazel so the binary is owned/installed correctly; update the
"Install Bazelisk" step to use that approach.
- Around line 142-150: The pipeline captures the wrong exit code after running
the npx `@anthropic-ai/claude-code` ... | tee "$STREAM_LOG" pipeline because it
reads `$?` (which reflects tee) while `set +e` is active; update the code that
sets `status` (currently `status=$?`) to instead capture the first pipeline
command's status via `PIPESTATUS[0]` so failures of the npx command are
preserved, and ensure the later `exit $status` uses that corrected variable;
look for the npx `@anthropic-ai/claude-code` invocation, the `tee "$STREAM_LOG"`
usage, the `status=$?` assignment, and the `exit $status` call to make this
change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 9b0dd86e-5052-481b-81df-e67de5ed2fb8

📥 Commits

Reviewing files that changed from the base of the PR and between 72c36d0 and dc90590.

📒 Files selected for processing (3)
  • .github/workflows/testbot-respond.yaml
  • .github/workflows/testbot.yaml
  • src/scripts/testbot/README.md

Comment thread .github/workflows/testbot-respond.yaml
Comment thread .github/workflows/testbot.yaml
Comment thread .github/workflows/testbot.yaml
curl can't write directly to /usr/local/bin without sudo on the runner
(curl exit 23 / "Failure writing output to destination"). Download to a
writable scratch path first and use `sudo install` to place the binary
with correct mode.
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 28, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 42.54%. Comparing base (b9dd4ee) to head (91854a0).
⚠️ Report is 5 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #904      +/-   ##
==========================================
+ Coverage   42.45%   42.54%   +0.08%     
==========================================
  Files         211      211              
  Lines       26831    26833       +2     
  Branches     4195     4196       +1     
==========================================
+ Hits        11392    11416      +24     
+ Misses      14834    14807      -27     
- Partials      605      610       +5     
Flag Coverage Δ
backend 43.32% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.
see 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
.github/workflows/testbot.yaml (1)

143-150: ⚠️ Potential issue | 🔴 Critical

Capture Claude’s exit code from PIPESTATUS[0] (not $?).

Line 149 uses $? after a pipeline, which can report tee’s status and mask Claude failures. Capture immediately with PIPESTATUS[0] to preserve the npx exit code.

✅ Minimal fix
           npx `@anthropic-ai/claude-code`@2.1.91 --print \
             --model "$ANTHROPIC_MODEL" \
             --output-format stream-json --verbose \
             --allowedTools "Read,Edit,Write,Bash(bazel test *),Bash(pnpm --dir src/ui test *),Bash(pnpm --dir src/ui validate),Bash(pnpm --dir src/ui format),Glob,Grep" \
             --max-turns ${{ inputs.max_turns || '200' }} \
             "$PROMPT" | tee "$STREAM_LOG"
-          status=$?
+          status=${PIPESTATUS[0]}
#!/bin/bash
set -euo pipefail

echo "=== workflow snippet ==="
sed -n '138,151p' .github/workflows/testbot.yaml

echo "=== reproduce pipeline status under non-pipefail shell ==="
bash -e -c 'set +e; false | tee /tmp/t.out >/dev/null; echo "status_from_dollar_q=$?"'

echo "=== compare with PIPESTATUS capture ==="
bash -e -c 'set +e; false | tee /tmp/t.out >/dev/null; s=${PIPESTATUS[0]}; echo "status_from_PIPESTATUS_0=$s"'
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/testbot.yaml around lines 143 - 150, The pipeline is
capturing the wrong exit code with "$?" after the piped Claude command; replace
that with capturing PIPESTATUS[0] immediately after the pipeline so you record
the npx command's exit status rather than tee's. Specifically, after the npx
`@anthropic-ai/claude-code`... | tee "$STREAM_LOG" pipeline, set
status=${PIPESTATUS[0]} (or a similarly named variable) and use that for
subsequent checks instead of status=$?; ensure this assignment happens on the
next line without intervening commands so PIPESTATUS[0] reflects the npx
process.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/testbot.yaml:
- Around line 92-95: The workflow downloads and installs bazelisk directly (the
curl -> "$RUNNER_TEMP/bazelisk" and sudo install -> /usr/local/bin/bazelisk)
without verifying integrity; update the steps in both testbot.yaml and
testbot-respond.yaml to verify the binary before installing by either (1)
upgrading to a bazelisk release that publishes official SHA256 checksums and
validating the downloaded file with sha256sum against a pinned hash, or (2)
computing and pinning the sha256 of the downloaded "$RUNNER_TEMP/bazelisk" and
failing the job if it does not match, or (3) replace the direct download with a
verified package manager installation if available; ensure the step logs which
method was used and aborts instead of running sudo install when checksum
verification fails.

---

Duplicate comments:
In @.github/workflows/testbot.yaml:
- Around line 143-150: The pipeline is capturing the wrong exit code with "$?"
after the piped Claude command; replace that with capturing PIPESTATUS[0]
immediately after the pipeline so you record the npx command's exit status
rather than tee's. Specifically, after the npx `@anthropic-ai/claude-code`... |
tee "$STREAM_LOG" pipeline, set status=${PIPESTATUS[0]} (or a similarly named
variable) and use that for subsequent checks instead of status=$?; ensure this
assignment happens on the next line without intervening commands so
PIPESTATUS[0] reflects the npx process.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 36cb026d-db65-4856-a635-4adba2c14766

📥 Commits

Reviewing files that changed from the base of the PR and between dc90590 and 13372f3.

📒 Files selected for processing (2)
  • .github/workflows/testbot-respond.yaml
  • .github/workflows/testbot.yaml

Comment thread .github/workflows/testbot.yaml
@jiaenren jiaenren temporarily deployed to testbot-generate April 28, 2026 00:19 — with GitHub Actions Inactive
The previous run on this branch (run 25026758807, 57 turns / \$29.73
for one UI test file) burned ~50 turns on tool-call permission denials.
Two patterns dominated:

- 15 calls of `pnpm --dir <abs-path> ...` blocked because the allowlist
  was `pnpm --dir src/ui test *` — literal `src/ui` and only the `test`
  subcommand. Claude reached for `type-check`, `lint`, `format:check`,
  and used the runner's absolute path.
- 23 calls blocked by the multi-op / pipe rule (`cd X && Y`, `cmd 2>&1`,
  `cmd | head`). The matcher rejects every variant.

Replace the narrow `pnpm --dir src/ui ...` entries with `Bash(pnpm *)`
to cover all flag/subcommand variants regardless of working directory,
and add `Bash(bazel build *)`, `Bash(npx vitest *)`, `Bash(npx tsc *)`,
plus direct `./node_modules/.bin/{vitest,tsc} *` paths for the cases
where Claude bypasses pnpm. Risk surface is unchanged: pnpm scripts
are bounded by the repo's package.json (Claude can't add new ones via
the post-run guardrails filter), git/gh/network access stays excluded.

Add a one-line guidance block at the top of TESTBOT_RULES.md
"Verification" so both generate and respond stop reaching for
`cd && ...` and `2>&1`.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/scripts/testbot/respond.py (1)

35-37: Consider scoping pnpm allowlist to src/ui for least-privilege.

Bash(pnpm *) materially broadens command access beyond the documented UI pattern (pnpm --dir src/ui ...). Tightening this pattern would keep the workflow flexible while reducing accidental/risky command surface.

Suggested adjustment
 ALLOWED_TOOLS = (
     "Read,Edit,Write,Glob,Grep,"
     "Bash(bazel test *),Bash(bazel build *),"
-    "Bash(pnpm *),Bash(npx vitest *),Bash(npx tsc *),"
+    "Bash(pnpm --dir src/ui *),Bash(npx vitest *),Bash(npx tsc *),"
     "Bash(./node_modules/.bin/vitest *),Bash(./node_modules/.bin/tsc *),"
     "Bash(gh pr view *),Bash(gh pr diff *),Bash(gh pr checks *)"
 )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/scripts/testbot/respond.py` around lines 35 - 37, The allowlist entry
"Bash(pnpm *)" is too broad; restrict it to the UI workspace by replacing that
pattern with a scoped form such as "Bash(pnpm --dir src/ui *)" (or the
equivalent flag your repo uses, e.g. --workspace/--prefix) so commands can only
run within src/ui; update the allowlist string in respond.py where "Bash(pnpm
*)" appears to the scoped pattern and run tests to confirm no other patterns
depend on the global pnpm form.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/scripts/testbot/respond.py`:
- Around line 35-37: The allowlist entry "Bash(pnpm *)" is too broad; restrict
it to the UI workspace by replacing that pattern with a scoped form such as
"Bash(pnpm --dir src/ui *)" (or the equivalent flag your repo uses, e.g.
--workspace/--prefix) so commands can only run within src/ui; update the
allowlist string in respond.py where "Bash(pnpm *)" appears to the scoped
pattern and run tests to confirm no other patterns depend on the global pnpm
form.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4202b456-584b-4120-b286-458f65cd0d90

📥 Commits

Reviewing files that changed from the base of the PR and between 13372f3 and 08849cb.

📒 Files selected for processing (3)
  • .github/workflows/testbot.yaml
  • src/scripts/testbot/TESTBOT_RULES.md
  • src/scripts/testbot/respond.py
✅ Files skipped from review due to trivial changes (1)
  • src/scripts/testbot/TESTBOT_RULES.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • .github/workflows/testbot.yaml

Add `Bash(cd *)` to the allowlist for both generate and respond. `cd`
itself executes nothing — it just sets the working directory for the
following segment in a compound command. The matcher splits on `&&`,
so `cd src/ui && pnpm test` only works if both segments are individually
allowed; `pnpm` is already covered by `Bash(pnpm *)`.

This eliminates the largest remaining denial pattern (~13 of 67 in the
prior run) at zero added attack surface — `cd` cannot exec, read, or
write anything by itself, and every command that runs after it still
has to clear its own pattern.

Update the TESTBOT_RULES.md verification note accordingly: `cd && <cmd>`
and `2>&1` are now explicitly allowed; pipes and output redirects (`>`,
`>>`) remain blocked since the action log already captures full output.
@jiaenren jiaenren force-pushed the jiaenr/testbot-bazelisk-fix branch from 926fbf3 to 0de31e9 Compare April 28, 2026 01:29
@jiaenren jiaenren temporarily deployed to testbot-generate April 28, 2026 01:31 — with GitHub Actions Inactive
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/testbot.yaml:
- Around line 151-179: Generate a random token in the shell before invoking the
Python block and pass it into the script (e.g., via an env var or extra arg),
then in the script emit a GitHub Actions stop-commands wrapper around the
untrusted model text: before printing the model output (the code that prints
"result:" and print(result[:2000]) where result is derived from final), print
the stop marker ::stop-commands::<TOKEN> and after printing the model text print
::<TOKEN>:: to re-enable command parsing; use the generated token variable so
the markers are unique and surround only the model output.
- Around line 143-147: Validate inputs.max_turns with a short run step before
invoking the npx command (e.g., a shell step that checks the value matches
^[0-9]+$ and fails the workflow if not), store the safe value in an env/output
like VALIDATED_MAX_TURNS, and then pass it quoted to the CLI (replace
--max-turns ${{ inputs.max_turns || '200' }} with --max-turns '${{
env.VALIDATED_MAX_TURNS }}' or the output variable); this ensures only digits
reach the --max-turns flag and prevents command injection when running npx
`@anthropic-ai/claude-code`.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e4ff2ebd-1657-4cf3-ad58-0eae3068e94f

📥 Commits

Reviewing files that changed from the base of the PR and between 926fbf3 and 0de31e9.

📒 Files selected for processing (3)
  • .github/workflows/testbot.yaml
  • src/scripts/testbot/TESTBOT_RULES.md
  • src/scripts/testbot/respond.py
✅ Files skipped from review due to trivial changes (1)
  • src/scripts/testbot/TESTBOT_RULES.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/scripts/testbot/respond.py

Comment thread .github/workflows/testbot.yaml Outdated
Comment thread .github/workflows/testbot.yaml
@jiaenren jiaenren changed the title testbot: unblock workflows + add generate-side diagnostics, raise max-turns Testbot: unblock CI, add diagnostics, cut wasted spend 80% Apr 28, 2026
@jiaenren jiaenren changed the title Testbot: unblock CI, add diagnostics, cut wasted spend 80% Testbot: fix bazelisk outage, add diagnostics, broaden allowlist Apr 28, 2026
@jiaenren jiaenren changed the title Testbot: fix bazelisk outage, add diagnostics, broaden allowlist Testbot: fix bazelisk, add diagnostics, broaden allowlist Apr 28, 2026
@jiaenren jiaenren changed the title Testbot: fix bazelisk, add diagnostics, broaden allowlist Testbot: fix bazelisk, add observability, broaden allowlist - 80% cost cut Apr 28, 2026
…jection

Three findings on PR #904:

1. **Pipe exit code masked Claude failures.** `status=$?` after
   `npx ... | tee "$STREAM_LOG"` reads tee's exit, not npx's. With
   `set +e` active, a Claude failure surfaced as exit 0. Fix: use
   `${PIPESTATUS[0]}` to capture the first segment's status.

2. **Command injection via `inputs.max_turns`.** The value was
   interpolated into the run script before the shell saw it, so a
   malicious dispatcher (anyone with workflow_dispatch perms) could
   pass `200; <cmd>`. workflow_dispatch is already gated by repo
   write access so the practical risk is low, but the fix is cheap:
   pass the input via `env: MAX_TURNS: ...` (literal string), then
   reject anything not matching `^[0-9]+$` before passing to npx.

3. **`::workflow-command::` injection from Claude output.** The
   diagnostics block prints up to 2KB of `result` text. If Claude's
   text contains `::warning::`, `::add-mask::`, `::endgroup::`, etc.,
   the runner interprets them as workflow commands. Fix: wrap the
   untrusted output in `::stop-commands::<random-token>::` /
   `::<random-token>::` so command parsing is disabled while
   the result is being printed.

Skipping two other findings:
- "No SHA256 verification of bazelisk" — bazelisk doesn't publish
  checksums for v1.27.0 and the existing setup-bazel action also
  trusts the same TLS-protected GH release URL without verification.
- The duplicate "/usr/local/bin write permission" findings were
  already resolved in 13372f3.
Two cleanups from /simplify review:

- Stream-json `result` is always the final event. Iterate in reverse
  and break on first match instead of parsing every assistant/user/
  tool message. With `max_turns=200` the prior approach json-decoded
  ~1000 lines per run; now it touches one.
- Drop the redundant comment on the `MAX_TURNS:` env entry — the
  validation block in the run script already explains the indirection.
Earlier review flagged that the workflow downloads and sudo-installs
bazelisk into /usr/local/bin without integrity verification. I had
skipped this on the basis that bazelisk doesn't publish official
checksums for v1.27.0 and the existing setup-bazel action trusts the
same TLS-protected URL — but the threat is real: a tampered binary
runs in the runner with access to SVC_OSMO_CI_TOKEN and could push
backdoored code. TLS doesn't protect against artifact replacement.

Pin the SHA256 we observed and verify with `sha256sum -c -` before
the sudo install. Hash is stable across re-fetch
(e1508323f347ad1465a887bc5d2bfb91cffc232d11e8e997b623227c6b32fb76).
On version bump, recompute and update both env entries.
The verbose stream-json output was useful for debugging but drowned
out the diagnostics summary in the action log UI. Wrap the stream in
::group::/::endgroup:: so it collapses by default; reviewers see the
short diagnostics summary first and click to expand the turn-by-turn
log when needed.
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 28, 2026

📖 Docs preview: https://d3in15bfzp49i0.cloudfront.net/904/index.html

Root-cause analysis of run 25069219888 (201 turns / $88.80 / max_turns
hit). Claude wrote `datasets-hooks.test.ts`, then realized the test
needed `.test.tsx` to use JSX, and tried to rename it. With `mv`, `rm`,
and `git rm` all blocked, Claude burned ~180 turns retrying variants
before max_turns finally cut it off.

Fix: add `Bash(mv *)` to the allowlist (generate workflow + respond.py
shared list). Risk surface stays bounded — runner is ephemeral and
guardrails.py discards non-test file changes after the run.

Also lower the default `max_turns` from 200 to 100 to cap the blast
radius if Claude does loop. The earlier 200 was set to absorb worst
cases; with `mv` available the recovery cost is one turn, so 100 is
plenty for legitimate work and ~10x cheaper on a runaway.
@jiaenren jiaenren merged commit 636a866 into main Apr 28, 2026
14 of 15 checks passed
@jiaenren jiaenren deleted the jiaenr/testbot-bazelisk-fix branch April 28, 2026 21:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants