feat(ci): headless PR review agent (phase 1)#28
Merged
Conversation
A code-review agent that fires after `ci` passes on a PR and posts
a single PR review (APPROVE / COMMENT / REQUEST_CHANGES) before a
human reviewer picks it up. The intent: triage the obvious-in-
hindsight stuff (schema mirror drift, body-column scans, missing
queryKey deps, classifier rules sensitive to window width, …) so the
human reviewer arrives at a PR with the easy 80% already surfaced.
## Pieces
* `.github/workflows/pr-review.yml`
Trigger = `workflow_run` on `ci` completion, gated on
`conclusion == 'success'` and `event == 'pull_request'`. Manual
`workflow_dispatch` (with pr_number input) is kept as a re-run
hatch. Concurrency-keyed per PR so a re-trigger cancels the
in-flight job.
* `scripts/pr-review/run_review.sh`
Substitutes PR_NUMBER / HEAD_SHA / BASE_REF into the prompt
template, pre-flights LiteLLM with a 5 s curl, runs `claude -p`
in print mode with a read-only tool allowlist + 1800 s outer
timeout, drops the model's stdout into
`/tmp/pr-review-${N}-out.md`.
* `scripts/pr-review/prompt.md`
Repo-specific reviewer prompt. Carries the crate map, the schema-
mirror rules (Rust serde ↔ console TS types), and an explicit
"things this repo has been bitten by" section that the agent must
actively look for (body-column scans like commit bf4887f, window-
width-sensitive classifier heuristics like fea1d83, etc).
Output format is strict — empty sections must be omitted, every
Blocking/Suggestion item must cite file:line.
* `scripts/pr-review/allowed_tools.txt`
Read / Grep / Glob / Bash(gh pr diff:*) / Bash(rg:*) / …
No Edit, no Write, no unrestricted Bash(*) — the agent is
read-only.
* `scripts/pr-review/post_review.py`
Parses the agent's markdown for which sections are populated,
picks the gh-review event accordingly (Blocking →
REQUEST_CHANGES, Suggestions/Questions only → COMMENT, none →
APPROVE), posts via `gh pr review`. Falls back to a plain
`gh pr comment` when the bot can't review its own PRs.
* `docs/pr-review-agent.md`
Architecture + ops notes — runner setup, cost/latency budget,
failure modes, phasing.
## Model routing
Agent → LiteLLM (172.16.103.81:4200) → GLM-5 SGLang (172.16.103.81
:9000). LiteLLM rewrites the Anthropic-shaped
`claude-3-5-sonnet-20241022` calls onto GLM-5, which has
sglang's `glm47` tool-call parser configured. No new infrastructure
needed.
## Phasing
Phase 1 (this PR): manual + post-CI auto-trigger. Test on a few
real PRs to calibrate the prompt. Phase 2: tune prompt against a
labeled set of past PRs + reviewer comments. Phase 3: structured
inline review comments once line numbers prove reliable.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
One-off probe runnable via workflow_dispatch. Verifies the self-hosted runner has: * claude CLI on PATH (Claude Code) * gh CLI installed + authenticated * python3 + envsubst * network path to LiteLLM at 172.16.103.81:4200 * round-trip claude → LiteLLM → GLM-5 returns expected token Run before merging pr-review.yml or after re-imaging the runner so prereq failures surface as clearly-labeled step failures instead of opaque agent crashes in the real review path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A code-review agent that fires after
cipasses on a PR and posts a single PR review (APPROVE / COMMENT / REQUEST_CHANGES) before a human reviewer picks it up. The agent runs on the existingtokenscopeself-hosted runner; routes Anthropic-shaped Claude Code calls through LiteLLM (172.16.103.81:4200) onto GLM-5 (SGLang:9000). No new infrastructure to stand up — all the model serving is already running on wuneng.The point isn't to replace human review. It's to triage the obvious-in-hindsight stuff (schema mirror drift, body-column scans on wide windows, missing tanstack-query
queryKeydeps, classifier rules sensitive to window width — recent footguns this repo has hit) so the human reviewer arrives at a PR with the easy 80% already flagged.Trigger sequence
If CI fails, the agent never runs.
Files
.github/workflows/pr-review.ymlworkflow_runtrigger gated on CI success +workflow_dispatchmanual re-run; concurrency-keyed per PRscripts/pr-review/run_review.shclaude -pwith read-only tools + 1800 s outer timeoutscripts/pr-review/prompt.mdscripts/pr-review/allowed_tools.txtBash(gh pr diff:*)/ etc — no Edit, no Write, no unrestricted Bashscripts/pr-review/post_review.pydocs/pr-review-agent.mdRunner setup (one-time, not in this PR)
On the
tokenscope-ciVM:The runner is already labeled
tokenscope(same one theciworkflow uses). Network path to LiteLLM at172.16.103.81:4200already exists via the libvirt bridge.Cost / latency
GLM-5 runs on-prem — no per-request cost, the constraint is GPU minutes.
Test plan
workflow_dispatchpath).gh workflow run pr-review.yml -f pr_number=<test PR>against a small recent PR; verify it posts a review comment.workflow_runauto-trigger once the prompt produces reviewer-grade output on the three test PRs.🤖 Generated with Claude Code