chore(ci): shadow CI jobs on blacksmith for runner trial#54559
chore(ci): shadow CI jobs on blacksmith for runner trial#54559
Conversation
Runs each Depot-labeled CI job in parallel on a matching Blacksmith tier, gated by repo variable BLACKSMITH_SHADOW_ENABLED so the trial can be flipped off without a code change. Shadows are continue-on-error and not wired into any required-check needs list. Approach: - Single-shot jobs use a 2-entry matrix on `runner` (adds ~6 lines each). - Jobs with dynamic/complex matrices get a separate `-blacksmith` or `-shadow` job with a trimmed matrix in ci-blacksmith-shadow.yml. - Artifact uploads, cache saves, and side effects are gated to depot only. - Container-build / CD workflows are intentionally skipped because depot/build-push-action offloads the build to Depot's cloud builder, so the runner is just an orchestrator and timing would be near-identical. Post-trial pairing via .github/scripts/compare-ci-runners.py, which scrapes gh job durations, pairs (workflow, job, sha) across runners, and reports median/p95 speedup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Folds the playwright shadow into the main workflow via a 2-entry runner matrix gated on BLACKSMITH_SHADOW_ENABLED. Shadow leg is continue-on-error; artifact uploads, Visual Review run, Cloudflare deploy, PR comment, and screenshot patches remain depot-only via !matrix.is_shadow. Drops the redundant playwright-shadow stub. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
🎭 Playwright report · View test results →
These issues are not necessarily caused by your changes. |
Prompt To Fix All With AIThis is a comment left during a code review.
Path: .github/scripts/compare-ci-runners.py
Line: 59-68
Comment:
**`-L 10` will truncate data to hours, not 7 days**
The fetch limit of `10` means only the last 10 workflow runs are retrieved before the `since` date filter is applied. On a repo running CI many times per day, 10 runs covers a few hours at most, making the `--days 7` window useless in practice. The comment directly above (which says `-L 1000`) suggests this was the intended value.
```suggestion
"-L",
"1000",
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: .github/workflows/ci-e2e-playwright.yml
Line: 173-174
Comment:
**Matrix output race condition silently breaks Visual Review**
`playwright` is now a 2-entry matrix job. GitHub Actions resolves matrix job-level outputs from whichever entry finishes last. The shadow entry's `vr-create` step is skipped (`!matrix.is_shadow`), so its `vr_run_id` is `""`. If the shadow entry completes after the depot entry — which is unpredictable — `needs.playwright.outputs.vr_run_id` becomes `""`, and `handle-screenshots` skips both the "Complete Visual Review run" step and the VR CLI install (both gated on `needs.playwright.outputs.vr_run_id != ''`), silently abandoning every VR run.
A safe fix is to pin the output to only the depot matrix entry using a dedicated `outputs` step that is skipped for shadows, or move the VR run ID to a separate job output artifact instead of a matrix output.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: .github/workflows/ci-e2e-playwright.yml
Line: 741-742
Comment:
**Unrelated job rename may break branch protection**
`capture-run-time` has been renamed to `calculate-running-time`. This change is unrelated to the shadow trial but would silently remove the old job name from any branch protection required-status-check list. If `capture-run-time` was a required check, merging to master would become unguarded until the protection rule is updated.
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "cleanup" | Re-trigger Greptile |
There was a problem hiding this comment.
Pull request overview
Adds a repo-variable–gated “shadow” lane to run most compute-heavy CI jobs on Blacksmith in parallel with the existing Depot runners, enabling apples-to-apples timing comparisons on the same SHAs.
Changes:
- Matrix-expands many existing CI jobs to run on both Depot and Blacksmith (shadow runs are
continue-on-errorand avoid conflicting side effects like artifact uploads / cache saves). - Adds a dedicated
ci-blacksmith-shadow.ymlworkflow for jobs that can’t be safely matrix-expanded (complex/dynamic matrices, side effects). - Adds
.github/scripts/compare-ci-runners.pyto scrape GitHub Actions job durations and report paired Depot vs Blacksmith statistics.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| .github/workflows/ci-storybook.yml | Matrix-expands Storybook build to add an optional Blacksmith shadow and gates cache/artifact side effects. |
| .github/workflows/ci-rust.yml | Matrix-expands Rust build/lint; adds a separate Blacksmith-only shadow test job subset. |
| .github/workflows/ci-rust-flags-integration.yml | Matrix-expands Rust /flags integration tests to add an optional Blacksmith shadow. |
| .github/workflows/ci-python.yml | Matrix-expands Python quality job; gates cache/test-result artifact writes for shadows. |
| .github/workflows/ci-proto.yml | Matrix-expands proto lint/breaking/codegen checks to add optional Blacksmith shadows. |
| .github/workflows/ci-nodejs.yml | Matrix-expands Node lint/build and adds a runner dimension to sharded tests for optional Blacksmith shadows. |
| .github/workflows/ci-mcp.yml | Matrix-expands MCP integration tests to add an optional Blacksmith shadow. |
| .github/workflows/ci-e2e-playwright.yml | Matrix-expands Playwright job to add an optional Blacksmith shadow and disables shadow side effects (VR, artifacts, comments, deploy). |
| .github/workflows/ci-dagster.yml | Adds runner dimension to Dagster test matrix and avoids artifact writes for shadows. |
| .github/workflows/ci-blacksmith-shadow.yml | New workflow containing Blacksmith-only trimmed shadows for complex backend jobs. |
| .github/workflows/ci-backend.yml | Matrix-expands select backend jobs (repo checks, async migrations) for optional Blacksmith shadows and gates artifact writes. |
| .github/scripts/compare-ci-runners.py | New CLI to fetch CI job durations via gh and compute paired Depot vs Blacksmith medians/p95/speedup. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Speedup was silently contaminated by failed runs — a crash-fast job paired against a successful slow job was reported as a speedup. Now: - Speedup uses only success↔success pairs - Per-runner failure counts (x/y) surface runner-specific instability - Mixed-conclusion pairs are reported separately - New "Unpaired runs" table shows where the shadow didn't fire - CSV gains depot_conclusion/blacksmith_conclusion columns Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Problem
We want to evaluate Blacksmith runners but need apples-to-apples performance data to make an informed cost/performance decision. A one-shot cut-over would make it impossible to compare runners on identical commits and workloads, and a narrow shadow of a few jobs would only cover part of our CI spend.
Changes
For most
runs-onjob that does actual compute on the runner, this PR runs a parallel Blacksmith shadow on the same commit. The trial is gated behind repo variableBLACKSMITH_SHADOW_ENABLEDso it can be toggled without code changes, and every shadow iscontinue-on-error: trueand excluded from required-check lists.Approach chosen per job shape:
runnerdimension. ~6 lines of diff each; side effects (artifact uploads, cache saves) gated to the entry so shadows don't collide.ci-python.yml,ci-proto.yml(3 jobs),ci-mcp.yml,ci-dagster.yml(added runner dim to existing CH-version matrix),ci-rust-flags-integration.yml,ci-nodejs.yml(3 jobs including sharded tests),ci-rust.yml(build + linting),ci-backend.yml(repo-checks + async-migrations),ci-storybook.yml-blacksmithshadow job with a trimmed hardcoded subset (1-2 shards) — fusing dimensions into afromJsonmatrix breaks downstreamneeds.X.outputssemantics.ci-rust.yml→test-blacksmith(2 representative packages)ci-blacksmith-shadow.yml(new file) →turbo-discover-shadow,turbo-tests-shadow,check-migrations-shadow,django-shadowDeliberately skipped: the 11 container-build / CD workflows (
_rust-build-images.yml,cd-*-image.yml,container-images-*.yml,ci-*-container.yml,livestream-docker-image.yml,llm-gateway-cd.yml). They usebuild-push-action, which offloads the build to a cloud builder — the runner is just orchestrator, so Blacksmith would produce near-identical timings. Shadowing them would also risk double-pushing images.Apples-to-apples comparison tool:
.github/scripts/compare-ci-runners.pyscrapes job durations viagh, pairs(workflow, base_job_name, sha)across runners, and reports median / p95 / speedup in markdown or CSV. Pairs work because shadows run on the same SHA as originals and encode the runner label in the job name (for matrix-expanded jobs) or use a-blacksmith/shadowsuffix (for dedicated shadow jobs).Enable: set repo variable
BLACKSMITH_SHADOW_ENABLED=truein Settings → Variables → Actions. Flip tofalseto halt shadows mid-trial without a code change. After a few days, run:How did you test this code?
yaml.safe_loadacross all 10 edited workflow files.github/scripts/check-ci-timeouts.py) — all jobs havetimeout-minutesactionlintvia docker — no new errors introduced (only pre-existing shellcheck/label warnings the repo already tolerates)Open items for human follow-up:
compare-ci-runners.py --days 1after one day's data to sanity-check pairing before leaving shadows on for the full week.Publish to changelog?
no
Docs update
skip-inkeep-docs
🤖 LLM context
Authored with Claude Opus 4.6 (1M context)