Skip to content

[build] Improve CI speed and add job timeouts#17590

Closed
AutomatedTester wants to merge 1 commit into
trunkfrom
worktree-misty-sprouting-cupcake
Closed

[build] Improve CI speed and add job timeouts#17590
AutomatedTester wants to merge 1 commit into
trunkfrom
worktree-misty-sprouting-cupcake

Conversation

@AutomatedTester
Copy link
Copy Markdown
Member

Summary

  • Merge check+read-targets into a single native job (ci.yml): the affected-targets detection is a shell script that reads a cached index file — it doesn't need the full bazel.yml setup (free-disk-space, setup-bazel, disk-checks, artifact upload/download). Collapsing these two sequential jobs into one native job saves ~3–5 min off the critical path on every PR.
  • Remove needs: build from Ruby lint (ci-ruby.yml): lint runs ./go rb:lint which is source-level analysis and doesn't need the compiled artifacts. Now runs in parallel with build, matching the Python pattern.
  • Add timeout-minutes to bazel.yml (default 60) with per-caller overrides: 15 min for lint, 20 min for builds, 30 min for unit tests, 90 min for browser/integration tests. Prevents hung browser tests from holding a runner for 6 hours (the GitHub Actions default).
  • Merge Rust platform builds (ci-rust.yml): consolidate {windows,linux,macos}-{stable,debug} from 6 jobs into 3 combined jobs that build both profiles in one runner, sharing toolchain setup and halving the fan-out.

Test plan

  • Verify Check Targets job completes without Bazel setup (check the job log — no "Setup Bazel" step should appear)
  • Verify Ruby lint and build start at the same time on a Ruby PR
  • Verify a Rust PR shows Windows (stable + debug), Linux (stable + debug), MacOS (stable + debug) jobs instead of 6 separate jobs
  • Confirm artifact names for Rust binaries are unchanged (selenium-manager-windows, selenium-manager-windows-debug, etc.)
  • Confirm ci-success still passes when non-affected language jobs are skipped

- Merge check+read-targets into a single native job in ci.yml, skipping
  the full bazel.yml setup (disk-free, setup-bazel, disk-checks) for
  what is essentially a shell script. Saves 3-5 min off the critical
  path on every PR by eliminating one job + one artifact upload/download.

- Remove needs:build from Ruby lint so lint and build run in parallel,
  matching the Python pattern already in place.

- Add timeout-minutes input to bazel.yml (default 60) and set explicit
  limits per caller: 15-20 min for lint/build, 30 min for unit tests,
  90 min for browser/integration tests. Prevents hung runners from
  holding concurrency slots for 6 hours.

- Merge Rust {windows,linux,macos}-{stable,debug} into three combined
  build jobs that produce both artifacts in one runner, halving the
  platform build fan-out from 6 jobs to 3 and sharing toolchain setup.
@selenium-ci selenium-ci added the B-build Includes scripting, bazel and CI integrations label May 29, 2026
@titusfortner
Copy link
Copy Markdown
Member

Most of this doesn't do what we need and is mistaken in a few places. Things are a lot better optimized than the AI agent thinks they are.

The only we'll actually move the needle on CI speed is to stop running tests that aren't providing value (#17539).
I've been doing a lot of work on Ruby to show the template I'd like to see in other languages (#17550).

The specifics on this PR:

  1. affected_targets is a shell script that calls bazel in 2 places, so it can't be removed from bazel.yml, and we don't want to repeat the bazel.yml logic in ci.yml anyway, the checkout logic alone makes it worth running through bazel.yml. Fetch-depth 2 would break in a number of places. Besides, the bazel.yml overhead with the conditionals we have in place is about ~10s without the disk cleanup script that we need to remove, anyway.
  2. Ruby Lint needs bazel as well
  3. Merging Rust platform builds will actually just slow down the wall-clock because of parallelization. That said, I don't think we need to run debug at all most of the time - @bonigarcia how often do you work with the debug artifacts on PRs? We could gate generation with the same logic as we do for the following release step. But still this is a few minutes of runner time, the real answer to rust performance is, like you said, moving things to bazel ([rust] Allow cross-compilation of selenium-manager on all platforms #17586).
  4. I don't think hard coded job-level timeouts are the right answer either. The rerun with debug step requires the bazel runs to complete so we can get extra info about the failure if we need. If we bail early then we might not have the info we need to actually fix things. Do we have any data to support the benefit here? Looking at the past 1000 jobs, most of the issues look to be Safari related, or MacOS availability starvation which gets us back to reducing how much we're running right now.

@diemol
Copy link
Copy Markdown
Member

diemol commented May 29, 2026

I'm with @titusfortner on this. No one has spent more time optimizing the build than him, and some of these changes are overriding changes he has made. I don't think we need to merge this.

@bonigarcia
Copy link
Copy Markdown
Member

how often do you work with the debug artifacts on PRs?

SM is quite stable now, so the debug artifacts are not strictly required. We can stop generating them, IMO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

B-build Includes scripting, bazel and CI integrations

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants