Skip to content

.pr_agent_auto_best_practices

qodo-merge-bot edited this page May 27, 2026 · 19 revisions

Pattern 1: When a test or helper mutates global process state (e.g., ENV variables or shared singletons), preserve the original value and restore it in ensure/finally instead of unconditionally deleting/disposing.

Example code before:

old = ENV["FOO"]
ENV["FOO"] = "x"
begin
  # test...
ensure
  ENV.delete("FOO") # clobbers pre-existing value
end

Example code after:

old = ENV.key?("FOO") ? ENV["FOO"] : :__missing__
ENV["FOO"] = "x"
begin
  # test...
ensure
  old == :__missing__ ? ENV.delete("FOO") : ENV["FOO"] = old
end
Relevant past accepted suggestions:
Suggestion 1: [reliability] Test clobbers `SE_CHROMEDRIVER`
Test clobbers `SE_CHROMEDRIVER` The new DriverFinder env-precedence unit test overwrites `ENV['SE_CHROMEDRIVER']` and then unconditionally deletes it in `ensure`, which can clobber a pre-existing value from a developer/CI environment and leak side effects into later examples. This makes the suite potentially order-dependent and flaky across environments, undermining the expectation that tests remain isolated and reliable.

Issue description

The DriverFinder env-precedence spec sets ENV['SE_CHROMEDRIVER'] and then unconditionally calls ENV.delete('SE_CHROMEDRIVER') in ensure, which can wipe a pre-existing value from the parent environment and leak state across examples.

Issue Context

This test is intended to validate env-var precedence, but to comply with the expectation that test changes are reliable across environments (PR Compliance ID 13) and to avoid order-dependent failures, it should not permanently mutate global process environment. Other unit specs in this repo follow a safer pattern by preserving/restoring prior env values rather than always deleting them.

Fix Focus Areas

  • rb/spec/unit/selenium/webdriver/common/driver_finder_spec.rb[36-48]

Suggestion 2: [reliability] Shared driver disposed unsafely
Shared driver disposed unsafely DriverTestFixture disposes the shared EnvironmentManager-held driver without clearing EnvironmentManager's internal driver reference, so later GetCurrentDriver() calls can return a disposed driver. ResetOnError() also disposes before calling CreateFreshDriver(), which then calls CloseCurrentDriver()/Quit on an already-disposed instance and can throw, masking the real test failure.

Issue description

DriverTestFixture currently calls driver.Dispose() in [OneTimeTearDown] and in ResetOnError(). Because the driver is owned/tracked by EnvironmentManager (singleton), disposing the driver directly can leave EnvironmentManager holding a disposed instance and cause later calls to reuse or quit a disposed driver.

Issue Context

  • EnvironmentManager.GetCurrentDriver() returns the cached driver field if non-null.
  • EnvironmentManager.CloseCurrentDriver() is the method that both quits the driver and clears the cached reference (driver = null).

Fix Focus Areas

  • dotnet/test/webdriver/DriverTestFixture.cs[128-148]

Suggested fix

  1. In DriverTestFixture.TearDown() (the [OneTimeTearDown]), replace driver?.Dispose(); with EnvironmentManager.Instance.CloseCurrentDriver(); (and optionally set driver = null; for clarity).
  2. In ResetOnError(), remove the explicit driver?.Dispose(); and just call driver = EnvironmentManager.Instance.CreateFreshDriver(); (since CreateFreshDriver() already closes the current driver via CloseCurrentDriver()).
  3. Ensure there is no remaining path where EnvironmentManager.driver can remain non-null while the underlying driver is disposed.

Suggestion 3: [reliability] close() can hang waiters
close() can hang waiters Index.waitForConnection() depends on a WebSocket 'close'/'error' listener to reject when the socket dies before opening, but Index.close() removes all 'close' listeners before initiating ws.close(), which can remove waitForConnection()’s onFailure handler. If close() is called while a send() is awaiting waitForConnection(), that wait can remain pending indefinitely.

Issue description

waitForConnection() uses this._ws.once('close'|'error', ...) to reject when the socket fails before opening. But close() calls this._ws.removeAllListeners('close') before triggering this._ws.close(), which can remove waitForConnection()’s failure handler if a caller closes while another caller is awaiting connection.

Issue Context

This can lead to an indefinitely pending waitForConnection() (and therefore send()) promise during connection setup/cancellation scenarios.

Fix Focus Areas

  • javascript/selenium-webdriver/bidi/index.js[133-155]
  • javascript/selenium-webdriver/bidi/index.js[272-291]

Suggested approach

  • Avoid removeAllListeners('close') before the socket actually closes. Instead:
    • Remove only the internal close handler(s) you own (store them as named/bound functions so they can be removed explicitly), or
    • Defer removeAllListeners() until inside the once('close', ...) callback (after other listeners like waitForConnection() have had a chance to run).
  • Optionally, make _failPending() also reject/resolve a tracked “connect promise” if you introduce one, so close() can reliably unblock waitForConnection() without relying on WS listener ordering.

Suggestion 4: [reliability] `logs` ArrayList used concurrently
`logs` ArrayList used concurrently A non-thread-safe `ArrayList` is mutated from an async console log handler while being read/cleared on the test thread, which can cause data races and flaky or failing tests. This violates the requirement to ensure concurrent code is race-safe with correct synchronization/visibility.

Issue description

canUnpinScript() uses a non-thread-safe ArrayList (logs) that is mutated by an async console log handler while also being read/cleared by the test thread, creating a data race and potential flakiness.

Issue Context

The console message handler registered via script.addConsoleMessageHandler(...) can be invoked asynchronously. The test concurrently performs assertThat(logs)... and logs.clear() while the handler may still be appending.

Fix Focus Areas

  • java/test/org/openqa/selenium/WebScriptTest.java[260-292]

Pattern 2: In CI/workflows and automation scripts, validate preconditions and external command outputs (artifacts, JSON, API results) and fail-safe on errors (explicit outputs, correct quoting, pagination, and integrity mismatch handling).

Example code before:

echo "exists=$(gh api ... | jq -e 'any(.[]; .name == "x")' && echo "true" || echo "false")" >> "$GITHUB_OUTPUT"
gh api "repos/$REPO/items" | jq '.[] | .id'   # no pagination, schema assumptions
if [ ! -s file.patch ]; then exit 0; fi      # missing treated as empty

Example code after:

set -euo pipefail
exists=false
if json="$(gh api --paginate "repos/$REPO/items" 2>/tmp/err)"; then
  if jq -e --arg n "x" 'any(.[]?; .name==$n)' <<<"$json" >/dev/null; then
    exists=true
  fi
else
  echo "::warning::API failed; using safe default"
  exists=true
fi
echo "exists=$exists" >> "$GITHUB_OUTPUT"
test -f file.patch || { echo "::error::missing patch artifact"; exit 1; }
Relevant past accepted suggestions:
Suggestion 1: [reliability] Cache listing not paginated
Cache listing not paginated `prune-codeql-caches.sh` fetches only 100 caches (`gh cache list --limit 100`), so any matching caches beyond that are never considered for deletion. Since the script itself notes CodeQL creates a new cache per commit, older caches can accumulate beyond 100 and remain unpruned.

Issue description

The script only lists up to 100 caches and prunes within that partial set, which can leave most stale caches untouched once the repository accumulates more than 100 CodeQL caches.

Issue Context

The script comments say CodeQL creates a new overlay-base-database cache per commit; on an active repository, that can exceed 100 entries quickly.

Fix Focus Areas

  • scripts/github-actions/prune-codeql-caches.sh[3-6]
  • scripts/github-actions/prune-codeql-caches.sh[24-28]

Suggested fix

  • Replace the single gh cache list ... --limit 100 call with logic that retrieves all matching caches.
    • Option A: increase --limit to the maximum supported by gh (if sufficiently high for this repo’s scale).
    • Option B (more robust): implement pagination via the GitHub API (gh api) to iterate through all pages of caches, building the rows array from the full result set.
  • Keep the rest of the grouping/deletion logic unchanged so behavior stays “keep newest per group”.

Suggestion 2: [reliability] `git ls-remote origin` without checkout
`git ls-remote origin` without checkout The `detect-branch` job runs `git ls-remote ... origin ...` without checking out the repository or otherwise configuring an `origin` remote, so the step is likely to fail or produce an incorrect `exists` output that prevents downstream `evaluate`/`commit-pins`/`promote` jobs from running even when the target branch exists. This violates the requirement to harden CI/scripts so they behave deterministically and robustly.

Issue description

The detect-branch job calls git ls-remote ... origin ... even though no repository checkout (or equivalent remote setup) occurs, so origin is undefined and the branch-existence detection becomes unreliable; this can incorrectly skip downstream evaluate/commit-pins/promote jobs.

Issue Context

detect-branch runs on a fresh GitHub-hosted runner; without actions/checkout, there is no local git repository and no configured remote named origin, so git ls-remote origin ... can fail or yield incorrect results. Other jobs/workflows that successfully use git ls-remote ... origin do so after actions/checkout, which configures origin, indicating the missing prerequisite here.

Fix Focus Areas

  • .github/workflows/renovate-dependency-pr.yml[25-44]
  • .github/workflows/renovate-dependency-pr.yml[40-44]

Suggestion 3: [reliability] `gh api` output quoting broken
`gh api` output quoting broken In `renovate-dependency-pr.yml`, the `detect-branch` step uses nested, unescaped double-quotes inside an `echo "exists=$(...)"` command substitution when writing to `$GITHUB_OUTPUT`, which can break shell parsing and cause the workflow to fail before setting the `exists` output. Because downstream evaluation/pinning/promotion is gated on `steps.detect.outputs.exists`, this can lead to jobs being skipped or failing, violating the requirement to harden CI scripts with safe, deterministic shell handling.

Issue description

The detect-branch step writes exists=... to $GITHUB_OUTPUT using a command substitution containing nested, unescaped double-quotes (via echo "true" / echo "false" inside an outer echo "exists=$(...)"), which can break shell parsing and prevent the exists output from being set, causing downstream jobs gated on this value to skip or fail.

Issue Context

This is part of a CI workflow and must be robust and deterministic (per PR Compliance ID 17) to avoid silent failures or broken runs. The problematic pattern is effectively echo "exists=$( ... && echo "true" || echo "false")", where the inner quotes can prematurely terminate the outer string; downstream logic depends on needs.detect-branch.outputs.exists being 'true' for the evaluate job (and related evaluation/pinning/promotion) to run.

Fix Focus Areas

  • .github/workflows/renovate-dependency-pr.yml[35-46]

Suggestion 4: [reliability] Missing patch treated as ok
Missing patch treated as ok The new guard `if [ ! -s changes.patch ]; then` treats a missing `changes.patch` the same as an intentionally empty patch, so an artifact download failure/misconfiguration can be silently skipped as “no changes”. Because the download step is `continue-on-error: true`, this can produce a green job without indicating that the expected artifact wasn’t applied.

Issue description

The workflow currently checks only file size (-s) and exits 0 when the patch is missing/empty. With continue-on-error: true on the download step, an actual artifact download failure can be silently treated as a successful no-op.

Issue Context

This reusable workflow is invoked from release/CI workflows where a missing patch may indicate an upstream failure or wrong artifact name.

Fix Focus Areas

  • .github/workflows/commit-changes.yml[41-59]

Suggested fix

  • Emit a ::notice:: before exiting when no patch is present.
  • Optionally, distinguish:
    • if [ ! -f changes.patch ]; then ... (artifact missing)
    • elif [ ! -s changes.patch ]; then ... (empty patch) and decide whether “missing” should fail or remain a no-op based on intended caller behavior.
  • If you want to keep it as a no-op, at minimum log clearly that commit/push is being skipped due to missing/empty patch.

Suggestion 5: [reliability] CI may skip all tests
CI may skip all tests `ci.yml` downloads the `check-targets` artifact with `digest-mismatch: warn` and then treats a missing/unreadable `bazel-targets.txt` as an empty string, which can cause every language job to be skipped. The final `ci-success` gate only fails on failures/cancellations, so a run with all test jobs skipped can still pass.

Issue description

The Read Targets job can succeed while effectively selecting zero test jobs if bazel-targets.txt is missing/unreadable (it falls back to an empty string). Combined with digest-mismatch: warn, this makes it more likely for a corrupted/invalid artifact to allow the workflow to proceed and still report overall success.

Issue Context

Downstream test jobs are guarded by if: needs.read-targets.outputs.<lang> != '', and ci-success only checks for failures/cancellations (skipped jobs do not fail the workflow).

Fix Focus Areas

  • .github/workflows/ci.yml[68-92]
  • .github/workflows/ci.yml[130-141]

Suggested fix

Make the Read Targets job fail (or fall back to running a safe default like the full test suite) when bazel-targets.txt is missing/empty/unparseable. Also consider making digest mismatch fatal for the check-targets artifact specifically, since it drives which tests run.


Suggestion 6: [security] `digest-mismatch` set to `warn`
`digest-mismatch` set to `warn` Multiple CI workflows configure `actions/download-artifact@v8` with `digest-mismatch: warn`, allowing artifact integrity mismatches to proceed without failing the job and weakening determinism. In release/publish flows this is especially risky because the workflows can immediately publish or upload the downloaded artifacts, potentially distributing corrupted outputs despite a detected mismatch.

Issue description

Workflows set digest-mismatch: warn when downloading artifacts via actions/download-artifact@v8, allowing digest mismatches to pass without failing the workflow; in release/publish jobs, this can result in publishing or releasing artifacts even when an integrity mismatch was detected.

Issue Context

The PR updates actions/download-artifact to v8 and explicitly configures digest mismatch handling to warning; per build hardening guidance (PR Compliance ID 14), integrity mismatches should fail deterministically rather than be downgraded. This is particularly important for jobs that publish to PyPI or upload assets to GitHub releases (and other release-like flows), because they use the downloaded artifacts for distribution without any subsequent integrity check.

Fix Focus Areas

  • .github/workflows/ci-build-index.yml[27-30]
  • .github/workflows/ci-rust.yml[238-277]
  • .github/workflows/ci-rust.yml[245-248]
  • .github/workflows/ci.yml[69-72]
  • .github/workflows/commit-changes.yml[49-52]
  • .github/workflows/nightly.yml[80-102]
  • .github/workflows/nightly.yml[89-93]
  • .github/workflows/pin-browsers.yml[35-38]
  • .github/workflows/pre-release.yml[260-264]
  • .github/workflows/release.yml[160-229]
  • .github/workflows/release.yml[171-175]
  • .github/workflows/release.yml[198-203]
  • .github/workflows/release.yml[206-210]
  • .github/workflows/update-documentation.yml[99-103]

Suggestion 7: [reliability] Missing report validation
Missing report validation scripts/update_py_deps.py treats missing packages in pip’s JSON report as “no change”, so it can print “All repo tooling requirements are up to date!” even when version resolution didn’t return entries for some/all requested requirements. This can silently leave py/requirements.txt stale while the command exits successfully.

Issue description

scripts/update_py_deps.py builds installed from report.get("install", []) and then uses installed.get(name_normalized); missing entries are treated as unchanged, and if updates stays empty the script prints a success message and exits. This can mask cases where the report is empty/partial and no versions were actually resolved.

Issue Context

This script is used to update pinned versions in py/requirements.txt and is invoked via Bazel (bazel run //scripts:update_py_deps). A silent success with no updates when resolution data is missing makes dependency maintenance unreliable.

Fix Focus Areas

  • scripts/update_py_deps.py[54-73]

Suggested fix

  • After parsing the report, validate that every name_normalized from packages exists in installed.
    • If any are missing, raise a RuntimeError that includes a helpful diagnostic (e.g., mention missing names and include result.stderr and a truncated result.stdout).
  • Optionally wrap json.loads(result.stdout) in try/except json.JSONDecodeError and raise a clearer error including stderr/stdout context.

Suggestion 8: [reliability] `pip --report` JSON unvalidated
`pip --report` JSON unvalidated The new dependency resolution flow parses `pip --report -` output with `json.loads(result.stdout)` and then assumes specific keys exist, which can crash with non-actionable `JSONDecodeError`/`KeyError` if pip emits non-JSON output or the report schema differs. This violates the requirement to validate external/protocol inputs and raise deterministic, descriptive exceptions.

Issue description

scripts/update_py_deps.py parses pip --report - output without validating that the output is valid JSON or that expected keys exist, which can produce unclear crashes (e.g., JSONDecodeError, KeyError).

Issue Context

This script consumes external/protocol-derived data (pip report JSON). Per compliance, failures should be deterministic and actionable (explicit exceptions indicating what was wrong and how to fix).

Fix Focus Areas

  • scripts/update_py_deps.py[54-56]

Suggested implementation notes

  • Wrap json.loads(result.stdout) in try/except json.JSONDecodeError and raise a RuntimeError (or custom exception) that includes a short message and relevant stderr/stdout context.
  • Validate report is a dict and that report.get("install") is a list.
  • When building installed, use .get() accessors (or explicit checks) and raise a descriptive exception if required fields like metadata.name/metadata.version are missing.

Suggestion 9: [reliability] Release check not fail-safe
Release check not fail-safe The nightly workflow only sets `release-in-progress=true` when the rulesets query succeeds and returns true; if `gh api` fails (e.g., auth/API error), the condition evaluates false and the output is left unset, so nightly releases proceed despite an unknown release state. This can cause the nightly pipeline to run during an active release or when the ruleset check is unavailable.

Issue description

The prepare job’s “Check if release ruleset is active” step only writes release-in-progress=true in the success path. If the gh api call fails, the step produces no output, and downstream logic treats it as not in progress.

Issue Context

needs.prepare.outputs.release-in-progress gates the nightly release job. An unset output is not equal to 'true', so the nightly release will run even when the ruleset check failed.

Fix Focus Areas

  • .github/workflows/nightly.yml[52-61]

Suggested fix

Update the step to be fail-safe and to always write an explicit output value.

For example:

  • Call gh api and check its exit status.
  • If the API call fails, emit a warning and set release-in-progress=true (safe default).
  • Otherwise set release-in-progress=true|false based on the jq result.

(Any equivalent approach that distinguishes “false” from “error” and always sets the output is fine.)


Suggestion 10: [reliability] `gh api` missing pagination
`gh api` missing pagination The workflow captures `existing` rulesets via `gh api` without `--paginate`, which can silently omit rulesets beyond the first page. This can cause non-deterministic behavior (e.g., failing to detect or delete some release rulesets), potentially leaving trunk incorrectly restricted/unrestricted.

Issue description

The workflow lists existing rulesets using gh api "repos/$GH_REPO/rulesets" but does not use --paginate, so it may only read the first page of results.

Issue Context

The create/delete logic relies on the existing JSON to determine whether to POST new rulesets and which IDs to DELETE. If the repository has more rulesets than fit in one API page, some relevant rulesets may be missed.

Fix Focus Areas

  • .github/workflows/restrict-trunk.yml[62-63]
  • .github/workflows/restrict-trunk.yml[77-78]

Suggestion 11: [reliability] Unvalidated `$id` in delete
Unvalidated `$id` in delete The workflow deletes rulesets by using a `$id` derived from `gh api` name matching without validating that exactly one ruleset ID was found, so empty or multiple matches can yield an invalid DELETE URL and non-deterministic failures that can block trunk unlock. This violates the requirement to harden scripts by validating preconditions and handling external/config inputs deterministically.

Issue description

Harden the ruleset deletion logic so it does not blindly construct and call DELETE .../rulesets/$id from gh api output; instead, it must validate that the name lookup yields a safe, deterministic set of IDs (and handle empty/multiple matches explicitly) to avoid invalid URLs and non-deterministic failures during unlock.

Issue Context

This is release automation: the unlock path (Delete release rulesets) runs when unlock-trunk.yml calls the workflow with restrict: false. Rulesets are created via POST in a loop and can be recreated on reruns, so duplicate names/IDs are possible; if the lookup returns empty output (not found) or multiple newline-separated IDs (duplicates), the current deletion step can fail unpredictably, produce unclear errors, and block trunk unlock. Consider also validating that the glob .github/rulesets/release-*.json matches at least one file before looping, and fail with a clear message if none are found.

Fix Focus Areas

  • .github/workflows/restrict-trunk.yml[58-64]
  • .github/workflows/restrict-trunk.yml[65-73]
  • .github/workflows/unlock-trunk.yml[8-16]

Suggestion 12: [reliability] `download-artifact` missing `merge-multiple`
`download-artifact` missing `merge-multiple` The updated `actions/download-artifact@v4` step uses `pattern: documentation-*` but does not set `merge-multiple: true` or a deterministic extraction path, which can place files under an extra per-artifact directory. This can prevent the later `git add docs/api/` from staging the generated docs, breaking the workflow.

Issue description

The workflow now downloads artifacts via pattern: documentation-* without merge-multiple/path, which can extract files into an unexpected per-artifact directory and cause git add docs/api/ to miss the generated documentation.

Issue Context

Previously, the workflow downloaded a single artifact by name. After this PR, artifact names are language-scoped, but the download step should still ensure the extracted files end up where subsequent steps expect them.

Fix Focus Areas

  • .github/workflows/update-documentation.yml[95-105]

Suggested change

Prefer downloading the single expected artifact explicitly (e.g., name: documentation-${{ needs.parse.outputs.language }}), or keep pattern but set merge-multiple: true and an explicit path that matches the later git add docs/api/ staging expectations.


Pattern 3: Keep public API contracts and typed/signature surfaces consistent: avoid breaking removals/behavior changes without deprecation, and ensure signature/type files match runtime methods (including setters) and interface changes are non-breaking.

Example code before:

# Runtime method exists but signature missing / or public method removed without deprecation
class Options
  def browser_name=(v) = (@browser_name = v)
end
# signature file: only declares browser_name, not browser_name=

Example code after:

# Keep runtime + signature aligned, and deprecate before removal
class Options
  def browser_name=(v) = (@browser_name = v)
end
# signature: def browser_name=: (String value) -> void
# If removing a public API: keep it with deprecation warning for a cycle.
Relevant past accepted suggestions:
Suggestion 1: [maintainability] RBS missing browser_name=
RBS missing browser_name= `Safari::Options` now defines `browser_name=` at runtime, but `rb/sig/.../safari/options.rbs` only declares `browser_name` and not the setter, so typed consumers cannot call `options.browser_name = ...` without type errors.

Issue description

Selenium::WebDriver::Safari::Options defines a browser_name= setter in Ruby, but the RBS signature file does not declare this method. This creates a mismatch between runtime API and the typed interface.

Issue Context

The PR introduced/updated Safari::Options#browser_name= in Ruby, and updated the RBS to include browser_name, but the setter signature was not added.

Fix Focus Areas

  • rb/lib/selenium/webdriver/safari/options.rb[39-45]
  • rb/sig/lib/selenium/webdriver/safari/options.rbs[13-16]

Suggested fix

Add a setter signature in rb/sig/lib/selenium/webdriver/safari/options.rbs, e.g.:

  • def browser_name=: (String value) -> void (or return String if you want to model Ruby’s assignment return value).

Suggestion 2: [correctness] `Safari.path` now nil
`Safari.path` now nil `Selenium::WebDriver::Safari.path` now defaults to `nil`, removing the previous behavior of providing/validating a default Safari binary path and raising actionable errors. This is a user-visible behavior change that can break callers relying on a non-nil default or on the earlier deterministic exceptions.

Issue description

Safari.path now memoizes nil, changing the public method’s behavior so it no longer provides a default Safari path nor performs early validation/OS checks. This is a backward-incompatible, user-visible behavior change.

Issue Context

Safari.path is a public singleton method. With the new implementation, any caller expecting a default path (or early, actionable errors) will instead receive nil and may fail later with less clear errors.

Fix Focus Areas

  • rb/lib/selenium/webdriver/safari.rb[48-50]

Suggested approach

  • Reintroduce the previous default path and validation/error behavior (e.g., default to the Safari binary on macOS and raise a clear WebDriverError when unsupported/unavailable).
  • If the new nil default is intentional, add a deprecation path and update callers/documentation accordingly so upgrades do not silently change behavior.

Suggestion 3: [correctness] `Service#env_path` removed
`Service#env_path` removed `Selenium::WebDriver::Service#env_path` was removed without a deprecation path, which is a breaking change for callers and leaves the RBI/RBS contract inconsistent with the runtime implementation. This can trigger runtime `NoMethodError` for existing consumers and also break RBS/Steep validation during upgrades.

Issue description

Selenium::WebDriver::Service#env_path was removed from the Ruby implementation without a deprecation path, but the corresponding RBS signature still declares it, creating an API break for callers and leaving the type/signature contract out of sync with runtime behavior.

Issue Context

  • The Ruby signature file still exposes env_path as a public method (def env_path).
  • The Service implementation no longer defines env_path, so existing callers will hit NoMethodError at runtime.
  • The mismatch will also fail RBS/Steep (or other signature validation) because the method is declared but not implemented.
  • Compliance requires maintaining API compatibility and deprecating public functionality before removal.

Fix Focus Areas

  • rb/lib/selenium/webdriver/common/service.rb[89-105]
  • rb/sig/lib/selenium/webdriver/common/service.rbs[48-60]

Suggestion 4: [maintainability] `waitForConnection()` now rejects
`waitForConnection()` now rejects `waitForConnection()` previously only resolved (or could hang) but now rejects when the socket is closed or closes before opening, which can break callers that do not handle promise rejection. This changes the public API contract/behavior for a user-facing interface.

Issue description

waitForConnection() now rejects on closed/failed connections, which is a user-visible behavioral change that may break downstream code that previously awaited it without handling rejections.

Issue Context

This class is exported from selenium-webdriver/bidi, so changes to promise rejection behavior can be compatibility-impacting.

Fix Focus Areas

  • javascript/selenium-webdriver/bidi/index.js[134-155]

Suggestion 5: [correctness] `Selenium.WebDriver` now signed
`Selenium.WebDriver` now signed The PR enables strong-name signing on existing user-facing assemblies, changing .NET assembly identity (strong-name) and risking ABI incompatibilities for consumers during upgrade. This violates the requirement to maintain API/ABI compatibility such that upgrades should not require user code changes.

Issue description

Enabling strong-name signing on existing user-facing assemblies changes the assembly identity/ABI for consumers. Compliance requires upgrades not to require user code changes due to API/ABI incompatibilities.

Issue Context

The diff adds SignAssembly and AssemblyOriginatorKeyFile to existing .csproj files, which will produce strongly named assemblies under the same package identity.

Fix Focus Areas

  • dotnet/src/webdriver/Selenium.WebDriver.csproj[9-10]
  • dotnet/src/support/Selenium.WebDriver.Support.csproj[10-11]

Suggestion 6: [maintainability] Added methods to `IBiDi`
Added methods to `IBiDi` New members were added to the public `IBiDi` interface, which is a breaking API/ABI change for any downstream implementers of the interface. This violates the requirement to keep public interfaces backward-compatible.

Issue description

IBiDi gained new members, which is a breaking change for any external implementers.

Issue Context

Adding methods to a public interface is an ABI/API breaking change in .NET.

Fix Focus Areas

  • dotnet/src/webdriver/BiDi/IBiDi.cs[60-70]

Pattern 4: Add explicit timeouts/cancellation and deterministic failure paths to async waits (streams, websockets, HTTP calls) so operations cannot hang indefinitely and failures surface promptly with actionable errors.

Example code before:

# C#
var msg = await stream.Where(x => x.Type == "done").FirstAsync(); // can hang forever

# Ruby
body = Net::HTTP.get(URI(url)) # no timeouts

Example code after:

// C#
using var cts = new CancellationTokenSource(TimeSpan.FromSeconds(5));
var msg = await stream.Where(x => x.Type == "done").FirstAsync(cts.Token);

# Ruby
uri = URI(url)
Net::HTTP.start(uri.host, uri.port, use_ssl: uri.scheme == "https",
                open_timeout: 5, read_timeout: 10) do |http|
  http.get(uri.request_uri).body
end
Relevant past accepted suggestions:
Suggestion 1: [reliability] No HTTP timeouts
No HTTP timeouts maven_stable_release uses Net::HTTP.get without open/read timeouts, so `rake java:update` can hang indefinitely on network stalls. This makes dependency updates flaky and can block CI/developer workflows.

Issue description

maven_stable_release fetches maven-metadata.xml via Net::HTTP.get without any timeouts, which can cause rake java:update to hang indefinitely if Maven Central is slow or the connection stalls.

Issue Context

This rake file already uses Net::HTTP.start with explicit open_timeout and read_timeout for other HTTP calls, so timeouts are an established reliability pattern here.

Fix Focus Areas

  • rake_tasks/java.rake[310-323]
  • rake_tasks/java.rake[101-114]

Suggested fix

Replace Net::HTTP.get(URI(url)) with a Net::HTTP.start call that sets use_ssl: true plus open_timeout / read_timeout, and read the response body via Net::HTTP::Get so the call fails fast instead of hanging.


Suggestion 2: [reliability] Hanging stream LINQ query
Hanging stream LINQ query NetworkTests.CanDisownData awaits FirstAsync() on an event stream without any cancellation/timeout, so if the matching ResponseCompleted event never arrives the test can hang indefinitely and stall the suite.

Issue description

CanDisownData awaits an async stream LINQ query ending in FirstAsync() without providing a CancellationToken or any timeout. If no matching event is produced, the test can hang indefinitely.

Issue Context

Other tests in this repo bound async stream waits using a CancellationTokenSource(TimeSpan...) (e.g., sub.FirstAsync(cts.Token)) or WaitAsync(TimeSpan...) to avoid CI stalls.

Fix Focus Areas

  • dotnet/test/webdriver/BiDi/Network/NetworkTests.cs[290-304]

Suggested change

Wrap the FirstAsync with a timeout, e.g.:

  • create using var cts = new CancellationTokenSource(TimeSpan.FromSeconds(5));
  • call await stream.Where(...).Select(...).FirstAsync(cts.Token); (or equivalently use WaitAsync(TimeSpan.FromSeconds(5)) on the returned task/value-task).

Suggestion 3: [reliability] send() can hang forever
send() can hang forever After a socket 'close'/'error', _failPending() sets connected=false, but send() will then await waitForConnection(), which only resolves on a future 'open' event; on a closed/failed WebSocket this can leave new send() calls pending indefinitely. This can stall higher-level BiDi helpers that call bidi.send() without checking connection state.

Issue description

After _failPending() runs, connected becomes false, and future send() calls can await waitForConnection() forever because there will be no future 'open' event on a closed/failed socket.

Issue Context

_failPending() is triggered on the underlying WebSocket 'close'/'error' events and marks the connection closed. Many BiDi helper classes call bidi.send() directly; if the connection drops and a command is attempted afterward, the call can hang indefinitely.

Fix Focus Areas

  • javascript/selenium-webdriver/bidi/index.js[91-101]

Suggested fix

Add a fast-fail guard in send() (or waitForConnection()):

  • If this._closed is true, immediately reject/throw a clear error.
  • Also consider checking this._ws.readyState and rejecting when CLOSING/CLOSED.
  • Optionally, have waitForConnection() reject on 'close'/'error' if it is waiting for 'open'.

Suggestion 4: [reliability] Unexpected close stalls sends
Unexpected close stalls sends Index only rejects pending send() calls on timeout or explicit close(); it does not handle underlying WebSocket 'close'/'error' events. If the peer disconnects mid-request, callers can hang until RESPONSE_TIMEOUT (30s) instead of failing immediately.

Issue description

Index tracks in-flight requests in _pending, but it does not subscribe to the underlying WebSocket 'close' / 'error' events. If the connection drops without an explicit Index.close() call, pending send() promises will only fail after RESPONSE_TIMEOUT, delaying failure propagation.

Issue Context

  • close() now correctly rejects _pending when invoked, but unexpected disconnects still leave callers waiting up to 30 seconds.
  • Since there is now a single dispatcher and centralized _pending, the most reliable place to fail outstanding requests is in a single socket-level close/error handler.

Fix Focus Areas

  • javascript/selenium-webdriver/bidi/index.js[33-63]
  • javascript/selenium-webdriver/bidi/index.js[217-241]

Suggested approach

  • Add _ws.on('close', ...) and _ws.on('error', ...) handlers in the constructor.
  • In those handlers:
    • set this.connected = false
    • iterate _pending.values() and clearTimeout(timeoutId) then reject(new Error(...))
    • _pending.clear()
  • Ensure handlers are idempotent (e.g., guard with a boolean like this._closed = true) so close() and 'close' event don’t double-reject.
  • (Optional) extend tests to simulate server-side disconnect and assert pending sends reject promptly (without waiting for the timeout).

Pattern 5: In concurrent/resource-heavy code paths, make cleanup exception-safe and bounded: ensure Dispose/teardown ordering is correct, remove per-consumer registrations, and cap buffers/queues while releasing ref-counted resources on all terminal paths.

Example code before:

// Pseudocode
pending = new Deque()
onMessage(frame): pending.add(frame)  // unbounded
onClose(): closed = true; return      // pending frames never released if upgrade never happens

async DisposeAsync():
  dispatcher.Dispose()               // disposed before processor drains
  await processingTask

Example code after:

const int MaxPending = 1000;
onMessage(frame):
  if (pending.size >= MaxPending) { frame.release(); markClosed(); return; }
  pending.add(frame)

onClose():
  closed = true
  while (pending.notEmpty()) pending.pop().release()

async DisposeAsync():
  try { await processingTask; }
  finally {
    await dispatcher.DisposeAsync();
    await transport.DisposeAsync();
  }
Relevant past accepted suggestions:
Suggestion 1: [reliability] Pending frames leak onClose
Pending frames leak onClose On a pre-handshake upstream `onClose`, the listener records the close and closes the `HttpClient` but intentionally keeps buffered `WebSocketFrame`s for a future `onUpgrade` drain. If the client-side upgrade never completes (e.g., client disconnects mid-handshake), those buffered ref-counted frames are never released, risking a Netty buffer leak.

Issue description

DirectForwardingListener keeps pre-handshake buffered WebSocketFrames on upstream onClose so they can be drained during onUpgrade. If the client-side upgrade never happens, those buffered (ref-counted) frames never get released, creating a potential memory/leak-detector issue.

Issue Context

The code already treats buffered frames as ref-counted (it calls frame.release() on other terminal paths like overflow), but the pre-handshake onClose path retains the buffer without any guaranteed later release.

Fix Focus Areas

  • java/src/org/openqa/selenium/grid/router/ProxyWebsocketsIntoGrid.java[333-353]
  • java/src/org/openqa/selenium/grid/router/ProxyWebsocketsIntoGrid.java[295-313]

Suggestion 2: [reliability] Unbounded frame buffer
Unbounded frame buffer DirectForwardingListener buffers pre-upgrade frames in an unbounded deque with no cap, so if the client upgrade is delayed/stalled while upstream is producing frames, memory can grow without bound (including retaining ref-counted frame buffers). This risks router memory pressure or OOM under pathological conditions.

Issue description

DirectForwardingListener uses an unbounded Deque<WebSocketFrame> to buffer frames arriving before onUpgrade(Channel) is called. If the downstream upgrade stalls while upstream continues sending, the deque can grow without limit, retaining ref-counted frames and risking memory exhaustion.

Issue Context

  • Buffering is correct for ordering, but needs a safety valve.

Fix approach

  • Add a hard limit (by frame count and/or total bytes) for pending.
  • When the limit is exceeded, choose a deterministic policy: drop newest/oldest with release(), and/or close the downstream channel once available; also log at WARNING with session context.
  • Consider a time-based cutoff (e.g., if upgrade hasn’t completed within N seconds, discard pending and mark closed).

Fix Focus Areas

  • java/src/org/openqa/selenium/grid/router/ProxyWebsocketsIntoGrid.java[220-285]
  • java/src/org/openqa/selenium/grid/router/ProxyWebsocketsIntoGrid.java[331-336]

Suggestion 3: [reliability] Unsubscribe cleanup not exception-safe
Unsubscribe cleanup not exception-safe Broker.UnsubscribeAsync only removes the subscription from the local registry after the remote UnsubscribeAsync succeeds, so if the transport call throws the subscription remains registered and still traversed for every event. Subscription.DisposeAsync also sets its disposed flag before awaiting UnsubscribeAsync, preventing retry after a transient failure and increasing the chance of a permanent local leak.

Issue description

Broker.UnsubscribeAsync performs local cleanup (removing from _subscriptions) only after the remote unsubscribe completes successfully. If the remote call throws, the local subscription remains in the registry, and the broker continues iterating it on every event.

Additionally, Subscription.DisposeAsync sets _disposed before awaiting UnsubscribeAsync(). If UnsubscribeAsync() throws, the subscription becomes non-retryable for disposal/unsubscribe, making the local leak harder to recover from.

Issue Context

  • Local cleanup should be resilient to transport/remote failures.
  • Even if the remote unsubscribe fails, local detachment should happen to stop further event delivery and avoid retaining stale subscriptions.

Fix Focus Areas

  • dotnet/src/webdriver/BiDi/Broker.cs[93-101]
  • dotnet/src/webdriver/BiDi/Subscription.cs[50-57]

Suggested change

  1. Wrap the remote unsubscribe in try/finally so local registry removal always runs:
public async ValueTask UnsubscribeAsync(Subscription subscription, CancellationToken cancellationToken)
{
  try
  {
    await _bidi.Session.UnsubscribeAsync([subscription.SubscriptionId], null, cancellationToken).ConfigureAwait(false);
  }
  finally
  {
    if (_subscriptions.TryGetValue(subscription.EventName, out var registry))
      registry.Remove(subscription);
  }
}
  1. Consider making Subscription.DisposeAsync retry-safe by only setting _disposed after a successful unsubscribe, or resetting _disposed on failure (so callers can retry), e.g. try { ... } catch { Interlocked.Exchange(ref _disposed, 0); throw; }.

This prevents permanent local leaks when the transport is flaky or shutting down.


Suggestion 4: [reliability] Enumerator child-channel leak
Enumerator child-channel leak Subscription.GetAsyncEnumerator() adds a per-enumerator child Channel to _children but never removes it when enumeration ends or is cancelled, so DrainAsync keeps writing events into abandoned channels for the lifetime of the subscription. This can cause unbounded memory growth and retain references even after consumers stop iterating.

Issue description

Subscription<TEventArgs>.GetAsyncEnumerator() registers a per-enumerator child Channel<TEventArgs> in _children but never unregisters it when the consumer stops iterating (natural completion, DisposeAsync, or cancellation). As a result, DrainAsync() continues to write every event into channels that no longer have active readers, leading to unbounded buffering/memory growth and retained references.

Issue Context

  • Child channels are stored in _children and written-to on every event.
  • Cancellation of the returned async iterator does not trigger any code that removes the child from _children.

Fix Focus Areas

  • dotnet/src/webdriver/BiDi/Subscription.cs[91-107]
  • dotnet/src/webdriver/BiDi/Subscription.cs[118-157]

Suggested change

Implement enumeration via an instance async IAsyncEnumerable<TEventArgs> that:

  1. creates and registers the child channel,
  2. yields items from child.Reader.ReadAllAsync(cancellationToken),
  3. in a finally block removes the child from _children and completes the child writer.

Example shape:

public IAsyncEnumerator<TEventArgs> GetAsyncEnumerator(CancellationToken ct = default)
  => Enumerate(ct).GetAsyncEnumerator(ct);

private async IAsyncEnumerable<TEventArgs> Enumerate([EnumeratorCancellation] CancellationToken ct)
{
  var child = Channel.CreateUnbounded<TEventArgs>(new UnboundedChannelOptions { SingleReader = true, SingleWriter = true });
  lock (_children) _children.Add(child);
  try
  {
    await foreach (var item in child.Reader.ReadAllAsync(ct).ConfigureAwait(false))
      yield return item;
  }
  finally
  {
    lock (_children) _children.Remove(child);
    child.Writer.TryComplete();
  }
}

This ensures abandoned enumerators do not keep accumulating writes.


Suggestion 5: [reliability] `DisposeAsync` disposes `_eventDispatcher` early
`DisposeAsync` disposes `_eventDispatcher` early `Broker.DisposeAsync()` disposes `_eventDispatcher` before awaiting `_processingTask`, so `ProcessMessagesAsync()` can still call `_eventDispatcher.EnqueueEvent()` after the dispatcher has completed its channel, silently dropping in-flight events. This is an unsafe shutdown ordering in concurrent code that can lead to lost/partial event delivery during disposal.

Issue description

Broker.DisposeAsync() disposes _eventDispatcher before _processingTask has completed. Since ProcessMessagesAsync() can still call _eventDispatcher.EnqueueEvent() while draining _receivedMessages, events can be silently dropped because EventDispatcher.DisposeAsync() completes _pendingEvents.Writer.

Issue Context

The new producer/consumer split means message processing can outlive receiving and can continue enqueuing events during shutdown. Disposal ordering should ensure the processing pipeline has fully drained (or is explicitly prevented from enqueuing) before tearing down the dispatcher.

Fix Focus Areas

  • dotnet/src/webdriver/BiDi/Broker.cs[121-137]
  • dotnet/src/webdriver/BiDi/Broker.cs[353-376]
  • dotnet/src/webdriver/BiDi/EventDispatcher.cs[74-77]
  • dotnet/src/webdriver/BiDi/EventDispatcher.cs[118-127]

Suggestion 6: [reliability] Dispose skips cleanup on fault
Dispose skips cleanup on fault DisposeAsync will abort before disposing the transport and pooled buffers if either background task faults with a non-cancellation exception. This can leak the underlying transport and pooled arrays when ReceiveMessagesAsync throws (it rethrows on non-cancellation failures).

Issue description

Broker.DisposeAsync() can exit early (throw) when _receivingTask or _processingTask faults, skipping disposal of _transport and pooled buffers. This can leak sockets and ArrayPool-rented memory.

Issue Context

ReceiveMessagesAsync rethrows on non-cancellation exceptions, so the receiving task can fault during normal runtime failures (e.g., remote closes websocket).

Fix Focus Areas

  • dotnet/src/webdriver/BiDi/Broker.cs[121-148]
  • dotnet/src/webdriver/BiDi/Broker.cs[299-351]

Suggested approach

  • Restructure DisposeAsync using try/finally so that transport and buffer cleanup always runs.
  • Optionally capture exceptions from _receivingTask/_processingTask and rethrow after cleanup (or aggregate them) so disposal remains deterministic but still reports failures.
  • Ensure _processingTask is awaited/observed even if _receivingTask faults, to avoid leaving work running in the background.

[Auto-generated best practices - 2026-05-27]

Clone this wiki locally