-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
.pr_agent_auto_best_practices
Pattern 1: When a test or helper mutates global process state (e.g., ENV variables or shared singletons), preserve the original value and restore it in ensure/finally instead of unconditionally deleting/disposing.
Example code before:
old = ENV["FOO"]
ENV["FOO"] = "x"
begin
# test...
ensure
ENV.delete("FOO") # clobbers pre-existing value
end
Example code after:
old = ENV.key?("FOO") ? ENV["FOO"] : :__missing__
ENV["FOO"] = "x"
begin
# test...
ensure
old == :__missing__ ? ENV.delete("FOO") : ENV["FOO"] = old
end
Relevant past accepted suggestions:
Suggestion 1:
[reliability] Test clobbers `SE_CHROMEDRIVER`
Test clobbers `SE_CHROMEDRIVER`
The new DriverFinder env-precedence unit test overwrites `ENV['SE_CHROMEDRIVER']` and then unconditionally deletes it in `ensure`, which can clobber a pre-existing value from a developer/CI environment and leak side effects into later examples. This makes the suite potentially order-dependent and flaky across environments, undermining the expectation that tests remain isolated and reliable.The DriverFinder env-precedence spec sets ENV['SE_CHROMEDRIVER'] and then unconditionally calls ENV.delete('SE_CHROMEDRIVER') in ensure, which can wipe a pre-existing value from the parent environment and leak state across examples.
This test is intended to validate env-var precedence, but to comply with the expectation that test changes are reliable across environments (PR Compliance ID 13) and to avoid order-dependent failures, it should not permanently mutate global process environment. Other unit specs in this repo follow a safer pattern by preserving/restoring prior env values rather than always deleting them.
- rb/spec/unit/selenium/webdriver/common/driver_finder_spec.rb[36-48]
Suggestion 2:
[reliability] Shared driver disposed unsafely
Shared driver disposed unsafely
DriverTestFixture disposes the shared EnvironmentManager-held driver without clearing EnvironmentManager's internal driver reference, so later GetCurrentDriver() calls can return a disposed driver. ResetOnError() also disposes before calling CreateFreshDriver(), which then calls CloseCurrentDriver()/Quit on an already-disposed instance and can throw, masking the real test failure.DriverTestFixture currently calls driver.Dispose() in [OneTimeTearDown] and in ResetOnError(). Because the driver is owned/tracked by EnvironmentManager (singleton), disposing the driver directly can leave EnvironmentManager holding a disposed instance and cause later calls to reuse or quit a disposed driver.
-
EnvironmentManager.GetCurrentDriver()returns the cacheddriverfield if non-null. -
EnvironmentManager.CloseCurrentDriver()is the method that both quits the driver and clears the cached reference (driver = null).
- dotnet/test/webdriver/DriverTestFixture.cs[128-148]
- In
DriverTestFixture.TearDown()(the[OneTimeTearDown]), replacedriver?.Dispose();withEnvironmentManager.Instance.CloseCurrentDriver();(and optionally setdriver = null;for clarity). - In
ResetOnError(), remove the explicitdriver?.Dispose();and just calldriver = EnvironmentManager.Instance.CreateFreshDriver();(sinceCreateFreshDriver()already closes the current driver viaCloseCurrentDriver()). - Ensure there is no remaining path where
EnvironmentManager.drivercan remain non-null while the underlying driver is disposed.
Suggestion 3:
[reliability] close() can hang waiters
close() can hang waiters
Index.waitForConnection() depends on a WebSocket 'close'/'error' listener to reject when the socket dies before opening, but Index.close() removes all 'close' listeners before initiating ws.close(), which can remove waitForConnection()’s onFailure handler. If close() is called while a send() is awaiting waitForConnection(), that wait can remain pending indefinitely.waitForConnection() uses this._ws.once('close'|'error', ...) to reject when the socket fails before opening. But close() calls this._ws.removeAllListeners('close') before triggering this._ws.close(), which can remove waitForConnection()’s failure handler if a caller closes while another caller is awaiting connection.
This can lead to an indefinitely pending waitForConnection() (and therefore send()) promise during connection setup/cancellation scenarios.
- javascript/selenium-webdriver/bidi/index.js[133-155]
- javascript/selenium-webdriver/bidi/index.js[272-291]
- Avoid
removeAllListeners('close')before the socket actually closes. Instead:- Remove only the internal close handler(s) you own (store them as named/bound functions so they can be removed explicitly), or
- Defer
removeAllListeners()until inside theonce('close', ...)callback (after other listeners likewaitForConnection()have had a chance to run).
- Optionally, make
_failPending()also reject/resolve a tracked “connect promise” if you introduce one, soclose()can reliably unblockwaitForConnection()without relying on WS listener ordering.
Suggestion 4:
[reliability] `logs` ArrayList used concurrently
`logs` ArrayList used concurrently
A non-thread-safe `ArrayList` is mutated from an async console log handler while being read/cleared on the test thread, which can cause data races and flaky or failing tests. This violates the requirement to ensure concurrent code is race-safe with correct synchronization/visibility.canUnpinScript() uses a non-thread-safe ArrayList (logs) that is mutated by an async console log handler while also being read/cleared by the test thread, creating a data race and potential flakiness.
The console message handler registered via script.addConsoleMessageHandler(...) can be invoked asynchronously. The test concurrently performs assertThat(logs)... and logs.clear() while the handler may still be appending.
- java/test/org/openqa/selenium/WebScriptTest.java[260-292]
Pattern 2: In CI/workflows and automation scripts, validate preconditions and external command outputs (artifacts, JSON, API results) and fail-safe on errors (explicit outputs, correct quoting, pagination, and integrity mismatch handling).
Example code before:
echo "exists=$(gh api ... | jq -e 'any(.[]; .name == "x")' && echo "true" || echo "false")" >> "$GITHUB_OUTPUT"
gh api "repos/$REPO/items" | jq '.[] | .id' # no pagination, schema assumptions
if [ ! -s file.patch ]; then exit 0; fi # missing treated as empty
Example code after:
set -euo pipefail
exists=false
if json="$(gh api --paginate "repos/$REPO/items" 2>/tmp/err)"; then
if jq -e --arg n "x" 'any(.[]?; .name==$n)' <<<"$json" >/dev/null; then
exists=true
fi
else
echo "::warning::API failed; using safe default"
exists=true
fi
echo "exists=$exists" >> "$GITHUB_OUTPUT"
test -f file.patch || { echo "::error::missing patch artifact"; exit 1; }
Relevant past accepted suggestions:
Suggestion 1:
[reliability] Cache listing not paginated
Cache listing not paginated
`prune-codeql-caches.sh` fetches only 100 caches (`gh cache list --limit 100`), so any matching caches beyond that are never considered for deletion. Since the script itself notes CodeQL creates a new cache per commit, older caches can accumulate beyond 100 and remain unpruned.The script only lists up to 100 caches and prunes within that partial set, which can leave most stale caches untouched once the repository accumulates more than 100 CodeQL caches.
The script comments say CodeQL creates a new overlay-base-database cache per commit; on an active repository, that can exceed 100 entries quickly.
- scripts/github-actions/prune-codeql-caches.sh[3-6]
- scripts/github-actions/prune-codeql-caches.sh[24-28]
- Replace the single
gh cache list ... --limit 100call with logic that retrieves all matching caches.- Option A: increase
--limitto the maximum supported bygh(if sufficiently high for this repo’s scale). - Option B (more robust): implement pagination via the GitHub API (
gh api) to iterate through all pages of caches, building therowsarray from the full result set.
- Option A: increase
- Keep the rest of the grouping/deletion logic unchanged so behavior stays “keep newest per group”.
Suggestion 2:
[reliability] `git ls-remote origin` without checkout
`git ls-remote origin` without checkout
The `detect-branch` job runs `git ls-remote ... origin ...` without checking out the repository or otherwise configuring an `origin` remote, so the step is likely to fail or produce an incorrect `exists` output that prevents downstream `evaluate`/`commit-pins`/`promote` jobs from running even when the target branch exists. This violates the requirement to harden CI/scripts so they behave deterministically and robustly.The detect-branch job calls git ls-remote ... origin ... even though no repository checkout (or equivalent remote setup) occurs, so origin is undefined and the branch-existence detection becomes unreliable; this can incorrectly skip downstream evaluate/commit-pins/promote jobs.
detect-branch runs on a fresh GitHub-hosted runner; without actions/checkout, there is no local git repository and no configured remote named origin, so git ls-remote origin ... can fail or yield incorrect results. Other jobs/workflows that successfully use git ls-remote ... origin do so after actions/checkout, which configures origin, indicating the missing prerequisite here.
- .github/workflows/renovate-dependency-pr.yml[25-44]
- .github/workflows/renovate-dependency-pr.yml[40-44]
Suggestion 3:
[reliability] `gh api` output quoting broken
`gh api` output quoting broken
In `renovate-dependency-pr.yml`, the `detect-branch` step uses nested, unescaped double-quotes inside an `echo "exists=$(...)"` command substitution when writing to `$GITHUB_OUTPUT`, which can break shell parsing and cause the workflow to fail before setting the `exists` output. Because downstream evaluation/pinning/promotion is gated on `steps.detect.outputs.exists`, this can lead to jobs being skipped or failing, violating the requirement to harden CI scripts with safe, deterministic shell handling.The detect-branch step writes exists=... to $GITHUB_OUTPUT using a command substitution containing nested, unescaped double-quotes (via echo "true" / echo "false" inside an outer echo "exists=$(...)"), which can break shell parsing and prevent the exists output from being set, causing downstream jobs gated on this value to skip or fail.
This is part of a CI workflow and must be robust and deterministic (per PR Compliance ID 17) to avoid silent failures or broken runs. The problematic pattern is effectively echo "exists=$( ... && echo "true" || echo "false")", where the inner quotes can prematurely terminate the outer string; downstream logic depends on needs.detect-branch.outputs.exists being 'true' for the evaluate job (and related evaluation/pinning/promotion) to run.
- .github/workflows/renovate-dependency-pr.yml[35-46]
Suggestion 4:
[reliability] Missing patch treated as ok
Missing patch treated as ok
The new guard `if [ ! -s changes.patch ]; then` treats a missing `changes.patch` the same as an intentionally empty patch, so an artifact download failure/misconfiguration can be silently skipped as “no changes”. Because the download step is `continue-on-error: true`, this can produce a green job without indicating that the expected artifact wasn’t applied.The workflow currently checks only file size (-s) and exits 0 when the patch is missing/empty. With continue-on-error: true on the download step, an actual artifact download failure can be silently treated as a successful no-op.
This reusable workflow is invoked from release/CI workflows where a missing patch may indicate an upstream failure or wrong artifact name.
- .github/workflows/commit-changes.yml[41-59]
- Emit a
::notice::before exiting when no patch is present. - Optionally, distinguish:
-
if [ ! -f changes.patch ]; then ...(artifact missing) -
elif [ ! -s changes.patch ]; then ...(empty patch) and decide whether “missing” should fail or remain a no-op based on intended caller behavior.
-
- If you want to keep it as a no-op, at minimum log clearly that commit/push is being skipped due to missing/empty patch.
Suggestion 5:
[reliability] CI may skip all tests
CI may skip all tests
`ci.yml` downloads the `check-targets` artifact with `digest-mismatch: warn` and then treats a missing/unreadable `bazel-targets.txt` as an empty string, which can cause every language job to be skipped. The final `ci-success` gate only fails on failures/cancellations, so a run with all test jobs skipped can still pass.The Read Targets job can succeed while effectively selecting zero test jobs if bazel-targets.txt is missing/unreadable (it falls back to an empty string). Combined with digest-mismatch: warn, this makes it more likely for a corrupted/invalid artifact to allow the workflow to proceed and still report overall success.
Downstream test jobs are guarded by if: needs.read-targets.outputs.<lang> != '', and ci-success only checks for failures/cancellations (skipped jobs do not fail the workflow).
- .github/workflows/ci.yml[68-92]
- .github/workflows/ci.yml[130-141]
Make the Read Targets job fail (or fall back to running a safe default like the full test suite) when bazel-targets.txt is missing/empty/unparseable. Also consider making digest mismatch fatal for the check-targets artifact specifically, since it drives which tests run.
Suggestion 6:
[security] `digest-mismatch` set to `warn`
`digest-mismatch` set to `warn`
Multiple CI workflows configure `actions/download-artifact@v8` with `digest-mismatch: warn`, allowing artifact integrity mismatches to proceed without failing the job and weakening determinism. In release/publish flows this is especially risky because the workflows can immediately publish or upload the downloaded artifacts, potentially distributing corrupted outputs despite a detected mismatch.Workflows set digest-mismatch: warn when downloading artifacts via actions/download-artifact@v8, allowing digest mismatches to pass without failing the workflow; in release/publish jobs, this can result in publishing or releasing artifacts even when an integrity mismatch was detected.
The PR updates actions/download-artifact to v8 and explicitly configures digest mismatch handling to warning; per build hardening guidance (PR Compliance ID 14), integrity mismatches should fail deterministically rather than be downgraded. This is particularly important for jobs that publish to PyPI or upload assets to GitHub releases (and other release-like flows), because they use the downloaded artifacts for distribution without any subsequent integrity check.
- .github/workflows/ci-build-index.yml[27-30]
- .github/workflows/ci-rust.yml[238-277]
- .github/workflows/ci-rust.yml[245-248]
- .github/workflows/ci.yml[69-72]
- .github/workflows/commit-changes.yml[49-52]
- .github/workflows/nightly.yml[80-102]
- .github/workflows/nightly.yml[89-93]
- .github/workflows/pin-browsers.yml[35-38]
- .github/workflows/pre-release.yml[260-264]
- .github/workflows/release.yml[160-229]
- .github/workflows/release.yml[171-175]
- .github/workflows/release.yml[198-203]
- .github/workflows/release.yml[206-210]
- .github/workflows/update-documentation.yml[99-103]
Suggestion 7:
[reliability] Missing report validation
Missing report validation
scripts/update_py_deps.py treats missing packages in pip’s JSON report as “no change”, so it can print “All repo tooling requirements are up to date!” even when version resolution didn’t return entries for some/all requested requirements. This can silently leave py/requirements.txt stale while the command exits successfully.scripts/update_py_deps.py builds installed from report.get("install", []) and then uses installed.get(name_normalized); missing entries are treated as unchanged, and if updates stays empty the script prints a success message and exits. This can mask cases where the report is empty/partial and no versions were actually resolved.
This script is used to update pinned versions in py/requirements.txt and is invoked via Bazel (bazel run //scripts:update_py_deps). A silent success with no updates when resolution data is missing makes dependency maintenance unreliable.
- scripts/update_py_deps.py[54-73]
- After parsing the report, validate that every
name_normalizedfrompackagesexists ininstalled.- If any are missing, raise a
RuntimeErrorthat includes a helpful diagnostic (e.g., mention missing names and includeresult.stderrand a truncatedresult.stdout).
- If any are missing, raise a
- Optionally wrap
json.loads(result.stdout)intry/except json.JSONDecodeErrorand raise a clearer error including stderr/stdout context.
Suggestion 8:
[reliability] `pip --report` JSON unvalidated
`pip --report` JSON unvalidated
The new dependency resolution flow parses `pip --report -` output with `json.loads(result.stdout)` and then assumes specific keys exist, which can crash with non-actionable `JSONDecodeError`/`KeyError` if pip emits non-JSON output or the report schema differs. This violates the requirement to validate external/protocol inputs and raise deterministic, descriptive exceptions.scripts/update_py_deps.py parses pip --report - output without validating that the output is valid JSON or that expected keys exist, which can produce unclear crashes (e.g., JSONDecodeError, KeyError).
This script consumes external/protocol-derived data (pip report JSON). Per compliance, failures should be deterministic and actionable (explicit exceptions indicating what was wrong and how to fix).
- scripts/update_py_deps.py[54-56]
- Wrap
json.loads(result.stdout)intry/except json.JSONDecodeErrorand raise aRuntimeError(or custom exception) that includes a short message and relevantstderr/stdoutcontext. - Validate
reportis a dict and thatreport.get("install")is a list. - When building
installed, use.get()accessors (or explicit checks) and raise a descriptive exception if required fields likemetadata.name/metadata.versionare missing.
Suggestion 9:
[reliability] Release check not fail-safe
Release check not fail-safe
The nightly workflow only sets `release-in-progress=true` when the rulesets query succeeds and returns true; if `gh api` fails (e.g., auth/API error), the condition evaluates false and the output is left unset, so nightly releases proceed despite an unknown release state. This can cause the nightly pipeline to run during an active release or when the ruleset check is unavailable.The prepare job’s “Check if release ruleset is active” step only writes release-in-progress=true in the success path. If the gh api call fails, the step produces no output, and downstream logic treats it as not in progress.
needs.prepare.outputs.release-in-progress gates the nightly release job. An unset output is not equal to 'true', so the nightly release will run even when the ruleset check failed.
- .github/workflows/nightly.yml[52-61]
Update the step to be fail-safe and to always write an explicit output value.
For example:
- Call
gh apiand check its exit status. - If the API call fails, emit a warning and set
release-in-progress=true(safe default). - Otherwise set
release-in-progress=true|falsebased on the jq result.
(Any equivalent approach that distinguishes “false” from “error” and always sets the output is fine.)
Suggestion 10:
[reliability] `gh api` missing pagination
`gh api` missing pagination
The workflow captures `existing` rulesets via `gh api` without `--paginate`, which can silently omit rulesets beyond the first page. This can cause non-deterministic behavior (e.g., failing to detect or delete some release rulesets), potentially leaving trunk incorrectly restricted/unrestricted.The workflow lists existing rulesets using gh api "repos/$GH_REPO/rulesets" but does not use --paginate, so it may only read the first page of results.
The create/delete logic relies on the existing JSON to determine whether to POST new rulesets and which IDs to DELETE. If the repository has more rulesets than fit in one API page, some relevant rulesets may be missed.
- .github/workflows/restrict-trunk.yml[62-63]
- .github/workflows/restrict-trunk.yml[77-78]
Suggestion 11:
[reliability] Unvalidated `$id` in delete
Unvalidated `$id` in delete
The workflow deletes rulesets by using a `$id` derived from `gh api` name matching without validating that exactly one ruleset ID was found, so empty or multiple matches can yield an invalid DELETE URL and non-deterministic failures that can block trunk unlock. This violates the requirement to harden scripts by validating preconditions and handling external/config inputs deterministically.Harden the ruleset deletion logic so it does not blindly construct and call DELETE .../rulesets/$id from gh api output; instead, it must validate that the name lookup yields a safe, deterministic set of IDs (and handle empty/multiple matches explicitly) to avoid invalid URLs and non-deterministic failures during unlock.
This is release automation: the unlock path (Delete release rulesets) runs when unlock-trunk.yml calls the workflow with restrict: false. Rulesets are created via POST in a loop and can be recreated on reruns, so duplicate names/IDs are possible; if the lookup returns empty output (not found) or multiple newline-separated IDs (duplicates), the current deletion step can fail unpredictably, produce unclear errors, and block trunk unlock. Consider also validating that the glob .github/rulesets/release-*.json matches at least one file before looping, and fail with a clear message if none are found.
- .github/workflows/restrict-trunk.yml[58-64]
- .github/workflows/restrict-trunk.yml[65-73]
- .github/workflows/unlock-trunk.yml[8-16]
Suggestion 12:
[reliability] `download-artifact` missing `merge-multiple`
`download-artifact` missing `merge-multiple`
The updated `actions/download-artifact@v4` step uses `pattern: documentation-*` but does not set `merge-multiple: true` or a deterministic extraction path, which can place files under an extra per-artifact directory. This can prevent the later `git add docs/api/` from staging the generated docs, breaking the workflow.The workflow now downloads artifacts via pattern: documentation-* without merge-multiple/path, which can extract files into an unexpected per-artifact directory and cause git add docs/api/ to miss the generated documentation.
Previously, the workflow downloaded a single artifact by name. After this PR, artifact names are language-scoped, but the download step should still ensure the extracted files end up where subsequent steps expect them.
- .github/workflows/update-documentation.yml[95-105]
Prefer downloading the single expected artifact explicitly (e.g., name: documentation-${{ needs.parse.outputs.language }}), or keep pattern but set merge-multiple: true and an explicit path that matches the later git add docs/api/ staging expectations.
Pattern 3: Keep public API contracts and typed/signature surfaces consistent: avoid breaking removals/behavior changes without deprecation, and ensure signature/type files match runtime methods (including setters) and interface changes are non-breaking.
Example code before:
# Runtime method exists but signature missing / or public method removed without deprecation
class Options
def browser_name=(v) = (@browser_name = v)
end
# signature file: only declares browser_name, not browser_name=
Example code after:
# Keep runtime + signature aligned, and deprecate before removal
class Options
def browser_name=(v) = (@browser_name = v)
end
# signature: def browser_name=: (String value) -> void
# If removing a public API: keep it with deprecation warning for a cycle.
Relevant past accepted suggestions:
Suggestion 1:
[maintainability] RBS missing browser_name=
RBS missing browser_name=
`Safari::Options` now defines `browser_name=` at runtime, but `rb/sig/.../safari/options.rbs` only declares `browser_name` and not the setter, so typed consumers cannot call `options.browser_name = ...` without type errors.Selenium::WebDriver::Safari::Options defines a browser_name= setter in Ruby, but the RBS signature file does not declare this method. This creates a mismatch between runtime API and the typed interface.
The PR introduced/updated Safari::Options#browser_name= in Ruby, and updated the RBS to include browser_name, but the setter signature was not added.
- rb/lib/selenium/webdriver/safari/options.rb[39-45]
- rb/sig/lib/selenium/webdriver/safari/options.rbs[13-16]
Add a setter signature in rb/sig/lib/selenium/webdriver/safari/options.rbs, e.g.:
-
def browser_name=: (String value) -> void(or returnStringif you want to model Ruby’s assignment return value).
Suggestion 2:
[correctness] `Safari.path` now nil
`Safari.path` now nil
`Selenium::WebDriver::Safari.path` now defaults to `nil`, removing the previous behavior of providing/validating a default Safari binary path and raising actionable errors. This is a user-visible behavior change that can break callers relying on a non-nil default or on the earlier deterministic exceptions.Safari.path now memoizes nil, changing the public method’s behavior so it no longer provides a default Safari path nor performs early validation/OS checks. This is a backward-incompatible, user-visible behavior change.
Safari.path is a public singleton method. With the new implementation, any caller expecting a default path (or early, actionable errors) will instead receive nil and may fail later with less clear errors.
- rb/lib/selenium/webdriver/safari.rb[48-50]
- Reintroduce the previous default path and validation/error behavior (e.g., default to the Safari binary on macOS and raise a clear
WebDriverErrorwhen unsupported/unavailable). - If the new
nildefault is intentional, add a deprecation path and update callers/documentation accordingly so upgrades do not silently change behavior.
Suggestion 3:
[correctness] `Service#env_path` removed
`Service#env_path` removed
`Selenium::WebDriver::Service#env_path` was removed without a deprecation path, which is a breaking change for callers and leaves the RBI/RBS contract inconsistent with the runtime implementation. This can trigger runtime `NoMethodError` for existing consumers and also break RBS/Steep validation during upgrades.Selenium::WebDriver::Service#env_path was removed from the Ruby implementation without a deprecation path, but the corresponding RBS signature still declares it, creating an API break for callers and leaving the type/signature contract out of sync with runtime behavior.
- The Ruby signature file still exposes
env_pathas a public method (def env_path). - The
Serviceimplementation no longer definesenv_path, so existing callers will hitNoMethodErrorat runtime. - The mismatch will also fail RBS/Steep (or other signature validation) because the method is declared but not implemented.
- Compliance requires maintaining API compatibility and deprecating public functionality before removal.
- rb/lib/selenium/webdriver/common/service.rb[89-105]
- rb/sig/lib/selenium/webdriver/common/service.rbs[48-60]
Suggestion 4:
[maintainability] `waitForConnection()` now rejects
`waitForConnection()` now rejects
`waitForConnection()` previously only resolved (or could hang) but now rejects when the socket is closed or closes before opening, which can break callers that do not handle promise rejection. This changes the public API contract/behavior for a user-facing interface.waitForConnection() now rejects on closed/failed connections, which is a user-visible behavioral change that may break downstream code that previously awaited it without handling rejections.
This class is exported from selenium-webdriver/bidi, so changes to promise rejection behavior can be compatibility-impacting.
- javascript/selenium-webdriver/bidi/index.js[134-155]
Suggestion 5:
[correctness] `Selenium.WebDriver` now signed
`Selenium.WebDriver` now signed
The PR enables strong-name signing on existing user-facing assemblies, changing .NET assembly identity (strong-name) and risking ABI incompatibilities for consumers during upgrade. This violates the requirement to maintain API/ABI compatibility such that upgrades should not require user code changes.Enabling strong-name signing on existing user-facing assemblies changes the assembly identity/ABI for consumers. Compliance requires upgrades not to require user code changes due to API/ABI incompatibilities.
The diff adds SignAssembly and AssemblyOriginatorKeyFile to existing .csproj files, which will produce strongly named assemblies under the same package identity.
- dotnet/src/webdriver/Selenium.WebDriver.csproj[9-10]
- dotnet/src/support/Selenium.WebDriver.Support.csproj[10-11]
Suggestion 6:
[maintainability] Added methods to `IBiDi`
Added methods to `IBiDi`
New members were added to the public `IBiDi` interface, which is a breaking API/ABI change for any downstream implementers of the interface. This violates the requirement to keep public interfaces backward-compatible.IBiDi gained new members, which is a breaking change for any external implementers.
Adding methods to a public interface is an ABI/API breaking change in .NET.
- dotnet/src/webdriver/BiDi/IBiDi.cs[60-70]
Pattern 4: Add explicit timeouts/cancellation and deterministic failure paths to async waits (streams, websockets, HTTP calls) so operations cannot hang indefinitely and failures surface promptly with actionable errors.
Example code before:
# C#
var msg = await stream.Where(x => x.Type == "done").FirstAsync(); // can hang forever
# Ruby
body = Net::HTTP.get(URI(url)) # no timeouts
Example code after:
// C#
using var cts = new CancellationTokenSource(TimeSpan.FromSeconds(5));
var msg = await stream.Where(x => x.Type == "done").FirstAsync(cts.Token);
# Ruby
uri = URI(url)
Net::HTTP.start(uri.host, uri.port, use_ssl: uri.scheme == "https",
open_timeout: 5, read_timeout: 10) do |http|
http.get(uri.request_uri).body
end
Relevant past accepted suggestions:
Suggestion 1:
[reliability] No HTTP timeouts
No HTTP timeouts
maven_stable_release uses Net::HTTP.get without open/read timeouts, so `rake java:update` can hang indefinitely on network stalls. This makes dependency updates flaky and can block CI/developer workflows.maven_stable_release fetches maven-metadata.xml via Net::HTTP.get without any timeouts, which can cause rake java:update to hang indefinitely if Maven Central is slow or the connection stalls.
This rake file already uses Net::HTTP.start with explicit open_timeout and read_timeout for other HTTP calls, so timeouts are an established reliability pattern here.
- rake_tasks/java.rake[310-323]
- rake_tasks/java.rake[101-114]
Replace Net::HTTP.get(URI(url)) with a Net::HTTP.start call that sets use_ssl: true plus open_timeout / read_timeout, and read the response body via Net::HTTP::Get so the call fails fast instead of hanging.
Suggestion 2:
[reliability] Hanging stream LINQ query
Hanging stream LINQ query
NetworkTests.CanDisownData awaits FirstAsync() on an event stream without any cancellation/timeout, so if the matching ResponseCompleted event never arrives the test can hang indefinitely and stall the suite.CanDisownData awaits an async stream LINQ query ending in FirstAsync() without providing a CancellationToken or any timeout. If no matching event is produced, the test can hang indefinitely.
Other tests in this repo bound async stream waits using a CancellationTokenSource(TimeSpan...) (e.g., sub.FirstAsync(cts.Token)) or WaitAsync(TimeSpan...) to avoid CI stalls.
- dotnet/test/webdriver/BiDi/Network/NetworkTests.cs[290-304]
Wrap the FirstAsync with a timeout, e.g.:
- create
using var cts = new CancellationTokenSource(TimeSpan.FromSeconds(5)); - call
await stream.Where(...).Select(...).FirstAsync(cts.Token);(or equivalently useWaitAsync(TimeSpan.FromSeconds(5))on the returned task/value-task).
Suggestion 3:
[reliability] send() can hang forever
send() can hang forever
After a socket 'close'/'error', _failPending() sets connected=false, but send() will then await waitForConnection(), which only resolves on a future 'open' event; on a closed/failed WebSocket this can leave new send() calls pending indefinitely. This can stall higher-level BiDi helpers that call bidi.send() without checking connection state.After _failPending() runs, connected becomes false, and future send() calls can await waitForConnection() forever because there will be no future 'open' event on a closed/failed socket.
_failPending() is triggered on the underlying WebSocket 'close'/'error' events and marks the connection closed. Many BiDi helper classes call bidi.send() directly; if the connection drops and a command is attempted afterward, the call can hang indefinitely.
- javascript/selenium-webdriver/bidi/index.js[91-101]
Add a fast-fail guard in send() (or waitForConnection()):
- If
this._closedis true, immediately reject/throw a clear error. - Also consider checking
this._ws.readyStateand rejecting whenCLOSING/CLOSED. - Optionally, have
waitForConnection()reject on'close'/'error'if it is waiting for'open'.
Suggestion 4:
[reliability] Unexpected close stalls sends
Unexpected close stalls sends
Index only rejects pending send() calls on timeout or explicit close(); it does not handle underlying WebSocket 'close'/'error' events. If the peer disconnects mid-request, callers can hang until RESPONSE_TIMEOUT (30s) instead of failing immediately.Index tracks in-flight requests in _pending, but it does not subscribe to the underlying WebSocket 'close' / 'error' events. If the connection drops without an explicit Index.close() call, pending send() promises will only fail after RESPONSE_TIMEOUT, delaying failure propagation.
-
close()now correctly rejects_pendingwhen invoked, but unexpected disconnects still leave callers waiting up to 30 seconds. - Since there is now a single dispatcher and centralized
_pending, the most reliable place to fail outstanding requests is in a single socket-level close/error handler.
- javascript/selenium-webdriver/bidi/index.js[33-63]
- javascript/selenium-webdriver/bidi/index.js[217-241]
- Add
_ws.on('close', ...)and_ws.on('error', ...)handlers in the constructor. - In those handlers:
- set
this.connected = false - iterate
_pending.values()andclearTimeout(timeoutId)thenreject(new Error(...)) _pending.clear()
- set
- Ensure handlers are idempotent (e.g., guard with a boolean like
this._closed = true) soclose()and'close'event don’t double-reject. - (Optional) extend tests to simulate server-side disconnect and assert pending sends reject promptly (without waiting for the timeout).
Pattern 5: In concurrent/resource-heavy code paths, make cleanup exception-safe and bounded: ensure Dispose/teardown ordering is correct, remove per-consumer registrations, and cap buffers/queues while releasing ref-counted resources on all terminal paths.
Example code before:
// Pseudocode
pending = new Deque()
onMessage(frame): pending.add(frame) // unbounded
onClose(): closed = true; return // pending frames never released if upgrade never happens
async DisposeAsync():
dispatcher.Dispose() // disposed before processor drains
await processingTask
Example code after:
const int MaxPending = 1000;
onMessage(frame):
if (pending.size >= MaxPending) { frame.release(); markClosed(); return; }
pending.add(frame)
onClose():
closed = true
while (pending.notEmpty()) pending.pop().release()
async DisposeAsync():
try { await processingTask; }
finally {
await dispatcher.DisposeAsync();
await transport.DisposeAsync();
}
Relevant past accepted suggestions:
Suggestion 1:
[reliability] Pending frames leak onClose
Pending frames leak onClose
On a pre-handshake upstream `onClose`, the listener records the close and closes the `HttpClient` but intentionally keeps buffered `WebSocketFrame`s for a future `onUpgrade` drain. If the client-side upgrade never completes (e.g., client disconnects mid-handshake), those buffered ref-counted frames are never released, risking a Netty buffer leak.DirectForwardingListener keeps pre-handshake buffered WebSocketFrames on upstream onClose so they can be drained during onUpgrade. If the client-side upgrade never happens, those buffered (ref-counted) frames never get released, creating a potential memory/leak-detector issue.
The code already treats buffered frames as ref-counted (it calls frame.release() on other terminal paths like overflow), but the pre-handshake onClose path retains the buffer without any guaranteed later release.
- java/src/org/openqa/selenium/grid/router/ProxyWebsocketsIntoGrid.java[333-353]
- java/src/org/openqa/selenium/grid/router/ProxyWebsocketsIntoGrid.java[295-313]
Suggestion 2:
[reliability] Unbounded frame buffer
Unbounded frame buffer
DirectForwardingListener buffers pre-upgrade frames in an unbounded deque with no cap, so if the client upgrade is delayed/stalled while upstream is producing frames, memory can grow without bound (including retaining ref-counted frame buffers). This risks router memory pressure or OOM under pathological conditions.DirectForwardingListener uses an unbounded Deque<WebSocketFrame> to buffer frames arriving before onUpgrade(Channel) is called. If the downstream upgrade stalls while upstream continues sending, the deque can grow without limit, retaining ref-counted frames and risking memory exhaustion.
- Buffering is correct for ordering, but needs a safety valve.
- Add a hard limit (by frame count and/or total bytes) for
pending. - When the limit is exceeded, choose a deterministic policy: drop newest/oldest with
release(), and/or close the downstream channel once available; also log at WARNING with session context. - Consider a time-based cutoff (e.g., if upgrade hasn’t completed within N seconds, discard pending and mark closed).
- java/src/org/openqa/selenium/grid/router/ProxyWebsocketsIntoGrid.java[220-285]
- java/src/org/openqa/selenium/grid/router/ProxyWebsocketsIntoGrid.java[331-336]
Suggestion 3:
[reliability] Unsubscribe cleanup not exception-safe
Unsubscribe cleanup not exception-safe
Broker.UnsubscribeAsync only removes the subscription from the local registry after the remote UnsubscribeAsync succeeds, so if the transport call throws the subscription remains registered and still traversed for every event. Subscription.DisposeAsync also sets its disposed flag before awaiting UnsubscribeAsync, preventing retry after a transient failure and increasing the chance of a permanent local leak.Broker.UnsubscribeAsync performs local cleanup (removing from _subscriptions) only after the remote unsubscribe completes successfully. If the remote call throws, the local subscription remains in the registry, and the broker continues iterating it on every event.
Additionally, Subscription.DisposeAsync sets _disposed before awaiting UnsubscribeAsync(). If UnsubscribeAsync() throws, the subscription becomes non-retryable for disposal/unsubscribe, making the local leak harder to recover from.
- Local cleanup should be resilient to transport/remote failures.
- Even if the remote unsubscribe fails, local detachment should happen to stop further event delivery and avoid retaining stale subscriptions.
- dotnet/src/webdriver/BiDi/Broker.cs[93-101]
- dotnet/src/webdriver/BiDi/Subscription.cs[50-57]
- Wrap the remote unsubscribe in
try/finallyso local registry removal always runs:
public async ValueTask UnsubscribeAsync(Subscription subscription, CancellationToken cancellationToken)
{
try
{
await _bidi.Session.UnsubscribeAsync([subscription.SubscriptionId], null, cancellationToken).ConfigureAwait(false);
}
finally
{
if (_subscriptions.TryGetValue(subscription.EventName, out var registry))
registry.Remove(subscription);
}
}- Consider making
Subscription.DisposeAsyncretry-safe by only setting_disposedafter a successful unsubscribe, or resetting_disposedon failure (so callers can retry), e.g.try { ... } catch { Interlocked.Exchange(ref _disposed, 0); throw; }.
This prevents permanent local leaks when the transport is flaky or shutting down.
Suggestion 4:
[reliability] Enumerator child-channel leak
Enumerator child-channel leak
Subscription.GetAsyncEnumerator() adds a per-enumerator child Channel to _children but never removes it when enumeration ends or is cancelled, so DrainAsync keeps writing events into abandoned channels for the lifetime of the subscription. This can cause unbounded memory growth and retain references even after consumers stop iterating.Subscription<TEventArgs>.GetAsyncEnumerator() registers a per-enumerator child Channel<TEventArgs> in _children but never unregisters it when the consumer stops iterating (natural completion, DisposeAsync, or cancellation). As a result, DrainAsync() continues to write every event into channels that no longer have active readers, leading to unbounded buffering/memory growth and retained references.
- Child channels are stored in
_childrenand written-to on every event. - Cancellation of the returned async iterator does not trigger any code that removes the child from
_children.
- dotnet/src/webdriver/BiDi/Subscription.cs[91-107]
- dotnet/src/webdriver/BiDi/Subscription.cs[118-157]
Implement enumeration via an instance async IAsyncEnumerable<TEventArgs> that:
- creates and registers the child channel,
- yields items from
child.Reader.ReadAllAsync(cancellationToken), -
in a
finallyblock removes the child from_childrenand completes the child writer.
Example shape:
public IAsyncEnumerator<TEventArgs> GetAsyncEnumerator(CancellationToken ct = default)
=> Enumerate(ct).GetAsyncEnumerator(ct);
private async IAsyncEnumerable<TEventArgs> Enumerate([EnumeratorCancellation] CancellationToken ct)
{
var child = Channel.CreateUnbounded<TEventArgs>(new UnboundedChannelOptions { SingleReader = true, SingleWriter = true });
lock (_children) _children.Add(child);
try
{
await foreach (var item in child.Reader.ReadAllAsync(ct).ConfigureAwait(false))
yield return item;
}
finally
{
lock (_children) _children.Remove(child);
child.Writer.TryComplete();
}
}This ensures abandoned enumerators do not keep accumulating writes.
Suggestion 5:
[reliability] `DisposeAsync` disposes `_eventDispatcher` early
`DisposeAsync` disposes `_eventDispatcher` early
`Broker.DisposeAsync()` disposes `_eventDispatcher` before awaiting `_processingTask`, so `ProcessMessagesAsync()` can still call `_eventDispatcher.EnqueueEvent()` after the dispatcher has completed its channel, silently dropping in-flight events. This is an unsafe shutdown ordering in concurrent code that can lead to lost/partial event delivery during disposal.Broker.DisposeAsync() disposes _eventDispatcher before _processingTask has completed. Since ProcessMessagesAsync() can still call _eventDispatcher.EnqueueEvent() while draining _receivedMessages, events can be silently dropped because EventDispatcher.DisposeAsync() completes _pendingEvents.Writer.
The new producer/consumer split means message processing can outlive receiving and can continue enqueuing events during shutdown. Disposal ordering should ensure the processing pipeline has fully drained (or is explicitly prevented from enqueuing) before tearing down the dispatcher.
- dotnet/src/webdriver/BiDi/Broker.cs[121-137]
- dotnet/src/webdriver/BiDi/Broker.cs[353-376]
- dotnet/src/webdriver/BiDi/EventDispatcher.cs[74-77]
- dotnet/src/webdriver/BiDi/EventDispatcher.cs[118-127]
Suggestion 6:
[reliability] Dispose skips cleanup on fault
Dispose skips cleanup on fault
DisposeAsync will abort before disposing the transport and pooled buffers if either background task faults with a non-cancellation exception. This can leak the underlying transport and pooled arrays when ReceiveMessagesAsync throws (it rethrows on non-cancellation failures).Broker.DisposeAsync() can exit early (throw) when _receivingTask or _processingTask faults, skipping disposal of _transport and pooled buffers. This can leak sockets and ArrayPool-rented memory.
ReceiveMessagesAsync rethrows on non-cancellation exceptions, so the receiving task can fault during normal runtime failures (e.g., remote closes websocket).
- dotnet/src/webdriver/BiDi/Broker.cs[121-148]
- dotnet/src/webdriver/BiDi/Broker.cs[299-351]
- Restructure
DisposeAsyncusingtry/finallyso that transport and buffer cleanup always runs. - Optionally capture exceptions from
_receivingTask/_processingTaskand rethrow after cleanup (or aggregate them) so disposal remains deterministic but still reports failures. - Ensure
_processingTaskis awaited/observed even if_receivingTaskfaults, to avoid leaving work running in the background.
[Auto-generated best practices - 2026-05-27]
This wiki is not where you want to be! Visit the Wiki Home for more useful links
Getting Involved
Triaging Issues
Releasing Selenium
Ruby Development
Python Bindings
Ruby Bindings
WebDriverJs
This content is being evaluated for where it belongs
Architectural Overview
Automation Atoms
HtmlUnitDriver
Lift Style API
LoadableComponent
Logging
PageFactory
RemoteWebDriver
Xpath In WebDriver
Moved to Official Documentation
Bot Style Tests
Buck
Continuous Integration
Crazy Fun Build
Design Patterns
Desired Capabilities
Developer Tips
Domain Driven Design
Firefox Driver
Firefox Driver Internals
Focus Stealing On Linux
Frequently Asked Questions
Google Summer Of Code
Grid Platforms
History
Internet Explorer Driver
InternetExplorerDriver Internals
Next Steps
PageObjects
RemoteWebDriverServer
Roadmap
Scaling WebDriver
SeIDE Release Notes
Selenium Emulation
Selenium Grid 4
Selenium Help
Shipping Selenium 3
The Team
TLC Meetings
Untrusted SSL Certificates
WebDriver For Mobile Browsers
Writing New Drivers