Skip to content

Conversation

@VietND96
Copy link
Member

@VietND96 VietND96 commented Jan 16, 2026

🔗 Related Issues

💥 What does this PR do?

Adds a configurable threshold to automatically mark a Node as DOWN after consecutive session creation failures. This helps detect and isolate unhealthy Nodes that consistently fail to create sessions, improving Grid reliability


--node-down-failure-threshold

Behavior

  • When 0 (default): Feature disabled, node accepts sessions regardless of failure count
  • When > 0: After N consecutive failures, the node:
    • Marks itself as DOWN in status/availability checks
    • Immediately rejects new session creation requests
    • Can be recovered using resetConsecutiveSessionFailures() method

Additional Improvements Included

  1. Race condition fix: Atomic session counting prevents multiple threads from exceeding maxSessionCount
  2. Lock-free slot selection: Uses tryReserve() instead of synchronized block for better concurrency
  3. Double-checked locking: Early drain check before acquiring lock for fast path
  4. Session count restoration: restoreSessionCount() prevents premature draining on failed session attempts
  5. JMX monitoring: ConsecutiveSessionFailures attribute exposed for monitoring
  6. Recovery method: resetConsecutiveSessionFailures() allows programmatic recovery

🔧 Implementation Notes

💡 Additional Considerations

🔄 Types of changes

  • Cleanup (formatting, renaming)
  • Bug fix (backwards compatible)
  • New feature (non-breaking change which adds functionality and tests!)
  • Breaking change (fix or feature that would cause existing functionality to change)

Signed-off-by: Viet Nguyen Duc <nguyenducviet4496@gmail.com>
@selenium-ci selenium-ci added B-grid Everything grid and server related C-java Java Bindings labels Jan 16, 2026
@qodo-code-review
Copy link
Contributor

qodo-code-review bot commented Jan 16, 2026

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
🟢
No security concerns identified No security vulnerabilities detected by AI analysis. Human verification advised for critical code.
Ticket Compliance
🟡
🎫 #5678
🔴 Prevent or resolve repeated "Error: ConnectFailure (Connection refused)" errors when
instantiating multiple ChromeDriver instances (first instance OK, subsequent instances
fail) on Ubuntu/Chrome/Selenium versions described.
🟡
🎫 #1234
🔴 Ensure that calling click() triggers JavaScript in a link's href as it did in Selenium
2.47.1 (regression in 2.48.x) on Firefox 42.
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
🟢
Generic: Comprehensive Audit Trails

Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Meaningful Naming and Self-Documenting Code

Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Error Handling

Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Logging Practices

Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

🔴
Generic: Robust Error Handling and Edge Case Management

Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation

Status:
Missing input validation: The new session-creation-retry-limit configuration value is read without validating
boundary/allowed values (e.g., only -1 or > 0), which can lead to ambiguous behavior
for 0 or other negatives.

Referred Code
public int getSessionCreationRetryLimit() {
  return config
      .getInt(NODE_SECTION, "session-creation-retry-limit")
      .orElse(DEFAULT_SESSION_CREATION_RETRY_LIMIT);
}

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Security-First Input Validation and Data Handling

Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities

Status:
Unvalidated external input: The externally supplied config/CLI value for session-creation-retry-limit is accepted
without sanitization/validation (e.g., rejecting unsupported values like 0 or < -1),
which can cause unexpected runtime states.

Referred Code
public int getSessionCreationRetryLimit() {
  return config
      .getInt(NODE_SECTION, "session-creation-retry-limit")
      .orElse(DEFAULT_SESSION_CREATION_RETRY_LIMIT);
}

Learn more about managing compliance generic rules or creating your own custom rules

  • Update
Compliance status legend 🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

@qodo-code-review
Copy link
Contributor

qodo-code-review bot commented Jan 16, 2026

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Short-circuit newSession on failure limit
Suggestion Impact:The commit adds an early check in newSession (before acquiring the drain lock) to reject new sessions when consecutive failures exceed a configured threshold (renamed to nodeDownFailureThreshold). It also adds a second re-check after acquiring the lock and updates related naming/docs, aligning with the suggested fast-path rejection behavior.

code diff:

@@ -491,6 +491,15 @@
           new RetrySessionRequestException("The node is draining. Cannot accept new sessions."));
     }
 
+    // Early check for failure threshold (fast path for unhealthy nodes)
+    if (nodeDownFailureThreshold > 0
+        && consecutiveSessionFailures.get() >= nodeDownFailureThreshold) {
+      return Either.left(
+          new RetrySessionRequestException(
+              "The node is marked as DOWN due to exceeding the failure threshold. Cannot accept new"
+                  + " sessions."));
+    }
+
     Lock lock = drainLock.readLock();
     lock.lock();
 
@@ -509,6 +518,18 @@
                 "The node is draining. Cannot accept new sessions."));
         return Either.left(
             new RetrySessionRequestException("The node is draining. Cannot accept new sessions."));
+      }
+
+      // Re-check failure threshold after acquiring lock
+      if (nodeDownFailureThreshold > 0
+          && consecutiveSessionFailures.get() >= nodeDownFailureThreshold) {
+        span.setStatus(
+            Status.UNAVAILABLE.withDescription(
+                "The node is marked as DOWN due to exceeding the failure threshold."));
+        return Either.left(
+            new RetrySessionRequestException(
+                "The node is marked as DOWN due to exceeding the failure threshold. "
+                    + "Cannot accept new sessions."));
       }

In the newSession method, add an early check to reject new session requests if
the sessionCreationRetryLimit has been exceeded.

java/src/org/openqa/selenium/grid/node/local/LocalNode.java [488-492]

 // Early check before acquiring lock (fast path for draining nodes)
 if (isDraining()) {
   return Either.left(
       new RetrySessionRequestException("The node is draining. Cannot accept new sessions."));
 }
+// Early check for retry limit exceeded
+if (sessionCreationRetryLimit > 0
+    && consecutiveSessionFailures.get() >= sessionCreationRetryLimit) {
+  return Either.left(
+      new RetrySessionRequestException("Node marked DOWN due to too many session failures."));
+}

[Suggestion processed]

Suggestion importance[1-10]: 7

__

Why: This suggestion improves efficiency by adding a fast-path rejection for new sessions when the node is already marked as DOWN, preventing unnecessary lock acquisition and processing.

Medium
Learned
best practice
Validate configuration input values

Validate the configured value so only -1 (disabled) or a positive integer is
accepted; reject 0 and values < -1 to avoid ambiguous/unsafe behavior.

java/src/org/openqa/selenium/grid/node/config/NodeOptions.java [281-285]

 public int getSessionCreationRetryLimit() {
-  return config
-      .getInt(NODE_SECTION, "session-creation-retry-limit")
-      .orElse(DEFAULT_SESSION_CREATION_RETRY_LIMIT);
+  int limit =
+      config
+          .getInt(NODE_SECTION, "session-creation-retry-limit")
+          .orElse(DEFAULT_SESSION_CREATION_RETRY_LIMIT);
+
+  if (limit != -1) {
+    Require.positive("Session creation retry limit", limit);
+  }
+  return limit;
 }
  • Apply / Chat
Suggestion importance[1-10]: 6

__

Why:
Relevant best practice - Add explicit validation/guards at configuration boundaries to prevent invalid values from silently changing runtime behavior.

Low
General
Validate retry limit input

Add validation to the sessionCreationRetryLimit builder method to ensure the
provided limit is not less than -1.

java/src/org/openqa/selenium/grid/node/local/LocalNode.java [1369-1372]

 public Builder sessionCreationRetryLimit(int limit) {
+  if (limit < -1) {
+    throw new IllegalArgumentException("sessionCreationRetryLimit must be -1 or greater");
+  }
   this.sessionCreationRetryLimit = limit;
   return this;
 }
  • Apply / Chat
Suggestion importance[1-10]: 5

__

Why: This suggestion adds input validation to the builder method, preventing invalid configuration values and making the API more robust.

Low
Reset session counter on node close

Reset the reservedOrActiveSessionCount to zero in the close() method of
LocalNode to ensure a clean state upon shutdown.

java/src/org/openqa/selenium/grid/node/local/LocalNode.java [331-334]

 @Override
 public void close() {
   shutdown.run();
+  reservedOrActiveSessionCount.set(0);
 }

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 3

__

Why: This is a good practice for ensuring a clean state on node shutdown, which improves the reliability of monitoring during the shutdown process, although it is not a critical bug.

Low
  • Update

Signed-off-by: Viet Nguyen Duc <nguyenducviet4496@gmail.com>
Signed-off-by: Viet Nguyen Duc <nguyenducviet4496@gmail.com>
Signed-off-by: Viet Nguyen Duc <nguyenducviet4496@gmail.com>
Signed-off-by: Viet Nguyen Duc <nguyenducviet4496@gmail.com>
@VietND96 VietND96 merged commit e845e55 into SeleniumHQ:trunk Jan 16, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

B-grid Everything grid and server related C-java Java Bindings Review effort 4/5

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants