Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[🐛 Bug]: Facing "Unable to find session with id" errors during initial executions of the test in the Selenium grid #15347

Open
rookieInTraining opened this issue Feb 28, 2025 · 11 comments

Comments

@rookieInTraining
Copy link
Contributor

rookieInTraining commented Feb 28, 2025

What happened?

Hi team! I've been debugging a probable issue with my selenium grid setup. A couple of seconds/minutes into my tests I face a lot of "Unable to find session with ID" exceptions and a large chunk of my test suite fails. The behaviour looks similar to the comment mentioned here just that in my case, since the grid is dynamic is could be an event where the node is initialized.

The configurations that I'm using for the hub:

      - env:
        - name: SE_ENABLE_TRACING
          value: "true"
        - name: SE_OPTS
          value: --log-level FINE
        - name: SE_NODE_SESSION_TIMEOUT
          value: "1800"

Image
Image
Image

Any pointers to debug this issue further would help.

How can we reproduce the issue?

The framework that is in use is a variant of the code present here (with the latest selenium version and JDK 17 as a requirement) : https://github.com/rookieInTraining/selenium-testng-boilerplate/tree/master

Relevant log output

Unable to find session with ID: 7a6e3f0073bd7b9f385a5aa2da3d5521 Build info: version: '4.28.1', revision: '73f5ad48a2' System info: os.name: 'Linux', os.arch: 'amd64', os.version: '6.1.58+', java.version: '17.0.13' Driver info: driver.version: unknown Build info: version: '4.28.0', revision: 'ac342546e9' System info: os.name: 'Linux', os.arch: 'amd64', os.version: '5.14.0-362.24.2.el9_3.x86_64', java.version: '17.0.9' Driver info: org.openqa.selenium.remote.RemoteWebDriver Command: [7a6e3f0073bd7b9f385a5aa2da3d5521, get {url=https://space-prod0-automation.sprinklr.com//workforce-planner-app/home}] Capabilities {acceptInsecureCerts: true, browserName: chrome, browserVersion: 132.0.6834.159, chrome: {chromedriverVersion: 132.0.6834.159 (2d77d3fc445..., userDataDir: /tmp/.org.chromium.Chromium...}, fedcm:accounts: true, goog:chromeOptions: {debuggerAddress: localhost:42597}, goog:loggingPrefs: {browser: ALL}, networkConnectionEnabled: false, pageLoadStrategy: none, platformName: linux, proxy: Proxy(), se:bidiEnabled: false, se:cdp: wss://qa6-selenium-grid-car..., se:cdpVersion: 132.0.6834.159, se:containerName: care-chrome-node-579655886b..., se:name: workforce-management/smoke-..., se:noVncPort: 7900, se:vnc: wss://qa6-selenium-grid-car..., se:vncEnabled: true, se:vncLocalAddress: ws://10.102.35.210:7900, setWindowRect: true, strictFileInteractability: false, timeouts: {implicit: 0, pageLoad: 300000, script: 30000}, unhandledPromptBehavior: accept, webauthn:extension:credBlob: true, webauthn:extension:largeBlob: true, webauthn:extension:minPinLength: true, webauthn:extension:prf: true, webauthn:virtualAuthenticators: true} Session ID: 7a6e3f0073bd7b9f385a5aa2da3d5521
java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77)
java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:499)
java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:480)
org.openqa.selenium.remote.ErrorCodec.decode(ErrorCodec.java:167)
org.openqa.selenium.remote.codec.w3c.W3CHttpResponseCodec.decode(W3CHttpResponseCodec.java:138)
org.openqa.selenium.remote.codec.w3c.W3CHttpResponseCodec.decode(W3CHttpResponseCodec.java:50)
org.openqa.selenium.remote.HttpCommandExecutor.execute(HttpCommandExecutor.java:215)
org.openqa.selenium.remote.TracedCommandExecutor.execute(TracedCommandExecutor.java:53)
org.openqa.selenium.remote.RemoteWebDriver.execute(RemoteWebDriver.java:545)
org.openqa.selenium.remote.RemoteWebDriver.get(RemoteWebDriver.java:313)
com.spr.tests.workforcemanagement.ShiftActivityTest.goToActivities(ShiftActivityTest.java:20)
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.base/java.lang.reflect.Method.invoke(Method.java:568)
org.testng.internal.invokers.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:139)
org.testng.internal.invokers.MethodInvocationHelper.invokeMethodConsideringTimeout(MethodInvocationHelper.java:69)
org.testng.internal.invokers.ConfigInvoker.invokeConfigurationMethod(ConfigInvoker.java:390)
org.testng.internal.invokers.ConfigInvoker.invokeConfigurations(ConfigInvoker.java:325)
org.testng.internal.invokers.TestInvoker.runConfigMethods(TestInvoker.java:810)
org.testng.internal.invokers.TestInvoker.invokeMethod(TestInvoker.java:577)
org.testng.internal.invokers.TestInvoker.invokeTestMethod(TestInvoker.java:227)
org.testng.internal.invokers.MethodRunner.runInSequence(MethodRunner.java:50)
org.testng.internal.invokers.TestInvoker$MethodInvocationAgent.invoke(TestInvoker.java:957)
org.testng.internal.invokers.TestInvoker.invokeTestMethods(TestInvoker.java:200)
org.testng.internal.invokers.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:148)
org.testng.internal.invokers.TestMethodWorker.run(TestMethodWorker.java:128)
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
java.base/java.lang.Thread.run(Thread.java:842)

Operating System

Linux (Docker)

Selenium version

4.28.1

What are the browser(s) and version(s) where you see this issue?

Chrome

What are the browser driver(s) and version(s) where you see this issue?

132.0.6834.159

Are you using Selenium Grid?

4.28.1

Copy link

@rookieInTraining, thank you for creating this issue. We will troubleshoot it as soon as we can.


Info for maintainers

Triage this issue by using labels.

If information is missing, add a helpful comment and then I-issue-template label.

If the issue is a question, add the I-question label.

If the issue is valid but there is no time to troubleshoot it, consider adding the help wanted label.

If the issue requires changes or fixes from an external project (e.g., ChromeDriver, GeckoDriver, MSEdgeDriver, W3C), add the applicable G-* label, and it will provide the correct link and auto-close the issue.

After troubleshooting the issue, please add the R-awaiting answer label.

Thank you!

@VietND96
Copy link
Member

VietND96 commented Mar 4, 2025

Can you refer to https://www.selenium.dev/documentation/webdriver/drivers/http_client/
Add withRetries() to ClientConfig when creating the RemoteWebDriver to see any help?
Ignore this.

@VietND96
Copy link
Member

VietND96 commented Mar 4, 2025

Btw, in PR #15348, traces data for session stop is added. So, we can track the session was cloded due to timed out or not.

@joerg1985
Copy link
Member

@rookieInTraining i think your framework code is broken, this might be the root cause.

@BeforeMethod is creating a new ThreadLocal, this will reset the other threads state and as you do not use synchronization a thread might see a old ThreadLocal instance. You should create one static final ThreadLocal and use set() instead of initialValue(), the @AfterMethod should .clear() the thread local (after quiting the driver).
https://github.com/rookieInTraining/selenium-testng-boilerplate/blob/master/src/main/java/com/framework/factory/DriverFactory.java#L27-L34

@AfterSuite should not be needed, as the driver should be terminated in @AfterMethod.
Could you disable it to see the issue is gone after these two changes?

@rookieInTraining
Copy link
Contributor Author

rookieInTraining commented Mar 5, 2025

@joerg1985 , the reason I'm not clearing the threadlocals is in order to re-use the same browser session in my test suite. I basically am creating a pool of browsers which I can leverage for the lifetime of my test suite. I've been logging the threads which these sessions are being called from, have not seen any issues you've mentioned but I'll definitely take a look at your feedback.

@VietND96 - is there a possibility to test this out via a nightly build?

@joerg1985
Copy link
Member

@VietND96 after reading the javadoc of CacheBuilder i have a bad feeling about the cache it self:
https://guava.dev/releases/33.0.0-jre/api/docs/com/google/common/cache/CacheBuilder.html

The prologue ends with: Caffeine offers better performance, more features (including asynchronous loading), and fewer [bugs](https://github.com/google/guava/issues?q=is%3Aopen+is%3Aissue+label%3Apackage%3Dcache+label%3Atype%3Ddefect).

@VietND96
Copy link
Member

VietND96 commented Mar 5, 2025

Do you think something in LocalNode, where drain event is fired? I saw in the drain node when the session configured reach, there are .cleanup() and .invalidateAll(), something wrong around these?
Btw, I also saw one PR from you to remove Guava, but it looks like complex scenarios need to be considered there.

@VietND96
Copy link
Member

VietND96 commented Mar 5, 2025

I also just walked through a few examples where migrating from Guava to Caffeine (e.g https://opendev.org/opendev/gerrit/commit/06c86046fefc6555b98d81f3726dd664020aeb28). Do you want to make this transition in part of #12737?

@joerg1985
Copy link
Member

joerg1985 commented Mar 5, 2025

Do you think something in LocalNode, where drain event is fired? I saw in the drain node when the session configured reach, there are .cleanup() and .invalidateAll(), something wrong around these?

.cleanup() and .invalidateAll() is part of the shutdown after all sessions are gone and has been added recently with 0338677 in 4.27.0. But i think the issue older, or?

Do you want to make this transition in part of #12737?

I think it might be best to move to Caffeine, as the Guava Cache has 17 unfixed issues, 6 date back to 2014.
So i will make a PR on this and see what other think about replacing it.

@VietND96
Copy link
Member

VietND96 commented Mar 6, 2025

@joerg1985, the intermittent issue we are discussing in #15370 looks like it appeared from 4.28.0.

@rookieInTraining
Copy link
Contributor Author

One question here @VietND96 @joerg1985 . If the issue is occurring due to the local cache implemented in grid. Does an external caching system like Redis be used to help mitigate this problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants