Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[🐛 Bug]: On grid, session is started but never used #11881

Closed
bhecquet opened this issue Apr 11, 2023 · 14 comments
Closed

[🐛 Bug]: On grid, session is started but never used #11881

bhecquet opened this issue Apr 11, 2023 · 14 comments
Labels
I-question Applied to questions. Issues should be closed and send the user to community resources.

Comments

@bhecquet
Copy link
Contributor

bhecquet commented Apr 11, 2023

What happened?

I encounter an issue when trying to execute more tests than a Selenium Grid can handle

My setup:

  • a grid Hub: java -jar selenium-server-4.8.3.jar hub --session-request-timeout 30 --port 4448
  • a grid node: java -jar selenium-server-4.8.3.jar node --max-sessions 1 --port 6666 --grid-url http://selenium-server.company.com:4448 --publish-events tcp://selenium-server.company.com:4442 --subscribe-events tcp://selenium-server.company.com:4443
  • 4 instances of a test (see Java+TestNG test script below)

The script adds a wait to force new sessions to be queued
So when starting the 4 instances (methods test1, test2, test3 and test4), I can see them run, or queued, but at some point, a session is created but not used: on the node, browser is started, but URL not loaded.
In that case, we need to wait 5 mins (default session timeout) for next test to be started

From what I observed, I would say that there is a race condition between the session request timeout and the session creation thread, as the "stalled" session is logged as timed out in test script logs: Cannot create driver: Could not start a new session. Response code 500. Message: Could not start a new session. New session request timed out

Thanks for your help

How can we reproduce the issue?

package company;

import org.openqa.selenium.By;
import org.openqa.selenium.Capabilities;
import org.openqa.selenium.MutableCapabilities;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import org.openqa.selenium.remote.RemoteWebDriver;

import java.net.MalformedURLException;
import java.net.URL;

public class Test {

    @org.testng.annotations.Test
    public void testMethod(String name) throws MalformedURLException, InterruptedException {
        for (int i = 0; i < 10; i++) {
            MutableCapabilities caps = new ChromeOptions();
            caps.setCapability("cs:test", String.format(name + "-" + i));
            try {
                System.out.println(String.format("Starting session %s: %d", name, i));
                WebDriver driver = new RemoteWebDriver(new URL("http://selenium-server.company.com:4448/wd/hub"), caps);
                System.out.println(String.format("Started session %s: %d", name, i));
                try {
                    int wait = (int) (Math.random() * 30000);
                    driver.get("https://www.selenium.dev/");
                    driver.manage().window().maximize();
                    driver.findElement(By.linkText("Documentation")).click();
                    Thread.sleep(wait);
                    System.out.println("Test OK");
                } finally {
                    driver.quit();
                }
            } catch (Exception e) {
                System.out.println("Cannot create driver: " + e.getMessage());
            }
        }
    }

    @org.testng.annotations.Test
    public void test() throws MalformedURLException, InterruptedException {
        testMethod("test1");
    }
    @org.testng.annotations.Test
    public void test2() throws MalformedURLException, InterruptedException {
        testMethod("test2");
    }
    @org.testng.annotations.Test
    public void test3() throws MalformedURLException, InterruptedException {
        testMethod("test3");
    }
    @org.testng.annotations.Test
    public void test4() throws MalformedURLException, InterruptedException {
        testMethod("test4");
    }
}

Relevant log output

Hub: 
15:47:37.325 INFO [LocalDistributor.newSession] - Session created by the Distributor. Id: 056d017f54d5e55bc24fc8dd03b06316
 Caps: Capabilities {acceptInsecureCerts: false, browserName: chrome, browserVersion: 104.0.5112.102, chrome: {chromedriverVersion: 104.0.5112.79 (3cf3e8c8a07d..., userDataDir: C:\Users\ZSSELE~1\AppData\L...}, cs:test: test1-2, goog:chromeOptions: {debuggerAddress: localhost:54467}, networkConnectionEnabled: false, pageLoadStrategy: normal, platformName: Windows 10, proxy: {}, se:bidiEnabled: false, se:cdp: ws://selenium-server..., se:cdpVersion: 104.0.5112.102, setWindowRect: true, strictFileInteractability: false, timeouts: {implicit: 0, pageLoad: 300000, script: 30000}, unhandledPromptBehavior: dismiss and notify, webauthn:extension:credBlob: true, webauthn:extension:largeBlob: true, webauthn:virtualAuthenticators: true}
15:47:39.333 WARN [SeleniumSpanExporter$1.lambda$export$3] - {"traceId": "4467cec474a962ae70f33c9b68cba777","eventTime": 1681220859332195377,"eventName": "HTTP request execution complete","attributes": {"http.flavor": 1,"http.handler_class": "org.openqa.selenium.grid.sessionqueue.local.LocalNewSessionQueue","http.host": "selenium-server.company.com:4448","http.method": "POST","http.request_content_length": "421","http.scheme": "HTTP","http.status_code": 500,"http.target": "\u002fsession","http.user_agent": "selenium\u002f4.7.2 (java windows)"}}

15:48:05.054 WARN [SeleniumSpanExporter$1.lambda$export$3] - {"traceId": "c7dbb6ffaea0c77307032001fd43f947","eventTime": 1681220885053873705,"eventName": "HTTP request execution complete","attributes": {"http.flavor": 1,"http.handler_class": "org.openqa.selenium.grid.sessionqueue.local.LocalNewSessionQueue","http.host": "selenium-server.company.com:4448","http.method": "POST","http.request_content_length": "421","http.scheme": "HTTP","http.status_code": 500,"http.target": "\u002fsession","http.user_agent": "selenium\u002f4.7.2 (java windows)"}}

Moreover an other test showed the following events:

  • 09:19:10 => session request through new RemoteWebDriver
  • 09:19:10 => "session request received by the distributor" => no free slot
  • 09:19:25 => "session request received by the distributor" => no free slot
  • 09:19:39 => "session request received by the distributor"
  • 09:19:40 => timeout on session creation (from client point of view
  • 09:19:41 => "Session created by the Distributor"
    The point here seems to be the delay between the slot reservation, when distributor receives the new session request and the session creation which is done after the timeout occured

Operating System

Hub: RHE 7.9, Node: Windows 10

Selenium version

4.7.2 (script), 4.8.3 (Hub & node)

What are the browser(s) and version(s) where you see this issue?

chrome 104

What are the browser driver(s) and version(s) where you see this issue?

chromedriver 104

Are you using Selenium Grid?

4.8.3

@github-actions
Copy link

@bhecquet, thank you for creating this issue. We will troubleshoot it as soon as we can.


Info for maintainers

Triage this issue by using labels.

If information is missing, add a helpful comment and then I-issue-template label.

If the issue is a question, add the I-question label.

If the issue is valid but there is no time to troubleshoot it, consider adding the help wanted label.

If the issue requires changes or fixes from an external project (e.g., ChromeDriver, GeckoDriver, MSEdgeDriver, W3C), add the applicable G-* label, and it will provide the correct link and auto-close the issue.

After troubleshooting the issue, please add the R-awaiting answer label.

Thank you!

@extravio
Copy link

I've got a similar issue.
Environment: Grid deployed with Helm (4.8.1 - fully distributed - no extra config) - 2 chrome nodes.

5 concurrent tests, each test has a 200 second delay.
result: 1 failure

response = {'status': 500, 'value': '{\n  "value": {\n    "error": "timeout",\n    "message": "java.net.http.HttpTimeoutException...n\\tat org.openqa.selenium.remote.http.jdk.JdkHttpClient.execute(JdkHttpClient.java:292)\\n\\t... 36 more\\n"\n  }\n}'}
[...]
=========================== short test summary info ============================
FAILED selenium-test.py::test_qahss03_4 - selenium.common.exceptions.TimeoutE...
=================== 1 failed, 4 passed in 404.93s (0:06:44) ====================

To avoid the timeout, I add the following configuration

extraEnvironmentVariables:
    - name: SE_SESSION_REQUEST_TIMEOUT
      value: "1200"
    - name: SE_SESSION_RETRY_INTERVAL
      value: "5"`

result: 4 tests PASSED, no response is received for the 5th test and the test suite never completes

All our test suites that use concurrent tests are affected by timeout issues:
[TestNG] org.openqa.selenium.SessionNotCreatedException: Could not start a new session. Response code 500. Message: java.util.concurrent.TimeoutException

@bhecquet
Copy link
Contributor Author

An other information that may help to resolve this bug: if I remove the "--session-request-timeout 30" parameter from hub, I don't have the bug anymore.
But this is not a solution for me as I have 2 hubs in parallel (for redundancy purpose) and I don't want my new session request being stuck on one of them

@diemol
Copy link
Member

diemol commented Apr 24, 2023

I do not see the issue. There are three timeouts into play here:

  • The client side timeout, which is what the client waits for a command to be executed, 3 minutes by default.
  • The session request queue timeout, which is how long a session request can stay on the queue before it gets created, default is 5 minutes.
  • The session timeout, which is how long a running session can stay idle on the Node, without executing any commands, 5 minutes by default.

@bhecquet
In your case, you are setting 30 seconds for the session request queue timeout, therefore any new session request will stave because you are leaving the running sessions idle and waiting for the session timeout to happen.

I recommend to check the different flags for configuration here https://www.selenium.dev/documentation/grid/configuration/cli_options/, also the first steps here https://www.selenium.dev/documentation/grid/getting_started/

If you have more questions, I will leave some links below you can use to join our user chat community.

@diemol diemol added I-question Applied to questions. Issues should be closed and send the user to community resources. and removed I-defect needs-triaging labels Apr 24, 2023
@github-actions
Copy link

💬 Please ask questions at:

@bhecquet
Copy link
Contributor Author

Hello,

thanks for your reply.
I think you missed something: I do not let the session idle, in the problematic case, the client request is "aborted" whereas, at the same time, the session is created on the node, so, there is no way to use the created session

@diemol
Copy link
Member

diemol commented Apr 24, 2023

The reason is the same, you need to figure out what timeouts work for you. If the client times out before the queue times out, then you will have orphan sessions. That is why we left it configurable and not hard coded.

@bhecquet
Copy link
Contributor Author

So I do not understand why I get a staled session on the node, corresponding to a client request which has already timeout (after 30 secs only).

Tell me if I understand correctly the process:

  1. when client needs a session, this session is put in session queue
  2. At regular (and configurable) intervals, distributor ask the session queue for pending sessions, if it can handle at least one
  3. The first pending session is pulled from session queue by distributor which creates the session (I don't know here if it's really removed or not from session queue). I observed that this process can take 2 seconds
  4. The session is created on node

Is there a moment where session is still in session queue and also created on node ?

@diemol
Copy link
Member

diemol commented Apr 24, 2023

That seems correct, but what you are seeing:

Cannot create driver: Could not start a new session. Response code 500. Message: Could not start a new session. New session request timed out 

is because the session request was in the queue for over 30 seconds and no Node slot was released in the meantime.

@bhecquet
Copy link
Contributor Author

Yes, I agree with that, but, in my test code (provided in the issue) I set a custom capability ("cs:test<some_id>") that help me identify which session is currently started on node, and what I observe is:

  • client request browser with "cs:test3" (3 is the loop index)
  • after 30 secs, client receives the message Cannot create driver: Could not start a new session. Response code 500. Message: Could not start a new session. New session request timed out
  • At the same time, a browser is started on node with capability "cs:test3"

That's what surprise me

@diemol
Copy link
Member

diemol commented Apr 24, 2023

I do not think you can correlate that because you are executing a loop, 10 times, for each test. So you do not have a single request with cs:test3, you are actually creating 10.

@bhecquet
Copy link
Contributor Author

bhecquet commented Apr 24, 2023

caps.setCapability("cs:test", String.format(name + "-" + i));
I simplified :)
the capability holds the test name and the loop index

I looked at the "LocalNewSessionQueue" code and see that while distributor takes effectively the session request to create it, it's removed from the queue, so theorically, what I observe is not possible :(

I will do more tests to understand better what happens and let you know

Thanks for your time

@bhecquet
Copy link
Contributor Author

bhecquet commented May 3, 2023

Hello @diemol

I think I've found the root cause of my problem and it's in LocalNewSessionQueue class

In fact, when session request arrives, a wait is performed

if (data.latch.await(requestTimeout.toMillis(), MILLISECONDS)) {

During this wait, LocalDistributor calls

public List<SessionRequest> getNextAvailable(Map<Capabilities, Long> stereotypes) {

to get sessions in queue

But nothing prevents the session request to timeout while the session is being created

I've added more logs to grid code (whole file attached)
logs_reduced.log

08:48:44.243 INFO [LocalNewSessionQueue.addToQueue] - 7f03040b-a214-48bd-94b3-47df03cc21c5: Adding session request to queue
08:48:44.243 INFO [LocalNewSessionQueue.injectIntoQueue] - 7f03040b-a214-48bd-94b3-47df03cc21c5: Adding request to queue
08:48:44.243 INFO [LocalNewSessionQueue.addToQueue] - 7f03040b-a214-48bd-94b3-47df03cc21c5: Waiting for session to be created
...
08:49:13.947 INFO [LocalNewSessionQueue.lambda$getNextAvailable$7] - available request: 7f03040b-a214-48bd-94b3-47df03cc21c5
...
08:49:14.244 INFO [LocalNewSessionQueue.addToQueue] - 7f03040b-a214-48bd-94b3-47df03cc21c5: Session request timed out
08:49:14.244 INFO [LocalNewSessionQueue.addToQueue] - 7f03040b-a214-48bd-94b3-47df03cc21c5: session creation error sent to client 
...
08:49:15.786 INFO [LocalDistributor.newSession] - Session created by the Distributor. Id: 934ed7006d17484db7ca3b67695672bf [request 7f03040b-a214-48bd-94b3-47df03cc21c5] 
 Caps: Capabilities {acceptInsecureCerts: false, browserName: chrome, browserVersion: 104.0.5112.102, chrome: {chromedriverVersion: 104.0.5112.79 (3cf3e8c8a07d..., userDataDir: C:\Users\ZSSELE~1\AppData\L...}, cs:test: test3-1, goog:chromeOptions: {debuggerAddress: localhost:57386}, networkConnectionEnabled: false, pageLoadStrategy: normal, platformName: Windows 10, proxy: {}, se:bidiEnabled: false, se:cdp: ws://selenium-server-dev.co..., se:cdpVersion: 104.0.5112.102, setWindowRect: true, strictFileInteractability: false, timeouts: {implicit: 0, pageLoad: 300000, script: 30000}, unhandledPromptBehavior: dismiss and notify, webauthn:extension:credBlob: true, webauthn:extension:largeBlob: true, webauthn:virtualAuthenticators: true}

You can look at the modified code (logs added) here: https://github.com/bhecquet/selenium/blob/grid_concurrency/java/src/org/openqa/selenium/grid/sessionqueue/local/LocalNewSessionQueue.java

diemol added a commit that referenced this issue May 22, 2023
…e (corrects issue #11881) (#12014)

* Add logs

* add more logs

* Control the status of session request before timeout

* Increase request session timeout by 5 seconds

This allow the extended waiting to be done without being interrupted by
timeoutSessions() method

* remove logs LocalNewSessionQueue.java

* remove logs

* remove logger import

* Add tests to check timeout behaviour

* timeoutSessions method now checks if session is in queue before deleting it

---------

Co-authored-by: Diego Molina <diemol@users.noreply.github.com>
Copy link

github-actions bot commented Dec 9, 2023

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked and limited conversation to collaborators Dec 9, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
I-question Applied to questions. Issues should be closed and send the user to community resources.
Projects
None yet
Development

No branches or pull requests

3 participants