New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When a Windows client is killed, Grid nodes can leave sessions/browsers open indefinitely #2332

Closed
jessehudl opened this Issue Jun 21, 2016 · 5 comments

Comments

Projects
None yet
5 participants
@jessehudl
Copy link

jessehudl commented Jun 21, 2016

Meta -

OS: Windows
Selenium Version: 2.53

Expected Behavior -

When a client has disconnected for any reason, an attempt is always made to cleanup the session on the remote node, at least by the expiration of the idle timeout.

Actual Behavior -

If the Selenium Grid hub detects a client "is gone" or "Socket timed out", it terminates/frees the session on the Grid, but no attempt is made to delete the session on the remote node. This is handled in RequestHandler.java:#L131-136:

        } catch (ClientGoneException e) {
          log.log(Level.WARNING, "The client is gone for session " + session + ", terminating");
          registry.terminate(session, SessionTerminationReason.CLIENT_GONE);
        } catch (SocketTimeoutException e) {
          log.log(Level.SEVERE, "Socket timed out for session " + session + ", " + e.getMessage());
          registry.terminate(session, SessionTerminationReason.SO_TIMEOUT);

registry.terminate() specifies this in the docstring:

  /**
   * Ends this test session for the hub, releasing the resources in the hub / registry. It does not
   * release anything on the remote.

This means sessions are left on the node and will never be cleaned up, which can cause dozens of web browsers left running eventually leading to resource exhaustion on the node and slow/failing tests. maxSession does not matter in this case.

We see this in our environment when our continuous integration server (TeamCity) kills a build on a Windows agent during a Selenium test, either manually or due to overrunning the allowed run time.

Steps to reproduce -

I can only reproduce this using Windows as a client, I am not sure if there's something different in the way it handles connections, or why exactly. I imagine it's possible on other OS, but not as easy to reproduce.

Since this is a Grid hub specific issue, there are a few steps:

  1. Start a grid server with an idle timeout of 10 seconds:

    java -jar selenium-server-standalone-2.53.0.jar -role hub -timeout 10
    
  2. Start a Grid node on the same machine:

    java -jar selenium-server-standalone-2.53.0.jar -role node
    
  3. On a Windows machine from command prompt run a script equivalent to this (replace localhost with the Grid IP, if different):

    # Start a Grid session and browse to google.com 30 times
    from selenium import webdriver
    
    driver = webdriver.Remote('http://localhost:4444/wd/hub', {'browserName':'chrome'})
    for i in range(30):
       driver.get('http://www.google.com')
  4. While the script is running close the terminal window (do not ctrl-c or kill the process)

You should see this in the log:

13:35:12.062 WARN - The client is gone for session ext. key 59d33b92-47ac-4b44-a5a1-b53e32b4455e

This means the session is freed on the Grid, but the session is left running on the node. You can confirm this by browsing to your node's URL (e.g. http://localhost:5555/wd/hub/static/resource/hub.html).

Solutions -

I have a (very small) commit on my forked branch which removes the session.terminate() calls when handling these types of client disconnects - jessehudl@fe5fafb

Removing the session.terminate() allows the Grid to clean up the sessions like it usually would - after the idle timeout has expired. If you re-run the "Steps to reproduce", you will see that after the client is gone, the session is still cleaned up after 10 seconds and the browser is closed on the node.

This is working well for us internally but I wanted to gain some insight from the core devs before submitting a PR:

  1. Is there a reason these sessions are currently handled the way they are?
  2. Is there a better way to clean them up?
  3. Since 2.X development is closed, would this have to be submitted to 3.X?

@jleyba jleyba added the C-grid label Jun 22, 2016

@mach6

This comment has been minimized.

Copy link
Member

mach6 commented Oct 4, 2016

@jessehudl bump - I'd like to see a PR raised for this (submitted for 3.x), assuming it still an issue

@cgoldberg

This comment has been minimized.

Copy link
Member

cgoldberg commented Oct 5, 2016

@jessehudl
Your PR removes termination from the hub... Do those ever get cleaned up on the server? (or is this a memory leak by abandoning sessions). I'm also not sure why removing termination actually solves your issue. Perhaps clients clean up succesfully when the server side drops the connection, but not when it encounters the case you described?

Is there a better way to clean them up?

This has some good ideas (and code) for killing remaining/crashed browsers before each test run:
http://elementalselenium.com/tips/70-grid-extras

@jessehudl

This comment has been minimized.

Copy link
Author

jessehudl commented Oct 20, 2016

Sorry guys, I actually took another job so wasn't getting these notifications since they were tied to my old email.

@cgoldberg: my proposed change which I implemented at my now-former job was to specifically remove the session deletions on the hub in these cases where the client disconnects unexpectedly. What was happening was when the client session was terminated/killed, the hub would simply delete the session on the hub only and not the node. This left sessions and browsers open indefinitely on the nodes causing resource exhaustion and slowness. The comment I quoted in the original post seems to confirm this is intentional, in registry.terminate():

   * Ends this test session for the hub, releasing the resources in the hub / registry. It does not
   * release anything on the remote. The resources are released in a separate thread, so the call
   * returns immediately. It allows release with long duration not to block the test while the hub is
   * releasing the resource.

So it removes sessions from the Grid but not the node.

By removing these registry.terminate() calls, this allows the hub to expire the sessions on the nodes like normal, instead of just leaving them there forever. This means the Grid session isn't freed immediately, but it will be freed after the timeout period, as if a client never sent another command (which it doesn't). The case for this is the client being killed (by either a build being cancelled (our case), terminal being closed (my repro example), etc.). The client is now gone and won't cleanup the session, and the hub also won't clean up the session because it's already deleted it locally without notifying the nodes.

I'm not sure what the intended use case for this is, that's the reason I opened the discussion. Perhaps it was simply the benefit of freeing the Grid session immediately, but I don't see why an attempt wouldn't be made to free the session on the node. Maybe that would be an even better fix than what I proposed - cleaning up the session on the hub and the node. I think I did investigate that approach, but didn't see an easy way to kill the session on the node from this area of the code. This was the first time I ever looked at the Grid code and I'm no longer working on this or using Selenium grid regularly.

Selenium Grid Extras is fine as an add-on, but if we could fix this in the core code, why not? Maybe this is a bug that the Extras is working around?

I have no problem opening a PR for the fix I originally proposed if @mach6 agrees it's the best approach.

@jessehudl

This comment has been minimized.

Copy link
Author

jessehudl commented Mar 16, 2017

I tried reproducing this tonight with v3.3.1 and was unable to. It appears that this scenario is covered by SessionCleaner on the node now. I noticed when the node starts:

00:37:36.778 INFO - SessionCleaner initialized with insideBrowserTimeout 0 and clientGoneTimeout 10000 polling every 1000

SessionCleaner did not output that it was enabled in v2.53.0, possibly because the log line to say it was enabled was added in 9f88503.

Running the same repro steps outlined in the OP I see the hub detect the client gone first as usual (the node is again not affected):

00:54:26.194 WARN - The client is gone for session ext. key e8136b36-f613-40fa-b079-dba47a076df8, terminating

However the node does now detect a client timeout and closes the session (and the browser):

00:54:35.527 INFO - Session e8136b36-f613-40fa-b079-dba47a076df8 deleted due to client timeout

I also suspect commit 705be71 fixed this issue (maybe)?

I think this issue can be closed as it's no longer occurring on v3.3.1, unless the fix should be backported to 2.x

@barancev

This comment has been minimized.

Copy link
Member

barancev commented Dec 20, 2017

Checked again in 3.8.1 -- the issue seems to be fixed. Yes, the hub starts all pending sessions, but the node kills the by timeout, one by one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment