New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

time-box the docker client to avoid ever getting stuck #773

Merged
merged 4 commits into from Jan 4, 2016

Conversation

Projects
None yet
4 participants
@ssalinas
Member

ssalinas commented Nov 23, 2015

It seems that we can get into an odd state when contacting the docker daemon. Even with the appropriate timeouts (default apache client in this case) set, we can still get into a situation where we try to contact the docker daemon then hang there forever. While the root problem lies in the docker daemon, it shouldn't stop the executor from being able to shut down properly.

This optionally creates a time limit for all calls using the docker daemon to avoid things hanging and causing an executor process that will simply wait forever doing nothing.

@jhaber

This comment has been minimized.

Show comment
Hide comment
@jhaber

jhaber Nov 23, 2015

Member

Would you still have a stuck thread? If so, doesn't this just hide a resource leak and eventually you could run out of threads or HTTP connections?

Member

jhaber commented Nov 23, 2015

Would you still have a stuck thread? If so, doesn't this just hide a resource leak and eventually you could run out of threads or HTTP connections?

@ssalinas

This comment has been minimized.

Show comment
Hide comment
@ssalinas

ssalinas Nov 23, 2015

Member

From my testing while things were in the stuck state, interrupting the thread / killing the process will still stop it correctly. Maybe stuck was a bad word, the docker daemon call just never returns and never times out

Member

ssalinas commented Nov 23, 2015

From my testing while things were in the stuck state, interrupting the thread / killing the process will still stop it correctly. Maybe stuck was a bad word, the docker daemon call just never returns and never times out

@jhaber

This comment has been minimized.

Show comment
Hide comment
@jhaber

jhaber Nov 23, 2015

Member

Interesting, maybe we should PR the spotify docker client if it's not respecting timeouts

Member

jhaber commented Nov 23, 2015

Interesting, maybe we should PR the spotify docker client if it's not respecting timeouts

@ssalinas

This comment has been minimized.

Show comment
Hide comment
@ssalinas

ssalinas Nov 23, 2015

Member

spotify docker client is just using apache http client underneath, default timeouts for connect and read get set to 5 and 30 seconds. I think the current bug with the docker daemon is that it keeps the connection open in such a way that the timeouts do no get hit, thus the usage of TimeLimiter here instead (docker cli calls and calls to docker daemon from any source/client hang as well)

The bug is a rare case and the purpose of the PR is more to limit our executor from being pinned by it. If we can't launch something because docker is in a bad state, call it failed and move on.

Member

ssalinas commented Nov 23, 2015

spotify docker client is just using apache http client underneath, default timeouts for connect and read get set to 5 and 30 seconds. I think the current bug with the docker daemon is that it keeps the connection open in such a way that the timeouts do no get hit, thus the usage of TimeLimiter here instead (docker cli calls and calls to docker daemon from any source/client hang as well)

The bug is a rare case and the purpose of the PR is more to limit our executor from being pinned by it. If we can't launch something because docker is in a bad state, call it failed and move on.

@ssalinas ssalinas added the hs_staging label Dec 28, 2015

Show outdated Hide outdated ...a/com/hubspot/singularity/executor/SingularityExecutorThreadChecker.java
@@ -124,6 +125,8 @@ private int getNumUsedThreads(SingularityExecutorTaskProcessCallable taskProcess
}
} catch (DockerException e) {
throw new ProcessFailedException(String.format("Could not get docker root pid due to error: %s", e));
} catch (UncheckedTimeoutException te) {
throw new ProcessFailedException("Timed out trying to reach docker daemon");

This comment has been minimized.

@tpetr

tpetr Dec 28, 2015

Member

could we include the timeout duration here? (i.e. Timed out trying to reach docker daemon after N seconds)

@tpetr

tpetr Dec 28, 2015

Member

could we include the timeout duration here? (i.e. Timed out trying to reach docker daemon after N seconds)

This comment has been minimized.

@ssalinas

ssalinas Dec 28, 2015

Member

Updated

@ssalinas

ssalinas Dec 28, 2015

Member

Updated

@tpetr tpetr added this to the 0.4.8 milestone Dec 31, 2015

@@ -125,6 +126,8 @@ private boolean cleanDocker() {
} catch (ContainerNotFoundException e) {

This comment has been minimized.

@wsorenson

wsorenson Jan 4, 2016

Member

You should use the {} instead of String.format

@wsorenson

wsorenson Jan 4, 2016

Member

You should use the {} instead of String.format

This comment has been minimized.

wsorenson added a commit that referenced this pull request Jan 4, 2016

Merge pull request #773 from HubSpot/time_limited_docker
time-box the docker client to avoid ever getting stuck

@wsorenson wsorenson merged commit 9339962 into master Jan 4, 2016

0 of 2 checks passed

continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
continuous-integration/travis-ci/push The Travis CI build is in progress
Details

@ssalinas ssalinas deleted the time_limited_docker branch Feb 9, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment