-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Big kubernetes job ends cryptically #2884
Comments
That looks like Kubernetes is refusing connections from the leader.
Maybe Toil can be made to retry its request, and/or maybe it needs to
back off on the rate at which it is sending in job deletion requests.
…On 12/9/19, Glenn Hickey ***@***.***> wrote:
After swamping the cluster which a bunch of small jobs which seem to run
fine, my cactus run aborts with the following connection error.
```
Traceback (most recent call last):
File "/venv2/bin/cactus", line 8, in <module>
sys.exit(main())
File
"/venv2/local/lib/python2.7/site-packages/cactus/progressive/cactus_progressive.py",
line 496, in main
runCactusProgressive(options)
File
"/venv2/local/lib/python2.7/site-packages/cactus/progressive/cactus_progressive.py",
line 546, in runCactusProgressive
halID = toil.start(RunCactusPreprocessorThenProgressiveDown(options,
project, memory=configWrapper.getDefaultMemory()))
File "/usr/local/lib/python2.7/dist-packages/toil/common.py", line 795, in
start
self._shutdownBatchSystem()
File "/usr/local/lib/python2.7/dist-packages/toil/common.py", line 1073,
in _shutdownBatchSystem
self._batchSystem.shutdown()
File
"/usr/local/lib/python2.7/dist-packages/toil/batchSystems/kubernetes.py",
line 666, in shutdown
propagation_policy='Background')
File
"/usr/local/lib/python2.7/dist-packages/kubernetes/client/apis/batch_v1_api.py",
line 306, in delete_namespaced_job
(data) = self.delete_namespaced_job_with_http_info(name, namespace,
**kwargs)
File
"/usr/local/lib/python2.7/dist-packages/kubernetes/client/apis/batch_v1_api.py",
line 406, in delete_namespaced_job_with_http_info
collection_formats=collection_formats)
File
"/usr/local/lib/python2.7/dist-packages/kubernetes/client/api_client.py",
line 334, in call_api
_return_http_data_only, collection_formats, _preload_content,
_request_timeout)
File
"/usr/local/lib/python2.7/dist-packages/kubernetes/client/api_client.py",
line 168, in __call_api
_request_timeout=_request_timeout)
File
"/usr/local/lib/python2.7/dist-packages/kubernetes/client/api_client.py",
line 400, in request
body=body)
File "/usr/local/lib/python2.7/dist-packages/kubernetes/client/rest.py",
line 256, in DELETE
body=body)
File "/usr/local/lib/python2.7/dist-packages/kubernetes/client/rest.py",
line 166, in request
headers=headers)
File "/usr/local/lib/python2.7/dist-packages/urllib3/request.py", line 76,
in request
method, url, fields=fields, headers=headers, **urlopen_kw
File "/usr/local/lib/python2.7/dist-packages/urllib3/request.py", line 97,
in request_encode_url
return self.urlopen(method, url, **extra_kw)
File "/usr/local/lib/python2.7/dist-packages/urllib3/poolmanager.py", line
330, in urlopen
response = conn.urlopen(method, u.request_uri, **kw)
File "/usr/local/lib/python2.7/dist-packages/urllib3/connectionpool.py",
line 760, in urlopen
**response_kw
File "/usr/local/lib/python2.7/dist-packages/urllib3/connectionpool.py",
line 760, in urlopen
**response_kw
File "/usr/local/lib/python2.7/dist-packages/urllib3/connectionpool.py",
line 760, in urlopen
**response_kw
File "/usr/local/lib/python2.7/dist-packages/urllib3/connectionpool.py",
line 720, in urlopen
method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
File "/usr/local/lib/python2.7/dist-packages/urllib3/util/retry.py", line
436, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='10.96.0.1',
port=443): Max retries exceeded with url:
/apis/batch/v1/namespaces/vg/jobs/root-toil-8731999c-ed32-4aad-8093-bb325fc30f17-4803?propagationPolicy=Background
(Caused by NewConnectio
nError('<urllib3.connection.VerifiedHTTPSConnection object at
0x7f27d050aad0>: Failed to establish a new connection: [Errno 111]
Connection refused',))
```
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
#2884
|
And when I retry it, I get a different (but similar?) error after a couple hours
|
Ooh I'm not sure what we can do about that one. Maybe we need to set a
longer timeout and give the API more time to list all 7000 jobs we've
put in the namespace?
…On 12/9/19, Glenn Hickey ***@***.***> wrote:
And when I retry it, I get a different (but similar?) error after a couple
hours
```
Stopping real-time logging server.
Joining real-time logging server thread.
Traceback (most recent call last):
File "/venv2/bin/cactus", line 8, in <module>
sys.exit(main())
File
"/venv2/local/lib/python2.7/site-packages/cactus/progressive/cactus_progressive.py",
line 496, in main
runCactusProgressive(options)
File
"/venv2/local/lib/python2.7/site-packages/cactus/progressive/cactus_progressive.py",
line 506, in runCactusProgressive
halID = toil.restart()
File "/usr/local/lib/python2.7/dist-packages/toil/common.py", line 827, in
restart
self._shutdownBatchSystem()
File "/usr/local/lib/python2.7/dist-packages/toil/common.py", line 1073,
in _shutdownBatchSystem
self._batchSystem.shutdown()
File
"/usr/local/lib/python2.7/dist-packages/toil/batchSystems/kubernetes.py",
line 649, in shutdown
for job in self._ourJobObjects():
File
"/usr/local/lib/python2.7/dist-packages/toil/batchSystems/kubernetes.py",
line 303, in _ourJobObjects
results = self.batchApi.list_namespaced_job(self.namespace, **kwargs)
File
"/usr/local/lib/python2.7/dist-packages/kubernetes/client/apis/batch_v1_api.py",
line 643, in list_namespaced_job
(data) = self.list_namespaced_job_with_http_info(namespace, **kwargs)
File
"/usr/local/lib/python2.7/dist-packages/kubernetes/client/apis/batch_v1_api.py",
line 743, in list_namespaced_job_with_http_info
collection_formats=collection_formats)
File
"/usr/local/lib/python2.7/dist-packages/kubernetes/client/api_client.py",
line 334, in call_api
_return_http_data_only, collection_formats, _preload_content,
_request_timeout)
File
"/usr/local/lib/python2.7/dist-packages/kubernetes/client/api_client.py",
line 168, in __call_api
_request_timeout=_request_timeout)
File
"/usr/local/lib/python2.7/dist-packages/kubernetes/client/api_client.py",
line 355, in request
headers=headers)
File "/usr/local/lib/python2.7/dist-packages/kubernetes/client/rest.py",
line 231, in GET
query_params=query_params)
File "/usr/local/lib/python2.7/dist-packages/kubernetes/client/rest.py",
line 222, in request
raise ApiException(http_resp=r)
kubernetes.client.rest.ApiException: (504)
Reason: Gateway Timeout
HTTP response headers: HTTPHeaderDict({'Date': 'Mon, 09 Dec 2019 18:42:22
GMT', 'Content-Length': '136', 'Content-Type': 'text/plain;
charset=utf-8'})
HTTP response body: {"metadata":{},"status":"Failure","message":"Timeout:
request did not complete within
1m0s","reason":"Timeout","details":{},"code":504}
```
--
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
#2884 (comment)
|
If it is not already doing it, a dynamic backoff retry might be
a good approach. It is really hard to define timeouts that
always work.
Adam Novak <notifications@github.com> writes:
… Ooh I'm not sure what we can do about that one. Maybe we need to set a
longer timeout and give the API more time to list all 7000 jobs we've
put in the namespace?
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
After swamping the cluster which a bunch of small jobs which seem to run fine, my cactus run aborts with the following connection error.
┆Issue is synchronized with this Jira Task
┆Issue Number: TOIL-468
The text was updated successfully, but these errors were encountered: