Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Big kubernetes job ends cryptically #2884

Closed
glennhickey opened this issue Dec 9, 2019 · 4 comments
Closed

Big kubernetes job ends cryptically #2884

glennhickey opened this issue Dec 9, 2019 · 4 comments
Assignees

Comments

@glennhickey
Copy link
Contributor

glennhickey commented Dec 9, 2019

After swamping the cluster which a bunch of small jobs which seem to run fine, my cactus run aborts with the following connection error.

Traceback (most recent call last):
  File "/venv2/bin/cactus", line 8, in <module>
    sys.exit(main())
  File "/venv2/local/lib/python2.7/site-packages/cactus/progressive/cactus_progressive.py", line 496, in main
    runCactusProgressive(options)
  File "/venv2/local/lib/python2.7/site-packages/cactus/progressive/cactus_progressive.py", line 546, in runCactusProgressive
    halID = toil.start(RunCactusPreprocessorThenProgressiveDown(options, project, memory=configWrapper.getDefaultMemory()))
  File "/usr/local/lib/python2.7/dist-packages/toil/common.py", line 795, in start
    self._shutdownBatchSystem()
  File "/usr/local/lib/python2.7/dist-packages/toil/common.py", line 1073, in _shutdownBatchSystem
    self._batchSystem.shutdown()
  File "/usr/local/lib/python2.7/dist-packages/toil/batchSystems/kubernetes.py", line 666, in shutdown
    propagation_policy='Background')
  File "/usr/local/lib/python2.7/dist-packages/kubernetes/client/apis/batch_v1_api.py", line 306, in delete_namespaced_job
    (data) = self.delete_namespaced_job_with_http_info(name, namespace, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/kubernetes/client/apis/batch_v1_api.py", line 406, in delete_namespaced_job_with_http_info
    collection_formats=collection_formats)
  File "/usr/local/lib/python2.7/dist-packages/kubernetes/client/api_client.py", line 334, in call_api
    _return_http_data_only, collection_formats, _preload_content, _request_timeout)
  File "/usr/local/lib/python2.7/dist-packages/kubernetes/client/api_client.py", line 168, in __call_api
    _request_timeout=_request_timeout)
  File "/usr/local/lib/python2.7/dist-packages/kubernetes/client/api_client.py", line 400, in request
    body=body)
  File "/usr/local/lib/python2.7/dist-packages/kubernetes/client/rest.py", line 256, in DELETE
    body=body)
  File "/usr/local/lib/python2.7/dist-packages/kubernetes/client/rest.py", line 166, in request
    headers=headers)
  File "/usr/local/lib/python2.7/dist-packages/urllib3/request.py", line 76, in request
    method, url, fields=fields, headers=headers, **urlopen_kw
  File "/usr/local/lib/python2.7/dist-packages/urllib3/request.py", line 97, in request_encode_url
    return self.urlopen(method, url, **extra_kw)
  File "/usr/local/lib/python2.7/dist-packages/urllib3/poolmanager.py", line 330, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
  File "/usr/local/lib/python2.7/dist-packages/urllib3/connectionpool.py", line 760, in urlopen
    **response_kw
  File "/usr/local/lib/python2.7/dist-packages/urllib3/connectionpool.py", line 760, in urlopen
    **response_kw
  File "/usr/local/lib/python2.7/dist-packages/urllib3/connectionpool.py", line 760, in urlopen
    **response_kw
  File "/usr/local/lib/python2.7/dist-packages/urllib3/connectionpool.py", line 720, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/usr/local/lib/python2.7/dist-packages/urllib3/util/retry.py", line 436, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='10.96.0.1', port=443): Max retries exceeded with url: /apis/batch/v1/namespaces/vg/jobs/root-toil-8731999c-ed32-4aad-8093-bb325fc30f17-4803?propagationPolicy=Background (Caused by NewConnectio
nError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f27d050aad0>: Failed to establish a new connection: [Errno 111] Connection refused',))

┆Issue is synchronized with this Jira Task
┆Issue Number: TOIL-468

@adamnovak
Copy link
Member

adamnovak commented Dec 9, 2019 via email

@glennhickey
Copy link
Contributor Author

And when I retry it, I get a different (but similar?) error after a couple hours

Stopping real-time logging server.
Joining real-time logging server thread.
Traceback (most recent call last):
  File "/venv2/bin/cactus", line 8, in <module>
    sys.exit(main())
  File "/venv2/local/lib/python2.7/site-packages/cactus/progressive/cactus_progressive.py", line 496, in main
    runCactusProgressive(options)
  File "/venv2/local/lib/python2.7/site-packages/cactus/progressive/cactus_progressive.py", line 506, in runCactusProgressive
    halID = toil.restart()
  File "/usr/local/lib/python2.7/dist-packages/toil/common.py", line 827, in restart
    self._shutdownBatchSystem()
  File "/usr/local/lib/python2.7/dist-packages/toil/common.py", line 1073, in _shutdownBatchSystem
    self._batchSystem.shutdown()
  File "/usr/local/lib/python2.7/dist-packages/toil/batchSystems/kubernetes.py", line 649, in shutdown
    for job in self._ourJobObjects():
  File "/usr/local/lib/python2.7/dist-packages/toil/batchSystems/kubernetes.py", line 303, in _ourJobObjects
    results = self.batchApi.list_namespaced_job(self.namespace, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/kubernetes/client/apis/batch_v1_api.py", line 643, in list_namespaced_job
    (data) = self.list_namespaced_job_with_http_info(namespace, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/kubernetes/client/apis/batch_v1_api.py", line 743, in list_namespaced_job_with_http_info
    collection_formats=collection_formats)
  File "/usr/local/lib/python2.7/dist-packages/kubernetes/client/api_client.py", line 334, in call_api
    _return_http_data_only, collection_formats, _preload_content, _request_timeout)
  File "/usr/local/lib/python2.7/dist-packages/kubernetes/client/api_client.py", line 168, in __call_api
    _request_timeout=_request_timeout)
  File "/usr/local/lib/python2.7/dist-packages/kubernetes/client/api_client.py", line 355, in request
    headers=headers)
  File "/usr/local/lib/python2.7/dist-packages/kubernetes/client/rest.py", line 231, in GET
    query_params=query_params)
  File "/usr/local/lib/python2.7/dist-packages/kubernetes/client/rest.py", line 222, in request
    raise ApiException(http_resp=r)
kubernetes.client.rest.ApiException: (504)
Reason: Gateway Timeout
HTTP response headers: HTTPHeaderDict({'Date': 'Mon, 09 Dec 2019 18:42:22 GMT', 'Content-Length': '136', 'Content-Type': 'text/plain; charset=utf-8'})
HTTP response body: {"metadata":{},"status":"Failure","message":"Timeout: request did not complete within 1m0s","reason":"Timeout","details":{},"code":504}

@adamnovak
Copy link
Member

adamnovak commented Dec 9, 2019 via email

@diekhans
Copy link
Collaborator

diekhans commented Dec 9, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants