Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K8S: add retries for job and pod status fetching #756

Merged
merged 1 commit into from
Oct 13, 2021

Conversation

oavdeev
Copy link
Collaborator

@oavdeev oavdeev commented Oct 13, 2021

When running our integration test suite, for longer running tests I'd regularly run into issues when K8S API sometimes returns intermittent 500 errors, like the one below, or a variation of it with GRPC "Transport closed" error, and that would cause tasks to fail. This PR adds retries to K8S API calls that are made most frequently, and that fixes the issue for integration tests.

We could extend this to other K8S API calls we make too (but it could be another PR if that looks good).

2021-10-12 20:51:09.614 [10748/foreach_join_x/197086 (pid 52067)] Internal error
--
  | 2021-10-12 20:51:09.623 [10748/foreach_join_x/197086 (pid 52067)] Traceback (most recent call last):
  | 2021-10-12 20:51:09.623 [10748/foreach_join_x/197086 (pid 52067)] File "/var/lib/buildkite-agent/builds/buildkite-i-0455616781cbc5b41-1/outerbounds/integration-test-suite-eks/metaflow/cli.py", line 1008, in main
  | 2021-10-12 20:51:09.623 [10748/foreach_join_x/197086 (pid 52067)] start(auto_envvar_prefix='METAFLOW', obj=state)
  | 2021-10-12 20:51:09.624 [10748/foreach_join_x/197086 (pid 52067)] File "/var/lib/buildkite-agent/.local/lib/python3.7/site-packages/click/core.py", line 1128, in __call__
  | 2021-10-12 20:51:09.624 [10748/foreach_join_x/197086 (pid 52067)] return self.main(args, kwargs)
  | 2021-10-12 20:51:10.201 [10748/foreach_join_x/197086 (pid 52067)] File "/var/lib/buildkite-agent/.local/lib/python3.7/site-packages/click/core.py", line 1053, in main
  | 2021-10-12 20:51:10.201 [10748/foreach_join_x/197086 (pid 52067)] rv = self.invoke(ctx)
  | 2021-10-12 20:51:10.201 [10748/foreach_join_x/197086 (pid 52067)] File "/var/lib/buildkite-agent/.local/lib/python3.7/site-packages/click/core.py", line 1659, in invoke
  | 2021-10-12 20:51:10.201 [10748/foreach_join_x/197086 (pid 52067)] return _process_result(sub_ctx.command.invoke(sub_ctx))
  | 2021-10-12 20:51:10.201 [10748/foreach_join_x/197086 (pid 52067)] File "/var/lib/buildkite-agent/.local/lib/python3.7/site-packages/click/core.py", line 1659, in invoke
  | 2021-10-12 20:51:10.202 [10748/foreach_join_x/197086 (pid 52067)] return _process_result(sub_ctx.command.invoke(sub_ctx))
  | 2021-10-12 20:51:10.202 [10748/foreach_join_x/197086 (pid 52067)] File "/var/lib/buildkite-agent/.local/lib/python3.7/site-packages/click/core.py", line 1395, in invoke
  | 2021-10-12 20:51:10.202 [10748/foreach_join_x/197086 (pid 52067)] return ctx.invoke(self.callback, ctx.params)
  | 2021-10-12 20:51:10.202 [10748/foreach_join_x/197086 (pid 52067)] File "/var/lib/buildkite-agent/.local/lib/python3.7/site-packages/click/core.py", line 754, in invoke
  | 2021-10-12 20:51:10.202 [10748/foreach_join_x/197086 (pid 52067)] return __callback(args, kwargs)
  | 2021-10-12 20:51:10.202 [10748/foreach_join_x/197086 (pid 52067)] File "/var/lib/buildkite-agent/.local/lib/python3.7/site-packages/click/decorators.py", line 26, in new_func
  | 2021-10-12 20:51:10.203 [10748/foreach_join_x/197086 (pid 52067)] return f(get_current_context(), args, kwargs)
  | 2021-10-12 20:51:10.203 [10748/foreach_join_x/197086 (pid 52067)] File "/var/lib/buildkite-agent/builds/buildkite-i-0455616781cbc5b41-1/outerbounds/integration-test-suite-eks/metaflow/plugins/aws/eks/kubernetes_cli.py", line 238, in step
  | 2021-10-12 20:51:10.203 [10748/foreach_join_x/197086 (pid 52067)] kubernetes.wait(echo=echo)
  | 2021-10-12 20:51:10.203 [10748/foreach_join_x/197086 (pid 52067)] File "/var/lib/buildkite-agent/builds/buildkite-i-0455616781cbc5b41-1/outerbounds/integration-test-suite-eks/metaflow/plugins/aws/eks/kubernetes.py", line 355, in wait
  | 2021-10-12 20:51:10.203 [10748/foreach_join_x/197086 (pid 52067)] wait_for_launch(self._job)
  | 2021-10-12 20:51:10.204 [10748/foreach_join_x/197086 (pid 52067)] File "/var/lib/buildkite-agent/builds/buildkite-i-0455616781cbc5b41-1/outerbounds/integration-test-suite-eks/metaflow/plugins/aws/eks/kubernetes.py", line 321, in wait_for_launch
  | 2021-10-12 20:51:10.204 [10748/foreach_join_x/197086 (pid 52067)] new_status = job.status
  | 2021-10-12 20:51:10.204 [10748/foreach_join_x/197086 (pid 52067)] File "/var/lib/buildkite-agent/builds/buildkite-i-0455616781cbc5b41-1/outerbounds/integration-test-suite-eks/metaflow/plugins/aws/eks/kubernetes_client.py", line 682, in status
  | 2021-10-12 20:51:10.204 [10748/foreach_join_x/197086 (pid 52067)] return self._get_status()
  | 2021-10-12 20:51:10.204 [10748/foreach_join_x/197086 (pid 52067)] File "/var/lib/buildkite-agent/builds/buildkite-i-0455616781cbc5b41-1/outerbounds/integration-test-suite-eks/metaflow/plugins/aws/eks/kubernetes_client.py", line 546, in _get_status
  | 2021-10-12 20:51:10.204 [10748/foreach_join_x/197086 (pid 52067)] self._pod = self._fetch_pod()
  | 2021-10-12 20:51:10.205 [10748/foreach_join_x/197086 (pid 52067)] File "/var/lib/buildkite-agent/builds/buildkite-i-0455616781cbc5b41-1/outerbounds/integration-test-suite-eks/metaflow/plugins/aws/eks/kubernetes_client.py", line 423, in _fetch_pod
  | 2021-10-12 20:51:10.205 [10748/foreach_join_x/197086 (pid 52067)] label_selector="job-name={}".format(self._name),
  | 2021-10-12 20:51:10.205 [10748/foreach_join_x/197086 (pid 52067)] File "/var/lib/buildkite-agent/.local/lib/python3.7/site-packages/kubernetes/client/api/core_v1_api.py", line 15302, in list_namespaced_pod
  | 2021-10-12 20:51:10.205 [10748/foreach_join_x/197086 (pid 52067)] return self.list_namespaced_pod_with_http_info(namespace, kwargs)  # noqa: E501
  | 2021-10-12 20:51:10.205 [10748/foreach_join_x/197086 (pid 52067)] File "/var/lib/buildkite-agent/.local/lib/python3.7/site-packages/kubernetes/client/api/core_v1_api.py", line 15427, in list_namespaced_pod_with_http_info
  | 2021-10-12 20:51:10.206 [10748/foreach_join_x/197086 (pid 52067)] collection_formats=collection_formats)
  | 2021-10-12 20:51:10.206 [10748/foreach_join_x/197086 (pid 52067)] File "/var/lib/buildkite-agent/.local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 353, in call_api
  | 2021-10-12 20:51:10.206 [10748/foreach_join_x/197086 (pid 52067)] _preload_content, _request_timeout, _host)
  | 2021-10-12 20:51:10.206 [10748/foreach_join_x/197086 (pid 52067)] File "/var/lib/buildkite-agent/.local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 184, in __call_api
  | 2021-10-12 20:51:10.206 [10748/foreach_join_x/197086 (pid 52067)] _request_timeout=_request_timeout)
  | 2021-10-12 20:51:10.206 [10748/foreach_join_x/197086 (pid 52067)] File "/var/lib/buildkite-agent/.local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 377, in request
  | 2021-10-12 20:51:10.207 [10748/foreach_join_x/197086 (pid 52067)] headers=headers)
  | 2021-10-12 20:51:10.207 [10748/foreach_join_x/197086 (pid 52067)] File "/var/lib/buildkite-agent/.local/lib/python3.7/site-packages/kubernetes/client/rest.py", line 243, in GET
  | 2021-10-12 20:51:10.207 [10748/foreach_join_x/197086 (pid 52067)] query_params=query_params)
  | 2021-10-12 20:51:10.207 [10748/foreach_join_x/197086 (pid 52067)] File "/var/lib/buildkite-agent/.local/lib/python3.7/site-packages/kubernetes/client/rest.py", line 233, in request
  | 2021-10-12 20:51:10.207 [10748/foreach_join_x/197086 (pid 52067)] raise ApiException(http_resp=r)
  | 2021-10-12 20:51:10.207 [10748/foreach_join_x/197086 (pid 52067)] kubernetes.client.exceptions.ApiException: (500)
  | 2021-10-12 20:51:10.208 [10748/foreach_join_x/197086 (pid 52067)] Reason: Internal Server Error
  | 2021-10-12 20:51:10.208 [10748/foreach_join_x/197086 (pid 52067)] HTTP response headers: HTTPHeaderDict({'Audit-Id': '5e724cff-c10d-4ceb-895d-cfd8628552e9', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': 'cd0ea542-87b8-4eff-a88e-08f6aa6e3a68', 'X-Kubernetes-Pf-Prioritylevel-Uid': '8d7e6e31-ea70-4ae2-b89f-1ea3e1738e07', 'Date': 'Tue, 12 Oct 2021 20:51:09 GMT', 'Content-Length': '122'})
  | 2021-10-12 20:51:10.208 [10748/foreach_join_x/197086 (pid 52067)] HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"etcdserver: request timed out","code":500}
  | 2021-10-12 20:51:10.208 [10748/foreach_join_x/197086 (pid 52067)]
  | 2021-10-12 20:51:10.208 [10748/foreach_join_x/197086 (pid 52067)]
  | 2021-10-12 20:51:10.209 [10748/foreach_join_x/197086 (pid 52067)]
  | 2021-10-12 20:51:10.277 [10748/foreach_join_x/197086 (pid 52067)] Task failed.


@savingoyal savingoyal merged commit f58ae58 into Netflix:kubernetes-pr Oct 13, 2021
@oavdeev oavdeev deleted the k8s-api-retry branch October 13, 2021 17:44
savingoyal added a commit that referenced this pull request Oct 15, 2021
* Refactor @resources decorator

@resources decorator is shared by all compute related decorators -
@Batch, @lambda, @K8s, @titus. This patch moves it out of
batch_decorator.py so that other decorators can cleanly reference
it.

* Update __init__.py

* Refactor @Batch decorator

* more change

* more changes

* more changes

* @kubernetes

* Kubernetes

* More changes

* More changes

* more changes

* some more changes

* more changes

* add disk space

* Add todos

* some fixes

* add k8s testing context

* more changes

* some more changes

* minor fixups

* better error handling for evicted pods (#711)

* fixes for pod/job metadata race conditions (#704)

* K8S: label value sanitizer (#719)

* rename name_space to namespace for k8s plugin (#750)

* fix k8s attribute handling bug (#753)

* tweak k8s test resources (to run on kind) (#754)

* add k8s api retries (#756)

* update done marker

* Use linux binaries in @conda when run in k8s (#758)

Conda environment should pack linux python binary when run on MacOS
to avoid an error

metaflow_PlayListFlow_osx-64_179c56284704ca8e53622f848a3df27cdd1f4327/bin/python: cannot execute binary file: Exec format error

* fix comment

* fix merge conflict

* update char

Co-authored-by: Oleg Avdeev <oleg.v.avdeev@gmail.com>
Co-authored-by: Roman Kindruk <36699371+sappier@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants