Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K8S: better error handling for evicted pods #711

Merged
merged 1 commit into from
Sep 22, 2021

Conversation

oavdeev
Copy link
Collaborator

@oavdeev oavdeev commented Sep 21, 2021

From testing on our EKS cluster, it turns out container_statuses can sometimes be None for a failed pod. It looks like it happens when the pod was assigned to a node, haven't had a chance to start the containers yet, and gets evicted. In that case current version fails with a non-descriptive error ("NoneType not subsciptable" and a stack trace). This PR adds a handler for this case and prints a nicer error message, derived from V1PodStatus.reason:

...
2021-09-21 20:27:10.659 [6617/start/117133 (pid 14120)] Kubernetes error:
2021-09-21 20:27:10.660 [6617/start/117133 (pid 14120)] Evicted: Pod The node had condition: [DiskPressure]. . This could be a transient error. Use @retry to retry.
2021-09-21 20:27:10.760 [6617/start/117133 (pid 14120)]

This condition is somewhat tricky to repro without running the full test suite, so for reference, here's how V1PodStatus looks like in those cases, as returned by K8S API:

    ...
    "status": {
        "conditions": None,
        "container_statuses": None,
        "ephemeral_container_statuses": None,
        "host_ip": None,
        "init_container_statuses": None,
        "message": "Pod The node had condition: [DiskPressure]. ",
        "nominated_node_name": None,
        "phase": "Failed",
        "pod_ip": None,
        "pod_i_ps": None,
        "qos_class": None,
        "reason": "Evicted",
        "start_time": datetime.datetime(2021, 9, 17, 2, 26, 4, tzinfo=tzlocal()),
    },

@oavdeev oavdeev changed the title better error handling for evicted pods K8S: better error handling for evicted pods Sep 21, 2021
@oavdeev oavdeev marked this pull request as ready for review September 21, 2021 22:48
@savingoyal savingoyal merged commit c784b16 into Netflix:kubernetes-pr Sep 22, 2021
savingoyal added a commit that referenced this pull request Oct 15, 2021
* Refactor @resources decorator

@resources decorator is shared by all compute related decorators -
@Batch, @lambda, @K8s, @titus. This patch moves it out of
batch_decorator.py so that other decorators can cleanly reference
it.

* Update __init__.py

* Refactor @Batch decorator

* more change

* more changes

* more changes

* @kubernetes

* Kubernetes

* More changes

* More changes

* more changes

* some more changes

* more changes

* add disk space

* Add todos

* some fixes

* add k8s testing context

* more changes

* some more changes

* minor fixups

* better error handling for evicted pods (#711)

* fixes for pod/job metadata race conditions (#704)

* K8S: label value sanitizer (#719)

* rename name_space to namespace for k8s plugin (#750)

* fix k8s attribute handling bug (#753)

* tweak k8s test resources (to run on kind) (#754)

* add k8s api retries (#756)

* update done marker

* Use linux binaries in @conda when run in k8s (#758)

Conda environment should pack linux python binary when run on MacOS
to avoid an error

metaflow_PlayListFlow_osx-64_179c56284704ca8e53622f848a3df27cdd1f4327/bin/python: cannot execute binary file: Exec format error

* fix comment

* fix merge conflict

* update char

Co-authored-by: Oleg Avdeev <oleg.v.avdeev@gmail.com>
Co-authored-by: Roman Kindruk <36699371+sappier@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants