Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add exceptions for dealing with image errors on pods #9

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

adamrdrew
Copy link
Collaborator

This patch adds exceptions for image pull backoff and image errors. But, at the time of this writing, I haven't adequately confirmed that it works.

@adamrdrew
Copy link
Collaborator Author

ocviapy.StatusError: image pull error for resource pod/advisor-backend-run-django-migration-job-ic906yi-xwm7q/advisor-backend-run-django-migration-job-ic906yi-xwm7q
ERROR: deploy failed: image pull error for resource pod/advisor-backend-run-django-migration-job-ic906yi-xwm7q/advisor-backend-run-django-migration-job-ic906yi-xwm7q

@adamrdrew adamrdrew marked this pull request as ready for review September 1, 2023 12:41
Copy link
Collaborator

@bsquizz bsquizz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice way of doing this, I forgot we copy the watcher's resources. A couple minor code tweaks I spotted.

Also, I think right now you're missing the case where ResourceWaiter was told to watch a pod. You could add a check right here:

https://github.com/RedHatInsights/ocviapy/pull/9/files#diff-028ea95c402cb62a8d41c1f73660fbfbe0fc44359b8a960ffacb4d54a0abaf2fR648

You'll want to check if resource.image_pull_error is True right before that if self.watch_owned

You could also move the error check out of the _check_owned_resources function and just put it here in the for loop: https://github.com/RedHatInsights/ocviapy/pull/9/files#diff-028ea95c402cb62a8d41c1f73660fbfbe0fc44359b8a960ffacb4d54a0abaf2fR652

So I think _observe could look like this:

        # update our records for this resource
        self.observed_resources[key] = resource
        
        if resource.image_pull_error:
            raise StatusError(f"image pull error for resource {resource.key}/{resource.name}")
        
        if self.watch_owned:
            # use .copy() in case dict changes during iteration
            for _, r in self.watcher.resources.copy().items():
                if r.image_pull_error:
                    raise StatusError(f"image pull error for resource {r.key}/{r.name}")
                self._check_owned_resources(r)

ocviapy/__init__.py Outdated Show resolved Hide resolved
ocviapy/__init__.py Outdated Show resolved Hide resolved
ocviapy/__init__.py Show resolved Hide resolved
ocviapy/__init__.py Outdated Show resolved Hide resolved
@adamrdrew
Copy link
Collaborator Author

Those are all great suggestions. Incorporated them all. I'm going to mess about in the debugger and see if I can make the exceptions nicer as you mentioned

@bsquizz
Copy link
Collaborator

bsquizz commented Sep 1, 2023

Looks like message will give more info:

    state:
      waiting:
        message: Back-off pulling image "quay.io/cloudservices/automation-analytics-frontend:c8bf506"
        reason: ImagePullBackOff

So might be good to print the 'message', 'reason', and to indicate the container name as well?

Something like:

raise StatusError(f"{reason} error on {resource.key}/{resource.name} (container {container_name}): {message}")

@bsquizz
Copy link
Collaborator

bsquizz commented Sep 1, 2023

You already fetch all this info when you loop through all this stuff in .image_pull_error -- you could maybe just log.error it there as a start

@adamrdrew
Copy link
Collaborator Author

I feel pretty good about that last push. Simplified / DRY things a bit and better messaging:

2023-09-01 13:17:38 [   ERROR] [          MainThread] hit status error: image pull error for resource pod/advisor-backend-run-django-migration-job-pe398z2-zt4lv/advisor-backend-run-django-migration-job-pe398z2-zt4lv: advisor-backend-run-django-migration-job-pe398z2: ImagePullBackOff quay.io/cloudservices/advisor-backend:324324

@adamrdrew
Copy link
Collaborator Author

Saw your feedback after I pushed. Even better error message:

2023-09-01 13:26:26 [   ERROR] [          MainThread] hit status error: Image Pull Failed: advisor-backend-run-django-migration-job-pe398z2-zt4lv ImagePullBackOff Back-off pulling image "quay.io/cloudservices/advisor-backend:324324" 

for owner_ref in resource.data["metadata"].get("ownerReferences", []):
restype_matches = owner_ref["kind"].lower() == self.restype
owner_uid_matches = owner_ref["uid"] == self.resource.uid
if restype_matches and owner_uid_matches:
# this resource is owned by "self"
previously_observed = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code block is unchanged, just moved into the new _check_status_if_owned func below for better readability

if self.watch_owned:
# use .copy() in case dict changes during iteration
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code block was moved into the _check_owned_resources func. The only new addition to this block of code is https://github.com/RedHatInsights/ocviapy/pull/9/files#diff-028ea95c402cb62a8d41c1f73660fbfbe0fc44359b8a960ffacb4d54a0abaf2fR666

Copy link
Collaborator

@Victoremepunto Victoremepunto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a hard time trying to review this code, in part because of lack of familiarity with it . I wish it had tests on it , I was about to request tests along this feature but since the repo doesn't have any ...

The way I tested this was thanks to bonfire - installing it with this PR's version of ocviapy seems to show that it is able to detect an Image Pull error ... unfortunately, if there's a pod in the namespace in this state already, it makes a bonfire deploy to crash immediately after creating all the resources (thanks for pointing this out @bsquizz ).

I don't know - I don't feel confident enough to give this a 👍 - so I'd just comment - not explicitly approve, and I'm not requesting changes to add tests simply because the project itself doesn't have tests whatsoever - that doesn't mean though it could use some tests, as it would help for future reviews and feature work.

@bsquizz
Copy link
Collaborator

bsquizz commented Sep 12, 2023

It is a good point -- if you deploy a bad config and you want to re-deploy over that bad config -- this feature will cause the "wait for ready" steps to immediately raise an exception on that second re-deploy. Need to think through this a bit further ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants