Add exceptions for dealing with image errors on pods #9

adamrdrew · 2023-08-29T20:08:57Z

This patch adds exceptions for image pull backoff and image errors. But, at the time of this writing, I haven't adequately confirmed that it works.

adamrdrew · 2023-09-01T12:39:30Z

ocviapy.StatusError: image pull error for resource pod/advisor-backend-run-django-migration-job-ic906yi-xwm7q/advisor-backend-run-django-migration-job-ic906yi-xwm7q
ERROR: deploy failed: image pull error for resource pod/advisor-backend-run-django-migration-job-ic906yi-xwm7q/advisor-backend-run-django-migration-job-ic906yi-xwm7q

bsquizz

Nice way of doing this, I forgot we copy the watcher's resources. A couple minor code tweaks I spotted.

Also, I think right now you're missing the case where ResourceWaiter was told to watch a pod. You could add a check right here:

https://github.com/RedHatInsights/ocviapy/pull/9/files#diff-028ea95c402cb62a8d41c1f73660fbfbe0fc44359b8a960ffacb4d54a0abaf2fR648

You'll want to check if resource.image_pull_error is True right before that if self.watch_owned

You could also move the error check out of the _check_owned_resources function and just put it here in the for loop: https://github.com/RedHatInsights/ocviapy/pull/9/files#diff-028ea95c402cb62a8d41c1f73660fbfbe0fc44359b8a960ffacb4d54a0abaf2fR652

So I think _observe could look like this:

        # update our records for this resource
        self.observed_resources[key] = resource
        
        if resource.image_pull_error:
            raise StatusError(f"image pull error for resource {resource.key}/{resource.name}")
        
        if self.watch_owned:
            # use .copy() in case dict changes during iteration
            for _, r in self.watcher.resources.copy().items():
                if r.image_pull_error:
                    raise StatusError(f"image pull error for resource {r.key}/{r.name}")
                self._check_owned_resources(r)

ocviapy/__init__.py

adamrdrew · 2023-09-01T15:50:19Z

Those are all great suggestions. Incorporated them all. I'm going to mess about in the debugger and see if I can make the exceptions nicer as you mentioned

bsquizz · 2023-09-01T16:10:45Z

Looks like message will give more info:

    state:
      waiting:
        message: Back-off pulling image "quay.io/cloudservices/automation-analytics-frontend:c8bf506"
        reason: ImagePullBackOff

So might be good to print the 'message', 'reason', and to indicate the container name as well?

Something like:

raise StatusError(f"{reason} error on {resource.key}/{resource.name} (container {container_name}): {message}")

bsquizz · 2023-09-01T16:13:36Z

You already fetch all this info when you loop through all this stuff in .image_pull_error -- you could maybe just log.error it there as a start

adamrdrew · 2023-09-01T17:19:00Z

I feel pretty good about that last push. Simplified / DRY things a bit and better messaging:

2023-09-01 13:17:38 [   ERROR] [          MainThread] hit status error: image pull error for resource pod/advisor-backend-run-django-migration-job-pe398z2-zt4lv/advisor-backend-run-django-migration-job-pe398z2-zt4lv: advisor-backend-run-django-migration-job-pe398z2: ImagePullBackOff quay.io/cloudservices/advisor-backend:324324

adamrdrew · 2023-09-01T17:27:27Z

Saw your feedback after I pushed. Even better error message:

2023-09-01 13:26:26 [   ERROR] [          MainThread] hit status error: Image Pull Failed: advisor-backend-run-django-migration-job-pe398z2-zt4lv ImagePullBackOff Back-off pulling image "quay.io/cloudservices/advisor-backend:324324"

…readability

bsquizz · 2023-09-05T15:32:40Z

ocviapy/__init__.py

        for owner_ref in resource.data["metadata"].get("ownerReferences", []):
            restype_matches = owner_ref["kind"].lower() == self.restype
            owner_uid_matches = owner_ref["uid"] == self.resource.uid
            if restype_matches and owner_uid_matches:
                # this resource is owned by "self"
-                previously_observed = False


This code block is unchanged, just moved into the new _check_status_if_owned func below for better readability

bsquizz · 2023-09-05T15:34:08Z

ocviapy/__init__.py

        if self.watch_owned:
-            # use .copy() in case dict changes during iteration


This code block was moved into the _check_owned_resources func. The only new addition to this block of code is https://github.com/RedHatInsights/ocviapy/pull/9/files#diff-028ea95c402cb62a8d41c1f73660fbfbe0fc44359b8a960ffacb4d54a0abaf2fR666

Victoremepunto

I had a hard time trying to review this code, in part because of lack of familiarity with it . I wish it had tests on it , I was about to request tests along this feature but since the repo doesn't have any ...

The way I tested this was thanks to bonfire - installing it with this PR's version of ocviapy seems to show that it is able to detect an Image Pull error ... unfortunately, if there's a pod in the namespace in this state already, it makes a bonfire deploy to crash immediately after creating all the resources (thanks for pointing this out @bsquizz ).

I don't know - I don't feel confident enough to give this a 👍 - so I'd just comment - not explicitly approve, and I'm not requesting changes to add tests simply because the project itself doesn't have tests whatsoever - that doesn't mean though it could use some tests, as it would help for future reviews and feature work.

bsquizz · 2023-09-12T20:24:34Z

It is a good point -- if you deploy a bad config and you want to re-deploy over that bad config -- this feature will cause the "wait for ready" steps to immediately raise an exception on that second re-deploy. Need to think through this a bit further ...

Add exceptions for dealing with image errors on pods

399d17f

adamrdrew mentioned this pull request Aug 29, 2023

Handle image errors RedHatInsights/bonfire#324

Closed

adamrdrew added 3 commits September 1, 2023 08:21

IT'S ALIVE!

e65eb67

Make linter happy

8b2b314

This works better

8f30e39

Make linter happy

f00eda5

adamrdrew marked this pull request as ready for review September 1, 2023 12:41

adamrdrew mentioned this pull request Sep 1, 2023

Handle status error RedHatInsights/bonfire#326

Merged

bsquizz requested changes Sep 1, 2023

View reviewed changes

ocviapy/__init__.py Outdated Show resolved Hide resolved

ocviapy/__init__.py Outdated Show resolved Hide resolved

ocviapy/__init__.py Show resolved Hide resolved

ocviapy/__init__.py Outdated Show resolved Hide resolved

adamrdrew added 2 commits September 1, 2023 11:32

Changes proposed in review

5bb653c

Better resource checking

5ef2cd6

Better error messages and slight refactor

8edeaa0

Even better message

9341035

adamrdrew and others added 8 commits September 1, 2023 13:28

Make linter happy

99e01ab

Catch multiple StatusErrors present in the ns, rework some funcs for …

04430b7

…readability

Handle init containers too, de-duplicate log messages

518bcfe

Handle status errors with wait_for_ready_threaded

e8e0455

Tweaking further

f7109a2

Do not raise StatusError from the Resource class

4f58c14

Don't keep starting more threads if one has already hit a status error

2942dfc

Fix accidental line removal

e4ed53f

bsquizz reviewed Sep 5, 2023

View reviewed changes

bsquizz assigned Victoremepunto Sep 6, 2023

bsquizz approved these changes Sep 6, 2023

View reviewed changes

Victoremepunto reviewed Sep 12, 2023

View reviewed changes

bsquizz mentioned this pull request Oct 25, 2023

Bonfire should know it shouldn't keep trying it if can't pull image for iqe-cji pod RedHatInsights/bonfire#190

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add exceptions for dealing with image errors on pods #9

Add exceptions for dealing with image errors on pods #9

adamrdrew commented Aug 29, 2023

adamrdrew commented Sep 1, 2023

bsquizz left a comment

adamrdrew commented Sep 1, 2023

bsquizz commented Sep 1, 2023 •

edited

bsquizz commented Sep 1, 2023

adamrdrew commented Sep 1, 2023

adamrdrew commented Sep 1, 2023

bsquizz Sep 5, 2023

bsquizz Sep 5, 2023

Victoremepunto left a comment

bsquizz commented Sep 12, 2023

		if self.watch_owned:
		# use .copy() in case dict changes during iteration

Add exceptions for dealing with image errors on pods #9

Are you sure you want to change the base?

Add exceptions for dealing with image errors on pods #9

Conversation

adamrdrew commented Aug 29, 2023

adamrdrew commented Sep 1, 2023

bsquizz left a comment

Choose a reason for hiding this comment

adamrdrew commented Sep 1, 2023

bsquizz commented Sep 1, 2023 • edited

bsquizz commented Sep 1, 2023

adamrdrew commented Sep 1, 2023

adamrdrew commented Sep 1, 2023

bsquizz Sep 5, 2023

Choose a reason for hiding this comment

bsquizz Sep 5, 2023

Choose a reason for hiding this comment

Victoremepunto left a comment

Choose a reason for hiding this comment

bsquizz commented Sep 12, 2023

bsquizz commented Sep 1, 2023 •

edited