Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Readiness Probe #355

Closed
OperationalDev opened this issue Oct 17, 2022 · 8 comments · Fixed by #365
Closed

Readiness Probe #355

OperationalDev opened this issue Oct 17, 2022 · 8 comments · Fixed by #365
Labels
kind/enhancement New feature or request

Comments

@OperationalDev
Copy link
Contributor

Challenge

When restarting authorino deployments (with 3 replicas and a pdb), for a few brief seconds, requests are denied with a 403.

Solution

Have a healtz endpoint that can be used as a readiness probe that is true when all of the auth configs have been loaded.

@guicassolato
Copy link
Collaborator

@OperationalDev, are you using the Status block of the AuthConfig CRs to check for readiness? Conditions Available and Ready will tell you respectively when at least one host name listed in the AuthConfig has been indexed or all of them. The Summary field provides you with more details about which host names have been indexed (amongst other stuff).

(Not saying we couldn't use your suggestion to enhance the health check endpoint, BTW!)

@OperationalDev
Copy link
Contributor Author

@guicassolato I am not using them. Can you give me some guidance as to how I might use them in this scenario to ensure 1 pod is always available?

@guicassolato
Copy link
Collaborator

@OperationalDev, it's a per-CR check and it won't give you the readiness state for the entire deployment, nor of any arbitrary pod for that matter. Rather, it gives you the state of a particular CR in the index, from the perspective of the leader replica, i.e. according to the one pod that won the dispute amongst all 3 replicas to become the leader Authorino replica and therefore it's responsible for updating the status sub-resource of the AuthConfigs. Unfortunately, we haven't yet implemented any synchronization between pods for the status update; that's why the value in the status sub-resource reflects only what the leader knows. On the other hand, it is relatively safe to assume that what the leader knows is either identical or no more than a couple milliseconds different (ahead or behind) what the other replicas know.

Checking the status sub-resource of a particular AuthConfig is straightforward. For example, given the AuthConfig from this example applied to the default namespace of the cluster, .status.summary.ready tells you if Authorino is ready to receive authz requests for the AuthConfig:

kubectl get authconfig/talker-api-protection -o jsonpath='{.status.summary.ready}'

Output:

true

In the example above, the AuthConfig lists only one host name, i.e. talker-api-authorino.127.0.0.1.nip.io. In this case, being available (i.e. "at least one host name of the AuthConfig linked in the index") equals being ready (i.e. "all host names of the AuthConfig linked in the index"). In some other cases where you have more than one host name listed in the AuthConfig, due to avoiding host name collisions, you can run into situations where an AuthConfig is available but not ready. Because of that, you may want to check as well:

kubectl get authconfig/talker-api-protection -o jsonpath='{.status.conditions}'

Output:

[{"lastTransitionTime":"2022-10-17T09:37:36Z","reason":"HostsLinked","status":"True","type":"Available"},{"lastTransitionTime":"2022-10-17T09:37:36Z","reason":"Reconciled","status":"True","type":"Ready"}]

...and/or

kubectl get authconfig/talker-api-protection -o jsonpath='{.status.summary.hostsReady}'

Output:

["talker-api-authorino.127.0.0.1.nip.io"]

@OperationalDev
Copy link
Contributor Author

@guicassolato It's not clear to me how I would use the status of the authconfigs to know which of the authorino pods are ready to serve requests?

@guicassolato
Copy link
Collaborator

@OperationalDev , sorry if I wasn't clear before. The status block in the authconfigs won't help you with which authorino pods are ready. Instead, it can only tell you whether the authconfig is ready in a particular authorino pod, specifically the leader one. My point from before is that the difference between being ready in the leader pod and being in any other pod should no more than a couple milliseconds away.

This is not ideal. I know! But hopefully it's enough to mitigate those 403s a little bit.

@OperationalDev
Copy link
Contributor Author

Ok ok, sorry I misunderstood, thank you for clarifying.

@alexsnaps alexsnaps added kind/enhancement New feature or request target/current labels Oct 27, 2022
@alexsnaps
Copy link
Member

While not completely overlapping, it might be nice to consider this in the context of this issue: Kuadrant/kuadrant-operator#96

guicassolato added a commit that referenced this issue Nov 18, 2022
Implements health and readiness probe endpoints for the controllers, reporting particularly the aggregated state of the AuthConfigs.

New endpoints:
- `/healthy`: Health probe (ping)
- `/readyz`: Aggregated readiness probe (only AuthConfig reconciler currently reporting)
- `/readyz/authconfigs`: Aggregated status of the AuthConfigs

The default binding network address is `:8081`. It can be changed using the newly introduced flag (command-line arg) `--health-probe-addr`.

The endpoints return either `200` ("ok") or `500` when 1+ probes fail.

The query string parameters `verbose=true` and `exclude=authconfigs` are supported respectively to provide more verbose responses and exclude a particular probe ("authconfigs" in the example provided).

Closes #355
@OperationalDev
Copy link
Contributor Author

Just wanted to say thanks for the quick turn around time on this, working as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement New feature or request
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants