New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
container ssa: use readiness prob #14578
container ssa: use readiness prob #14578
Conversation
@simon3z @moolitayer @ilackarms please review |
5f1a606
to
a7f6f34
Compare
Sweet 👍 After the pod is ready, we'll still use the proxy URL to analyze the image, right? In one of the BZs conclusion was misconfigured proxy (?) - would this help there? |
case response | ||
when Net::HTTPOK | ||
begin | ||
ready = kubernetes_client.get_pod(options[:pod_name], options[:pod_namespace])[:status][:containerStatuses][0][:ready] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how are containerStatuses sorted by default? will [0] always be the most recent status? will [0] always exist (array len >= 1)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its by containers. Each container has one status, so if a pod had two containers in it this list would be of length 2. because we only have one container we can safely use [0]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good. i wasn't sure if it was an historical list of statuses; this makes more sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should think to improve this using the kubernetes watch API (on the pod).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simon3z As far as I understand, the watch API will hang on the http response waiting for new updates. This means that the worker that is executing the job will hang until the image-inspector scan is complete. It means that parallel scans will be less "parallel" and the scaling will be damaged.
If there is a better way of using the watch API we should consider it but from what I have seen using curl and Kubeclient [1] we have to hang and wait on the connection.
[1]https://github.com/abonas/kubeclient/blob/master/lib/kubeclient/watch_stream.rb#L14
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@enoodle shouldn't we protect against possible nil
s here: [:status][:containerStatuses][0][:ready]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simon3z discussed this with Mooli: #14578 (comment)
The ready field is always there, I will add a check for containerStatuses and will log this before trying again (Assuming it is a temporary glitch from Openshift) instead of failing with a "method missing" cryptic error. I think this state is not possible for a running pod, but it might not be documented because I couldn't find it.
bc7dd35
to
c32a94e
Compare
@miq-bot add_label bug |
rescue SocketError, KubeException => e | ||
msg = "unknown access error to pod #{pod_full_name}: #{e.message}" | ||
_log.info(msg) | ||
queue_signal(:abort_job, msg, "error") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this keep queuing pod_wait if there is an error? (line 99)
Maybe you need:
return queue_signal(:abort_job, msg, "error")
Also do you mind adding brackets around e.message:
msg = "unknown access error to pod #{pod_full_name}: [#{e.message}]"
It makes nils really obvious
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Good catch, Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@moolitayer If there was an error then ready
won't be set, but I will add the "return" for clarity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes ready
will not be set and you would schedule another check...
c32a94e
to
6bc2d2d
Compare
Yes, I asked and seems to help with at least one of the mentioned BZs. The problem was that a "transparent" proxy was changing error messages, but now we are not based on error messages in our happy flow. If an error message from Kubernetes/Openshift is being switched then we will catch the exception and quit with the proxy's message, which is the best outcome IMO when we don't get the real message. |
@enoodle should we expect the usual ephemeral |
@moolitayer Generally we should always have a [1]https://github.com/kubernetes/kubernetes/blob/master/pkg/api/v1/types.go#L2504 |
@enoodle BTW I think there is some code in openshift ansible we would be able to throw away in version that have this change. I'm referring to the code adding the pod/proxy |
@ilackarms can give this a try and review? |
@ilackarms This is an unrelated error due to some changes done by the UI team (IIUC). make sure you have #14563 |
@enoodle problem solved; LTGM |
Instead of accessing the /healthz endpoint we can use kubernetes's readinessProbe to do this for us and use the standard API instead of the pod proxy which is less documented. This also simplifies the code.
6bc2d2d
to
4f5edf9
Compare
Checked commit enoodle@4f5edf9 with ruby 2.2.6, rubocop 0.47.1, and haml-lint 0.20.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
@miq-bot assign roliveri
Sorry, not sure. @simon3z should this be fine/yes? |
@simon3z I'm backporting PRs for 5.8.1 now, please confirm this should now be |
container ssa: use readiness prob (cherry picked from commit 08c3090) https://bugzilla.redhat.com/show_bug.cgi?id=1461558
Fine backport details:
|
Instead of accessing the /healthz endpoint we can use kubernetes's
readinessProbe to do this for us and use the standard API instead of the
pod proxy which is less documented. This also simplifies the code.
This will hopefully solve two long outstanding BZs:
https://bugzilla.redhat.com/show_bug.cgi?id=1384629
https://bugzilla.redhat.com/show_bug.cgi?id=1371803