Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

container ssa: use readiness prob #14578

Merged
merged 1 commit into from Apr 17, 2017

Conversation

enoodle
Copy link

@enoodle enoodle commented Mar 30, 2017

Instead of accessing the /healthz endpoint we can use kubernetes's
readinessProbe to do this for us and use the standard API instead of the
pod proxy which is less documented. This also simplifies the code.

This will hopefully solve two long outstanding BZs:
https://bugzilla.redhat.com/show_bug.cgi?id=1384629
https://bugzilla.redhat.com/show_bug.cgi?id=1371803

@enoodle
Copy link
Author

enoodle commented Mar 30, 2017

@simon3z @moolitayer @ilackarms please review

@enoodle enoodle force-pushed the container_ssa_use_readiness_probe branch 2 times, most recently from 5f1a606 to a7f6f34 Compare March 30, 2017 14:19
@cben
Copy link
Contributor

cben commented Mar 30, 2017

Sweet 👍

After the pod is ready, we'll still use the proxy URL to analyze the image, right? In one of the BZs conclusion was misconfigured proxy (?) - would this help there?

case response
when Net::HTTPOK
begin
ready = kubernetes_client.get_pod(options[:pod_name], options[:pod_namespace])[:status][:containerStatuses][0][:ready]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how are containerStatuses sorted by default? will [0] always be the most recent status? will [0] always exist (array len >= 1)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its by containers. Each container has one status, so if a pod had two containers in it this list would be of length 2. because we only have one container we can safely use [0].

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good. i wasn't sure if it was an historical list of statuses; this makes more sense

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should think to improve this using the kubernetes watch API (on the pod).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simon3z As far as I understand, the watch API will hang on the http response waiting for new updates. This means that the worker that is executing the job will hang until the image-inspector scan is complete. It means that parallel scans will be less "parallel" and the scaling will be damaged.
If there is a better way of using the watch API we should consider it but from what I have seen using curl and Kubeclient [1] we have to hang and wait on the connection.

[1]https://github.com/abonas/kubeclient/blob/master/lib/kubeclient/watch_stream.rb#L14

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@enoodle shouldn't we protect against possible nils here: [:status][:containerStatuses][0][:ready] ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simon3z discussed this with Mooli: #14578 (comment)
The ready field is always there, I will add a check for containerStatuses and will log this before trying again (Assuming it is a temporary glitch from Openshift) instead of failing with a "method missing" cryptic error. I think this state is not possible for a running pod, but it might not be documented because I couldn't find it.

@enoodle
Copy link
Author

enoodle commented Apr 2, 2017

@miq-bot add_label bug

@miq-bot miq-bot added the bug label Apr 2, 2017
rescue SocketError, KubeException => e
msg = "unknown access error to pod #{pod_full_name}: #{e.message}"
_log.info(msg)
queue_signal(:abort_job, msg, "error")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this keep queuing pod_wait if there is an error? (line 99)
Maybe you need:

return queue_signal(:abort_job, msg, "error")

Also do you mind adding brackets around e.message:

msg = "unknown access error to pod #{pod_full_name}: [#{e.message}]"

It makes nils really obvious

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Good catch, Thanks

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@moolitayer If there was an error then ready won't be set, but I will add the "return" for clarity

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes ready will not be set and you would schedule another check...

@enoodle enoodle force-pushed the container_ssa_use_readiness_probe branch from c32a94e to 6bc2d2d Compare April 2, 2017 13:12
@enoodle
Copy link
Author

enoodle commented Apr 2, 2017

@cben

After the pod is ready, we'll still use the proxy URL to analyze the image, right? In one of the BZs conclusion was misconfigured proxy (?) - would this help there?

Yes, I asked and seems to help with at least one of the mentioned BZs. The problem was that a "transparent" proxy was changing error messages, but now we are not based on error messages in our happy flow. If an error message from Kubernetes/Openshift is being switched then we will catch the exception and quit with the proxy's message, which is the best outcome IMO when we don't get the real message.

@moolitayer
Copy link

@enoodle should we expect the usual ephemeral missing method X for object nil that we get when k8s returns hashes that are missing elements? (what happens when you GET a pod right after creation, will it have containerStatuses?)

@enoodle
Copy link
Author

enoodle commented Apr 3, 2017

@moolitayer Generally we should always have a containerStatus per container [1], if not then there is a problem (even if the pod is just creating). The ready flag is always present [2] so that too is kind of safe.

[1]https://github.com/kubernetes/kubernetes/blob/master/pkg/api/v1/types.go#L2504
[2]https://github.com/kubernetes/kubernetes/blob/master/pkg/api/v1/types.go#L1848

@moolitayer
Copy link

@enoodle BTW I think there is some code in openshift ansible we would be able to throw away in version that have this change. I'm referring to the code adding the pod/proxy * resource to management-infra-admin SA

@simon3z
Copy link
Contributor

simon3z commented Apr 11, 2017

@ilackarms can give this a try and review?
There is a pending comment but overall LGTM.

@ilackarms
Copy link

i tested this locally and could not load Compute > Containers > Providers tab of the MIQ dashboard. Not sure if this is only my problem, maybe someone else can reproduce? Screenshot attached
screenshot from 2017-04-11 16-49-13

@enoodle
Copy link
Author

enoodle commented Apr 12, 2017

@ilackarms This is an unrelated error due to some changes done by the UI team (IIUC). make sure you have #14563

@ilackarms
Copy link

@enoodle problem solved; LTGM

Instead of accessing the /healthz endpoint we can use kubernetes's
readinessProbe to do this for us and use the standard API instead of the
pod proxy which is less documented. This also simplifies the code.
@enoodle enoodle force-pushed the container_ssa_use_readiness_probe branch from 6bc2d2d to 4f5edf9 Compare April 12, 2017 14:40
@miq-bot
Copy link
Member

miq-bot commented Apr 12, 2017

Checked commit enoodle@4f5edf9 with ruby 2.2.6, rubocop 0.47.1, and haml-lint 0.20.0
2 files checked, 0 offenses detected
Everything looks good. 👍

Copy link
Contributor

@simon3z simon3z left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍
@miq-bot assign roliveri

@roliveri roliveri self-assigned this Apr 17, 2017
@roliveri roliveri merged commit 08c3090 into ManageIQ:master Apr 17, 2017
@simaishi
Copy link
Contributor

@enoodle @roliveri should this be fine/yes?

@roliveri
Copy link
Member

Sorry, not sure.

@simon3z should this be fine/yes?

@simon3z
Copy link
Contributor

simon3z commented Apr 24, 2017

@simaishi @roliveri I would like to wait some more time to make sure everything is OK before backporting this to fine. I targeted the BZ to 5.8.1.

@chessbyte chessbyte added this to the Sprint 59 Ending Apr 24, 2017 milestone Jun 7, 2017
@simaishi
Copy link
Contributor

@simon3z I'm backporting PRs for 5.8.1 now, please confirm this should now be fine/yes

@simon3z
Copy link
Contributor

simon3z commented Jun 14, 2017

@miq-bot add_label fine/yes
cc @enoodle @simaishi

simaishi pushed a commit that referenced this pull request Jun 14, 2017
@simaishi
Copy link
Contributor

Fine backport details:

$ git log -1
commit 4769a2390063743d227386fc963cc9fd1b5e77fa
Author: Richard Oliveri <oliveri.richard.github@gmail.com>
Date:   Mon Apr 17 11:12:11 2017 -0400

    Merge pull request #14578 from enoodle/container_ssa_use_readiness_probe
    
    container ssa: use readiness prob
    (cherry picked from commit 08c3090302d4c7fa747781dc3699e502194adb62)
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1461558

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants