-
Notifications
You must be signed in to change notification settings - Fork 39.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scheduler ignores nodes that are in a bad state #7668
Conversation
/sub |
// We currently only use a conditionType of "Ready". If the kubelet doesn't | ||
// periodically report the status of a node, the nodecontroller sets its | ||
// ConditionStatus to "Unknown". If the kubelet thinks a node is unhealthy | ||
// it can (in theory) set it its ConditionStatus to "False". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/it its/its/
|
||
// We currently only use a conditionType of "Ready". If the kubelet doesn't | ||
// periodically report the status of a node, the nodecontroller sets its | ||
// ConditionStatus to "Unknown". If the kubelet thinks a node is unhealthy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/kubelet/node controller/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nodecontroller sets it to unknown when the kubelet hasn't updated it in so long, the kubelet sets it to something sensible usually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks.
LGTM sans nits-- can you add a test in test/integration/ somewhere? |
Doing, wanted to check if this was ok |
e06a218
to
76b54ef
Compare
Sorry for the delay, surfacing failures due to watch timeout in the integration tests was a slight pain but I didn't want to checkin something that flaked without reason on shippable given the weird n/w latencies we've observed in the past. PTAL, running e2e. |
E2e passed |
// Wait till the passFunc confirms that the object it expects to see is in the store. | ||
// Used to observe reflected events. | ||
func waitForReflection(s cache.Store, key string, passFunc func(n interface{}) bool) error { | ||
return wait.Poll(time.Second, time.Second*20, func() (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An entire second between checks in a test? Is it really that slow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can bring it down, copied the polling interval from the scheduler test but it probably isn't. Most cases on localhost it's done on the first poll.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to 10ms
Small suggestions. Basically fine. |
9c224e6
to
d93f74d
Compare
Addressed comments, PTAL |
LGTM |
d93f74d
to
51365a4
Compare
Thought it's unlikely i just realize the node controller can hit the pod eviction timeout and delete the pod we're expecting to see scheduled before we see it, because we poll. Hrm. The right way to solve this would be to watch for events. @lavalamp wdyt about doing that in this pr vs a follow up? Edit: the comment above was about the integration test not the actual code change. |
Filed #7874 for shippable |
You're worried about the test accidentally passing because of that? You could disable the node controller for the test, or switch to events like you say. |
51365a4
to
4b0607c
Compare
False alarm, I got confused with the other integration suite, this one doesn't start a nodecontroller to begin with. So I guess this is good to go when things pass, I only made the name of the nod/pod more unique to help with debuggin in my latest upload. |
@lavalamp can this be merged ? |
Scheduler ignores nodes that are in a bad state
Yeah I was just waiting for it to turn green. |
The scheduler should ignore unhealthy nodes.
ref #7222, #7561