Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler ignores nodes that are in a bad state #7668

Merged
merged 1 commit into from
May 7, 2015

Conversation

bprashanth
Copy link
Contributor

The scheduler should ignore unhealthy nodes.

ref #7222, #7561

@wojtek-t
Copy link
Member

wojtek-t commented May 4, 2015

/sub

// We currently only use a conditionType of "Ready". If the kubelet doesn't
// periodically report the status of a node, the nodecontroller sets its
// ConditionStatus to "Unknown". If the kubelet thinks a node is unhealthy
// it can (in theory) set it its ConditionStatus to "False".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/it its/its/

@lavalamp lavalamp self-assigned this May 4, 2015

// We currently only use a conditionType of "Ready". If the kubelet doesn't
// periodically report the status of a node, the nodecontroller sets its
// ConditionStatus to "Unknown". If the kubelet thinks a node is unhealthy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/kubelet/node controller/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nodecontroller sets it to unknown when the kubelet hasn't updated it in so long, the kubelet sets it to something sensible usually.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks.

@lavalamp
Copy link
Member

lavalamp commented May 4, 2015

LGTM sans nits-- can you add a test in test/integration/ somewhere?

@bprashanth
Copy link
Contributor Author

Doing, wanted to check if this was ok

@bprashanth
Copy link
Contributor Author

Sorry for the delay, surfacing failures due to watch timeout in the integration tests was a slight pain but I didn't want to checkin something that flaked without reason on shippable given the weird n/w latencies we've observed in the past. PTAL, running e2e.

@bprashanth bprashanth changed the title WIP: Scheduler ignores nodes that are in a bad state Scheduler ignores nodes that are in a bad state May 6, 2015
@bprashanth
Copy link
Contributor Author

E2e passed

// Wait till the passFunc confirms that the object it expects to see is in the store.
// Used to observe reflected events.
func waitForReflection(s cache.Store, key string, passFunc func(n interface{}) bool) error {
return wait.Poll(time.Second, time.Second*20, func() (bool, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An entire second between checks in a test? Is it really that slow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can bring it down, copied the polling interval from the scheduler test but it probably isn't. Most cases on localhost it's done on the first poll.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to 10ms

@lavalamp
Copy link
Member

lavalamp commented May 6, 2015

Small suggestions. Basically fine.

@bprashanth
Copy link
Contributor Author

Addressed comments, PTAL

@lavalamp
Copy link
Member

lavalamp commented May 6, 2015

LGTM

@lavalamp lavalamp added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 6, 2015
@bprashanth
Copy link
Contributor Author

Thought it's unlikely i just realize the node controller can hit the pod eviction timeout and delete the pod we're expecting to see scheduled before we see it, because we poll. Hrm. The right way to solve this would be to watch for events. @lavalamp wdyt about doing that in this pr vs a follow up?

Edit: the comment above was about the integration test not the actual code change.

@bprashanth
Copy link
Contributor Author

Filed #7874 for shippable

@lavalamp
Copy link
Member

lavalamp commented May 7, 2015

You're worried about the test accidentally passing because of that? You could disable the node controller for the test, or switch to events like you say.

@bprashanth
Copy link
Contributor Author

False alarm, I got confused with the other integration suite, this one doesn't start a nodecontroller to begin with. So I guess this is good to go when things pass, I only made the name of the nod/pod more unique to help with debuggin in my latest upload.

@ArtfulCoder
Copy link
Contributor

@lavalamp can this be merged ?

lavalamp added a commit that referenced this pull request May 7, 2015
Scheduler ignores nodes that are in a bad state
@lavalamp lavalamp merged commit 6ab51f3 into kubernetes:master May 7, 2015
@lavalamp
Copy link
Member

lavalamp commented May 7, 2015

Yeah I was just waiting for it to turn green.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm "Looks good to me", indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants