-
Notifications
You must be signed in to change notification settings - Fork 39.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding system oom events from kubelet #6718
Conversation
60e63b3
to
e25ae66
Compare
@@ -1776,6 +1777,51 @@ func (kl *Kubelet) recordNodeOnlineEvent() { | |||
kl.recorder.Eventf(kl.nodeRef, "online", "Node %s is now online", kl.hostname) | |||
} | |||
|
|||
// Returns the most recent sys oom event from cadvisor. | |||
// Returns nil if no sys oom events were observed. | |||
func (kl *Kubelet) getRecentSysOOMEvent() (*cadvisorApi.OomEventData, util.Time, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WDYT of putting this in the cAdvisor subpackage and exposing a "RecentSysOOMEvent()". Trying to minimize code in kubelet.go :)
Is something going to actuate off this NodeCondition, or is it just to aid debugging? |
The OOM is a transient event. We will have monitoring systems alerting on On Tue, Apr 14, 2015 at 10:08 AM, Eric Tune notifications@github.com
|
Turning an event into a state seems like the wrong way to model this. Better for monitoring to monitor the count of OOM events and alert on an excessive rate of events over a period. There may be multiple alerts for different rates and periods, and different groupings (single-node, cluster-wide). Faking this as a state would just confuse that. |
I agree with @erictune. I think what we really need is filtering those OOM events to figure out why a running container is killed and provided proper ContainerStatus. NodeCondition is changed at the moment when OOM is detected. |
We should have both node and pod events. If a pod is killed due to OOM, On Tue, Apr 14, 2015 at 10:38 AM, Dawn Chen notifications@github.com
|
c413141
to
229e3de
Compare
@dchen1107: I have updated the PR to generate events. Testing is still missing. If the general outline of the PR works, I will add the tests. |
Only downside to the watch approach we do today is that we'll miss OOM events from when we are down. This is probably okay initially. We should be able to fill in the gap when we come back up. |
Yup. I would argue that we should not let the kubelet go down for too long On Wed, Apr 22, 2015 at 9:41 AM, Victor Marmol notifications@github.com
|
@vmarmol: I added an unit test. The cadvisor API needs to be enhanced a bit to make the unit test useful. PTAL! |
@@ -558,10 +561,25 @@ func (kl *Kubelet) Run(updates <-chan PodUpdate) { | |||
} | |||
|
|||
go kl.syncNodeStatus() | |||
// Run the system oom watcher forever. | |||
util.Until(kl.runOOMWatcher, time.Microsecond, util.NeverStop) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use util.Forever() (which uses Until under the covers, but makes it clear)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
util.Forever is marked as being deprecated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whaaaaa? First I hear of it, thanks for the heads up :)
Huh, the integration test is failing with the error we were seeing yesterday. Were you able to find the cause? |
@vmarmol: The integration tests are failing because the kubelet is not able to get node information. I will fix that and update the PR. Meanwhile, lets iron out the rest of the code. |
Not yet. On Thu, Apr 23, 2015 at 8:52 AM, Victor Marmol notifications@github.com
|
Thanks @vish! |
@vmarmol: This PR is safe to be merged now. I am tackling the node events issue via a separate PR. |
@@ -543,7 +546,9 @@ func (kl *Kubelet) GetNode() (*api.Node, error) { | |||
return nil, errors.New("cannot list nodes") | |||
} | |||
host := kl.GetHostname() | |||
glog.V(2).Infof("hostname: %q", host) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove these two debug lines? If you want to keep the hostname one can we make it V(4)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
d684531
to
ecab94a
Compare
PTAL @vmarmol |
@@ -52,6 +52,6 @@ func (c *Fake) DockerImagesFsInfo() (cadvisorApiV2.FsInfo, error) { | |||
return cadvisorApiV2.FsInfo{}, nil | |||
} | |||
|
|||
func (c *Fake) GetPastEvents(request *events.Request) ([]*cadvisorApi.Event, error) { | |||
return []*cadvisorApi.Event{}, nil | |||
func (self *Fake) WatchEvents(request *events.Request) (*events.EventChannel, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/self/c/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
553d0a8
to
7e989b4
Compare
@@ -543,6 +546,7 @@ func (kl *Kubelet) GetNode() (*api.Node, error) { | |||
} | |||
host := kl.GetHostname() | |||
for _, n := range l.Items { | |||
glog.V(2).Info(n) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove?
63c32b2
to
2ff6a95
Compare
LGTM, thanks @vishh! Will wait for green. |
I apologize for the slipping in useless log lines. Thanks for the thorough review @vmarmol |
No worries :D thanks for the quick fixes! |
@vishh the integration test failed (again) can you take a look? I don't think it's a flake since it's failed twice now. |
useful for recording the timestamp of events that happened in the past.
Kubelet will continuously watch for system OOMs and generate events whenever it encounters a system OOM.
@vmarmol: The integration tests did catch a bug :) I updated the PR and it should pass this time around. |
Adding system oom events from kubelet
cadvisorApi.EventOom: true, | ||
}, | ||
ContainerName: "/", | ||
IncludeSubcontainers: false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems this option, IncludeSubcontainers
, should be opened, otherwise i am afraid we can not query cadvisor with correct result. As all containers are under the root container and we should include all descendants
Kubelet will surface the most recent system OOM as events.
Status: Unit tests pending; System test pending
for #2853