New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pods cycling endlessly through Pending and Running in Guestbook example #4414
Comments
I'm having all sorts of strange behavior with 0.10.1 as well in #4415. Can you try with |
@pires The versions I've already tried with are: 0.9.3, 0.9.2 and 0.10.1. |
Karl, And see what it prints? Join us on irc #google-containers for more interactive debugging. @pires, can you file issues on the problems you're seeing with 0.10.1? Thanks!
|
@brendanburns you just saw it but #4415. |
I am also seeing this. There are many entries like this:
So let's look at just that last one...
It looks like |
Looks like #2252 might be related. |
For me this only happens with 0.10.1, and the 0.9.3 is perfectly fine. |
I'm investigating the issue and can confirm reproducibility of the bug using the guestbook example. |
I've added some logging to the code that compute hash of a container (https://github.com/GoogleCloudPlatform/kubernetes/blob/master/pkg/kubelet/kubelet.go#L1049): and I get weird results. Basically the printed content of container is the same, but the hash is different: |
I wonder whether it might be related to #4462 reported by @smarterclayton since the container in question comes from Kubelet.pods member. |
I think the key is:
@dchen1107 should we gracefully degrade here? |
@brendandburns I assume this is reported from docker, not kubelet and cadvisor. memory.oom_control file is for OOM notification and other controls. docker/libcontainer has register to listen on OOM events from kernel, and process them. Without such information, when a docker container is oom killed by kernel, the container state will be at termination state without proper OOM killed information. Kubelet picks up such information if exist and report as part of ContainerStatus. In this case, I don't think that is the root case, but I did see many ppl report to docker about such failure since 1.4 release, one example is docker/issues/9902 Now back to the initial issue, what is the root cause of the container is dead? sys oom killed or hit the memory limit? I checked the example, only frontend container has memory limit, which means both master and slave pods are running as unlimit (bound to capacity of course). The question is that why only have issues with the latest release? Because daemons (kubelet, docker, proxy, etc.) in latest release use more memory, which triggers sys oom? Or the problem is only limited to frontend pods? then we hit memroy limit of container itself? |
btw, the issue of "not support OOM notifications" from docker should be fixed in 1.5 release. Even it is not the root cause, but it causes the confusion on the failure, and hurts the debuggability of the ecosystem. |
@brendanburns You asked for a log output. I chose one of the frontend-controllers.
That's all there was. Other pods returned the same kind of log output. |
I've synced my workspace to HEAD and applied @brendandburns patch #4494 and it still doesn't seem to fix the problem. I've added quite a lot of debug printing around the HashContainer function and exactly when the issue happens (death of a container due to hash mismatch) I see two calls to that function with the same container and different output hash: I0218 13:55:55.782802 19728 docker.go:573] Hashing container: &{php-redis kubernetes/example-guestbook-php-redis [] [{ 8000 80 TCP }] [] {map[cpu:{0.100 DecimalSI} memory:{50000000.000 DecimalSI}]} [] < I've been starting that the hashing code for a while and can't figure out where the issue is coming from. |
Ok, I think I can finally confirm that the issue is due to non deterministic printing behavior of following construct: I've did wrap the hasher with io.Writer that prints the input to the hash before hashing it and the result is not deterministic with respect to map elements (this is []bytes to string conversion of input to adler32.Write() method): I will work on the fix. |
Great findings! On Thu, 19 Feb 2015 14:55 rsokolowski notifications@github.com wrote:
|
Are we using the fuzzer to test the hasher? Would be a good way to verify the hasher.
|
Thanks for digging so deep on this! Yeah we should add hash validation to the fuzzer tests... I'll file an issue.
|
The right answer here might be a more targeted hash function that wipes out
|
After making hash function deterministic containers are no longer cycling in my cluster: the frontend container has been running without a restart for 15 hours. Closing. |
Thanks for fixing the issue here. |
I am trying to get the guestbook example (https://github.com/GoogleCloudPlatform/kubernetes/blob/master/examples/guestbook/README.md) working on a local vagrant cluster using Kubernetes v.0.10.1.
If you follow the guestbook tutorial, you initialise a cluster with one minion. In this case, when you create the two redis-slaves, only one gets attached to the minion and the others are unassigned. The same goes for the three frontend-controllers.
I tried with four minions and all pods were assigned. However, I noticed some strange behaviour: each frontend-controller kept periodically going into Pending and then back into Running but every time it was rerun it was assigned to the next IP address (i.e. last IP address + 1).
I also tried initialising my four-minion cluster with https://github.com/pires/kubernetes-vagrant-coreos-cluster, but got the same results.
Could this a problem with the example, or with Kubernetes itself?
The text was updated successfully, but these errors were encountered: