Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue with 100% CPU usage in logs.go. #704

Merged

Conversation

d11wtq
Copy link
Contributor

@d11wtq d11wtq commented Jun 19, 2018

Resolves: #531
See also: kubernetes/client-go#12

There is an issue in which the Pods watcher gets into a infinite tight
loop and begins consuming 100% of the CPU. This happens after skaffold
dev has been running for a while (~30 mins) and once it starts, it
doesn't stop.

The issue was narrowed down by @ajbouh to the event polling loop in
logs.go, which was not checking if the ResultChan() is closed or not.
Kubernetes actually closes the connection after a timeout (default is in
the range of 30-60 mins according to the related issue linked to above).
In this case, the intended solution is to start the watcher again.

This refactors the polling into two nested loops. One to start (and
restart) the Pods watcher itself and another to receive and process the
events from the watcher. If the ResultChan() is closed, the entire
watcher loop is restarted and log tailing continues.

There is a subtle difference in error handling as a result of this
change. Previously any error returned from client.Pods("").Watch()
would be immediately returned from the Start() func in logs.go. This
is no longer possible since the watcher is initialized in the goroutine
started by that func. As such, in the case the watcher cannot be
initialized, we simply log the error and stop tailing logs. Open to
suggestions as to be a better way to handle this error. Retrying in a
tight loop seems potentially problematic in the error scenario.

@googlebot
Copy link

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here (e.g. I signed it!) and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

@d11wtq
Copy link
Contributor Author

d11wtq commented Jun 19, 2018

I signed it!

@d11wtq
Copy link
Contributor Author

d11wtq commented Jun 19, 2018

Is that failing integration test correct? It looks like some kind of CI environment config issue https://ci.appveyor.com/project/r2d4/skaffold/build/1.0.6

The build phase is set to "MSBuild" mode (default), but no Visual Studio project or solution files were found in the root directory. If you are not building Visual Studio project switch build mode to "Script" and provide your custom build command.

@dgageot
Copy link
Contributor

dgageot commented Jun 19, 2018

@d11wtq Can you try to rebase? that should fix the appveyor bug. You should also sign the CLA with the email you've used for the commits.

@d11wtq
Copy link
Contributor Author

d11wtq commented Jun 19, 2018

Thanks @dgageot. I'm pretty sure I logged into Google with my @w3style.co.uk address used in the commits. Will try again.

Resolves: GoogleContainerTools#531
See also: kubernetes/client-go#12

There is an issue in which the Pods watcher gets into a infinite tight
loop and begins consuming 100% of the CPU. This happens after skaffold
dev has been running for a while (~30 mins) and once it starts, it
doesn't stop.

The issue was narrowed down by @ajbouh to the event polling loop in
`logs.go`, which was not checking if the `ResultChan()` is closed or not.
Kubernetes actually closes the connection after a timeout (default is in
the range of 30-60 mins according to the related issue linked to above).
In this case, the intended solution is to start the watcher again.

This refactors the polling into two nested loops. One to start (and
restart) the Pods watcher itself and another to receive and process the
events from the watcher. If the `ResultChan()` is closed, the entire
watcher loop is restarted and log tailing continues.

There is a subtle difference in error handling as a result of this
change. Previously any error returned from `client.Pods("").Watch()`
would be immediately returned from the `Watch()` func in `logs.go`. This
is no longer possible since the watcher is initialized in the goroutine
started by that func. As such, in the case the watcher cannot be
initialized, we simply log the error and stop tailing logs. Open to
suggestions as to be a better way to handle this error. Retrying in a
tight loop seems potentially problematic in the error scenario.
@d11wtq d11wtq force-pushed the fix/cpu-spinning-log-watcher branch from b4b6c6f to f73f27b Compare June 19, 2018 21:12
@googlebot
Copy link

CLAs look good, thanks!

@r2d4 r2d4 added the kokoro:run runs the kokoro jobs on a PR label Jun 19, 2018
@kokoro-team kokoro-team removed the kokoro:run runs the kokoro jobs on a PR label Jun 19, 2018
@googlebot
Copy link

So there's good news and bad news.

👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there.

😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request.

Note to project maintainer: This is a terminal state, meaning the cla/google commit status will not change from this state. It's up to you to confirm consent of the commit author(s) and merge this pull request when appropriate.

@dgageot
Copy link
Contributor

dgageot commented Jun 20, 2018

@d11wtq I had to update the branch. Can you just confirm that you are ok with that?

@d11wtq
Copy link
Contributor Author

d11wtq commented Jun 20, 2018

I am cool with that 👍

@dgageot dgageot merged commit 5750bac into GoogleContainerTools:master Jun 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

High CPU usage on Mac OS X
5 participants