Fix issue with 100% CPU usage in logs.go. #704

d11wtq · 2018-06-19T11:25:18Z

Resolves: #531
See also: kubernetes/client-go#12

There is an issue in which the Pods watcher gets into a infinite tight
loop and begins consuming 100% of the CPU. This happens after skaffold
dev has been running for a while (~30 mins) and once it starts, it
doesn't stop.

The issue was narrowed down by @ajbouh to the event polling loop in
logs.go, which was not checking if the ResultChan() is closed or not.
Kubernetes actually closes the connection after a timeout (default is in
the range of 30-60 mins according to the related issue linked to above).
In this case, the intended solution is to start the watcher again.

This refactors the polling into two nested loops. One to start (and
restart) the Pods watcher itself and another to receive and process the
events from the watcher. If the ResultChan() is closed, the entire
watcher loop is restarted and log tailing continues.

There is a subtle difference in error handling as a result of this
change. Previously any error returned from client.Pods("").Watch()
would be immediately returned from the Start() func in logs.go. This
is no longer possible since the watcher is initialized in the goroutine
started by that func. As such, in the case the watcher cannot be
initialized, we simply log the error and stop tailing logs. Open to
suggestions as to be a better way to handle this error. Retrying in a
tight loop seems potentially problematic in the error scenario.

googlebot · 2018-06-19T11:25:21Z

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here (e.g. I signed it!) and we'll verify it.

What to do if you already signed the CLA

Individual signers

It's possible we don't have your GitHub username or you're using a different email address on your commit. Check your existing CLA data and verify that your email is set on your git commits.

Corporate signers

Your company has a Point of Contact who decides which employees are authorized to participate. Ask your POC to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the Google project maintainer to go/cla#troubleshoot (Public version).
The email used to register you as an authorized contributor must be the email used for the Git commit. Check your existing CLA data and verify that your email is set on your git commits.
The email used to register you as an authorized contributor must also be attached to your GitHub account.

d11wtq · 2018-06-19T11:30:49Z

I signed it!

d11wtq · 2018-06-19T11:35:28Z

Is that failing integration test correct? It looks like some kind of CI environment config issue https://ci.appveyor.com/project/r2d4/skaffold/build/1.0.6

The build phase is set to "MSBuild" mode (default), but no Visual Studio project or solution files were found in the root directory. If you are not building Visual Studio project switch build mode to "Script" and provide your custom build command.

dgageot · 2018-06-19T13:59:14Z

@d11wtq Can you try to rebase? that should fix the appveyor bug. You should also sign the CLA with the email you've used for the commits.

d11wtq · 2018-06-19T21:01:14Z

Thanks @dgageot. I'm pretty sure I logged into Google with my @w3style.co.uk address used in the commits. Will try again.

@ajbouh

Resolves: GoogleContainerTools#531 See also: kubernetes/client-go#12 There is an issue in which the Pods watcher gets into a infinite tight loop and begins consuming 100% of the CPU. This happens after skaffold dev has been running for a while (~30 mins) and once it starts, it doesn't stop. The issue was narrowed down by @ajbouh to the event polling loop in `logs.go`, which was not checking if the `ResultChan()` is closed or not. Kubernetes actually closes the connection after a timeout (default is in the range of 30-60 mins according to the related issue linked to above). In this case, the intended solution is to start the watcher again. This refactors the polling into two nested loops. One to start (and restart) the Pods watcher itself and another to receive and process the events from the watcher. If the `ResultChan()` is closed, the entire watcher loop is restarted and log tailing continues. There is a subtle difference in error handling as a result of this change. Previously any error returned from `client.Pods("").Watch()` would be immediately returned from the `Watch()` func in `logs.go`. This is no longer possible since the watcher is initialized in the goroutine started by that func. As such, in the case the watcher cannot be initialized, we simply log the error and stop tailing logs. Open to suggestions as to be a better way to handle this error. Retrying in a tight loop seems potentially problematic in the error scenario.

googlebot · 2018-06-19T21:12:09Z

CLAs look good, thanks!

googlebot · 2018-06-20T04:42:35Z

So there's good news and bad news.

👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there.

😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request.

Note to project maintainer: This is a terminal state, meaning the cla/google commit status will not change from this state. It's up to you to confirm consent of the commit author(s) and merge this pull request when appropriate.

dgageot · 2018-06-20T04:43:26Z

@d11wtq I had to update the branch. Can you just confirm that you are ok with that?

d11wtq · 2018-06-20T05:29:28Z

I am cool with that 👍

d11wtq requested review from dgageot, dlorenc, r2d4 and viglesiasce as code owners June 19, 2018 11:25

d11wtq mentioned this pull request Jun 19, 2018

High CPU usage on Mac OS X #531

Closed

d11wtq force-pushed the fix/cpu-spinning-log-watcher branch from b4b6c6f to f73f27b Compare June 19, 2018 21:12

r2d4 added the kokoro:run runs the kokoro jobs on a PR label Jun 19, 2018

kokoro-team removed the kokoro:run runs the kokoro jobs on a PR label Jun 19, 2018

dgageot approved these changes Jun 20, 2018

View reviewed changes

Merge branch 'master' into fix/cpu-spinning-log-watcher

e89a4d6

dgageot merged commit 5750bac into GoogleContainerTools:master Jun 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issue with 100% CPU usage in logs.go. #704

Fix issue with 100% CPU usage in logs.go. #704

d11wtq commented Jun 19, 2018 •

edited

googlebot commented Jun 19, 2018

d11wtq commented Jun 19, 2018

d11wtq commented Jun 19, 2018

dgageot commented Jun 19, 2018

d11wtq commented Jun 19, 2018

googlebot commented Jun 19, 2018

googlebot commented Jun 20, 2018

dgageot commented Jun 20, 2018

d11wtq commented Jun 20, 2018

Fix issue with 100% CPU usage in logs.go. #704

Fix issue with 100% CPU usage in logs.go. #704

Conversation

d11wtq commented Jun 19, 2018 • edited

googlebot commented Jun 19, 2018

What to do if you already signed the CLA

Individual signers

Corporate signers

d11wtq commented Jun 19, 2018

d11wtq commented Jun 19, 2018

dgageot commented Jun 19, 2018

d11wtq commented Jun 19, 2018

googlebot commented Jun 19, 2018

googlebot commented Jun 20, 2018

dgageot commented Jun 20, 2018

d11wtq commented Jun 20, 2018

d11wtq commented Jun 19, 2018 •

edited