High CPU usage on Mac OS X #531

aplex · 2018-05-08T17:14:32Z

Actual behavior

skaffold process generates about 100% sustained cpu usage when it is started with "skaffold dev" command. The project itself is quite small (4 deployments of third-party services and 1 locally built docker image), and the high cpu usage is present even without any local file changes.

VBoxHeadless process also has a sustained 150-200% CPU usage in Activity Monitor.

Information

Skaffold version: v0.5.0
Operating system: Mac OS X 10.12.6
Content of skaffold.yaml:

apiVersion: skaffold/v1alpha2
kind: Config
build:
  artifacts:
  - imageName: my-tools-image
    workspace: ./tools
    docker: {}
  local: {}
deploy:
  kubectl:
    manifests:
    - ./kubernetes/*

The text was updated successfully, but these errors were encountered:

ajbouh · 2018-05-08T23:18:41Z

Also seeing this behavior on 10.11.6. I've been having issues with running out of file descriptors in other apps, though I'm not sure if that's relevant to what's going on here or not...

slaxor · 2018-05-11T20:42:20Z

I have seen it also on linux (Linux turtle 4.16.7-1-ARCH #1 SMP PREEMPT Wed May 2 21:12:36 UTC 2018 x86_64 GNU/Linux)

dlorenc · 2018-05-15T15:00:28Z

Ref #549

ajbouh · 2018-05-16T21:03:14Z

I've instrumented my local build of skaffold and found that this line:

skaffold/pkg/skaffold/kubernetes/log.go

Line 74 in 762c3ad

case evt := <-watcher.ResultChan():

Seems to suddenly return a nil/empty Event repeatedly. I have been able to trigger it by switching from one internet connection to another. I think the watch channel is closed and we're stuck in an infinite loop because of it.

I'm not certain this is the source of the problem, but I'm experimenting locally with:

	case evt, ok := <-watcher.ResultChan():
		if !ok {
			// Try reopening
			logrus.Infof("Log aggregator watch has stopped, respawning it.")
			a.Start(ctx, client)
			break
		}

ajbouh · 2018-05-23T21:11:53Z

I've been running this for a few days now and while the CPU usage issue is gone, I've seen the "Log aggregator watch has stopped, respawning it." message spewed to my logs over and over again.

I guess there's something that goes bad elsewhere in the log watching state machine and just respawning it again is not sufficient. This change is better than nothing as it gives a landmark in the logs to launch deeper investigations from. Anyone have ideas about what might be going on?

dlorenc · 2018-06-04T17:13:52Z

We have an mtime watcher now that should be better.

d11wtq · 2018-06-18T01:29:58Z

@dlorenc can you elaborate please? I looked at the linked PR but there is not much detail on it. Does the mtime watcher basically poll large directories for modification of files? I'm experiencing this issue when switching between my home network and my office network. If the mtime solution using polling this will probably be bad for my use case since I'm using nodejs and the node_modules tree can be extremely large (gigabytes of deep directory structure).

Very curious about the underlying cause of the inotify/fsevents solution. I've used those tools a lot on other projects and switched networks without issues, as I routinely work from home and from the office on the same laptop.

d11wtq · 2018-06-18T01:36:19Z

Is the mtime watcher the default on 0.7.0, or did it miss the cut? If it's the default then I think I'm hitting the 100% CPU usage with the mtime watcher after suspending the laptop for > 1 hour while skaffold dev is running.

ajbouh · 2018-06-18T01:44:27Z

I see this issue while using a custom file watching system I wrote that's completely independent of mtime and kevent. My current suspect is the log tailing logic...

…

On Sun, Jun 17, 2018, 18:36 Chris Corbyn ***@***.***> wrote: Is the mtime watcher the default on 0.7.0, or did it miss the cut? If it's the default then I think I'm hitting the 100% CPU usage with the mtime watcher after suspending the laptop for > 1 hour while skaffold dev is running. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#531 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAcndJjsp6ut2tqELmOqo7q8uwF5mZIks5t9wQWgaJpZM4T3BQC> .

d11wtq · 2018-06-18T01:45:59Z

Oh yeah, I've had the 100% CPU issue using kail for log tailing before. stern didn't have that issue, so I switched to that. Not sure how skaffold is doing it, but you might be onto something there.

d11wtq · 2018-06-18T02:26:39Z

I seem to have run into the 100% CPU usage again after around 45 minutes of inactivity.

d11wtq · 2018-06-18T02:27:36Z

Interestingly I'm running multiple skaffold processes (microservices architecture) and they all collectively started using 100% CPU at the same time. So something environmental causes it to start happening.

d11wtq · 2018-06-18T02:27:50Z

Can we re-open this issue?

d11wtq · 2018-06-19T01:40:08Z

Thanks @r2d4 - I'll see if I can pinpoint where the issue is.

d11wtq · 2018-06-19T06:01:00Z

I've done some rudimentary debugging (fmt.Println in spots I suspect to be problematic) and it seems that when this issue starts happening, it's in log.go's Start() func.

for {
    select {
    case <-ctx.Done():
        return
    case evt := <-watcher.ResultChan():
        fmt.Println("Got watcher evt")
        if evt.Type != watch.Added && evt.Type != watch.Modified {
            fmt.Println("Wrong evt.Type")
            continue
        }

It seems to start thrashing there and receives events seemingly in a tight loop. This matches up with what @ajbouh observed too.

d11wtq · 2018-06-19T06:03:40Z

kubernetes/client-go#12

watcher error after 20m

When I create a watch on pods e.g. like this:
...
I have the trouble that after exactly 20 minutes, the ResultChan() receives infinite messages, but they are all empty.

It is expected. Server will timeout the watch connection, see code. Though I'm not sure why it's 20 mins, it timeout after 30-60 mins if you didn't change the default configs of your server, because the default MinRequestTimeout is 30 mins.

Example code that uses Watcher correctly: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/tools/cache/reflector.go#L358

d11wtq · 2018-06-19T06:49:00Z

This probably matches what I see. Every time this issue starts happening for me, it's when log output from the containers has been idle for a while (e.g. when I go for lunch).

@ajbouh

Resolves: GoogleContainerTools#531 See also: kubernetes/client-go#12 There is an issue in which the Pods watcher gets into a infinite tight loop and begins consuming 100% of the CPU. This happens after skaffold dev has been running for a while (~30 mins) and once it starts, it doesn't stop. The issue was narrowed down by @ajbouh to the event polling loop in `logs.go`, which was not checking if the `ResultChan()` is closed or not. Kubernetes actually closes the connection after a timeout (default is in the range of 30-60 mins according to the related issue linked to above). In this case, the intended solution is to start the watcher again. This refactors the polling into two nested loops. One to start (and restart) the Pods watcher itself and another to receive and process the events from the watcher. If the `ResultChan()` is closed, the entire watcher loop is restarted and log tailing continues. There is a subtle difference in error handling as a result of this change. Previously any error returned from `client.Pods("").Watch()` would be immediately returned from the `Watch()` func in `logs.go`. This is no longer possible since the watcher is initialized in the goroutine started by that func. As such, in the case the watcher cannot be initialized, we simply log the error and stop tailing logs. Open to suggestions as to be a better way to handle this error. Retrying in a tight loop seems potentially problematic in the error scenario.

d11wtq · 2018-06-19T11:31:31Z

Posted a fix here: #704

@ajbouh

Resolves: GoogleContainerTools#531 See also: kubernetes/client-go#12 There is an issue in which the Pods watcher gets into a infinite tight loop and begins consuming 100% of the CPU. This happens after skaffold dev has been running for a while (~30 mins) and once it starts, it doesn't stop. The issue was narrowed down by @ajbouh to the event polling loop in `logs.go`, which was not checking if the `ResultChan()` is closed or not. Kubernetes actually closes the connection after a timeout (default is in the range of 30-60 mins according to the related issue linked to above). In this case, the intended solution is to start the watcher again. This refactors the polling into two nested loops. One to start (and restart) the Pods watcher itself and another to receive and process the events from the watcher. If the `ResultChan()` is closed, the entire watcher loop is restarted and log tailing continues. There is a subtle difference in error handling as a result of this change. Previously any error returned from `client.Pods("").Watch()` would be immediately returned from the `Watch()` func in `logs.go`. This is no longer possible since the watcher is initialized in the goroutine started by that func. As such, in the case the watcher cannot be initialized, we simply log the error and stop tailing logs. Open to suggestions as to be a better way to handle this error. Retrying in a tight loop seems potentially problematic in the error scenario.

davidfernandezm · 2020-06-01T15:57:15Z

I think we may need to reopen this.

I am observing this with low activity, also with skaffold dev. I am running v1.9.1 and it's just tailing logs from one pod. It's also watching for changes in a folder to sync (manual FileSync). Any pointers on how to reduce CPU usage?

The full command is:

$ ps -ef | grep skaffold
  501 18378 18350   0  3:19pm ??       113:05.89 skaffold dev --filename skaffold.yml --profile qc --trigger=manual --cleanup=false --auto-build=false --auto-deploy=false --default-repo eu.gcr.io/repo --port-forward=true --status-check=true

davidfernandezm · 2020-06-09T08:17:54Z

Can anyone reopen this issue or point to a more recent one? In the 2018 Macs the fans are at full speed when leaving skaffold dev monitoring logs

nkubala · 2020-06-15T22:45:40Z

@davidfernandezm wow, that's quite high. have you tried running with a different watcher? skaffold dev --trigger=polling may fix your issue temporarily. i'd love to know more about your project though, is the folder that you're watching for sync really big?

davidfernandezm · 2020-06-16T17:46:01Z

Thanks a lot for your response @nkubala. There's one folder being watched that has around 5000 files, it's my suspicion that's the culprit. I will try the polling trigger to see if that has any effect.

This is how my skaffold.yml profile looks like:

...
- name: qc
    build:
       artifacts:
        - image: "skaffold-qc"
          context: '..'
          sync:
            manual:
            - src: 'KVP/**/*.class'
              dest: /usr/local/tomcat/webapps/ROOT
              strip: 'KVP'
            - src: 'KVP/client/**/*.*'
              dest: /usr/local/tomcat/webapps/ROOT
              strip: 'KVP'
            - src: 'deploy/external-configuration/app/external-configuration/*.*'
              dest: /app/external-configuration
              strip: 'deploy/external-configuration/app/external-configuration'
          jib:
            args:
            - "-PskipBuildClient=true"
       local: {}
    deploy:
      helm:
        releases:
          - name: "{{.USER}}-skaffold-qc"
            chartPath: "qumucloud"
            values:
              image: "skaffold-qc"
            setValueTemplates:
              initDb: "{{.INIT_DB}}"
            valuesFiles:
              - "qumucloud/values-local-{{.USER}}.yaml"
            recreatePods: true
        flags:
          upgrade:
            - --install
      # This is the time that skaffold waits before assuming a deployment was timing out
      # Needed to be increased as QC an take sometimes 3 or more minutes to be built & deployed first time round
      statusCheckDeadlineSeconds: 240

and on JIB side:

...
jib {
    from.image = 'eu.gcr.io/qumu-dev/qumu-cloud-base-local:1.0.0'

    extraDirectories.paths = ['deploy/external-configuration',
                              'build/libs', 
                              'deploy/database', 
                              'scripts/job-processor', 
                              'database/liquibase', 
                              'src/main/jib'] 

    extraDirectories.permissions = [
            '/app/entrypoint.sh': '755',
            '/app/restart-tomcat.sh': '755',
            '/liquibase': '755'
    ]

    container.workingDirectory = "/app"
    container.entrypoint = ["/app/entrypoint.sh"]
    container.appRoot = '/usr/local/tomcat/webapps/ROOT'

    skaffold {
        watch {
            includes = [ "KVP/" ]
        }
    }
}

davidfernandezm · 2020-06-17T15:27:58Z

@nkubala thanks a lot for your suggestion, it fixed the CPU issue... now it's like standing at 20% (has been running for 3h). Still want to confirm, but could you point to some doc or explanation of the differences between manual, polling, notify and why manual goes crazy with CPU?

Thanks a lot again!

nkubala · 2020-06-24T17:36:45Z

@davidfernandezm glad to hear that fixed your issue! we have some brief documentation on the trigger types at https://skaffold.dev/docs/workflows/dev/ under "File Watcher and Watch Modes", but this could definitely be improved. I don't have time to do this right now but if you'd like to become a contributor feel free and throw some improvements together :)

FYI it's not manual that goes crazy, it's fsnotify. especially on OS X, when a project has lots and lots of files, the CPU usage skyrockets. the polling trigger simply computes changes in your workspace on a time interval, so it doesn't actually have to "monitor" anything. not as quick, but much less overhead.

davidfernandezm · 2020-07-03T11:29:26Z

I'd love to become a contributor, but I am still learning Go basics. I am actually reading the code and learning a lot. Congrats on such great product to speed up K8s and cloud adoption in dev environments.

tejal29 · 2020-07-13T20:27:43Z

Thanks @davidfernandezm!

davidfernandezm · 2020-08-03T14:57:42Z

I don't know if this is the right place for this, but this is related to File Sync of thousands of files, which won't work at the moment. A block like this:

 artifacts:
        - image: "skaffold-qc"
          context: '..'
          sync:
            manual:
            - src: 'KVP/**/*.class'
              dest: /usr/local/tomcat/webapps/ROOT
              strip: 'KVP'
            - src: 'KVP/client/**/*.*'
              dest: /usr/local/tomcat/webapps/ROOT
              strip: 'KVP'
            - src: 'deploy/external-configuration/app/external-configuration/*.*'
              dest: /app/external-configuration
              strip: 'deploy/external-configuration/app/external-configuration'

in our scenario is trying to sync 4000+ files and getting:

time="2020-08-03T15:54:49+01:00" level=warning msg="Skipping deploy due to sync error: deleting files: starting command /usr/local/bin/kubectl --context gke_qumu-dev_europe-west1-b_qumu-dev-dev1 exec qumucloud-dfernandez-85dc9c9d85-xn9xk --namespace default -c qumucloud-dfernandez -- rm -rf -- /usr/local/tomcat/webapps/ROOT/WEB-INF/classes/com/kuluvalley/guru/server/api/cache/Delegate.class /usr/local/tomcat/webapps/ROOT/WEB-INF/classes/com/kuluvalley/guru/server/api/content/chapter/ChapterApi.class /usr/local/tomcat/webapps/ROOT/WEB-INF/classes/com/kuluvalley/guru/server/integration/stoneshot/GetSessionsInvitedForContactResponse.class /usr/local/tomcat/webapps/ROOT/WEB-INF/classes/com/kuluvalley/guru/server/security/SsoUserMapper.class /usr/local/tomcat/webapps/ROOT/WEB-INF/classes/com/kuluvalley/guru/server/security/jwt/JwtSigningException.class /usr/local/tomcat/webapps/ROOT/WEB-INF/classes/com/kuluvalley/guru/server/api/dto/AsyncJobDto.class /usr/local/tomcat/webapps/ROOT/WEB-INF/classes/com/kuluvalley/guru/server/integration/stone...
-- MANY MORE FILE PATHS --
INF/classes/com/kuluvalley/guru/server/presentation/controller/admin/MetadataTypeEnumController.class /usr/local/tomcat/webapps/ROOT/WEB-INF/classes/com/kuluvalley/guru/server/security/KeyStoreConfig.class: fork/exec /usr/local/bin/kubectl: argument list too long"

key part: /usr/local/bin/kubectl: argument list too long"

Is there any workaround for when such number of files need to be deleted/synced?

nkubala · 2020-09-02T20:33:01Z

@davidfernandezm interesting....never seen that one before haha. would you mind opening a new issue to track that?

dgageot added kind/bug Something isn't working cmd/dev labels May 9, 2018

dlorenc closed this as completed Jun 4, 2018

r2d4 reopened this Jun 18, 2018

d11wtq mentioned this issue Jun 19, 2018

Fix issue with 100% CPU usage in logs.go. #704

Merged

dgageot closed this as completed in #704 Jun 20, 2018

dfernandezm mentioned this issue Sep 24, 2020

File Sync fails when syncing thousands of files: kubectl argument list too long #4821

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High CPU usage on Mac OS X #531

High CPU usage on Mac OS X #531

aplex commented May 8, 2018

ajbouh commented May 8, 2018

slaxor commented May 11, 2018

dlorenc commented May 15, 2018

ajbouh commented May 16, 2018

ajbouh commented May 23, 2018

dlorenc commented Jun 4, 2018

d11wtq commented Jun 18, 2018 •

edited

d11wtq commented Jun 18, 2018

ajbouh commented Jun 18, 2018 via email

d11wtq commented Jun 18, 2018 •

edited

d11wtq commented Jun 18, 2018

d11wtq commented Jun 18, 2018

d11wtq commented Jun 18, 2018 •

edited

d11wtq commented Jun 19, 2018

d11wtq commented Jun 19, 2018

d11wtq commented Jun 19, 2018 •

edited

d11wtq commented Jun 19, 2018

d11wtq commented Jun 19, 2018

davidfernandezm commented Jun 1, 2020 •

edited

davidfernandezm commented Jun 9, 2020

nkubala commented Jun 15, 2020

davidfernandezm commented Jun 16, 2020 •

edited

davidfernandezm commented Jun 17, 2020

nkubala commented Jun 24, 2020

davidfernandezm commented Jul 3, 2020

tejal29 commented Jul 13, 2020

davidfernandezm commented Aug 3, 2020

nkubala commented Sep 2, 2020

High CPU usage on Mac OS X #531

High CPU usage on Mac OS X #531

Comments

aplex commented May 8, 2018

Actual behavior

Information

ajbouh commented May 8, 2018

slaxor commented May 11, 2018

dlorenc commented May 15, 2018

ajbouh commented May 16, 2018

ajbouh commented May 23, 2018

dlorenc commented Jun 4, 2018

d11wtq commented Jun 18, 2018 • edited

d11wtq commented Jun 18, 2018

ajbouh commented Jun 18, 2018 via email

d11wtq commented Jun 18, 2018 • edited

d11wtq commented Jun 18, 2018

d11wtq commented Jun 18, 2018

d11wtq commented Jun 18, 2018 • edited

d11wtq commented Jun 19, 2018

d11wtq commented Jun 19, 2018

d11wtq commented Jun 19, 2018 • edited

d11wtq commented Jun 19, 2018

d11wtq commented Jun 19, 2018

davidfernandezm commented Jun 1, 2020 • edited

davidfernandezm commented Jun 9, 2020

nkubala commented Jun 15, 2020

davidfernandezm commented Jun 16, 2020 • edited

davidfernandezm commented Jun 17, 2020

nkubala commented Jun 24, 2020

davidfernandezm commented Jul 3, 2020

tejal29 commented Jul 13, 2020

davidfernandezm commented Aug 3, 2020

nkubala commented Sep 2, 2020

d11wtq commented Jun 18, 2018 •

edited

d11wtq commented Jun 18, 2018 •

edited

d11wtq commented Jun 18, 2018 •

edited

d11wtq commented Jun 19, 2018 •

edited

davidfernandezm commented Jun 1, 2020 •

edited

davidfernandezm commented Jun 16, 2020 •

edited