Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High CPU usage on Mac OS X #531

Closed
aplex opened this issue May 8, 2018 · 28 comments · Fixed by #704
Closed

High CPU usage on Mac OS X #531

aplex opened this issue May 8, 2018 · 28 comments · Fixed by #704
Labels
kind/bug Something isn't working

Comments

@aplex
Copy link

aplex commented May 8, 2018

Actual behavior

skaffold process generates about 100% sustained cpu usage when it is started with "skaffold dev" command. The project itself is quite small (4 deployments of third-party services and 1 locally built docker image), and the high cpu usage is present even without any local file changes.

VBoxHeadless process also has a sustained 150-200% CPU usage in Activity Monitor.

Information

Skaffold version: v0.5.0
Operating system: Mac OS X 10.12.6
Content of skaffold.yaml:

apiVersion: skaffold/v1alpha2
kind: Config
build:
  artifacts:
  - imageName: my-tools-image
    workspace: ./tools
    docker: {}
  local: {}
deploy:
  kubectl:
    manifests:
    - ./kubernetes/*

@ajbouh
Copy link
Contributor

ajbouh commented May 8, 2018

Also seeing this behavior on 10.11.6. I've been having issues with running out of file descriptors in other apps, though I'm not sure if that's relevant to what's going on here or not...

@dgageot dgageot added kind/bug Something isn't working cmd/dev labels May 9, 2018
@slaxor
Copy link

slaxor commented May 11, 2018

I have seen it also on linux (Linux turtle 4.16.7-1-ARCH #1 SMP PREEMPT Wed May 2 21:12:36 UTC 2018 x86_64 GNU/Linux)

@dlorenc
Copy link
Contributor

dlorenc commented May 15, 2018

Ref #549

@ajbouh
Copy link
Contributor

ajbouh commented May 16, 2018

I've instrumented my local build of skaffold and found that this line:

case evt := <-watcher.ResultChan():

Seems to suddenly return a nil/empty Event repeatedly. I have been able to trigger it by switching from one internet connection to another. I think the watch channel is closed and we're stuck in an infinite loop because of it.

I'm not certain this is the source of the problem, but I'm experimenting locally with:

	case evt, ok := <-watcher.ResultChan():
		if !ok {
			// Try reopening
			logrus.Infof("Log aggregator watch has stopped, respawning it.")
			a.Start(ctx, client)
			break
		}

@ajbouh
Copy link
Contributor

ajbouh commented May 23, 2018

I've been running this for a few days now and while the CPU usage issue is gone, I've seen the "Log aggregator watch has stopped, respawning it." message spewed to my logs over and over again.

I guess there's something that goes bad elsewhere in the log watching state machine and just respawning it again is not sufficient. This change is better than nothing as it gives a landmark in the logs to launch deeper investigations from. Anyone have ideas about what might be going on?

@dlorenc
Copy link
Contributor

dlorenc commented Jun 4, 2018

We have an mtime watcher now that should be better.

@dlorenc dlorenc closed this as completed Jun 4, 2018
@d11wtq
Copy link
Contributor

d11wtq commented Jun 18, 2018

@dlorenc can you elaborate please? I looked at the linked PR but there is not much detail on it. Does the mtime watcher basically poll large directories for modification of files? I'm experiencing this issue when switching between my home network and my office network. If the mtime solution using polling this will probably be bad for my use case since I'm using nodejs and the node_modules tree can be extremely large (gigabytes of deep directory structure).

Very curious about the underlying cause of the inotify/fsevents solution. I've used those tools a lot on other projects and switched networks without issues, as I routinely work from home and from the office on the same laptop.

@d11wtq
Copy link
Contributor

d11wtq commented Jun 18, 2018

Is the mtime watcher the default on 0.7.0, or did it miss the cut? If it's the default then I think I'm hitting the 100% CPU usage with the mtime watcher after suspending the laptop for > 1 hour while skaffold dev is running.

@ajbouh
Copy link
Contributor

ajbouh commented Jun 18, 2018 via email

@d11wtq
Copy link
Contributor

d11wtq commented Jun 18, 2018

Oh yeah, I've had the 100% CPU issue using kail for log tailing before. stern didn't have that issue, so I switched to that. Not sure how skaffold is doing it, but you might be onto something there.

@d11wtq
Copy link
Contributor

d11wtq commented Jun 18, 2018

I seem to have run into the 100% CPU usage again after around 45 minutes of inactivity.

@d11wtq
Copy link
Contributor

d11wtq commented Jun 18, 2018

Interestingly I'm running multiple skaffold processes (microservices architecture) and they all collectively started using 100% CPU at the same time. So something environmental causes it to start happening.

@d11wtq
Copy link
Contributor

d11wtq commented Jun 18, 2018

Can we re-open this issue?

image

@r2d4 r2d4 reopened this Jun 18, 2018
@d11wtq
Copy link
Contributor

d11wtq commented Jun 19, 2018

Thanks @r2d4 - I'll see if I can pinpoint where the issue is.

@d11wtq
Copy link
Contributor

d11wtq commented Jun 19, 2018

I've done some rudimentary debugging (fmt.Println in spots I suspect to be problematic) and it seems that when this issue starts happening, it's in log.go's Start() func.

for {
    select {
    case <-ctx.Done():
        return
    case evt := <-watcher.ResultChan():
        fmt.Println("Got watcher evt")
        if evt.Type != watch.Added && evt.Type != watch.Modified {
            fmt.Println("Wrong evt.Type")
            continue
        }

It seems to start thrashing there and receives events seemingly in a tight loop. This matches up with what @ajbouh observed too.

@d11wtq
Copy link
Contributor

d11wtq commented Jun 19, 2018

kubernetes/client-go#12

watcher error after 20m

When I create a watch on pods e.g. like this:
...
I have the trouble that after exactly 20 minutes, the ResultChan() receives infinite messages, but they are all empty.

It is expected. Server will timeout the watch connection, see code. Though I'm not sure why it's 20 mins, it timeout after 30-60 mins if you didn't change the default configs of your server, because the default MinRequestTimeout is 30 mins.

Example code that uses Watcher correctly: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/tools/cache/reflector.go#L358

@d11wtq
Copy link
Contributor

d11wtq commented Jun 19, 2018

This probably matches what I see. Every time this issue starts happening for me, it's when log output from the containers has been idle for a while (e.g. when I go for lunch).

d11wtq added a commit to d11wtq/skaffold that referenced this issue Jun 19, 2018
Resolves: GoogleContainerTools#531
See also: kubernetes/client-go#12

There is an issue in which the Pods watcher gets into a infinite tight
loop and begins consuming 100% of the CPU. This happens after skaffold
dev has been running for a while (~30 mins) and once it starts, it
doesn't stop.

The issue was narrowed down by @ajbouh to the event polling loop in
`logs.go`, which was not checking if the `ResultChan()` is closed or not.
Kubernetes actually closes the connection after a timeout (default is in
the range of 30-60 mins according to the related issue linked to above).
In this case, the intended solution is to start the watcher again.

This refactors the polling into two nested loops. One to start (and
restart) the Pods watcher itself and another to receive and process the
events from the watcher. If the `ResultChan()` is closed, the entire
watcher loop is restarted and log tailing continues.

There is a subtle difference in error handling as a result of this
change. Previously any error returned from `client.Pods("").Watch()`
would be immediately returned from the `Watch()` func in `logs.go`. This
is no longer possible since the watcher is initialized in the goroutine
started by that func. As such, in the case the watcher cannot be
initialized, we simply log the error and stop tailing logs. Open to
suggestions as to be a better way to handle this error. Retrying in a
tight loop seems potentially problematic in the error scenario.
@d11wtq
Copy link
Contributor

d11wtq commented Jun 19, 2018

Posted a fix here: #704

d11wtq added a commit to d11wtq/skaffold that referenced this issue Jun 19, 2018
Resolves: GoogleContainerTools#531
See also: kubernetes/client-go#12

There is an issue in which the Pods watcher gets into a infinite tight
loop and begins consuming 100% of the CPU. This happens after skaffold
dev has been running for a while (~30 mins) and once it starts, it
doesn't stop.

The issue was narrowed down by @ajbouh to the event polling loop in
`logs.go`, which was not checking if the `ResultChan()` is closed or not.
Kubernetes actually closes the connection after a timeout (default is in
the range of 30-60 mins according to the related issue linked to above).
In this case, the intended solution is to start the watcher again.

This refactors the polling into two nested loops. One to start (and
restart) the Pods watcher itself and another to receive and process the
events from the watcher. If the `ResultChan()` is closed, the entire
watcher loop is restarted and log tailing continues.

There is a subtle difference in error handling as a result of this
change. Previously any error returned from `client.Pods("").Watch()`
would be immediately returned from the `Watch()` func in `logs.go`. This
is no longer possible since the watcher is initialized in the goroutine
started by that func. As such, in the case the watcher cannot be
initialized, we simply log the error and stop tailing logs. Open to
suggestions as to be a better way to handle this error. Retrying in a
tight loop seems potentially problematic in the error scenario.
@davidfernandezm
Copy link

davidfernandezm commented Jun 1, 2020

I think we may need to reopen this.

I am observing this with low activity, also with skaffold dev. I am running v1.9.1 and it's just tailing logs from one pod. It's also watching for changes in a folder to sync (manual FileSync). Any pointers on how to reduce CPU usage?

Screenshot 2020-06-01 at 16 55 37

The full command is:

$ ps -ef | grep skaffold
  501 18378 18350   0  3:19pm ??       113:05.89 skaffold dev --filename skaffold.yml --profile qc --trigger=manual --cleanup=false --auto-build=false --auto-deploy=false --default-repo eu.gcr.io/repo --port-forward=true --status-check=true

@davidfernandezm
Copy link

Can anyone reopen this issue or point to a more recent one? In the 2018 Macs the fans are at full speed when leaving skaffold dev monitoring logs

@nkubala
Copy link
Contributor

nkubala commented Jun 15, 2020

@davidfernandezm wow, that's quite high. have you tried running with a different watcher? skaffold dev --trigger=polling may fix your issue temporarily. i'd love to know more about your project though, is the folder that you're watching for sync really big?

@davidfernandezm
Copy link

davidfernandezm commented Jun 16, 2020

Thanks a lot for your response @nkubala. There's one folder being watched that has around 5000 files, it's my suspicion that's the culprit. I will try the polling trigger to see if that has any effect.

This is how my skaffold.yml profile looks like:

...
- name: qc
    build:
       artifacts:
        - image: "skaffold-qc"
          context: '..'
          sync:
            manual:
            - src: 'KVP/**/*.class'
              dest: /usr/local/tomcat/webapps/ROOT
              strip: 'KVP'
            - src: 'KVP/client/**/*.*'
              dest: /usr/local/tomcat/webapps/ROOT
              strip: 'KVP'
            - src: 'deploy/external-configuration/app/external-configuration/*.*'
              dest: /app/external-configuration
              strip: 'deploy/external-configuration/app/external-configuration'
          jib:
            args:
            - "-PskipBuildClient=true"
       local: {}
    deploy:
      helm:
        releases:
          - name: "{{.USER}}-skaffold-qc"
            chartPath: "qumucloud"
            values:
              image: "skaffold-qc"
            setValueTemplates:
              initDb: "{{.INIT_DB}}"
            valuesFiles:
              - "qumucloud/values-local-{{.USER}}.yaml"
            recreatePods: true
        flags:
          upgrade:
            - --install
      # This is the time that skaffold waits before assuming a deployment was timing out
      # Needed to be increased as QC an take sometimes 3 or more minutes to be built & deployed first time round
      statusCheckDeadlineSeconds: 240

and on JIB side:

...
jib {
    from.image = 'eu.gcr.io/qumu-dev/qumu-cloud-base-local:1.0.0'

    extraDirectories.paths = ['deploy/external-configuration',
                              'build/libs', 
                              'deploy/database', 
                              'scripts/job-processor', 
                              'database/liquibase', 
                              'src/main/jib'] 

    extraDirectories.permissions = [
            '/app/entrypoint.sh': '755',
            '/app/restart-tomcat.sh': '755',
            '/liquibase': '755'
    ]

    container.workingDirectory = "/app"
    container.entrypoint = ["/app/entrypoint.sh"]
    container.appRoot = '/usr/local/tomcat/webapps/ROOT'

    skaffold {
        watch {
            includes = [ "KVP/" ]
        }
    }
}

@davidfernandezm
Copy link

@nkubala thanks a lot for your suggestion, it fixed the CPU issue... now it's like standing at 20% (has been running for 3h). Still want to confirm, but could you point to some doc or explanation of the differences between manual, polling, notify and why manual goes crazy with CPU?

Thanks a lot again!

@nkubala
Copy link
Contributor

nkubala commented Jun 24, 2020

@davidfernandezm glad to hear that fixed your issue! we have some brief documentation on the trigger types at https://skaffold.dev/docs/workflows/dev/ under "File Watcher and Watch Modes", but this could definitely be improved. I don't have time to do this right now but if you'd like to become a contributor feel free and throw some improvements together :)

FYI it's not manual that goes crazy, it's fsnotify. especially on OS X, when a project has lots and lots of files, the CPU usage skyrockets. the polling trigger simply computes changes in your workspace on a time interval, so it doesn't actually have to "monitor" anything. not as quick, but much less overhead.

@davidfernandezm
Copy link

I'd love to become a contributor, but I am still learning Go basics. I am actually reading the code and learning a lot. Congrats on such great product to speed up K8s and cloud adoption in dev environments.

@tejal29
Copy link
Member

tejal29 commented Jul 13, 2020

Thanks @davidfernandezm!

@davidfernandezm
Copy link

I don't know if this is the right place for this, but this is related to File Sync of thousands of files, which won't work at the moment. A block like this:

 artifacts:
        - image: "skaffold-qc"
          context: '..'
          sync:
            manual:
            - src: 'KVP/**/*.class'
              dest: /usr/local/tomcat/webapps/ROOT
              strip: 'KVP'
            - src: 'KVP/client/**/*.*'
              dest: /usr/local/tomcat/webapps/ROOT
              strip: 'KVP'
            - src: 'deploy/external-configuration/app/external-configuration/*.*'
              dest: /app/external-configuration
              strip: 'deploy/external-configuration/app/external-configuration'

in our scenario is trying to sync 4000+ files and getting:

time="2020-08-03T15:54:49+01:00" level=warning msg="Skipping deploy due to sync error: deleting files: starting command /usr/local/bin/kubectl --context gke_qumu-dev_europe-west1-b_qumu-dev-dev1 exec qumucloud-dfernandez-85dc9c9d85-xn9xk --namespace default -c qumucloud-dfernandez -- rm -rf -- /usr/local/tomcat/webapps/ROOT/WEB-INF/classes/com/kuluvalley/guru/server/api/cache/Delegate.class /usr/local/tomcat/webapps/ROOT/WEB-INF/classes/com/kuluvalley/guru/server/api/content/chapter/ChapterApi.class /usr/local/tomcat/webapps/ROOT/WEB-INF/classes/com/kuluvalley/guru/server/integration/stoneshot/GetSessionsInvitedForContactResponse.class /usr/local/tomcat/webapps/ROOT/WEB-INF/classes/com/kuluvalley/guru/server/security/SsoUserMapper.class /usr/local/tomcat/webapps/ROOT/WEB-INF/classes/com/kuluvalley/guru/server/security/jwt/JwtSigningException.class /usr/local/tomcat/webapps/ROOT/WEB-INF/classes/com/kuluvalley/guru/server/api/dto/AsyncJobDto.class /usr/local/tomcat/webapps/ROOT/WEB-INF/classes/com/kuluvalley/guru/server/integration/stone...
-- MANY MORE FILE PATHS --
INF/classes/com/kuluvalley/guru/server/presentation/controller/admin/MetadataTypeEnumController.class /usr/local/tomcat/webapps/ROOT/WEB-INF/classes/com/kuluvalley/guru/server/security/KeyStoreConfig.class: fork/exec /usr/local/bin/kubectl: argument list too long"

key part: /usr/local/bin/kubectl: argument list too long"

Is there any workaround for when such number of files need to be deleted/synced?

@nkubala
Copy link
Contributor

nkubala commented Sep 2, 2020

@davidfernandezm interesting....never seen that one before haha. would you mind opening a new issue to track that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants