Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High CPU usage and syncing stops with native OSX when files are being created and removed #497

Closed
ricmatsui opened this issue Nov 12, 2017 · 31 comments

Comments

@ricmatsui
Copy link
Contributor

ricmatsui commented Nov 12, 2017

Error

  • When creating and removing a file such as when checking out a branch in git or other scenarios, the sync container will start using very high CPU usage and syncing will stop happening
  • CPU usage stays at 100%, and memory usage also is very high
  • The only way is to restart the sync container or the unison process inside the container
  • As far as I can tell, there seems to be a bug in unison which causes this issue, but I'm not sure if docker-sync or docker is related. I am reporting this issue to see if there are others that can help with this as I continue investigating as well
  • I have created a repro which should be able to reproduce this issue. It reproduces on my machine pretty reliably: https://github.com/ricmatsui/docker-sync-cpu-repro
  • Steps to use the repro
    • Install gems
    • Run ./repro.sh
    • It will run a series of tests with a larger and larger test file to get the issue to happen
    • It creates a file, waits 1 second, then removes the file and waits 1 second, and does this 15 times
    • Then it checks the CPU usage 5 times and reports if it sees high CPU usage
    • If it does not see high CPU usage, it increases the size of the file 10x and repeats until it reaches a 100M length file
    • The issue reproduces regularly with 100K lines in the file
    • Let me know if this does not reproduce the issue

Recording: Repro at 2:20

asciicast

Docker Driver

d4m

Version 17.09.0-ce-mac35 (19611)
Channel: stable
a98b7c1b7c

Sync strategy

native_osx
docker-sync (0.4.6)
eugenmayer/unison:hostsync_0.2

your docker-sync.yml

https://github.com/ricmatsui/docker-sync-cpu-repro/blob/master/docker-sync.yml

version: '2'

syncs:
  reprocpu_sync:
    image: 'eugenmayer/unison:hostsync_0.2'
    src: '.'
    sync_strategy: native_osx
    sync_prefer: src

OS

OSX 10.12.6

MacBook Pro 15-inch Mid 2015
2.5 GHz Intel Core i7
16 GB 1600 MHz DDR3

@EugenMayer
Copy link
Owner

EugenMayer commented Nov 12, 2017

That is quiet a huge effort and even a reproduction, thank you a lot @ricmatsui !

One interesting fact is, those are very few files. Just so we can actually understand, what resources did you assign docker for mac?

You might want to remove the .ruby-version file, since it is basically hindering people using your repro directly not having ruby 2.3.0 - works with any ruby anyway ( its docker-sync only )


Beside that, i cannot reproduct the issue and i would really love to know why that is the case in general. It is obvious that you reduced the case on a very very limit so this is actually very interesting to understand. Please anybody testing this, write down your OSX version, your d4m version, you d4m resources and also probably if your home-folder is encrypted.



Result: NOT reproduceable
MacOS: Sierra
d4m: stable Version 17.09.0-ce-mac35 (19611)
d4m resources:: Laptop, 9.2GB (DDR3), 2/4CPUs of a 2,81 GHz Intel Core i7
FileVault: Off

Average CPU .03600000000000000000 %
  No repro with 10000000
REPRO FAILURE

@ricmatsuitile
Copy link

@EugenMayer No problem, I'm getting close but I think I'm going to need help.

Just so we can actually understand, what resources did you assign docker for mac?
What CPU / resources did you apply, could you past your table below?

CPUs: 4 out of 8
Memory: 4 GB out 16

screen shot 2017-11-11 at 10 38 50 pm

You might want to remove the .ruby-version file

Yeah that make sense, I'll remove it. It is probably safe since the problem is likely in the container and the environment and not in the Ruby code.

Beside that, i cannot reproduce the issue and i would really love to know why that is the case in general

Interesting, it repros pretty consistently between the 1K lines case and the 1M lines case for me, so I'll continue testing to see more if resource changes the behavior.

probably if your home-folder is encrypted

My home folder is encrypted:
FileVault: On

I'll try out those changes and script and see what the results are, thanks.

Result: NOT reproduceable

So just to confirm, the repro.sh script finishes with REPRO FAILURE and FAIL is written to the line-repros file, right?

@EugenMayer
Copy link
Owner

EugenMayer commented Nov 12, 2017

@ricmatsui just post your table using my syntax so we can easily compare :) and yes, see my last code box, thats my result

@ricmatsuitile
Copy link

A couple iterations, going to now do reboots and flush script:

Result: Reproduceable
MacOS: Sierra
d4m: stable Version 17.09.0-ce-mac35 (19611)
d4m resources:: Laptop, 4GB (DDR3), 4/8CPUs of a 2,5 GHz Intel Core i7
FileVault: On
Flushing: Default
Fresh reboot: No

REPRO SUCCESS with line count 100000 and sleep 1 second

Result: Reproduceable
MacOS: Sierra
d4m: stable Version 17.09.0-ce-mac35 (19611)
d4m resources:: Laptop, 4GB (DDR3), 2/8CPUs of a 2,5 GHz Intel Core i7
FileVault: On
Flushing: Default
Fresh reboot: No

    Average CPU 93.72600000000000000000 %
REPRO SUCCESS with line count 100000 and sleep 1 second

Result: Reproduceable
MacOS: Sierra
d4m: stable Version 17.09.0-ce-mac35 (19611)
d4m resources:: Laptop, 9.2GB (DDR3), 2/8CPUs of a 2,5 GHz Intel Core i7
FileVault: On
Flushing: Default
Fresh reboot: No

    Average CPU 85.73400000000000000000 %
REPRO SUCCESS with line count 100000 and sleep 1 second

@ricmatsui
Copy link
Contributor Author

Result: Reproduceable
MacOS: Sierra
d4m: stable Version 17.09.0-ce-mac35 (19611)
d4m resources:: Laptop, 9.2GB (DDR3), 2/8CPUs of a 2,5 GHz Intel Core i7
FileVault: On
Flushing: Default
Fresh reboot: Yes

    Average CPU 89.58400000000000000000 %
REPRO SUCCESS with line count 100000 and sleep 1 second

@ricmatsui
Copy link
Contributor Author

Result: SOMETIMES reproduceable - First time container did not reproduce and had I/O R/W at 0B / 0B, two other times it did reproduce and I/O R/W was non-zero.
MacOS: Sierra
d4m: stable Version 17.09.0-ce-mac35 (19611)
d4m resources:: Laptop, 9.2GB (DDR3), 2/8CPUs of a 2,5 GHz Intel Core i7
FileVault: On
Flushing: Using d4m_fix.sh
Fresh reboot: No

    Average CPU .03000000000000000000 %
  No repro with 10000000
REPRO FAILURE
    Average CPU 100.57600000000000000000 %
REPRO SUCCESS with line count 100000 and sleep 1 second
docker-sync-cpu-repro$ cat ~/Library/Containers/com.docker.docker/Data/database/com.docker.driver.amd64-linux/disk/full-sync-on-flush
falsedocker-sync-cpu-repro$

@ricmatsui
Copy link
Contributor Author

Result: Reproduceable
MacOS: Sierra
d4m: stable Version 17.09.0-ce-mac35 (19611)
d4m resources:: Laptop, 9.2GB (DDR3), 2/8CPUs of a 2,5 GHz Intel Core i7
FileVault: On
Flushing: Using d4m_fix.sh
Fresh reboot: Yes

    Average CPU 91.30200000000000000000 %
REPRO SUCCESS with line count 100000 and sleep 1 second
docker-sync-cpu-repro$ cat ~/Library/Containers/com.docker.docker/Data/database/com.docker.driver.amd64-linux/disk/full-sync-on-flush
falsedocker-sync-cpu-repro$ 

Those are my tables, only had one case where it did not repro which seemed strange since I/O was zero which may be another bug. All other cases did repro.

@EugenMayer
Copy link
Owner

EugenMayer commented Nov 12, 2017

Result: NOT Reproduceable
MacOS: Sierra
d4m: stable Version 17.09.0-ce-mac35 (19611)
d4m resources:: Laptop, 9.2GB (DDR3), 2/4CPUs of a 2,81 GHz Intel Core i7
FileVault: Off
Flushing: Using d4m_fix.sh
Fresh reboot: No

I am using a hight limit this time even

./repro.sh 1000000000
    Inspect #3
    Inspect #4
    Inspect #5
    Average CPU .05000000000000000000 %
  No repro with 100000000
REPRO FAILURE

@michaelbaudino
Copy link
Contributor

Hi there, huge thanks for the effort @ricmatsui 🎩 ✨

Just a note about flushing tweak: I don't know where d4m_fix.sh comes from, but iiuc it changes the value of full-sync-on-flush. If I'm right, and iiuc again, be aware that full-sync-on-flush has been deprecated and replaced by on-sync (as mentioned in this discussion, probably taking advantage of this Mirage feature) starting from Docker For Mac 17.03 (I'm not very sure about the version, though: I can't find a reliable source 😕).

I added values for both full-sync-on-flush and on-flush in my results below, just in case 🤷‍♂️ (since both are set on my system, but maybe it's just legacy config hanging around 🤔)

Here are my results from multiple successive runs (noteworthy changes are in bold):

Metric Value
MacOS Sierra 10.12.6
Hardware MacBook Pro 2015 (2.2GHz Intel Core i7)
d4m stable Version 17.09.0-ce-mac35 (19611)
d4m CPUs 4
d4m RAM 6.0GB (out of 16GB)
FileVault On
Flushing full-sync-on-flush = false
on-sync = os
Fresh reboot No
Result REPRO FAILURE

Metric Value
MacOS Sierra 10.12.6
Hardware MacBook Pro 2015 (2.2GHz Intel Core i7)
d4m stable Version 17.09.0-ce-mac35 (19611)
d4m CPUs 4
d4m RAM 6.0GB (out of 16GB)
FileVault On
Flushing full-sync-on-flush = false
on-sync = os
Fresh reboot No
Result REPRO FAILURE

♻️ SYSTEM REBOOT ♻️


Metric Value
MacOS Sierra 10.12.6
Hardware MacBook Pro 2015 (2.2GHz Intel Core i7)
d4m stable Version 17.09.0-ce-mac35 (19611)
d4m CPUs 4
d4m RAM 6.0GB (out of 16GB)
FileVault On
Flushing full-sync-on-flush = false
on-sync = os
Fresh reboot Yes
Result REPRO FAILURE

Metric Value
MacOS Sierra 10.12.6
Hardware MacBook Pro 2015 (2.2GHz Intel Core i7)
d4m stable Version 17.09.0-ce-mac35 (19611)
d4m CPUs 4
d4m RAM 6.0GB (out of 16GB)
FileVault On
Flushing full-sync-on-flush = false
on-sync = os
Fresh reboot No
Result REPRO FAILURE

🐳 DOCKER RESTART 🐳


Metric Value
MacOS Sierra 10.12.6
Hardware MacBook Pro 2015 (2.2GHz Intel Core i7)
d4m stable Version 17.09.0-ce-mac35 (19611)
d4m CPUs 1
d4m RAM 4GB (out of 16GB)
FileVault On
Flushing full-sync-on-flush = false
on-sync = os
Fresh reboot No
Result REPRO FAILURE

Metric Value
MacOS Sierra 10.12.6
Hardware MacBook Pro 2015 (2.2GHz Intel Core i7)
d4m stable Version 17.09.0-ce-mac35 (19611)
d4m CPUs 1
d4m RAM 6.0GB (out of 16GB
FileVault On
Flushing full-sync-on-flush = false
on-sync = os
Fresh reboot No
Result REPRO FAILURE

🐳 DOCKER RESTART 🐳


Metric Value
MacOS Sierra 10.12.6
Hardware MacBook Pro 2015 (2.2GHz Intel Core i7)
d4m stable Version 17.09.0-ce-mac35 (19611)
d4m CPUs 1
d4m RAM 6.0GB (out of 16GB
FileVault On
Flushing full-sync-on-flush = false
on-sync = drive
Fresh reboot No
Result REPRO FAILURE

🐳 DOCKER RESTART 🐳


Metric Value
MacOS Sierra 10.12.6
Hardware MacBook Pro 2015 (2.2GHz Intel Core i7)
d4m stable Version 17.09.0-ce-mac35 (19611)
d4m CPUs 1
d4m RAM 6.0GB (out of 16GB
FileVault On
Flushing full-sync-on-flush = false
on-sync = none
Fresh reboot No
Result REPRO FAILURE

@ricmatsui
Copy link
Contributor Author

Thanks @michaelbaudino for helping out, it seems to imply that FileVault may not be the issue, which was one of our hypotheses.

@EugenMayer In the meantime, I have a workaround which is to automatically restart unison if it fails a CPU usage health check using monit https://mmonit.com/monit/ . Right now I have it as 2 measurements of over 50% CPU usage in 10 seconds. Let me know if there is interest in getting this in a PR, and I can set that up and make it more configurable:

This is it in verbose mode:
asciicast

ricmatsui/docker-image-unison@c22e63e

And I published an image as: https://hub.docker.com/r/ricmatsui/unison/tags/

version: '2'

syncs:
  reprocpu_sync:
    src: '.'
    image: ricmatsui/unison:hostsync_0.2
    sync_strategy: native_osx
    sync_prefer: src

@EugenMayer
Copy link
Owner

hehe, thats a nice one. I am fairly familiar with monit so i fully aware about that being the right tool - since we have supervisor, this even makes more sense, since we have an easy restart.
I am very interested in this, could you create a pull request?

This is awesome work!

@EugenMayer
Copy link
Owner

If you create that pull request, i would have 2 suggestions:

  1. let it be possible to disable the check, we leave it one by default - but if we run into false positives and people are anoyed, they can turn it of

  2. let us tune the time and amount of cycles using ENV vars. Not sure monit supports ENV vars in the configuration, but if not, we could just write the values into the config on bootstrap, either simple heredoc or something else?

Overall i really much like the idea, we could later add more metrics too it

@EugenMayer
Copy link
Owner

@ricmatsui any news on your work - i would really love to incorporate your work!

@ricmatsui
Copy link
Contributor Author

@EugenMayer Thanks! Haven't had time lately, will try to get more time for this possibly this weekend hopefully! Definitely I agree with the dynamic options, will need to figure out a good way to do the monit config since like you said I'm not sure if ENV vars are supported in the config.

@EugenMayer
Copy link
Owner

Awesome news @ricmatsui thanks for keeping the drive!

@Krzysiaczek-at-theFoundry

So far I was working with version 0.45 and shortly 0.46 as there was some permission problem to update occasionally. I've noticed lack of sync with 0.46 but it was enough to reset the OS to go back to normal. Yesterday docker-sync was updated to 0.52 and things got much worse.

It took 3 times longer to "find changes" and after this was finished the very high CPU usage had remained. I had to restart the Docker but even after this docker-sync wasn't updating the content in the target containers.

Reset of OS did help only for one sync container but not for the other. I was planning to find a way to roll back to version 0.45 this morning but today the other sync seems to work.

So, in general, it seems to be very inconsistent and not reliable anymore.

@EugenMayer
Copy link
Owner

@Krzysiaczek-at-theFoundry do you consider this a d4m or a docker-sync issue? i cannot see who we could potentially make this a docker-sync issue.

Docker-sync, to most parts, is stateless. If you clean and run again, there is litterally no state left. So if you problem gets fixed magically after some reboots or days, it just cannot be related to a docker-sync run from clean.

@Krzysiaczek-at-theFoundry
Copy link

I think this is related to change of docker-sync from 0.45 to 0.46 and later to 0.52. I run the same version of d4m for a while and docker-sync is the only thing which has been updated/changed recently. The problems have been reported from other developers in my team as well.

As far as I remember the previous version didn't make this "scan for changes" which happens now.

There was only initial build which took 2-6 minutes and everything was good and quick. Now, this additional "look for changes" took 27 minutes and another time 12.

Your library seems to be the only cure for the slowness of d4m. Now when you library suffers those problems my managers start to question usability of Docker in general :(

@EugenMayer
Copy link
Owner

Well docker under OSX is PITA - not docker in general.

Be aware, that a lot people using 0.5, us included, daily, ever single second, on huge projects, without issues. So its not as easy as going its generally not working - right now, its working for far more people then its actually not.

I know that this does not help you - but we really try to find out this "very little" difference causing d4m to be entirely useless and unreliable. And right now all pointers are pointing to one layer deeper then docker-sync, and that starting from d4m, to xhyve, to macOS filesystem or encryption - we simply do not know.

Maybe its even a 3rd party tool or something else.

@Krzysiaczek-at-theFoundry
Copy link

Actually, this does work for me right now as well. But I just try to avoid stopping and restarting anything (which is really awkward for the Docker environment).

I've started this last week and put my Mac to sleep only and observe if there are entries in the log like this one ocassionally:

Sync Log: UNISON 2.48.4 started propagating changes at 14:55:20.92 on 11 Dec 2017

as I've started with docker-stack-sync start this time.

Could you please add to the wiki clear instruction how to downgrade to 0.45 just in case?

@EugenMayer
Copy link
Owner

Downgrading to 0.4.5 will change literally nothing, you are, to extend of 99%, hunting a ghost here.

@Krzysiaczek-at-theFoundry

Surprise, surprise. Today I had to update Mac and restart the OS. After this operation and running docker-sync it asked me to update to 0.5.2 (again). I thought that it has happened already. It showed up that I am still on 0.4.6.

This time message was also saying that installation was successful and I need to restart docker-sync. I was not sure how to do so effectively I restarted the whole OS one more time but guess what, I am still on 0.4.6.

Other than this I've updated d4m and now it runs
Version 17.09.1-ce-mac42 (21090)
Channel: stable 3176a6af01

As service-sync containers have been built before the start took no time, there was no Looking for Changes/Update message like previously and first Sync Log message came up instantly.

@vincentpazeller
Copy link

Just coming for news... could @ricmatsui workaround be merged?

Thank you for your work!

@ricmatsui
Copy link
Contributor Author

Thanks @vincentpazeller I'm still working on it. I'm seeing some race conditions on startup which can trigger some errors intermittently so I want to fix those first.

@agate
Copy link

agate commented Jan 12, 2018

@ricmatsui I suggest that we can have an option that allow the other user to set the heavy cpu usage detect delay time. Your original idea was 10s. But it's not a good number for everyone. For example, our repo is a very big repo. I already ignored all the big dirs .git and tmp and of course node_modules... But a fully sync still need more than 1minute. So I feel that option will be useful 🍷

@salvamomo
Copy link

Thanks for your work on this @ricmatsui, and the rest, you're amazing. I'll give the repro.sh script a go as soon as possible, since I think I'm hitting the exact same issue.

@ricmatsui
Copy link
Contributor Author

ricmatsui commented Jan 22, 2018

@EugenMayer Alright, finally got a couple PRs open which adds monit and three options (disable, interval, and cycles). Let me know, thanks:

#521
EugenMayer/docker-image-unison#11

(This doesn't help with the issue with Docker 17.12, which I am seeing events not propagating as has been pointed out)

@ricmatsui
Copy link
Contributor Author

@agate Yes, I have that option now, using a combination of interval and cycles which is how monit works: interval * cycles

This check also does not run on the initial sync that happens when the volume is created. That can take as long as it needs to. So it would trigger a false positive for example if you did a checkout of a branch which takes more than 10 seconds to sync the difference.

@grayside
Copy link

grayside commented Mar 6, 2018

Sorry, wrong thread :)

@mschering
Copy link

Why is this one closed? I encounter it a lot too. Are you not looking into a fix for the real problem?
Installing monit adds a another complexity for the dev environment that shouldn't be necessary. I hope this can be solved.

@EugenMayer
Copy link
Owner

@mschering feel free to contribute a fix - anytime. There are good reasons nobody is working on the root cause - because it is outside docker-sync.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants