New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High CPU usage and syncing stops with native OSX when files are being created and removed #497
Comments
That is quiet a huge effort and even a reproduction, thank you a lot @ricmatsui ! One interesting fact is, those are very few files. Just so we can actually understand, what resources did you assign docker for mac? You might want to remove the .ruby-version file, since it is basically hindering people using your repro directly not having ruby 2.3.0 - works with any ruby anyway ( its docker-sync only ) Beside that, i cannot reproduct the issue and i would really love to know why that is the case in general. It is obvious that you reduced the case on a very very limit so this is actually very interesting to understand. Please anybody testing this, write down your OSX version, your d4m version, you d4m resources and also probably if your home-folder is encrypted.
Result: NOT reproduceable
|
@EugenMayer No problem, I'm getting close but I think I'm going to need help.
CPUs: 4 out of 8
Yeah that make sense, I'll remove it. It is probably safe since the problem is likely in the container and the environment and not in the Ruby code.
Interesting, it repros pretty consistently between the 1K lines case and the 1M lines case for me, so I'll continue testing to see more if resource changes the behavior.
My home folder is encrypted: I'll try out those changes and script and see what the results are, thanks.
So just to confirm, the |
@ricmatsui just post your table using my syntax so we can easily compare :) and yes, see my last code box, thats my result |
A couple iterations, going to now do reboots and flush script: Result: Reproduceable
Result: Reproduceable
Result: Reproduceable
|
Result: Reproduceable
|
Result: SOMETIMES reproduceable - First time container did not reproduce and had I/O R/W at 0B / 0B, two other times it did reproduce and I/O R/W was non-zero.
|
Result: Reproduceable
Those are my tables, only had one case where it did not repro which seemed strange since I/O was zero which may be another bug. All other cases did repro. |
Result: NOT Reproduceable I am using a hight limit this time even
|
Hi there, huge thanks for the effort @ricmatsui 🎩 ✨ Just a note about flushing tweak: I don't know where I added values for both Here are my results from multiple successive runs (noteworthy changes are in bold):
♻️ SYSTEM REBOOT ♻️
🐳 DOCKER RESTART 🐳
🐳 DOCKER RESTART 🐳
🐳 DOCKER RESTART 🐳
|
Thanks @michaelbaudino for helping out, it seems to imply that FileVault may not be the issue, which was one of our hypotheses. @EugenMayer In the meantime, I have a workaround which is to automatically restart unison if it fails a CPU usage health check using ricmatsui/docker-image-unison@c22e63e And I published an image as: https://hub.docker.com/r/ricmatsui/unison/tags/
|
hehe, thats a nice one. I am fairly familiar with monit so i fully aware about that being the right tool - since we have supervisor, this even makes more sense, since we have an easy restart. This is awesome work! |
If you create that pull request, i would have 2 suggestions:
Overall i really much like the idea, we could later add more metrics too it |
@ricmatsui any news on your work - i would really love to incorporate your work! |
@EugenMayer Thanks! Haven't had time lately, will try to get more time for this possibly this weekend hopefully! Definitely I agree with the dynamic options, will need to figure out a good way to do the monit config since like you said I'm not sure if ENV vars are supported in the config. |
Awesome news @ricmatsui thanks for keeping the drive! |
So far I was working with version 0.45 and shortly 0.46 as there was some permission problem to update occasionally. I've noticed lack of sync with 0.46 but it was enough to reset the OS to go back to normal. Yesterday docker-sync was updated to 0.52 and things got much worse. It took 3 times longer to "find changes" and after this was finished the very high CPU usage had remained. I had to restart the Docker but even after this docker-sync wasn't updating the content in the target containers. Reset of OS did help only for one sync container but not for the other. I was planning to find a way to roll back to version 0.45 this morning but today the other sync seems to work. So, in general, it seems to be very inconsistent and not reliable anymore. |
@Krzysiaczek-at-theFoundry do you consider this a d4m or a docker-sync issue? i cannot see who we could potentially make this a docker-sync issue. Docker-sync, to most parts, is stateless. If you clean and run again, there is litterally no state left. So if you problem gets fixed magically after some reboots or days, it just cannot be related to a docker-sync run from clean. |
I think this is related to change of docker-sync from 0.45 to 0.46 and later to 0.52. I run the same version of d4m for a while and docker-sync is the only thing which has been updated/changed recently. The problems have been reported from other developers in my team as well. As far as I remember the previous version didn't make this "scan for changes" which happens now. There was only initial build which took 2-6 minutes and everything was good and quick. Now, this additional "look for changes" took 27 minutes and another time 12. Your library seems to be the only cure for the slowness of d4m. Now when you library suffers those problems my managers start to question usability of Docker in general :( |
Well docker under OSX is PITA - not docker in general. Be aware, that a lot people using 0.5, us included, daily, ever single second, on huge projects, without issues. So its not as easy as going its generally not working - right now, its working for far more people then its actually not. I know that this does not help you - but we really try to find out this "very little" difference causing d4m to be entirely useless and unreliable. And right now all pointers are pointing to one layer deeper then docker-sync, and that starting from d4m, to xhyve, to macOS filesystem or encryption - we simply do not know. Maybe its even a 3rd party tool or something else. |
Actually, this does work for me right now as well. But I just try to avoid stopping and restarting anything (which is really awkward for the Docker environment). I've started this last week and put my Mac to sleep only and observe if there are entries in the log like this one ocassionally:
as I've started with docker-stack-sync start this time. Could you please add to the wiki clear instruction how to downgrade to 0.45 just in case? |
Downgrading to 0.4.5 will change literally nothing, you are, to extend of 99%, hunting a ghost here. |
Surprise, surprise. Today I had to update Mac and restart the OS. After this operation and running docker-sync it asked me to update to 0.5.2 (again). I thought that it has happened already. It showed up that I am still on 0.4.6. This time message was also saying that installation was successful and I need to restart docker-sync. I was not sure how to do so effectively I restarted the whole OS one more time but guess what, I am still on 0.4.6. Other than this I've updated d4m and now it runs As service-sync containers have been built before the start took no time, there was no Looking for Changes/Update message like previously and first Sync Log message came up instantly. |
Just coming for news... could @ricmatsui workaround be merged? Thank you for your work! |
Thanks @vincentpazeller I'm still working on it. I'm seeing some race conditions on startup which can trigger some errors intermittently so I want to fix those first. |
@ricmatsui I suggest that we can have an option that allow the other user to set the heavy cpu usage detect delay time. Your original idea was 10s. But it's not a good number for everyone. For example, our repo is a very big repo. I already ignored all the big dirs |
Thanks for your work on this @ricmatsui, and the rest, you're amazing. I'll give the repro.sh script a go as soon as possible, since I think I'm hitting the exact same issue. |
@EugenMayer Alright, finally got a couple PRs open which adds #521 (This doesn't help with the issue with Docker 17.12, which I am seeing events not propagating as has been pointed out) |
@agate Yes, I have that option now, using a combination of interval and cycles which is how This check also does not run on the initial sync that happens when the volume is created. That can take as long as it needs to. So it would trigger a false positive for example if you did a checkout of a branch which takes more than 10 seconds to sync the difference. |
Sorry, wrong thread :) |
Why is this one closed? I encounter it a lot too. Are you not looking into a fix for the real problem? |
@mschering feel free to contribute a fix - anytime. There are good reasons nobody is working on the root cause - because it is outside docker-sync. |
Error
unison
which causes this issue, but I'm not sure if docker-sync or docker is related. I am reporting this issue to see if there are others that can help with this as I continue investigating as well./repro.sh
Recording: Repro at 2:20
Docker Driver
d4m
Version 17.09.0-ce-mac35 (19611)
Channel: stable
a98b7c1b7c
Sync strategy
native_osx
docker-sync (0.4.6)
eugenmayer/unison:hostsync_0.2
your docker-sync.yml
https://github.com/ricmatsui/docker-sync-cpu-repro/blob/master/docker-sync.yml
OS
OSX 10.12.6
MacBook Pro 15-inch Mid 2015
2.5 GHz Intel Core i7
16 GB 1600 MHz DDR3
The text was updated successfully, but these errors were encountered: