Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Critical: Using an external scheduler (cron) on AWS EC2 instances causes MASSIVE disk and cpu use. #5513

Closed
larssn opened this issue Nov 4, 2016 · 32 comments
Labels
[Package] Sync [Pri] High [Status] Needs Author Reply We would need you to make some changes or provide some more details about your PR. Thank you! [Type] Bug When a feature is broken and / or not performing as intended
Milestone

Comments

@larssn
Copy link

larssn commented Nov 4, 2016

Last Thursday, our entire cluster was brought to its knees, including all the Wordpress sites we host.

We identified Jetpack being indirectly responsible.

Before I get into that, a little background on the relevant part of our setup.

So we're hosted on Amazon, and we use their Elastic File System (EFS) to host our Wordpress files. This allows multiple instances to share the same Wordpress files, and we don't have to worry about instances having different files. The EFS has burstable performance, so if you need a sudden throughput of several GB/sec of data, it can do that. When it does that, you use what is called Burst Credits. For us, its pretty important that these don't run out.

We don't use WP Cron, as we require a reliable job scheduler, that runs at the right times. For this we use Cavalcade, which makes sure that the same job doesn't run on multiple instances. A job it does really well.

October 24, we had a credit balance of 32.4TB, and then something happened - not sure if Jetpack was updated, or the stars had the wrong alignment, but 3 days later our burst credits hit zero.
Image: EFS dying when credits hit 0

We've since determined that Jetpack apparently have sync jobs running - nonstop, that apparently uses a lot of disk IO, and also a lot of CPU (at zero load, our instance CPUs averaged at 21%). We're seeing the jobs jetpack_sync_cron and jetpack_sync_full_cron running, a lot. Like every minute... And they take more than a minute to finish, so we ended up seeing 3 of each of these jobs, per site (in our multisite-setup), running, constantly! This is crazy stuff!

Quickfix

If you're reading this, and are in a similar situation, modify sync/class.jetpack-sync-actions.php:

static function sync_allowed() {
                return false;
                /*return ( ! Jetpack_Sync_Settings::get_setting( 'disable' ) && Jetpack::is_active() && ! ( Jetpack::is_development_mode() || Jetpack::is_staging_site() ) )
                           || defined( 'PHPUNIT_JETPACK_TESTSUITE' );*/
        }

We also changed the job running every minute, to once per hour. If it actually does anything with the above change, don't know. We don't really care at this point. Our main concern is our credits staying up. With the above change, the credits stay at a nice even balance.

We're very dependent on Jetpack, and we've always considered it a good plugin. We still do; this is IT, these things happen.

We have to ask though, what can be so important, that you must run these jobs pretty much back to back?
If you absolutely must run these sync jobs, then they either require a redesign, or a higher running interval, or a filter, so we, at least, can change the interval ourselves.

Thanks for reading.

TLDR; Jetpack sync jobs + Amazon EFS + external job runner = bad.

@jeherve jeherve added [Type] Bug When a feature is broken and / or not performing as intended [Package] Sync [Pri] High labels Nov 7, 2016
@jeherve
Copy link
Member

jeherve commented Nov 7, 2016

Sorry for all the trouble!

We're seeing the jobs jetpack_sync_cron and jetpack_sync_full_cron running, a lot. Like every minute

They do indeed run every minute by default.

what can be so important, that you must run these jobs pretty much back to back?

This synchronization allows your data to be synchronized back with WordPress.com, and consequently gives us more reliable / up to date data to be used with modules relying on WordPress.com, like Publicize, Subscriptions, Related Posts, Stats, and the site management tools on WordPress.com.

If you absolutely must run these sync jobs, then they either require a redesign, or a higher running interval, or a filter, so we, at least, can change the interval ourselves.

We've made a lot of changes to Sync in the past few weeks to address issues like those you've experienced, so it might be worth giving our current Alpha version a try. It's available here:
https://github.com/Automattic/jetpack/archive/master-stable.zip

If you'd rather not use a development version of Jetpack on your production sites, you can use the jetpack_sync_incremental_sync_interval and jetpack_sync_full_sync_interval filters to change the frequency of the synchronization. The 2 filters are available in the current version of Jetpack (4.3.2). Here is how you could change the overall frequency to 10 minutes, for example:

function jp_support_5513_sync_schedule( $schedules ) {
    if ( ! isset( $schedules['10min'] ) ) {
        $schedules['10min'] = array(
            'interval' => 10 * MINUTE_IN_SECONDS,
            'display' => __( 'Every 10 minutes' ),
        );
    }
    return $schedules;
}
add_filter( 'cron_schedules', 'jp_support_5513_sync_schedule' );

function __return_jp_support_5513_10_min() {
    return '10min';
}
add_filter( 'jetpack_sync_incremental_sync_interval', '__return_jp_support_5513_10_min' );
add_filter( 'jetpack_sync_full_sync_interval', '__return_jp_support_5513_10_min' );

If you get the chance to test the development version of Jetpack, let us know how it goes!

@larssn
Copy link
Author

larssn commented Nov 7, 2016

Thanks for replying, and don't worry about it. We realise these things happen.

We normally don't run alphas in production, but thought we'd give it a go.

Results
Our nodes immediately saw a big spike in CPU (>50%), and our cluster started scaling.

So we had to revert back to our previous solution.

Neither we, nor our customers use wordpress.com for anything atm. So having a job this heavy, doing a task we don't need, is just unwanted. And tbh, I think we'll just turn it off entirely.

Does this functionality benefit your customers, or you? Because if the answer is the latter, then the right course of action is a redesign. Maybe it just runs unfortunately on our setup, hard for us to say.

Still think nothing could be that urgent, that you'd need to run this job back to back. If it truly is, then a non-php solution might be preferred.

Anyway, thanks for the quick response!

@jeherve
Copy link
Member

jeherve commented Nov 7, 2016

Thanks for giving that a try!

Neither we, nor our customers use wordpress.com for anything atm.
I think we'll just turn it off entirely.

Does this functionality benefit your customers, or you?

If your customers do not use any of the modules that rely on WordPress.com, you could activate Jetpack's Development mode.

If, however, some of your customers use Jetpack features like Subscriptions or Publicize, completely disabling sync will be problematic, as posts will stop being sent to their subscribers, or posted to their connected Social Networks.

Maybe it just runs unfortunately on our setup, hard for us to say.

That's most likely the case here, but since you gave us specific details about your setup we should be able to look into this, understand what happens, and find a way to fix things. However, in order to be able to debug the problem, we would need a few examples of the site URLs affected by the problem so we can check our logs and try to understand why sync is so slow. Could you post a few examples here, or send them to us via this contact form?

Thanks!

@larssn
Copy link
Author

larssn commented Nov 7, 2016

If you need more details on the intimates of our cluster, let me know.

Our customers are mainly small businesses and restaurants, and none of those use Subscriptions/Publicize. We'll check out development mode.

Here's a few sites running on this setup:
https://www.cmiile.com
https://www.yesushi.dk
https://www.chinawokhouse.dk

What exactly can you gauge from having these?

EDIT: Dev mode is a no-go, as it disables Photon.

EDIT2: Tried the filters from your first reply. Monitored the CPU and the running cron jobs (via WP Crontrol), it seems to have zero effect: The jobs seem to still chain, and actually ignore the fact that they should now only run every 30 min (we set it to 30). So the effect is the same - high CPU.

We also saw this last week when we were scrambling to fix the issue. Initially we hardcoded your 1min job to 3600 sec instead, and it had zero effect. Only the quickfix above had an effect.

And thus we're back at the quickfix.

@lezama
Copy link
Contributor

lezama commented Nov 7, 2016

@larssn thanks for the detailed report.

We are working on a PR (#5528) that adds the possibility to completely opt out form using cron for sync purposes.

It still needs some testing so don't try it on production yet 😅

@larssn
Copy link
Author

larssn commented Nov 7, 2016

Exciting stuff. Thanks for the attention, means a lot to us. :)

@jeherve jeherve added this to the 4.4 milestone Nov 7, 2016
@tillkruss
Copy link

tillkruss commented Nov 8, 2016

Same issue here using Cavalcade as the cronjob runner. WordPress is hosted on Heroku with DISABLE_WP_CRON set to true.

Running a cronjob every minute would be okay, but the jetpack_sync_cron and jetpack_sync_full_cron cronjobs are multiplying over time and Cavalcade is ending up running dozens of cronjobs simultaniously.

screen shot 2016-11-07 at 1 33 38 pm

We disabled the jetpack_sync_* cronjobs using the filter below until this is resolved.

add_filter('schedule_event', function ($event) {
    return strpos($event->hook, 'jetpack_sync_') === 0 ? false : $event;
});

@larssn
Copy link
Author

larssn commented Nov 8, 2016

Here's Amazon's analysis of our 3 day crisis:

We looked into your file system, and found that during the 3-day window you mentioned, your file system was driving ~600 NFS open operations per second, ~600 NFS close operations per second, ~600 NFS access operations per second, and ~1,700 NFS getattr operations per second. These operations collectively generated metadata throughput of more than ~10-11 MB/s, or ~35-40 GB/hour, which is the level we see on your CloudWatch chart.

@lezama
Copy link
Contributor

lezama commented Nov 8, 2016

@larssn, @tillkruss, We just merged #5528, it turns off using cron for sync purposes by default, it was working great on standard setups but it was causing nightmares on some particular configurations like yours.

We would love to see if it improves the situation for you all. If you could try https://github.com/Automattic/jetpack/archive/master-stable.zip (the built version of what's in master right now) or wait for the first 4.4-beta1 (it's going to be released very soon) that would be a big help to us.

TLDR; Jetpack sync jobs + Amazon EFS + external job runner = bad.

@larssn I am still curious to know what was causing the misbehaviour on your setup, how could I replicate a similar stack with the same external job runner?

Thanks for all your help and patience.

@larssn
Copy link
Author

larssn commented Nov 8, 2016

@lezama Are you familiar with the AWS stack at all? It would make explaining a lot easier.

@lezama
Copy link
Contributor

lezama commented Nov 8, 2016

@larssn, basic knowledge, but I am pretty sure @gravityrail has the required knowledge to help me if I don't get something :)

@larssn
Copy link
Author

larssn commented Nov 8, 2016

@lezama Ok lets see. I think the bare minimum for a proof of concept would be one EC2 (with Ubuntu 16.04.1) instance, with a mounted EFS (which has WP on it. Mount instructions are included when provisioned). Also need a DBS which can be pretty much anything. Basically a standard WP setup, with the only difference that the wp files are on a network file system.

PHP needs to be compiled with pcntl, I'm sure its possible to find a precompiled one with that (if they arent all already).
Cavalcade-Runner comes with a systemd script, which need to be placed in /lib/systemd/system and point to where the cavalcade-runner executable is.
Finally, WP-CLI.

I think thats the minimum.

@larssn
Copy link
Author

larssn commented Nov 10, 2016

@lezama We'll see if we can't squeeze in a test of your update, next week.

@lezama
Copy link
Contributor

lezama commented Nov 11, 2016

Great!

@jeherve jeherve added the [Status] Needs Author Reply We would need you to make some changes or provide some more details about your PR. Thank you! label Nov 16, 2016
@jeherve jeherve modified the milestones: 4.5, 4.4 Nov 16, 2016
@larssn
Copy link
Author

larssn commented Nov 18, 2016

We have a busy Friday today, so might not be able to test it until next week.
We'll see how it goes.

@larssn
Copy link
Author

larssn commented Nov 22, 2016

@lezama Should we still use https://github.com/Automattic/jetpack/archive/master-stable.zip for our test?

@lezama
Copy link
Contributor

lezama commented Nov 22, 2016

@larssn, Yesterday we shipped a new version, you can just upgrade the plugin from the .org repo.

Finally, we do use cron, but we reduced the amount of jobs created and also improved the way we unschedule them.

Please, let me know how it goes.

@larssn
Copy link
Author

larssn commented Nov 24, 2016

It didn't go well. Our cluster immediate jumped to about 46% CPU utilization. We let the sync run for 20 min, to see if the spike was temporary, it stayed at 46'ish % throughout.

And so we've turned it off again.

This was with v4.4.1.

@lezama Did you get a test setup with an EFS up and running?

@lezama
Copy link
Contributor

lezama commented Nov 28, 2016

Not yet, it's on our priority list to figure out what's going on here.

@tillkruss
Copy link

humanmade/Cavalcade#28
humanmade/Cavalcade#29

@lezama
Copy link
Contributor

lezama commented Nov 28, 2016

Thanks for the links @tillkruss

@larssn have you tried @dd32 patch?

It is possible to completely disable Jetpack cron usage, via wp shell doing:

Jetpack_Sync_Settings::update_settings( array( 'sync_via_cron' => 0 ) );

@dd32
Copy link
Member

dd32 commented Nov 29, 2016

Just to follow up on this, since I ran into it..

Honestly, this is 100% Cavalcades problem and not a Jetpack issue - although Jetpack triggered it, a bunch of other plugins can trigger it too (I don't have a list handy). dd32/Cavalcade@fbd23d2 is a good temporary patch, but it has shortcomings and Cavalcade needs fixing via humanmade/Cavalcade#29

@larssn
Copy link
Author

larssn commented Nov 29, 2016

Thanks for investigating.

EDIT:
So we've tried dd32's patch. Unfortunately it made no difference. A proper fix for Cavalcade might be required.

@enejb
Copy link
Member

enejb commented Jan 6, 2017

@larssn Does #5879 Help with this issue.
It is not merge into master.

@ebinnion
Copy link
Contributor

ebinnion commented Jan 6, 2017

#5879 is now merged into master.

As a note, we also merged #5996 which aims to limit the length that jobs run. This change has helped a couple of cases where we were overloading servers when syncing.

@larssn
Copy link
Author

larssn commented Jan 7, 2017

@enejb We have sync turned off for now. When 4.5 is officially released, we'll try again.

@rmccue
Copy link
Contributor

rmccue commented Jan 10, 2017

As the author of Cavalcade, sorry about this. I also agree that this is a Cavalcade issue, and I think you're fine to close this issue out on the Jetpack end.

@larssn
Copy link
Author

larssn commented Jan 10, 2017

@rmccue No worries mate, these things happen.

@lezama
I don't mind this being closed. However, if you are working on other related multisite optimisations, it might be relevant to keep it open?

Your call; - we've definitely determined where the main problem is.

@lezama
Copy link
Contributor

lezama commented Jan 10, 2017

Thank you all, closing it for the moment.

@lezama lezama closed this as completed Jan 10, 2017
@larssn
Copy link
Author

larssn commented Feb 7, 2017

Just a follow up:

Using the latest version of Jetpack, the extra CPU is insignificant, maybe 5% extra per node, which is great!

Thank you.

@earcos
Copy link

earcos commented Aug 31, 2017

@lezama Is the disable sync option coming anytime soon? Thanks a lot and thank you for all your great work 😄

@lezama
Copy link
Contributor

lezama commented Aug 31, 2017

@earcos the option to disable sync via cron has been there since a while ago 😅

One can set the blog option jetpack_sync_settings_sync_via_cron to 0 and Jetpack should stop using wp-cron in order to sync.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[Package] Sync [Pri] High [Status] Needs Author Reply We would need you to make some changes or provide some more details about your PR. Thank you! [Type] Bug When a feature is broken and / or not performing as intended
Projects
None yet
Development

No branches or pull requests

10 participants