Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance #1

Open
mcshanea opened this issue Oct 8, 2019 · 5 comments
Open

Performance #1

mcshanea opened this issue Oct 8, 2019 · 5 comments
Assignees
Labels
Projects
Milestone

Comments

@mcshanea
Copy link
Collaborator

@mcshanea mcshanea commented Oct 8, 2019

One of the main expected problem areas we know we'll need to address is performance of the sitemaps at scale. This task is to research and document how current plugins are handling this problem.

Detailed description

A single XML sitemap has a limit of 50MB (uncompressed) and 50,000 URLs. That means sitemaps for bigger sites need to be split up into multiple smaller ones in order to not exceed the limit.

Also, WordPress can’t load information for 50,000 posts on a single page as it would be way too slow. So the actual sitemap limit used by WordPress needs to be more reasonable. Yoast SEO and Jetpack currently use 1,000 entries per sitemap and split up posts ordered by ID basically. Another popular plugin, Google XML Sitemaps, splits sitemaps up by month, which implicates a much lower number of entries per sitemap, but a higher number of individual sitemaps. Additionally, in this comment Matthew Boynes suggests looking at the approach taken by the msm-sitemaps plugin.

Acceptance Criteria

Research and document the following attributes from at minimum the Yoast, Jetpack, and MSM-Sitemaps, and Google XML Sitemaps plugin implementations:

  • What's the max number of URLs included in a sitemap page
  • What's the pagination ordering strategy (e.g., Post by ID, by date, etc.)?
  • How are the queries generated for each sitemap page?
  • What caching is being done for each sitemap page?
  • What observable limitations exist that prevent the implementation from scaling?

Once we've reviewed the current approaches, we'll use that information to create a design document for how we would propose to solve this problem in core.

@mcshanea mcshanea added the Type: Spike label Oct 8, 2019
@mcshanea mcshanea added this to the Sprint 0 milestone Oct 8, 2019
@mcshanea mcshanea added this to To do in Execution via automation Oct 8, 2019
@joemcgill joemcgill self-assigned this Oct 11, 2019
@joemcgill joemcgill moved this from To do to In progress in Execution Oct 11, 2019
@joemcgill

This comment has been minimized.

Copy link
Collaborator

@joemcgill joemcgill commented Oct 15, 2019

Another aspect of performance that we need to consider is not just how pagination of each sitemap type is handled, but how the main index gets generated. As explained in https://markjaquith.wordpress.com/2018/01/22/how-i-fixed-yoast-seo-sitemaps-on-a-large-wordpress-site/, if the generation of the sitemap index is dynamic, then it doesn't matter how performant each sitemap page is. We need to make sure both individual page generation and the sitemap index generation is cached in some fashion, and ideally generated via a background process.

@joemcgill

This comment has been minimized.

Copy link
Collaborator

@joemcgill joemcgill commented Oct 16, 2019

I've done some digging into several of the current plugins that provide sitemap functionality and here's a general set of details for what I've found.

===

Jetpack

Max URLs per sitemap: 2,000
Pagination strategy: sequentially by post ID, regardless of type (mixed type)
How are the queries generated: Custom SELECT in Jetpack_Sitemap_Librarian::query_posts_after_id() (non-cached)
How is caching handled: No query caching, sitemap XML is base64 encoded into a custom post.
Observations:

  • Doesn't seem to use use WP routing at all (i.e. add_rewrite_rule()), instead it hooks in early, examines the URI, echos the sitemap file contents, and dies.
  • When a post is trashed, all sitemaps have to be regenerated from that point on.
  • If a sitemap doesn't exist, they print out a temporary XML file

MSM Sitemap

Max URLs per sitemap page: 500 (filterable)
Pagination strategy: Posts by year/date
How are the queries generated: custom query
How is caching handled: Sitemap XML is generated and stored in post_meta
Observations:

  • When the number of posts published on a day exceeds the posts per page max, no pagination occurs
  • If there are no posts for a date, it's excluded from the sitemap index.
  • All supported post types are included in the same date indexes.

Google XML Sitemaps

Max URLs per sitemap: No observable max.
Pagination strategy: Paginated by month, separated into post types.
How are queries generated: custom query in GoogleSitemapGeneratorStandardBuilder
How is caching handled:
Observations:

  • The code is not structured very well generally, e.g., several classes per file, etc.
  • A sitemap page seemed to die if a month archive contains too many posts, not sure if that will happen in practice but is still a concern.

Yoast

Max URLs per sitemap: 1,000
Pagination strategy: sequentially by post ID, separated by type
How are the queries generated: custom query in sitemap provider classes, e.g., WPSEO_Post_Type_Sitemap_Provider::get_posts()
How is caching handled: No caching
Observations:

@joemcgill

This comment has been minimized.

Copy link
Collaborator

@joemcgill joemcgill commented Oct 17, 2019

Interestingly, Yoast did add caching functionality and then removed it here: Yoast/wordpress-seo#11279. Would be interesting to better understand what problems people were running into.

@mcshanea mcshanea mentioned this issue Oct 21, 2019
@mcshanea mcshanea moved this from In progress to Review in progress in Execution Oct 29, 2019
@joemcgill

This comment has been minimized.

Copy link
Collaborator

@joemcgill joemcgill commented Nov 14, 2019

We've been working on a detailed design document that describes our approach to pagination, along with detailed implementation ideas for a caching mechanism, which may turn out to be required for this to work at scale.

The next steps here are to agree on the approach and refine the document that we want to share as part of a public kickoff of the project in a Make/Core post.

@joemcgill joemcgill modified the milestones: Sprint 0, Sprint 2 Nov 14, 2019
@joemcgill joemcgill moved this from Review in progress to Blocked in Execution Nov 14, 2019
@sybrew

This comment has been minimized.

Copy link

@sybrew sybrew commented Feb 1, 2020

Here are some details of the sitemap in The SEO Framework plugin, might that be useful:

  • Sitemap bridge bootstrap
    • Notes:
      • Is decoupled from WP Rewrite to improve performance and mitigate instability of rewrite rules flushing.
      • Has filters to allow custom sitemaps to be registered dynamically.
      • Pinging of the sitemap is handled via cronjobs, or (via option) directly when a post is updated/published.
      • The robots.txt entries are added dynamically via the registration arguments.
  • Base sitemap generator
    • Notes:
      • Contains only singular post types, excludes the attachment post type.
      • Iterates over hierarchical and nonhierarchical post types separately.
      • Has caching via transients.
      • Uses PHP 5.5 generators.
      • Relies on another abstract class (see below).
  • Main abstract sitemap class
    • Notes:
      • Has timezone fixes for WP 5.3 and earlier.
      • Maintains options and indexability checks.

Max URLs per sitemap page: 50000. The queries are separated by 3000 (has option).
Pagination strategy: None, yet.
How are the queries generated: Two simple custom queries. Nonhierarchical types (posts) are sorted descending by updated time. Hierarchical types (pages) are sorted ascending by published time. Two separate queries are invoked to obtain the blog page and frontpage. Only IDs are fetched, whereafter a generator loops over the post IDs to test for indexation to keep memory consumption low.
How is caching handled: Sitemap is stored in transient caching. It's flushed on post publish/update/delete, permalink structure changes. There's no preemptive cache initiator.

Here's an example of how we expanded the sitemap functionality:

Sidenotes: Google does not need to see all your posts listed in the sitemap. Once they know about a URL, they need only be updated once a post is updated. Otherwise, you might as well remove the URL from the sitemap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Execution
  
Blocked
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.