Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
One of the main expected problem areas we know we'll need to address is performance of the sitemaps at scale. This task is to research and document how current plugins are handling this problem.
A single XML sitemap has a limit of 50MB (uncompressed) and 50,000 URLs. That means sitemaps for bigger sites need to be split up into multiple smaller ones in order to not exceed the limit.
Also, WordPress can’t load information for 50,000 posts on a single page as it would be way too slow. So the actual sitemap limit used by WordPress needs to be more reasonable. Yoast SEO and Jetpack currently use 1,000 entries per sitemap and split up posts ordered by ID basically. Another popular plugin, Google XML Sitemaps, splits sitemaps up by month, which implicates a much lower number of entries per sitemap, but a higher number of individual sitemaps. Additionally, in this comment Matthew Boynes suggests looking at the approach taken by the msm-sitemaps plugin.
Research and document the following attributes from at minimum the Yoast, Jetpack, and MSM-Sitemaps, and Google XML Sitemaps plugin implementations:
Once we've reviewed the current approaches, we'll use that information to create a design document for how we would propose to solve this problem in core.
Another aspect of performance that we need to consider is not just how pagination of each sitemap type is handled, but how the main index gets generated. As explained in https://markjaquith.wordpress.com/2018/01/22/how-i-fixed-yoast-seo-sitemaps-on-a-large-wordpress-site/, if the generation of the sitemap index is dynamic, then it doesn't matter how performant each sitemap page is. We need to make sure both individual page generation and the sitemap index generation is cached in some fashion, and ideally generated via a background process.
I've done some digging into several of the current plugins that provide sitemap functionality and here's a general set of details for what I've found.
Max URLs per sitemap: 2,000
Max URLs per sitemap page: 500 (filterable)
Google XML Sitemaps
Max URLs per sitemap: No observable max.
Max URLs per sitemap: 1,000
We've been working on a detailed design document that describes our approach to pagination, along with detailed implementation ideas for a caching mechanism, which may turn out to be required for this to work at scale.
The next steps here are to agree on the approach and refine the document that we want to share as part of a public kickoff of the project in a Make/Core post.
Here are some details of the sitemap in The SEO Framework plugin, might that be useful:
Max URLs per sitemap page: 50000. The queries are separated by 3000 (has option).
Here's an example of how we expanded the sitemap functionality:
Sidenotes: Google does not need to see all your posts listed in the sitemap. Once they know about a URL, they need only be updated once a post is updated. Otherwise, you might as well remove the URL from the sitemap.