Skip to content
This repository has been archived by the owner on Sep 14, 2021. It is now read-only.

Implement a performant handling of sitemap pages #39

Closed
10 tasks
svandragt opened this issue Oct 31, 2019 · 5 comments
Closed
10 tasks

Implement a performant handling of sitemap pages #39

svandragt opened this issue Oct 31, 2019 · 5 comments
Assignees
Labels
Type: Enhancement Enhancement to an existing feature

Comments

@svandragt
Copy link
Contributor

svandragt commented Oct 31, 2019

Description

A performant scalable way of assigning posts of all registered content types to sitemap pages and processing updates / deletions.

A sitemap page is a sitemap linked from the index containing a subset of posts. A post is a piece of content with any registered post type.

Which feature is your enhancement request related to?
#31

Describe the solution you'd like
WIP

Acceptance Criteria

  • When a post updates it must be added to a page.
  • When a post is deleted it must be removed from a page.
  • When the feature activates, all existing posts must be assigned to pages.
  • Each page must only contain posts from one post type.
  • Each page must only contain up to X posts (currently 2000, maximum 50k).
  • An index must contain not more than 50k pages.
  • Each post should not contain post meta about the page relationship (slow performance, many database rows).
  • Each page could contain post_meta (limited number of pages).
  • The solution must scale to a good number of posts (5m?)
  • The solution must scale to a good number of post-types (10?)
@svandragt svandragt added the Type: Enhancement Enhancement to an existing feature label Oct 31, 2019
@svandragt svandragt self-assigned this Oct 31, 2019
@svandragt svandragt changed the title Distribution strategy for the post sitemap relationship Distribution strategy for the post <> sitemap-page relationship Oct 31, 2019
@svandragt
Copy link
Contributor Author

v3

"Supporting 50k/(2626) = 76 post-types. Scales up to 2626*2000 = 1.3 million posts per post-type."

  • md5 of post type and id
  • $length = 1 (26 pages, max length is 2)
  • for each post
    • page-id is post_type + strleft($hash, $length)
    • assign post to page.
    • if page -post-count >= 2000-(26*$length)
      • $length++

@svandragt svandragt mentioned this issue Oct 31, 2019
8 tasks
@svandragt
Copy link
Contributor Author

Blocked until the technical doc is updated with @joemcgill and @swissspidy thoughts.

@joemcgill
Copy link
Contributor

Thanks for kicking off this discussion, @svandragt.

If I'm understanding your above description correctly, you're exploring the idea of a hash lookup table where we would automatically create sitemap/buckets for evenly distributing a large number of URLs into groups where we could quickly look up the location of each object based on some deterministic algorithm (in this case, based on object type and ID).

This is a really smart solution for doing fast lookups, but I'm concerned that we'll end up with a large number of buckets containing artificially low numbers of objects on sites that have a large number of custom post types and/or custom taxonomy types, which could create performance issues when generating the sitemap index.

Ideally, I think we want to come up with a solution that optimizes the objects:buckets ratio so that we can pack a large number of objects into the smallest number of buckets possible, while still being able to quickly look up which bucket an object is in so we can update/delete buckets whenever an objects within that bucket is updated/deleted.

The simplest solution for looking up which would be to save the bucket ID as metadata of the object (e.g., post_meta for a post), but as you pointed out in the requirements above, that would lead to a huge increase in meta rows in the database as we add references for each object.

If we're storing each bucket as a post of a custom post type, perhaps we can save the maximum and minimum post ID from each bucket in the post meta of each bucket and give each bucket a name which identifies which object type it includes, then we could look up all buckets for a particular object type using a LIKE query on wp_post.post_name, which is indexed, and loop through the post meta values until we find the one whose min/max ID contains the ID for the object we're modifying. For active buckets (i.e., the newest one being filled) each time we add an object, we'll update the max ID in that bucket's post meta to match the ID of the post we're publishing, assuming it's larger than the max ID that already exists.

@joemcgill
Copy link
Contributor

I've started on a proof of concept in #64 based on a more fleshed out description detailed on the (still in progress) design document from #11.

@joemcgill joemcgill changed the title Distribution strategy for the post <> sitemap-page relationship Implement a performant handling of sitemap pages Nov 14, 2019
@swissspidy
Copy link
Contributor

Closing this one for now, as this optimization is off the table for now.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Type: Enhancement Enhancement to an existing feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants