Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider not storing full sitemap XML #110

Open
mjangda opened this issue Feb 14, 2017 · 3 comments

Comments

Projects
None yet
2 participants
@mjangda
Copy link
Member

commented Feb 14, 2017

Right now, sitemap XML is generated async and stored in the database to allow them to be served super quickly. The downside is that any code changes that modify the XML output means all sitemaps need to be re-generated which can be a very slow, time-consuming process on really large sites with thousands of sitemaps.

We should explore alternate ways to handle this (while maintaining backwards compat with existing actions/filters) and evaluate whether those approaches make sense.

@systemseven

This comment has been minimized.

Copy link

commented Feb 17, 2017

Sitemaps have to be regenerated when the template changes, no way around that. The trick here is to figure out how to do that efficently.

Right now the flow is as follows (simplistic version)

  • Cron job runs and creates a post for a particular day (sitemaps are technically custom post types)
  • As part of that cron job, using simpleXML to create the necessary XML nodes
  • Run several queries to get all the data we need, and save all that to a string as a postmeta option
  • Display that saved data string (from postmeta) when a user visits the page

Here's what I'm proposing

  • Cron job stays 99% the same, still creates a custom post type (although, we could look at moving away from it and to straight queries in the template I'm proposing below, but that may be past the scope of this individual issue)
  • BUT we don't create the simpleXML nodes, instead we create a list of posts (basically just their urls) that need to be included in that sitemap, we could cheat if we wanted to and make this list of posts the post content of the sitemap post, which removes the data from postmeta, which would mean 1 less query. (vs making a postmeta option containing the list of posts and querying to get that)
  • On display, we can grab that list, loop over, and display.
  • For the display, I say we move to a custom post template, we can still support the hooks and we can support changes. And it could be wired up in a way to allow people to create their own template for display if necessary.
  • For the XML rendering. Alot of the tags never change, so we could hard code them in the template vs using SimpleXML (ie: loc, lastmod, etc..). In fact, MOST of the template could be hard coded and we could use hooks for the stuff that could change (and/or users can always fall back to the custom template idea)
  • Performance wise I think we're gonna net out close to the same, but this method I'm proposing may be a bit more expensive
  • That being said, we could add in a check, were the sitemap caches itself for XXX amount of time, the cache is served if it hasn't expired, if it has expired it serves and creates a new cached version.
@mjangda

This comment has been minimized.

Copy link
Member Author

commented Feb 17, 2017

we create a list of posts (basically just their urls) that need to be included in that sitemap

What happens if the URL structure changes? How do we get other related data like the post modified date from just the URL?

Alot of the tags never change, so we could hard code them in the template vs using SimpleXML (ie: loc, lastmod, etc..)

Do we get any major benefits from switching to a hard-coded template? How will we maintain backwards compatibility (e.g. some filters pass in the simplexml object that sites use to add things like images)?

Performance wise I think we're gonna net out close to the same, but this method I'm proposing may be a bit more expensive

This is probably the biggest thing we'll need to watch. Some of the sites using this plugin have millions of posts dating back 5/10/20 years. If the newer method is significantly slower, it may not be worth it so it would be good to gather and compare some data as we work on this.

@systemseven

This comment has been minimized.

Copy link

commented Feb 20, 2017

Hey Mo,

A few others asked the some of the same questions you did on the internal site.

For the urls, that's just an example, we'll just need to make sure we store the right data needed, and in the post_meta, the the post_content

Good catch about the backwards compatibility on the SimpleXML objects, I'm going to take a look at that.

Thanks

@mjangda mjangda referenced this issue Mar 23, 2017

Open

Issue 110 #116

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.