New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate sitemap.xml #4348
Generate sitemap.xml #4348
Conversation
I'm super excited by this PR 👍 The event scaffold is something of a glimpse of the future I think :) Two questions I have are 1) what do we actually need to output - do we need to add a homepage URL, and what about tag/author pages etc and 2) is it more performant to keep the XML in memory as it is now, or to write a file... should this be a default only or configurable? To aid with review by others, here's what it outputs for my local test blog:
|
I'm quite excited about this. :-D My 0.5c
|
@ErisDS I'm just directly poking into the sitemap manager thingy right now without an over arching eventing system. Do you think it's worth doing a centralized eventing system as part of this PR? Didn't we already have something that did this before? I got the urlset structure from that Sitemaps.org site that I believe @halfdan posted in the other thread. I'm open to other fields, for now just wanted to get the minimum I can in. It was kind of a bit of digging to get the url for each post, but now I think I can also figure out the user and tags links too. Eventually, we might want to just automatically do the Sitemap index (list of all sitemaps with < 50,000 links) file route even with only one sitemap so that it will scale for both large and small sites with no extra overhead.
|
I think this is a pretty impressive start for this feature 👍. The only thing that I think is missing are cache invalidation headers for the I would like to make two proposals for the future that are not necessary for this PR but further improvements:
|
I think for this PR:
I'm all for doing the minimal on this PR and then pushing optimisations later :) |
Ok so... I just went back and read the issue #623 and it contains really specific instructions for the output (and mentions specifically saving in the filesystem). I totally missed the instructions on the separate sitemaps. |
I think we can probably get away with serving from memory initially if that's easier - but it's definitely something that should be in the FS. The output structure is the important bit to get right from the get-go. |
Alright, so gathering up all the feedback it looks like what's left is
Only question I have is
|
Sweet! This is great! With regards to what to include, I found the snippet below on: http://www.sitemaps.org/protocol.html I'm not sure how one would go about calculating 'priority' or 'changefreq' I'm guessing most posts are not updated regularly, I know mine aren't. <?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://www.example.com/</loc>
<lastmod>2005-01-01</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
</urlset> I personally like the idea of including the time. I'm not sure if it has any additional relevance in terms of how a page is treated by a search engine though. Here is an example, used in the WP Google XML sitemap plugin <url>
<loc>http://example.com</loc>
<lastmod>2014-02-28T23:49:47+00:00</lastmod>
<changefreq>monthly</changefreq>
<priority>0.2</priority>
</url> |
@jgable cache invalidation is missing from your list. Creating a new and modifying an existing post (page, tag, author) should invalidate the XML file. |
The example @JohnONolan has asked that we work from is the one from Yoast's SEO Plugin for WP. The expectation is, I believe, that we do our best to match as closely as possible. I've taken a thorough look through what that does, and we will need to include Here's a full example, with some notes on how to get each piece of data
Last modNeeds to be a full ISO8601 date string, i.e. ImageImage location is an absolute URL ChangefreqPossible values are: As far as I can figure, PriorityPossible values are 0.0 - 1.0, with 0.8 being the default. As far as I can figure, the homepage should be 1.0, individual posts and pages should be 0.8 and the archive/listing pages should be 0.6. So there is no hairy calculations to do, it's all just based on what type of URL you're adding. |
Well I just learnt something new. =) |
@jgable #4317 is now merged, which should make adding the images pretty straight forward. Here's an updated version of the list you posted, focusing on the important parts to get this shippable :)
|
Just made my move today to the left coast, should have some time tonight to dig back in and make progress on this. |
🎉 |
Sorry about the delay. Have had no internet this week and had a bunch of boxes to unpack so the progress has been sparse. |
Thanks for the pre-meeting update, much appreciated ;) |
I did some work on this tonight and made some progress:
Things left to do:
|
I added some basic integration tests (actually they look more like unit tests) and rebased on latest master. I'd like to get some end to end tests of generating the xml files and validating there contents after updates/deletes next. I'd like to punt on paging for over 50,000 urls since this is taking me so long to get done. @sebgie I'm struggling with the cache invalidation; it's my understanding that these sitemap xmls are not for general public consumption, but rather to be crawled by bots. I guess I'm just not really sure how cache invalidation works in general. |
@jgable This is all looking awesome, I've had a good read through the code and played with it a bit, and it's all seeming like it's nearly ready to go. More than happy to punt the 50,000 urls thing for now, especially as we're already splitting the possible URLs across multiple files. I've got a couple of small things of note:
With regard to cache invalidation, I think this is relatively straightforward. Although we're serving out of memory, it still makes sense to provide for these files being cached by any cache in front of Ghost - I'd suggest giving them the 1 hour cache rule. By updating the list of invalidation routes here to include All that aside I'm super excited about this feature as I think it's a great example of how we could refactor other bits of Ghost :) |
3c3cbcf
to
4c996ed
Compare
Updated;
I can squash and rebase once I get a review of what I've got so far. |
@@ -81,7 +81,7 @@ cacheInvalidationHeader = function (req, result) { | |||
|
|||
// Don't set x-cache-invalidate header for drafts | |||
if (hasStatusChanged || wasDeleted || wasPublishedUpdated) { | |||
cacheInvalidate = '/, /page/*, /rss/, /rss/*, /tag/*, /author/*'; | |||
cacheInvalidate = '/, /page/*, /rss/, /rss/*, /tag/*, /author/*, /sitemap-*.xml'; |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This all seems to be working really, really well 😀 There is one small thing missing - the cache-control headers for the responses. There is a utils module that provides the different max-ages: For 301 redirects, we usually use utils.ONE_YEAR_S: https://github.com/TryGhost/Ghost/blob/master/core/server/routes/frontend.js#L15 Other than that, I think this is ready to ship? |
Closes TryGhost#623 - Add basic init and eventing scaffold - Add sitemap-index.xml generation - Broke out generators to individual files, added request handler - Add page, author and tag xml files; add index mapping - Add SiteMapManager unit tests - Add Generators tests - Cache invalidation headers for sitemap-*.xml - Redirect sitemap.xml to index and rename to sitemap-index - Handle page convert and publish/draft changes - Add very basic functional test for route existence - Add cache headers to sitemap routes
Alrighty, I've added the Cache-Control headers plus squashed and rebased on master. |
👯 🎈 🎊 💃 🎵 🎉 |
refs TryGhost#623, TryGhost#4348 - this fixes sitemaps to list all posts, pages, tags and users - makes the API behave consistently across all paginated resources
fixes TryGhost#5104, refs TryGhost#4348, TryGhost#2263 - Create a centralised event module - Hook it up for posts, pages, tags and users - Use it in sitemaps instead of direct method calls - Use it for xmlrpc calls - Check events are fired in model tests - Update sitemap tests to work with new code - Fix a bug where invited users were appearing in sitemaps - Move sitemaps and xmlrpc into a directory together
Ref #623
Spent a couple hours on this tonight and wanted to get some feedback before getting too much more invested.