Generate sitemap.xml #4348

jgable · 2014-10-28T02:51:58Z

Keeps an in-memory url node lookup; has some basic in place updating but always regenerates the whole xml string
Ties into Bookshelf model created, updated, destroyed events for triggering re-generation
Has absolutely no unit tests.... yet.

Spent a couple hours on this tonight and wanted to get some feedback before getting too much more invested.

ErisDS · 2014-10-29T19:01:20Z

I'm super excited by this PR 👍 The event scaffold is something of a glimpse of the future I think :)

Two questions I have are 1) what do we actually need to output - do we need to add a homepage URL, and what about tag/author pages etc and 2) is it more performant to keep the XML in memory as it is now, or to write a file... should this be a default only or configurable?

To aid with review by others, here's what it outputs for my local test blog:

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://localhost:2368/cupcakes/</loc>
<lastmod>2014-09-24</lastmod>
</url>
<url>
<loc>http://localhost:2368/bacon/</loc>
<lastmod>2014-09-23</lastmod>
</url>
<url>
<loc>http://localhost:2368/welcome-to-ghost/</loc>
<lastmod>2014-09-25</lastmod>
</url>
</urlset>

nuclearpengy · 2014-10-30T10:43:09Z

I'm quite excited about this. :-D

My 0.5c

All pages and posts sounds good to me. :-)
- It might be worthwhile getting the opinion of someone who understands the SEO impact. I'm not sure how this impacts content being marked as duplicate or whether that is even something to be concerned about at this stage.
With a small blog(like my excuse for one), I guess keeping it in memory is safe, I'm not sure what the effect or impact of this would be on a really large blog with thousands of posts. I think serving up a static file seems like it would only result in server load and memory usage to generate the file after which it just sits there waiting to be indexed. This might be such a negligible issue though, I don't know what the overhead would be.

jgable · 2014-10-30T19:53:44Z

@ErisDS I'm just directly poking into the sitemap manager thingy right now without an over arching eventing system. Do you think it's worth doing a centralized eventing system as part of this PR? Didn't we already have something that did this before?

I got the urlset structure from that Sitemaps.org site that I believe @halfdan posted in the other thread. I'm open to other fields, for now just wanted to get the minimum I can in. It was kind of a bit of digging to get the url for each post, but now I think I can also figure out the user and tags links too.

Eventually, we might want to just automatically do the Sitemap index (list of all sitemaps with < 50,000 links) file route even with only one sitemap so that it will scale for both large and small sites with no extra overhead.

@nuclearpengy

I think we'll need to do pages/posts, a user page for each user, and a tags page for each tag?
The other thread mentioned that even with 50,000 links the memory footprint for the object to hold that is about 1Mb. I think that's reasonable, but storing it to disk is definitely an option. I just didn't want to go down that route until I had to.

sebgie · 2014-10-31T15:05:24Z

I think this is a pretty impressive start for this feature 👍.

The only thing that I think is missing are cache invalidation headers for the sitemap.xml file?

I would like to make two proposals for the future that are not necessary for this PR but further improvements:

As @jgable already mentioned a centralized eventing system would be great. The post model is not the right place to do XML-RPC pings and sitemap generation. There will be more features like this in the future and an eventing system would help us to keep the already complex models clean.
I would be in favor of writing the sitemap.xml file to the disc and serving it as a static asset. I know that it isn't a big win, but every MB we could save is good.

ErisDS · 2014-10-31T15:12:36Z

I think for this PR:

the eventing system is great as-is, but it is a clear (and very exciting) example of where we want to get to- so as @sebgie says, we can do further work to centralise the eventing and solve some other problems.
in-memory is ok, I'd certainly prefer to serve from disk if possible, but it's an improvement we can do later
defaulting to an index so that we can handle more than 50k urls is probably a really good idea, again it can be an improvemnt
I think at the very least, the sitemap needs to include the homepage link, as well as a list of all posts

I'm all for doing the minimal on this PR and then pushing optimisations later :)

ErisDS · 2014-10-31T16:19:04Z

Ok so... I just went back and read the issue #623 and it contains really specific instructions for the output (and mentions specifically saving in the filesystem). I totally missed the instructions on the separate sitemaps.

JohnONolan · 2014-10-31T16:21:52Z

I think we can probably get away with serving from memory initially if that's easier - but it's definitely something that should be in the FS. The output structure is the important bit to get right from the get-go.

jgable · 2014-10-31T17:49:55Z

Alright, so gathering up all the feedback it looks like what's left is

Leave eventing system out for now
Use an index approach and generate individual xml files for posts, pages, authors and tags
Write sitemap.xml and each posts.xml, authors.xml to disk/storage and serve as static assets
Write some unit/integration tests

Only question I have is

What fields should definitely be included? Are we okay with loc and lastmod for now (I can also update the lastmod to be a full timestamp)?

nuclearpengy · 2014-10-31T18:47:50Z

Sweet! This is great!
@jgable thank you for this effort, it's really appreciated. :-)

With regards to what to include, I found the snippet below on: http://www.sitemaps.org/protocol.html

I'm not sure how one would go about calculating 'priority' or 'changefreq' I'm guessing most posts are not updated regularly, I know mine aren't.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <url>
      <loc>http://www.example.com/</loc>
      <lastmod>2005-01-01</lastmod>
      <changefreq>monthly</changefreq>
      <priority>0.8</priority>
   </url>
</urlset>

I personally like the idea of including the time. I'm not sure if it has any additional relevance in terms of how a page is treated by a search engine though.

Here is an example, used in the WP Google XML sitemap plugin

    <url>
        <loc>http://example.com</loc>
        <lastmod>2014-02-28T23:49:47+00:00</lastmod>
        <changefreq>monthly</changefreq>
        <priority>0.2</priority>
    </url>

sebgie · 2014-10-31T19:25:40Z

@jgable cache invalidation is missing from your list. Creating a new and modifying an existing post (page, tag, author) should invalidate the XML file.

ErisDS · 2014-11-01T15:34:47Z

The example @JohnONolan has asked that we work from is the one from Yoast's SEO Plugin for WP. The expectation is, I believe, that we do our best to match as closely as possible.

I've taken a thorough look through what that does, and we will need to include changefreq and priority as well as loc and lastmod, plus wherever we can, we should include a URL for the cover image (posts, pages, users and tags all have this & #4317 makes it easy to get an absolute URL).

Here's a full example, with some notes on how to get each piece of data

<url>
    <loc>http://localhost:2368/my-first-post</loc>
    <lastmod>2014-10-30T13:56:44+00:00</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.8</priority>
    <image:image>
        <image:loc>http://localhost:2368/content/images/2014/10/my-first-image.png</image:loc>
        <image:caption>my-first-image.png</image:caption>
    </image:image>
</url>

Last mod

Needs to be a full ISO8601 date string, i.e. moment(post.published_at).toISOString()

Image

Image location is an absolute URL
Image caption defaults to the filename. (can we do something better here @JohnONolan?)

Changefreq

Possible values are: always, hourly, daily, weekly, monthly, yearly, never.

As far as I can figure, weekly should be the default, but archive pages or pages that contain lists of posts (tag pages, author pages, and the home page) should be daily.

Priority

Possible values are 0.0 - 1.0, with 0.8 being the default.

As far as I can figure, the homepage should be 1.0, individual posts and pages should be 0.8 and the archive/listing pages should be 0.6.

So there is no hairy calculations to do, it's all just based on what type of URL you're adding.

nuclearpengy · 2014-11-01T16:04:45Z

Well I just learnt something new. =)

ErisDS · 2014-11-03T15:10:27Z

@jgable #4317 is now merged, which should make adding the images pretty straight forward. Here's an updated version of the list you posted, focusing on the important parts to get this shippable :)

Use an index approach and generate individual xml files for posts, pages, authors and tags
Provide loc, lastmod, priority, changefreq, and if available, image
Leave eventing system out for now
Leave this in memory for now
Make sure the sitemaps get served with sensible cache headers (1 day?), and wire up cache invalidation
Write some unit/integration tests

jgable · 2014-11-04T23:28:36Z

Just made my move today to the left coast, should have some time tonight to dig back in and make progress on this.

JohnONolan · 2014-11-05T07:44:51Z

🎉

jgable · 2014-11-11T17:01:00Z

Sorry about the delay. Have had no internet this week and had a bunch of boxes to unpack so the progress has been sparse.

ErisDS · 2014-11-11T23:09:28Z

Thanks for the pre-meeting update, much appreciated ;)

jgable · 2014-11-17T03:32:58Z

I did some work on this tonight and made some progress:

Now serving a /sitemap_index.xml file that lists 4 different sitemap files; /sitemap-pages.xml, /sitemap-posts.xml, /sitemap-authors.xml and /sitemap-tags.xml
- I didn't do the paging yet; I just powered through with them separated for now
Now showing all attributes properly loc, lastmod, priority, changefreq, image

Things left to do:

Cache invalidation
Paging for resources with more than 50,000 urls
Unit and Integration tests

jgable · 2014-11-23T19:27:48Z

I added some basic integration tests (actually they look more like unit tests) and rebased on latest master. I'd like to get some end to end tests of generating the xml files and validating there contents after updates/deletes next.

I'd like to punt on paging for over 50,000 urls since this is taking me so long to get done.

@sebgie I'm struggling with the cache invalidation; it's my understanding that these sitemap xmls are not for general public consumption, but rather to be crawled by bots. I guess I'm just not really sure how cache invalidation works in general.

ErisDS · 2014-11-25T14:32:44Z

@jgable This is all looking awesome, I've had a good read through the code and played with it a bit, and it's all seeming like it's nearly ready to go. More than happy to punt the 50,000 urls thing for now, especially as we're already splitting the possible URLs across multiple files.

I've got a couple of small things of note:

sitemap.xml needs to be a 301 for sitemap_index.xml
although the example is inconsistent, I think the filename format should use - everywhere - so sitemap-index.xml rather than having it different for the resources (I didn't find anything in the standards that said it has to be one way or another)
potential bug: I created a new static page, and it landed in sitemap-posts.xml not sitemap-pages.xml, after a restart it was in the right place

With regard to cache invalidation, I think this is relatively straightforward. Although we're serving out of memory, it still makes sense to provide for these files being cached by any cache in front of Ghost - I'd suggest giving them the 1 hour cache rule. By updating the list of invalidation routes here to include /*.xml/, we can make sure that caches using the X-Cache-Invalidate header will clear the cache on demand.

All that aside I'm super excited about this feature as I think it's a great example of how we could refactor other bits of Ghost :)

jgable · 2014-11-29T20:00:04Z

Updated;

Rebased on latest
Redirect sitemap.xml to sitemap-index.xml
Use consistent - instead of sitemap_index.xml
Fixes for page/unpage and draft/publish changes not updating properly
Couple more unit tests
Really basic existence functional checks for front end sitemap routes

I can squash and rebase once I get a review of what I've got so far.

core/server/api/index.js

@@ -81,7 +81,7 @@ cacheInvalidationHeader = function (req, result) {

            // Don't set x-cache-invalidate header for drafts
            if (hasStatusChanged || wasDeleted || wasPublishedUpdated) {
-                cacheInvalidate = '/, /page/*, /rss/, /rss/*, /tag/*, /author/*';
+                cacheInvalidate = '/, /page/*, /rss/, /rss/*, /tag/*, /author/*, /sitemap-*.xml';


ErisDS · 2014-11-30T12:33:35Z

This all seems to be working really, really well 😀

There is one small thing missing - the cache-control headers for the responses. There is a utils module that provides the different max-ages:

For 301 redirects, we usually use utils.ONE_YEAR_S: https://github.com/TryGhost/Ghost/blob/master/core/server/routes/frontend.js#L15
And for the sitemaps themselves, I'd probably set ONE_HOUR_S?

Other than that, I think this is ready to ship?

Closes TryGhost#623 - Add basic init and eventing scaffold - Add sitemap-index.xml generation - Broke out generators to individual files, added request handler - Add page, author and tag xml files; add index mapping - Add SiteMapManager unit tests - Add Generators tests - Cache invalidation headers for sitemap-*.xml - Redirect sitemap.xml to index and rename to sitemap-index - Handle page convert and publish/draft changes - Add very basic functional test for route existence - Add cache headers to sitemap routes

jgable · 2014-11-30T18:59:16Z

Alrighty, I've added the Cache-Control headers plus squashed and rebased on master.

ErisDS · 2014-11-30T19:02:32Z

👯 🎈 🎊 💃 🎵 🎉

Generate sitemap.xml

refs TryGhost#623, TryGhost#4348 - this fixes sitemaps to list all posts, pages, tags and users - makes the API behave consistently across all paginated resources

fixes TryGhost#5104, refs TryGhost#4348, TryGhost#2263 - Create a centralised event module - Hook it up for posts, pages, tags and users - Use it in sitemaps instead of direct method calls - Use it for xmlrpc calls - Check events are fired in model tests - Update sitemap tests to work with new code - Fix a bug where invited users were appearing in sitemaps - Move sitemaps and xmlrpc into a directory together

jgable force-pushed the sitemaps branch from acb2079 to f4aae9e Compare November 17, 2014 03:25

jgable force-pushed the sitemaps branch from f4aae9e to c29fb00 Compare November 23, 2014 19:20

jgable force-pushed the sitemaps branch 2 times, most recently from 3c3cbcf to 4c996ed Compare November 29, 2014 19:57

jgable changed the title ~~[WIP] Generate sitemap.xml~~ Generate sitemap.xml Nov 29, 2014

ErisDS reviewed Nov 30, 2014
View reviewed changes

jgable force-pushed the sitemaps branch from 4c996ed to 2cfa184 Compare November 30, 2014 18:53

ErisDS added a commit that referenced this pull request Dec 1, 2014

Merge pull request #4348 from jgable/sitemaps

d2d9e4a

Generate sitemap.xml

ErisDS merged commit d2d9e4a into TryGhost:master Dec 1, 2014

ErisDS mentioned this pull request Dec 1, 2014

Added basic sitemap functionality #3774

Closed

jgable deleted the sitemaps branch December 2, 2014 19:35

ErisDS mentioned this pull request Dec 3, 2014

Add limit=all consistently to users, posts & tags #4573

Merged

ErisDS mentioned this pull request Mar 24, 2015

Refactor sitemaps to use centralised events #5069

Merged

snyk-bot mentioned this pull request Apr 22, 2021

[Snyk] Upgrade knex from 0.9.0 to 0.95.4 fooiskandar/Ghost#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate sitemap.xml #4348

Generate sitemap.xml #4348

jgable commented Oct 28, 2014

ErisDS commented Oct 29, 2014

nuclearpengy commented Oct 30, 2014

jgable commented Oct 30, 2014

sebgie commented Oct 31, 2014

ErisDS commented Oct 31, 2014

ErisDS commented Oct 31, 2014

JohnONolan commented Oct 31, 2014

jgable commented Oct 31, 2014

nuclearpengy commented Oct 31, 2014

sebgie commented Oct 31, 2014

ErisDS commented Nov 1, 2014

nuclearpengy commented Nov 1, 2014

ErisDS commented Nov 3, 2014

jgable commented Nov 4, 2014

JohnONolan commented Nov 5, 2014

jgable commented Nov 11, 2014

ErisDS commented Nov 11, 2014

jgable commented Nov 17, 2014

jgable commented Nov 23, 2014

ErisDS commented Nov 25, 2014

jgable commented Nov 29, 2014

This comment was marked as abuse.

ErisDS commented Nov 30, 2014

jgable commented Nov 30, 2014

ErisDS commented Nov 30, 2014

Generate sitemap.xml #4348

Generate sitemap.xml #4348

Conversation

jgable commented Oct 28, 2014

ErisDS commented Oct 29, 2014

nuclearpengy commented Oct 30, 2014

jgable commented Oct 30, 2014

sebgie commented Oct 31, 2014

ErisDS commented Oct 31, 2014

ErisDS commented Oct 31, 2014

JohnONolan commented Oct 31, 2014

jgable commented Oct 31, 2014

nuclearpengy commented Oct 31, 2014

sebgie commented Oct 31, 2014

ErisDS commented Nov 1, 2014

Last mod

Image

Changefreq

Priority

nuclearpengy commented Nov 1, 2014

ErisDS commented Nov 3, 2014

jgable commented Nov 4, 2014

JohnONolan commented Nov 5, 2014

jgable commented Nov 11, 2014

ErisDS commented Nov 11, 2014

jgable commented Nov 17, 2014

jgable commented Nov 23, 2014

ErisDS commented Nov 25, 2014

jgable commented Nov 29, 2014

This comment was marked as abuse.

ErisDS commented Nov 30, 2014

jgable commented Nov 30, 2014

ErisDS commented Nov 30, 2014