Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate sitemap.xml #4348

Merged
merged 1 commit into from Dec 1, 2014
Merged

Generate sitemap.xml #4348

merged 1 commit into from Dec 1, 2014

Conversation

jgable
Copy link
Contributor

@jgable jgable commented Oct 28, 2014

Ref #623

  • Keeps an in-memory url node lookup; has some basic in place updating but always regenerates the whole xml string
  • Ties into Bookshelf model created, updated, destroyed events for triggering re-generation
  • Has absolutely no unit tests.... yet.

Spent a couple hours on this tonight and wanted to get some feedback before getting too much more invested.

@ErisDS
Copy link
Member

ErisDS commented Oct 29, 2014

I'm super excited by this PR 👍 The event scaffold is something of a glimpse of the future I think :)

Two questions I have are 1) what do we actually need to output - do we need to add a homepage URL, and what about tag/author pages etc and 2) is it more performant to keep the XML in memory as it is now, or to write a file... should this be a default only or configurable?

To aid with review by others, here's what it outputs for my local test blog:

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://localhost:2368/cupcakes/</loc>
<lastmod>2014-09-24</lastmod>
</url>
<url>
<loc>http://localhost:2368/bacon/</loc>
<lastmod>2014-09-23</lastmod>
</url>
<url>
<loc>http://localhost:2368/welcome-to-ghost/</loc>
<lastmod>2014-09-25</lastmod>
</url>
</urlset>

@nuclearpengy
Copy link

I'm quite excited about this. :-D

My 0.5c

  1. All pages and posts sounds good to me. :-)
    • It might be worthwhile getting the opinion of someone who understands the SEO impact. I'm not sure how this impacts content being marked as duplicate or whether that is even something to be concerned about at this stage.
  2. With a small blog(like my excuse for one), I guess keeping it in memory is safe, I'm not sure what the effect or impact of this would be on a really large blog with thousands of posts. I think serving up a static file seems like it would only result in server load and memory usage to generate the file after which it just sits there waiting to be indexed. This might be such a negligible issue though, I don't know what the overhead would be.

@jgable
Copy link
Contributor Author

jgable commented Oct 30, 2014

@ErisDS I'm just directly poking into the sitemap manager thingy right now without an over arching eventing system. Do you think it's worth doing a centralized eventing system as part of this PR? Didn't we already have something that did this before?

I got the urlset structure from that Sitemaps.org site that I believe @halfdan posted in the other thread. I'm open to other fields, for now just wanted to get the minimum I can in. It was kind of a bit of digging to get the url for each post, but now I think I can also figure out the user and tags links too.

Eventually, we might want to just automatically do the Sitemap index (list of all sitemaps with < 50,000 links) file route even with only one sitemap so that it will scale for both large and small sites with no extra overhead.

@nuclearpengy

  1. I think we'll need to do pages/posts, a user page for each user, and a tags page for each tag?
  2. The other thread mentioned that even with 50,000 links the memory footprint for the object to hold that is about 1Mb. I think that's reasonable, but storing it to disk is definitely an option. I just didn't want to go down that route until I had to.

@sebgie
Copy link
Contributor

sebgie commented Oct 31, 2014

I think this is a pretty impressive start for this feature 👍.

The only thing that I think is missing are cache invalidation headers for the sitemap.xml file?

I would like to make two proposals for the future that are not necessary for this PR but further improvements:

  • As @jgable already mentioned a centralized eventing system would be great. The post model is not the right place to do XML-RPC pings and sitemap generation. There will be more features like this in the future and an eventing system would help us to keep the already complex models clean.
  • I would be in favor of writing the sitemap.xml file to the disc and serving it as a static asset. I know that it isn't a big win, but every MB we could save is good.

@ErisDS
Copy link
Member

ErisDS commented Oct 31, 2014

I think for this PR:

  • the eventing system is great as-is, but it is a clear (and very exciting) example of where we want to get to- so as @sebgie says, we can do further work to centralise the eventing and solve some other problems.
  • in-memory is ok, I'd certainly prefer to serve from disk if possible, but it's an improvement we can do later
  • defaulting to an index so that we can handle more than 50k urls is probably a really good idea, again it can be an improvemnt
  • I think at the very least, the sitemap needs to include the homepage link, as well as a list of all posts

I'm all for doing the minimal on this PR and then pushing optimisations later :)

@ErisDS
Copy link
Member

ErisDS commented Oct 31, 2014

Ok so... I just went back and read the issue #623 and it contains really specific instructions for the output (and mentions specifically saving in the filesystem). I totally missed the instructions on the separate sitemaps.

@JohnONolan
Copy link
Member

I think we can probably get away with serving from memory initially if that's easier - but it's definitely something that should be in the FS. The output structure is the important bit to get right from the get-go.

@jgable
Copy link
Contributor Author

jgable commented Oct 31, 2014

Alright, so gathering up all the feedback it looks like what's left is

  • Leave eventing system out for now
  • Use an index approach and generate individual xml files for posts, pages, authors and tags
  • Write sitemap.xml and each posts.xml, authors.xml to disk/storage and serve as static assets
  • Write some unit/integration tests

Only question I have is

  • What fields should definitely be included? Are we okay with loc and lastmod for now (I can also update the lastmod to be a full timestamp)?

@nuclearpengy
Copy link

Sweet! This is great!
@jgable thank you for this effort, it's really appreciated. :-)

With regards to what to include, I found the snippet below on: http://www.sitemaps.org/protocol.html

I'm not sure how one would go about calculating 'priority' or 'changefreq' I'm guessing most posts are not updated regularly, I know mine aren't.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <url>
      <loc>http://www.example.com/</loc>
      <lastmod>2005-01-01</lastmod>
      <changefreq>monthly</changefreq>
      <priority>0.8</priority>
   </url>
</urlset> 

I personally like the idea of including the time. I'm not sure if it has any additional relevance in terms of how a page is treated by a search engine though.

Here is an example, used in the WP Google XML sitemap plugin

    <url>
        <loc>http://example.com</loc>
        <lastmod>2014-02-28T23:49:47+00:00</lastmod>
        <changefreq>monthly</changefreq>
        <priority>0.2</priority>
    </url>

@sebgie
Copy link
Contributor

sebgie commented Oct 31, 2014

@jgable cache invalidation is missing from your list. Creating a new and modifying an existing post (page, tag, author) should invalidate the XML file.

@ErisDS
Copy link
Member

ErisDS commented Nov 1, 2014

The example @JohnONolan has asked that we work from is the one from Yoast's SEO Plugin for WP. The expectation is, I believe, that we do our best to match as closely as possible.

I've taken a thorough look through what that does, and we will need to include changefreq and priority as well as loc and lastmod, plus wherever we can, we should include a URL for the cover image (posts, pages, users and tags all have this & #4317 makes it easy to get an absolute URL).

Here's a full example, with some notes on how to get each piece of data

<url>
    <loc>http://localhost:2368/my-first-post</loc>
    <lastmod>2014-10-30T13:56:44+00:00</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.8</priority>
    <image:image>
        <image:loc>http://localhost:2368/content/images/2014/10/my-first-image.png</image:loc>
        <image:caption>my-first-image.png</image:caption>
    </image:image>
</url>

Last mod

Needs to be a full ISO8601 date string, i.e. moment(post.published_at).toISOString()

Image

Image location is an absolute URL
Image caption defaults to the filename. (can we do something better here @JohnONolan?)

Changefreq

Possible values are: always, hourly, daily, weekly, monthly, yearly, never.

As far as I can figure, weekly should be the default, but archive pages or pages that contain lists of posts (tag pages, author pages, and the home page) should be daily.

Priority

Possible values are 0.0 - 1.0, with 0.8 being the default.

As far as I can figure, the homepage should be 1.0, individual posts and pages should be 0.8 and the archive/listing pages should be 0.6.

So there is no hairy calculations to do, it's all just based on what type of URL you're adding.

@nuclearpengy
Copy link

Well I just learnt something new. =)

@ErisDS
Copy link
Member

ErisDS commented Nov 3, 2014

@jgable #4317 is now merged, which should make adding the images pretty straight forward. Here's an updated version of the list you posted, focusing on the important parts to get this shippable :)

  • Use an index approach and generate individual xml files for posts, pages, authors and tags
  • Provide loc, lastmod, priority, changefreq, and if available, image
  • Leave eventing system out for now
  • Leave this in memory for now
  • Make sure the sitemaps get served with sensible cache headers (1 day?), and wire up cache invalidation
  • Write some unit/integration tests

@jgable
Copy link
Contributor Author

jgable commented Nov 4, 2014

Just made my move today to the left coast, should have some time tonight to dig back in and make progress on this.

@JohnONolan
Copy link
Member

🎉

@jgable
Copy link
Contributor Author

jgable commented Nov 11, 2014

Sorry about the delay. Have had no internet this week and had a bunch of boxes to unpack so the progress has been sparse.

@ErisDS
Copy link
Member

ErisDS commented Nov 11, 2014

Thanks for the pre-meeting update, much appreciated ;)

@jgable
Copy link
Contributor Author

jgable commented Nov 17, 2014

I did some work on this tonight and made some progress:

  • Now serving a /sitemap_index.xml file that lists 4 different sitemap files; /sitemap-pages.xml, /sitemap-posts.xml, /sitemap-authors.xml and /sitemap-tags.xml
    • I didn't do the paging yet; I just powered through with them separated for now
  • Now showing all attributes properly loc, lastmod, priority, changefreq, image

Things left to do:

  • Cache invalidation
  • Paging for resources with more than 50,000 urls
  • Unit and Integration tests

@jgable
Copy link
Contributor Author

jgable commented Nov 23, 2014

I added some basic integration tests (actually they look more like unit tests) and rebased on latest master. I'd like to get some end to end tests of generating the xml files and validating there contents after updates/deletes next.

I'd like to punt on paging for over 50,000 urls since this is taking me so long to get done.

@sebgie I'm struggling with the cache invalidation; it's my understanding that these sitemap xmls are not for general public consumption, but rather to be crawled by bots. I guess I'm just not really sure how cache invalidation works in general.

@ErisDS
Copy link
Member

ErisDS commented Nov 25, 2014

@jgable This is all looking awesome, I've had a good read through the code and played with it a bit, and it's all seeming like it's nearly ready to go. More than happy to punt the 50,000 urls thing for now, especially as we're already splitting the possible URLs across multiple files.

I've got a couple of small things of note:

  1. sitemap.xml needs to be a 301 for sitemap_index.xml
  2. although the example is inconsistent, I think the filename format should use - everywhere - so sitemap-index.xml rather than having it different for the resources (I didn't find anything in the standards that said it has to be one way or another)
  3. potential bug: I created a new static page, and it landed in sitemap-posts.xml not sitemap-pages.xml, after a restart it was in the right place

With regard to cache invalidation, I think this is relatively straightforward. Although we're serving out of memory, it still makes sense to provide for these files being cached by any cache in front of Ghost - I'd suggest giving them the 1 hour cache rule. By updating the list of invalidation routes here to include /*.xml/, we can make sure that caches using the X-Cache-Invalidate header will clear the cache on demand.

All that aside I'm super excited about this feature as I think it's a great example of how we could refactor other bits of Ghost :)

@jgable jgable force-pushed the sitemaps branch 2 times, most recently from 3c3cbcf to 4c996ed Compare November 29, 2014 19:57
@jgable jgable changed the title [WIP] Generate sitemap.xml Generate sitemap.xml Nov 29, 2014
@jgable
Copy link
Contributor Author

jgable commented Nov 29, 2014

Updated;

  • Rebased on latest
  • Redirect sitemap.xml to sitemap-index.xml
  • Use consistent - instead of sitemap_index.xml
  • Fixes for page/unpage and draft/publish changes not updating properly
  • Couple more unit tests
  • Really basic existence functional checks for front end sitemap routes

I can squash and rebase once I get a review of what I've got so far.

@@ -81,7 +81,7 @@ cacheInvalidationHeader = function (req, result) {

// Don't set x-cache-invalidate header for drafts
if (hasStatusChanged || wasDeleted || wasPublishedUpdated) {
cacheInvalidate = '/, /page/*, /rss/, /rss/*, /tag/*, /author/*';
cacheInvalidate = '/, /page/*, /rss/, /rss/*, /tag/*, /author/*, /sitemap-*.xml';

This comment was marked as abuse.

@ErisDS
Copy link
Member

ErisDS commented Nov 30, 2014

This all seems to be working really, really well 😀

There is one small thing missing - the cache-control headers for the responses. There is a utils module that provides the different max-ages:

For 301 redirects, we usually use utils.ONE_YEAR_S: https://github.com/TryGhost/Ghost/blob/master/core/server/routes/frontend.js#L15
And for the sitemaps themselves, I'd probably set ONE_HOUR_S?

Other than that, I think this is ready to ship?

Closes TryGhost#623

- Add basic init and eventing scaffold
- Add sitemap-index.xml generation
- Broke out generators to individual files, added request handler
- Add page, author and tag xml files; add index mapping
- Add SiteMapManager unit tests
- Add Generators tests
- Cache invalidation headers for sitemap-*.xml
- Redirect sitemap.xml to index and rename to sitemap-index
- Handle page convert and publish/draft changes
- Add very basic functional test for route existence
- Add cache headers to sitemap routes
@jgable
Copy link
Contributor Author

jgable commented Nov 30, 2014

Alrighty, I've added the Cache-Control headers plus squashed and rebased on master.

@ErisDS
Copy link
Member

ErisDS commented Nov 30, 2014

👯 🎈 🎊 💃 🎵 🎉

ErisDS added a commit that referenced this pull request Dec 1, 2014
@ErisDS ErisDS merged commit d2d9e4a into TryGhost:master Dec 1, 2014
@jgable jgable deleted the sitemaps branch December 2, 2014 19:35
ErisDS added a commit to ErisDS/Ghost that referenced this pull request Dec 3, 2014
refs TryGhost#623, TryGhost#4348

- this fixes sitemaps to list all posts, pages, tags and users
- makes the API behave consistently across all paginated resources
ErisDS added a commit to ErisDS/Ghost that referenced this pull request Apr 5, 2015
fixes TryGhost#5104, refs TryGhost#4348, TryGhost#2263

- Create a centralised event module
- Hook it up for posts, pages, tags and users
- Use it in sitemaps instead of direct method calls
- Use it for xmlrpc calls
- Check events are fired in model tests
- Update sitemap tests to work with new code
- Fix a bug where invited users were appearing in sitemaps
- Move sitemaps and xmlrpc into a directory together
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants