New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Generate sitemap.xml #623

Closed
sebgie opened this Issue Sep 4, 2013 · 36 comments

Comments

Projects
None yet
@sebgie
Contributor

sebgie commented Sep 4, 2013

[Updated 26 Jul 2014 - John]

To support SEO, Ghost should generate a sitemap.xml.

The best format for this is likely to follow the work of Yoast and his SEO plugin for WordPress, which creates separated XML sitemaps for different content formats, brought together in a main XML file and served through an XSL template. This has been widely tested and updated over the years.

Demo: http://marketplace.ghost.org/sitemap.xml

For us the sitemap structure would be:

  • Index
    • Posts
    • Pages (contains home)
    • Tags
    • Authors

These should be generated once and saved in the filesystem. When a post is published or updated it should update the posts sitemap, when a page is published or updated it should update the pages sitemap, when a user is created or updated... and so on.

@JohnONolan

This comment has been minimized.

Show comment
Hide comment
@JohnONolan

JohnONolan Sep 4, 2013

Member

+1, but this is a little way off

Member

JohnONolan commented Sep 4, 2013

+1, but this is a little way off

@sebgie

This comment has been minimized.

Show comment
Hide comment
@sebgie

sebgie Sep 4, 2013

Contributor

I know, just wanted to note it down. You can assign it to milestone 0.far_far_away.

Contributor

sebgie commented Sep 4, 2013

I know, just wanted to note it down. You can assign it to milestone 0.far_far_away.

@halfdan

This comment has been minimized.

Show comment
Hide comment
@halfdan

halfdan Oct 17, 2013

Member

@JohnONolan: Mind if I work on this one?

Member

halfdan commented Oct 17, 2013

@JohnONolan: Mind if I work on this one?

@matteocrippa

This comment has been minimized.

Show comment
Hide comment
@matteocrippa

matteocrippa Oct 19, 2013

Probably using sitemap.js ( https://github.com/ekalinin/sitemap.js ) it would be a fast thing to do quite similar to RSS handler, without pagination.

Probably using sitemap.js ( https://github.com/ekalinin/sitemap.js ) it would be a fast thing to do quite similar to RSS handler, without pagination.

@halfdan

This comment has been minimized.

Show comment
Hide comment
@halfdan

halfdan Oct 19, 2013

Member

I think it's not worth adding a new dependency for this. Sitemaps are easy enough to generate with just a few lines of code.

Member

halfdan commented Oct 19, 2013

I think it's not worth adding a new dependency for this. Sitemaps are easy enough to generate with just a few lines of code.

@gotdibbs

This comment has been minimized.

Show comment
Hide comment
@gotdibbs

gotdibbs Oct 19, 2013

Member

@halfdan I think for now it's probably better to focus on the issues in milestone 0.4 or "critical" issues as they come up.

Member

gotdibbs commented Oct 19, 2013

@halfdan I think for now it's probably better to focus on the issues in milestone 0.4 or "critical" issues as they come up.

@ErisDS

This comment has been minimized.

Show comment
Hide comment
@ErisDS

ErisDS Oct 20, 2013

Member

I don't think this feature can be implemented well until such time as we have a scheduler?

Member

ErisDS commented Oct 20, 2013

I don't think this feature can be implemented well until such time as we have a scheduler?

@sebgie

This comment has been minimized.

Show comment
Hide comment
@sebgie

sebgie Oct 20, 2013

Contributor

@ErisDS I don't think that we need a scheduler for sitemap.xml. As stated above it is roughly the same functionality as RSS. The url /sitemap.xml should return a XML file.

Anyway, there are more than enough open issues for milestone 0.4 that need attention first.

Contributor

sebgie commented Oct 20, 2013

@ErisDS I don't think that we need a scheduler for sitemap.xml. As stated above it is roughly the same functionality as RSS. The url /sitemap.xml should return a XML file.

Anyway, there are more than enough open issues for milestone 0.4 that need attention first.

@ErisDS

This comment has been minimized.

Show comment
Hide comment
@ErisDS

ErisDS Nov 4, 2013

Member

Putting this here as well as IRC - I think that a sitemap is significantly different to RSS because RSS only ever creates one page of a small number of posts at a time, whereas a sitemap has to go away and generate for the entire site. The larger the site the longer it takes, eventually with large blogs you have a potential DoS endpoint. Especially up until we have an api cache.

I'm also concerned about memory usage for generating the sitemap.

It would absolutely have to be heavily cached I think - and I would feel much better if it was generated either on a schedule or when something is updated, similar to using the X-Cache-Invalidation headers.

Member

ErisDS commented Nov 4, 2013

Putting this here as well as IRC - I think that a sitemap is significantly different to RSS because RSS only ever creates one page of a small number of posts at a time, whereas a sitemap has to go away and generate for the entire site. The larger the site the longer it takes, eventually with large blogs you have a potential DoS endpoint. Especially up until we have an api cache.

I'm also concerned about memory usage for generating the sitemap.

It would absolutely have to be heavily cached I think - and I would feel much better if it was generated either on a schedule or when something is updated, similar to using the X-Cache-Invalidation headers.

@sebgie

This comment has been minimized.

Show comment
Hide comment
@sebgie

sebgie Nov 5, 2013

Contributor

Following up with the discussion on IRC the limit for 50.000 links per sitemap was introduced by Google. For sitemaps containing more entries a Sitemap index file is recommended.(See: https://support.google.com/webmasters/answer/183668?hl=en).

Contributor

sebgie commented Nov 5, 2013

Following up with the discussion on IRC the limit for 50.000 links per sitemap was introduced by Google. For sitemaps containing more entries a Sitemap index file is recommended.(See: https://support.google.com/webmasters/answer/183668?hl=en).

@halfdan

This comment has been minimized.

Show comment
Hide comment
Member

halfdan commented Nov 5, 2013

@ErisDS

This comment has been minimized.

Show comment
Hide comment
@ErisDS

ErisDS Nov 7, 2013

Member

Further discussion on this happened during Tuesday's public IRC meeting (#ghost, freenode, Tuesdays 5:30-7pm London time).

Logs are available here: http://107.20.237.151:8081/logs/%23ghost/20131105 (from 6pm ish)

Summary: it was agreed that generating a sitemap on GET request is not the best solution, but rather that the file will be generated & stored, and regenerated when a post is created/updated/deleted.

Storing a file has an additional difficulty because we currently do not have a general place, although content/images should be fine. This will tie further to the work being done to abstract the filesystem so that we can allow plugins to store files elsewhere/how.

Member

ErisDS commented Nov 7, 2013

Further discussion on this happened during Tuesday's public IRC meeting (#ghost, freenode, Tuesdays 5:30-7pm London time).

Logs are available here: http://107.20.237.151:8081/logs/%23ghost/20131105 (from 6pm ish)

Summary: it was agreed that generating a sitemap on GET request is not the best solution, but rather that the file will be generated & stored, and regenerated when a post is created/updated/deleted.

Storing a file has an additional difficulty because we currently do not have a general place, although content/images should be fine. This will tie further to the work being done to abstract the filesystem so that we can allow plugins to store files elsewhere/how.

@halfdan

This comment has been minimized.

Show comment
Hide comment
@halfdan

halfdan Nov 7, 2013

Member

@ErisDS Can you please assign this issue to me?

Member

halfdan commented Nov 7, 2013

@ErisDS Can you please assign this issue to me?

@JuanKRuiz

This comment has been minimized.

Show comment
Hide comment
@JuanKRuiz

JuanKRuiz Nov 13, 2013

I think there's no need to have a perfect "massive blog enabled sitemap"
Sitemaps improve discoverability of our blog contents in Search Engines so is a very important feature.

Things don't need to be "done" from first aproach, while full "massive sitemap" functionality is created I thing a basic aproach would help lots of users that at this time doesn't have nothing up to 100 blog entries.

I think there's no need to have a perfect "massive blog enabled sitemap"
Sitemaps improve discoverability of our blog contents in Search Engines so is a very important feature.

Things don't need to be "done" from first aproach, while full "massive sitemap" functionality is created I thing a basic aproach would help lots of users that at this time doesn't have nothing up to 100 blog entries.

@alicoding

This comment has been minimized.

Show comment
Hide comment
@alicoding

alicoding Nov 19, 2013

Contributor

+1 on this. I think it is very important to at least have it first.

Contributor

alicoding commented Nov 19, 2013

+1 on this. I think it is very important to at least have it first.

@WingTangWong

This comment has been minimized.

Show comment
Hide comment
@WingTangWong

WingTangWong Nov 19, 2013

+1 for this as well.

On Monday, November 18, 2013, Ali Al Dallal wrote:

+1 on this. I think it is very important to at least have it first.


Reply to this email directly or view it on GitHubhttps://github.com/TryGhost/Ghost/issues/623#issuecomment-28758706
.

Wing Wong
wingedpower@gmail.com
http://about.me/wingtangwong
https://www.facebook.com/wingedpower

+1 for this as well.

On Monday, November 18, 2013, Ali Al Dallal wrote:

+1 on this. I think it is very important to at least have it first.


Reply to this email directly or view it on GitHubhttps://github.com/TryGhost/Ghost/issues/623#issuecomment-28758706
.

Wing Wong
wingedpower@gmail.com
http://about.me/wingtangwong
https://www.facebook.com/wingedpower

@iteles

This comment has been minimized.

Show comment
Hide comment
@iteles

iteles Nov 22, 2013

Not sure if anything ever came of this?

iteles commented Nov 22, 2013

Not sure if anything ever came of this?

@ErisDS

This comment has been minimized.

Show comment
Hide comment
@ErisDS

ErisDS Nov 22, 2013

Member

It's on the roadmap for 0.4, which is scheduled for mid December https://github.com/TryGhost/Ghost/wiki/Roadmap

Member

ErisDS commented Nov 22, 2013

It's on the roadmap for 0.4, which is scheduled for mid December https://github.com/TryGhost/Ghost/wiki/Roadmap

@ssx

This comment has been minimized.

Show comment
Hide comment
@ssx

ssx Mar 20, 2014

Hi @ErisDS , has this made its way into a version of Ghost? I can't see it anywhere in settings.

ssx commented Mar 20, 2014

Hi @ErisDS , has this made its way into a version of Ghost? I can't see it anywhere in settings.

@halfdan

This comment has been minimized.

Show comment
Hide comment
@halfdan

halfdan Mar 20, 2014

Member

@ssx We rescheduled it to 0.6 - if you want to use it for Webmaster Tools you can also use the RSS feed for now.

Member

halfdan commented Mar 20, 2014

@ssx We rescheduled it to 0.6 - if you want to use it for Webmaster Tools you can also use the RSS feed for now.

@jsilton

This comment has been minimized.

Show comment
Hide comment
@jsilton

jsilton Mar 23, 2014

How many Ghost installs have over 50,000 URLs? A short term solution that would work for the majority of websites could be to generate a sitemap on a get request. This initial XML sitemap would only need to include the <loc> of each URL so that the site can be fully submitted to search engines.

In the future, running a job to create and store an XML sitemap would be beneficial. That job should also create a sitemap index and child sitemaps if the site has over 50,000 URLs.

jsilton commented Mar 23, 2014

How many Ghost installs have over 50,000 URLs? A short term solution that would work for the majority of websites could be to generate a sitemap on a get request. This initial XML sitemap would only need to include the <loc> of each URL so that the site can be fully submitted to search engines.

In the future, running a job to create and store an XML sitemap would be beneficial. That job should also create a sitemap index and child sitemaps if the site has over 50,000 URLs.

@vohof

This comment has been minimized.

Show comment
Hide comment

vohof commented May 24, 2014

👍

@JuanKRuiz

This comment has been minimized.

Show comment
Hide comment
@JuanKRuiz

JuanKRuiz May 24, 2014

While this functionality is developed...

I have created this library and command prompt tool :
LinkSpider

And actually I'm using it to generate sitemap.xaml when I created new post.

With this you can create your ghost sitemap.xaml to any site using command prompt, like this

LinkSpiderConsole.exe --u http://yourGhost.com --n /tag/ --m /tag/

The console app is compiled for Windows but you can compile it to run on other platforms using Mono.

While this functionality is developed...

I have created this library and command prompt tool :
LinkSpider

And actually I'm using it to generate sitemap.xaml when I created new post.

With this you can create your ghost sitemap.xaml to any site using command prompt, like this

LinkSpiderConsole.exe --u http://yourGhost.com --n /tag/ --m /tag/

The console app is compiled for Windows but you can compile it to run on other platforms using Mono.

@ErisDS ErisDS modified the milestones: Future, 0.5 Multi-user Jun 17, 2014

@Mathachew

This comment has been minimized.

Show comment
Hide comment
@Mathachew

Mathachew Jul 15, 2014

A friend of mine just published a blog post on how to achieve a dynamic sitemap until it arrives in 0.6. It's a simple and quick process. I've expanded on his tutorial with my own to increase the number of posts fetched and to include static pages as well. Once 0.6 is out, this method becomes unnecessary. Until then, it's just right.

A friend of mine just published a blog post on how to achieve a dynamic sitemap until it arrives in 0.6. It's a simple and quick process. I've expanded on his tutorial with my own to increase the number of posts fetched and to include static pages as well. Once 0.6 is out, this method becomes unnecessary. Until then, it's just right.

@JohnONolan

This comment has been minimized.

Show comment
Hide comment
@JohnONolan

JohnONolan Jul 26, 2014

Member

Updated main issue description. What do you reckon @tstrimple ?

Member

JohnONolan commented Jul 26, 2014

Updated main issue description. What do you reckon @tstrimple ?

@JohnONolan JohnONolan modified the milestones: Future, 0.5.x Feature Release Jul 31, 2014

@tstrimple

This comment has been minimized.

Show comment
Hide comment
@tstrimple

tstrimple Aug 13, 2014

There are some problems with implementing this in the ideal way. Ghost needs a generic storage abstraction layer (that also supports streaming) which can be used for more than just images. I feel like this is all premature optimization anyway. I really don't think it needs a sitemap cached to the filesystem, especially in the first iteration, but that's listed above in the requirements. We could use a (temporary) in-memory cache as a way to get the feature implemented quickly until more work can be done around the storage system.

If we used an in-memory cache, we could get around 7k unique urls stored in around 1MB of memory. The vast majority (all?) blogs are going to be significantly less than that. I'm happy to also look into a more generic storage solution as well. I had already started that work when I implemented azure blob storage as a method of storing images.

There are some problems with implementing this in the ideal way. Ghost needs a generic storage abstraction layer (that also supports streaming) which can be used for more than just images. I feel like this is all premature optimization anyway. I really don't think it needs a sitemap cached to the filesystem, especially in the first iteration, but that's listed above in the requirements. We could use a (temporary) in-memory cache as a way to get the feature implemented quickly until more work can be done around the storage system.

If we used an in-memory cache, we could get around 7k unique urls stored in around 1MB of memory. The vast majority (all?) blogs are going to be significantly less than that. I'm happy to also look into a more generic storage solution as well. I had already started that work when I implemented azure blob storage as a method of storing images.

@ErisDS

This comment has been minimized.

Show comment
Hide comment
@ErisDS

ErisDS Aug 13, 2014

Member

@tstrimple You may well be right about the premature optimisation thing, but I'm thinking about those people using Ghost who already have enormous blogs like codinghorror.com. Leaving an endpoint open which generates the sitemap on the fly provides too easy a target for someone to try and bring the blog down in my opinion.

I think an in-memory cache makes more sense than blocking this on the file system abstraction (#2852) improvements though. I'm keen to get that work done sooner rather than later, but would much prefer to do the 2 things independently. So yeah - lets go with in-memory?

Member

ErisDS commented Aug 13, 2014

@tstrimple You may well be right about the premature optimisation thing, but I'm thinking about those people using Ghost who already have enormous blogs like codinghorror.com. Leaving an endpoint open which generates the sitemap on the fly provides too easy a target for someone to try and bring the blog down in my opinion.

I think an in-memory cache makes more sense than blocking this on the file system abstraction (#2852) improvements though. I'm keen to get that work done sooner rather than later, but would much prefer to do the 2 things independently. So yeah - lets go with in-memory?

tstrimple added a commit to tstrimple/Ghost that referenced this issue Aug 13, 2014

tstrimple added a commit to tstrimple/Ghost that referenced this issue Aug 13, 2014

Added basic sitemap functionality
Closes #623
  * implemented in middleware to avoid trailing slash issue
  * used timer to invalidate cache instead of on post update

tstrimple added a commit to tstrimple/Ghost that referenced this issue Aug 13, 2014

Added basic sitemap functionality
closes #623
  * uses in-memory cache
  * timer to invalidate cache instad of on post update
  * implemented at middleware level to avoid trailing slash issues

@ErisDS ErisDS changed the title from Generate sitemap.xml to [Feature] Generate sitemap.xml Aug 19, 2014

@ErisDS ErisDS added feature labels Sep 2, 2014

@JohnONolan JohnONolan removed the themes label Sep 12, 2014

@MaluNoPeleke

This comment has been minimized.

Show comment
Hide comment
@MaluNoPeleke

MaluNoPeleke Sep 17, 2014

I thought this should be included in the upcoming 0.5.2 release!?

I thought this should be included in the upcoming 0.5.2 release!?

@ErisDS

This comment has been minimized.

Show comment
Hide comment
@ErisDS

ErisDS Sep 17, 2014

Member

@MaluNoPeleke We can't include things in a release if they're not ready yet!?

Member

ErisDS commented Sep 17, 2014

@MaluNoPeleke We can't include things in a release if they're not ready yet!?

@ErisDS ErisDS modified the milestones: Current backlog, Next Backlog Oct 21, 2014

@ghuntley

This comment has been minimized.

Show comment
Hide comment
@ghuntley

ghuntley Oct 24, 2014

Eagerly awaiting :shipit:

Eagerly awaiting :shipit:

@nodesocket

This comment has been minimized.

Show comment
Hide comment
@nodesocket

nodesocket Nov 3, 2014

Was just googling around how to generate sitemaps on ghost and found this PR. We are paying for Ghost hosted, when will this release?

Was just googling around how to generate sitemaps on ghost and found this PR. We are paying for Ghost hosted, when will this release?

@javorszky

This comment has been minimized.

Show comment
Hide comment
@javorszky

javorszky Nov 3, 2014

Member

@nodesocket when we release 0.5.4. That's at Hannah's discretion. Potentially soon though (as in: this week, or the next). Obviously treat my words as speculation, as Hannah's word is Law.

Member

javorszky commented Nov 3, 2014

@nodesocket when we release 0.5.4. That's at Hannah's discretion. Potentially soon though (as in: this week, or the next). Obviously treat my words as speculation, as Hannah's word is Law.

@joeldrapper

This comment has been minimized.

Show comment
Hide comment
@joeldrapper

joeldrapper Nov 3, 2014

Contributor

@nodesocket in the meantime, you can submit your RSS feed (yoursite.com/rss) as a "sitemap" in Google Webmaster Tools. It's not ideal, but probably better than nothing for now. In fact, having just done some quick research, it looks like it's recommended to submit your RSS feed in addition to your standard sitemap anyway.

Contributor

joeldrapper commented Nov 3, 2014

@nodesocket in the meantime, you can submit your RSS feed (yoursite.com/rss) as a "sitemap" in Google Webmaster Tools. It's not ideal, but probably better than nothing for now. In fact, having just done some quick research, it looks like it's recommended to submit your RSS feed in addition to your standard sitemap anyway.

@thetutlage

This comment has been minimized.

Show comment
Hide comment
@nodesocket

This comment has been minimized.

Show comment
Hide comment
@nodesocket

nodesocket Nov 25, 2014

The bummer thing is that ghost-sitemap requires changes to node source code to make the sitemap publicly accessible. Everytime you update ghost, you lose these changes.

The bummer thing is that ghost-sitemap requires changes to node source code to make the sitemap publicly accessible. Everytime you update ghost, you lose these changes.

@ErisDS ErisDS closed this in 2cfa184 Dec 1, 2014

@ErisDS ErisDS referenced this issue Dec 1, 2014

Closed

Add Sitemap XSL #4555

ErisDS added a commit to ErisDS/Ghost that referenced this issue Dec 3, 2014

Add limit=all consistently to users, posts & tags
refs #623, #4348

- this fixes sitemaps to list all posts, pages, tags and users
- makes the API behave consistently across all paginated resources

ErisDS added a commit to ErisDS/Ghost that referenced this issue Dec 3, 2014

Add limit=all consistently to users, posts & tags
refs #623, #4348

- this fixes sitemaps to list all posts, pages, tags and users
- makes the API behave consistently across all paginated resources
@nodesocket

This comment has been minimized.

Show comment
Hide comment
@nodesocket

nodesocket Dec 3, 2014

@ErisDS Will sitemap support rollout in 0.5.6?

Thanks.

@ErisDS Will sitemap support rollout in 0.5.6?

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment