New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix purging problem when providing sitemap url. #8

Merged
merged 1 commit into from Nov 3, 2017

Conversation

Projects
None yet
2 participants
@jone
Member

jone commented Nov 3, 2017

Purging is based on the site base url (the first argument for Site): the documents in Solr are identified by looking if they start with this site base url.

By making it possible to provide a sitemap url as first argument to Site, the purging broke, because the indexed document urls do not start with the sitemap (!) url.

This change removes the functionality to provide a sitemap url directly as first argument. Instead, the sitemap_urls keyword argument must be used, where a list of sitemap urls is provided.

All existing configurations with a sitemap url configured explicitly must be updated when updating ftw.crawler. For the affected sites the administrator should purge the solr index manually and reindex.

Site configuration update

Sitemap URLs must now be listed explicitly, e.g.:

 CONFIG = Config(
     sites=[
-        Site('https://stadt.winterthur.ch/_vk/cal_1/sitemap/',
+        Site('https://stadt.winterthur.ch/_vk/cal_1/',
+             sitemap_urls=['https://stadt.winterthur.ch/_vk/cal_1/sitemap/'],
              attributes={'site_area': 'Stadt Winterthur'}),
     ],
 )
Show outdated Hide outdated docs/HISTORY.txt Outdated
Fix purging problem when providing sitemap url.
Purging is based on the site base url (the first argument for Site):
the documents in Solr are identified by looking if they start with
this site base url.

By making it possible to provide a sitemap url as first argument to
Site, the purging broke, because the indexed document urls do not
start with the sitemap (!) url.

This change removes the functionality to provide a sitemap url
directly as first argument.
Instead, the sitemap_urls keyword argument must be used, where a list
of sitemap urls is provided.

All existing configurations with a sitemap url configured explicitly
must be updated when updating ftw.crawler.
For the affected sites the administrator should purge the solr index
manually and reindex.
@mbaechtold

mbaechtold approved these changes Nov 3, 2017 edited

👍

@jone

This comment has been minimized.

Show comment
Hide comment
@jone

jone Nov 3, 2017

Member

This will not solve all problems.
When using multiple site/crawler configurations for the same base url (e.g. for "pages", "news", "events"), the purger cannot know which config is responsible for which entry.
So we need to change that again, probably with another breaking change.

But this change is a progress and we need a solution now. Therefore I'm shipping this one.

Member

jone commented Nov 3, 2017

This will not solve all problems.
When using multiple site/crawler configurations for the same base url (e.g. for "pages", "news", "events"), the purger cannot know which config is responsible for which entry.
So we need to change that again, probably with another breaking change.

But this change is a progress and we need a solution now. Therefore I'm shipping this one.

@jone jone merged commit f9e9b36 into master Nov 3, 2017

1 check passed

CI Governor: tests.cfg Task #186623 succeeded
Details

@jone jone deleted the jone-fix-purging branch Nov 3, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment