Crawl behavior #243

OkkeKlein · 2016-04-20T19:07:16Z

Even with stayOnDomain="true" and maxDocuments=10 when the crawler finds a sitemap (https://www.flickr.com/sitemap/sitemap-photos-index-00000.xml) on external site it goes through all the links. Had to cancel crawl.

Am I wrong to thing stayOnDomain="true" would prevent crawling external site?

essiembre · 2016-04-20T20:41:02Z

Yes it should stick to the domain(s) of the URL(s) you provide. If flickr was not part of your start URLs, it should not be picked up. Unless your sitemap itself contain an external link to flicker? Just guessing. Can you attach your config?

OkkeKlein · 2016-04-20T20:44:56Z

Start is url not sitemap and only one domain. After adding referenceFilters that includes this domain it works as expected.

essiembre · 2016-04-20T20:53:39Z

Great, but it should work without you having to do this. It would be nice if you can share a config that reproduces the issue.

with no scheme (//www.example.com). Github #243.

essiembre · 2016-04-21T16:02:43Z

Thanks to the offline material you sent me, I was able to find the cause and fix it. The latest snapshot has the fix.

Please confirm.

essiembre · 2016-06-01T15:08:10Z

@OkkeKlein: Have you had a chance to confirm this fix? Can we close?

essiembre · 2016-06-25T03:37:34Z

This fix was released in 2.5.0.

essiembre added a commit that referenced this issue Apr 21, 2016

Fixed "stayOnDomain" being true not being honored for extracted URLs

f219fb6

with no scheme (//www.example.com). Github #243.

essiembre added bug resolved labels Apr 21, 2016

essiembre added this to the 2.5.0 milestone Apr 21, 2016

essiembre closed this as completed Jun 25, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawl behavior #243

Crawl behavior #243

OkkeKlein commented Apr 20, 2016 •

edited

essiembre commented Apr 20, 2016

OkkeKlein commented Apr 20, 2016

essiembre commented Apr 20, 2016

essiembre commented Apr 21, 2016

essiembre commented Jun 1, 2016

essiembre commented Jun 25, 2016

Crawl behavior #243

Crawl behavior #243

Comments

OkkeKlein commented Apr 20, 2016 • edited

essiembre commented Apr 20, 2016

OkkeKlein commented Apr 20, 2016

essiembre commented Apr 20, 2016

essiembre commented Apr 21, 2016

essiembre commented Jun 1, 2016

essiembre commented Jun 25, 2016

OkkeKlein commented Apr 20, 2016 •

edited