Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawl behavior #243

Closed
OkkeKlein opened this issue Apr 20, 2016 · 6 comments
Closed

Crawl behavior #243

OkkeKlein opened this issue Apr 20, 2016 · 6 comments
Milestone

Comments

@OkkeKlein
Copy link

OkkeKlein commented Apr 20, 2016

Even with stayOnDomain="true" and maxDocuments=10 when the crawler finds a sitemap (https://www.flickr.com/sitemap/sitemap-photos-index-00000.xml) on external site it goes through all the links. Had to cancel crawl.

Am I wrong to thing stayOnDomain="true" would prevent crawling external site?

@essiembre
Copy link
Contributor

Yes it should stick to the domain(s) of the URL(s) you provide. If flickr was not part of your start URLs, it should not be picked up. Unless your sitemap itself contain an external link to flicker? Just guessing. Can you attach your config?

@OkkeKlein
Copy link
Author

Start is url not sitemap and only one domain. After adding referenceFilters that includes this domain it works as expected.

@essiembre
Copy link
Contributor

Great, but it should work without you having to do this. It would be nice if you can share a config that reproduces the issue.

@essiembre
Copy link
Contributor

Thanks to the offline material you sent me, I was able to find the cause and fix it. The latest snapshot has the fix.

Please confirm.

@essiembre essiembre added this to the 2.5.0 milestone Apr 21, 2016
@essiembre
Copy link
Contributor

@OkkeKlein: Have you had a chance to confirm this fix? Can we close?

@essiembre
Copy link
Contributor

This fix was released in 2.5.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants