New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crawl behavior #243
Comments
Yes it should stick to the domain(s) of the URL(s) you provide. If flickr was not part of your start URLs, it should not be picked up. Unless your sitemap itself contain an external link to flicker? Just guessing. Can you attach your config? |
Start is url not sitemap and only one domain. After adding referenceFilters that includes this domain it works as expected. |
Great, but it should work without you having to do this. It would be nice if you can share a config that reproduces the issue. |
with no scheme (//www.example.com). Github #243.
Thanks to the offline material you sent me, I was able to find the cause and fix it. The latest snapshot has the fix. Please confirm. |
@OkkeKlein: Have you had a chance to confirm this fix? Can we close? |
This fix was released in 2.5.0. |
Even with stayOnDomain="true" and maxDocuments=10 when the crawler finds a sitemap (https://www.flickr.com/sitemap/sitemap-photos-index-00000.xml) on external site it goes through all the links. Had to cancel crawl.
Am I wrong to thing stayOnDomain="true" would prevent crawling external site?
The text was updated successfully, but these errors were encountered: