Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a way to configure collector behaviour based on IP address? #426

Open
ronjakoi opened this issue Nov 14, 2017 · 2 comments
Open

Comments

@ronjakoi
Copy link

I need to crawl a large set of hosts, some of which are hosted on a couple of "web hotel" servers. Is there a way that I can configure the collector so that it has a single delay value for all hosts that resolve to the same IP address?

Obviously one solution would be to group all the relevant sites to a single crawler to use as seeds and use ReferenceDelayResolver with scope=crawler, but it's unpredictable which sites will have links to these virtual hosts. The list of virtual hosted sites is also not static but changes almost daily, so maintaining it would be problematic.

Suggestions?

@essiembre
Copy link
Contributor

Unfortunately, there is currently no out-of-the-box option to delay based on IP. The closest is "per site". You would have to write your own IDelayResolver or we can make this a feature request if you like.

Do you think you have many sites using the same IP? Because setting the scope to "site" on the GenericDelayResolver could be sufficient if you suspect there is only a handful.

Setting the scope to "crawler" is the "safest" but also the slowest when you have many sites.

FYI, with your start URLs, if you set "stayOnDomain" to "true", it will not go to sites beyond those you defined as start URLs.

@ronjakoi
Copy link
Author

I have on the order of dozens of sites on a handful of IP addresses. I can probably get away with using the "site" scope with a conservative default delay for now, we are only trying to think of problems in advance before I deploy the crawler to production.

Please do make a feature request, however. Much appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants