New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawl-delay in robots.txt should not shrink delay configured by fetcher.server.delay #549

Merged
merged 1 commit into from Mar 26, 2018

Conversation

Projects
None yet
2 participants
@sebastian-nagel
Collaborator

sebastian-nagel commented Mar 16, 2018

to guarantee that the crawler always behaves polite and inconspicuous, see commoncrawl/news-crawl#24.

@sebastian-nagel

This comment has been minimized.

Collaborator

sebastian-nagel commented Mar 16, 2018

PR allows avoids repeated messages Crawl delay for queue: ... is set to 1000 as per robots.txt.

@jnioche

Thanks @sebastian-nagel, it is a good idea but I think this behaviour should be configurable. I think we should give users the option to chose how they want the crawler to behave.

@@ -402,6 +404,8 @@ public FetcherThread(Config conf, int num) {
this.maxCrawlDelay = ConfUtils.getInt(conf,
"fetcher.max.crawl.delay", 30) * 1000;
this.crawlDelay = (long) (ConfUtils.getFloat(conf,

This comment has been minimized.

@jnioche

jnioche Mar 19, 2018

Member

probably don't need to read this but instead use fiq.crawlDelay as you do in line 534

@sebastian-nagel

This comment has been minimized.

Collaborator

sebastian-nagel commented Mar 19, 2018

Agreed. What about a property fetcher.min.crawl.delay? The crawl delay will then be between this value and that of fetcher.max.crawl.delay. There is already a fetcher.server.min.delay but it's used as delay between "fetches without fetch" (eg. robots.txt denied).

@jnioche

This comment has been minimized.

Member

jnioche commented Mar 19, 2018

There is already a fetcher.server.min.delay but it's used as delay between "fetches without fetch" (eg. robots.txt denied).

No, it is used in normal fetches (asap==false). Robots denied have the opposite value of asap=true.
It is used when multithreading on a single queue to define an alternative value to the normal crawl delay. This is inherited from Nutch and I agree that it is confusing. The name of the param doesn't help either (server?)

Let me have a think about it. Do we need a special case for multithreading or would the fetcher.min.crawl.delay be used no matter what? It could always be set to 0 or a low value for users who want an aggressive setting. If so should we declare it in the same way and place as the max value?

Your thoughts and suggestions are welcome as usual

@sebastian-nagel

This comment has been minimized.

Collaborator

sebastian-nagel commented Mar 19, 2018

Ok, sorry, you're right.

Do we need a special case for multithreading

It could make sense if there is a custom per-queue number of threads (fetcher.maxThreads.queueId): then fetcher.server.min.delay is for queues which you're explicitly want aggressive crawling while the other is to guarantee a minimum delay for the polite queues.

@sebastian-nagel

This comment has been minimized.

Collaborator

sebastian-nagel commented Mar 21, 2018

Added a property fetcher.min.crawl.delay. Removed change from SimpleFetcherBolt which simply ignores overlong crawl delays, so it should be ok with shorter delays.

@jnioche

This comment has been minimized.

Member

jnioche commented Mar 21, 2018

If users are concerned with being too aggressive even if permitted by robots, would they set a min value different from the default delay they specify? What about having a boolean parameter instead indicating that if a delay from robots is found and is lower than the default, we should use the latter anyway? It would be set to false by default.

As for SimpleFetcherBolt, the logic is different from FetcherBolt when it comes to the max values (but that won't necessarily be always the case) and we could have the same logic as FetcherBolt when it comes to min values. In fact, we could have a new config indicating whether we should skip a URL if the max value set by a server is above our maxValue (as currently done in FetcherBolt) or enforce the max value we set (as in SimpleFetcherBolt)

@sebastian-nagel

This comment has been minimized.

Collaborator

sebastian-nagel commented Mar 21, 2018

would they set a min value different from the default delay they specify?

Rarely. And it makes even more sense if there is also a boolean whether to enforce the max. configured delay.

How to name the new boolean properties? fetcher.server.delay.enforce resp. fetcher.max.crawl.delay.enforce?

@jnioche

This comment has been minimized.

Member

jnioche commented Mar 21, 2018

@sebastian-nagel

This comment has been minimized.

Collaborator

sebastian-nagel commented Mar 21, 2018

ok, I'll update the PR tomorrow. Thanks!

Crawl-delay in robots.txt should optionally not shrink the configured…
… delay,

configurable behavior when if crawl-delay exceeds max. crawl delay
- add config property fetcher.server.delay.force (default false)
  if true: the value of fetcher.server.delay.force is used even
  if a shorter crawl-delay is specified in robots.txt
- add config property fetcher.max.crawl.delay.force (default false)
  if true: the value of fetcher.max.crawl.delay is used even if
           the robots.txt requests a longer crawl-delay
  if false: URLs in queues with an overlong crawl-delay are skipped
- avoid repeated log messages by logging only if robots.txt crawl-delay
  differs from queue delay
@sebastian-nagel

This comment has been minimized.

Collaborator

sebastian-nagel commented Mar 26, 2018

Done: properties fetcher.server.delay.force and fetcher.max.crawl.delay.force are added in ad15f72, description for wiki.

@jnioche jnioche merged commit 9956aad into DigitalPebble:master Mar 26, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@jnioche

This comment has been minimized.

Member

jnioche commented Mar 26, 2018

Fab, thanks @sebastian-nagel

@jnioche jnioche added this to the 1.9 milestone Mar 26, 2018

@sebastian-nagel sebastian-nagel deleted the sebastian-nagel:guaranteed-fetcher-server-delay branch Mar 27, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment