Skip to content

Crawling our website takes a very long time, looking for configuration advice #570

Answered by kris-sigur
Glenruben asked this question in Q&A
Discussion options

You must be logged in to vote

The queues are configured to limit the load on each target host/domain. As some are completed, the overall crawl rate drops as there are fewer hosts remaining to be crawled. Heritrix is, by default, configured to be reasonably polite in its crawling to avoid overwhelming individual sites (or getting it banned by angry sysadmins). Of course this means that very large sites take a very long time as you are only crawling one URL every few seconds.

The relevant settings are in this section:

 <bean id="disposition" class="org.archive.crawler.postprocessor.DispositionProcessor">
  <property name="delayFactor" value="1.0" />
  <property name="minDelayMs" value="500" />
  <property name="respectC…

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@Glenruben
Comment options

Answer selected by Glenruben
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants