Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
This document describes all configuration parameters that determine the behaviour of the crawler and all its components.
The file crawler-default.yaml lists the configuration elements presented below and provides a default value for them. This file is loaded automatically by the sub-classes of ConfigurableTopology and should not be modified. Instead we recommend that you provide a custom configuration file when launching a topology (see below).
The custom configuration file is expected to be in YAML format and can be passed as a command-line argument as
-conf <path_to_config_file> to the Java call of your Main class (which normally would be a sub-class of ConfigurableTopology).
The values in the custom configuration file will override the ones provided in crawler-default.yaml and it does not need to contain all the values.
You can use
-conf <path_to_config_file> more than once on the command line, which allows to separate the configuration files for instance between the generic configuration and the configuration of a specific resources.
With Maven installed, you must first generate an uberjar:
mvn clean package
before submitting the topology using the storm command:
storm jar path/to/allmycode.jar org.me.MyCrawlTopology -conf my-crawler-conf.yaml -local"
While deploying on a production Storm cluster, simply remove the
Passing a configuration file is mandatory. A sample configuration file can be found here.
The following tables describe all available configuration options and their default values. If one of the keys is not present in your YAML file, the default value will be taken.
||A JSON configuration file that defines URL filtering strategy. Here is the default implementation. Please also refer to URLFilters. Note: if you want to specify your own file you should give it a different name than
||The JSON configuration file that defines your ParseFilters. Here is the default one. This influences the behavior of JSoupParserBolt and SiteMapParserBolt. Note: if you want to specify your own file you should give it a different name than
Fetching and partitioning
|http.agent.name||-||A name to be part of the
|http.agent.version||-||A version to be part of the
|http.agent.description||-||A description to be part of the
|http.agent.url||-||A URL to be part of the
|http.agent.email||-||An Email address to be part of the
|http.basicauth.user||-||A user used for the Basic Authentication implemented in HTTPClient protocole|
|http.basicauth.password||-||Password associated with the property
||not yet implemented - whether or not to store the response time time in the Metadata|
||Generally ignore all robots.txt rules (not recommended)|
|http.proxy.host||-||A SOCKS HTTP proxy server to be used for all requests made by the crawler|
|http.proxy.port||-||The port of your SOCKS proxy server|
||A connection timeout specified in milliseconds. Tuples that run into this timeout will be emitted with the status ERROR in the StatusStream|
||Comma separated additional user-agent strings to be used for the interpretation of the robots.txt. If left empty (default) than the robots.txt is interpreted with the value of
||The maximum number of bytes for returned HTTP response bodies. By default no limit is applied. In the generated archetype a limit of
||Possible values are:
??? Possible values are:
||The maximum number in seconds that will be accepted by Crawl-delay directives in robots.txt files. If the crawl-delay exceeds this value the behavior depends on the value of
|fetcher.max.crawl.delay.force||false||Configures the behavior of fetcher if the robots.txt crawl-delay exceeds
||The default number of threads per queue. This can be overwritten for specific hosts/domains/IPs. See below|
||Overwrites the default value of
||Defines delay between crawls in the same queue if no Craw-delay is defines for this URL in the pages robots.txt. Note: For multi-threaded queues neither this value nor the one from the robots.txt will be honored. See
||??? Defines the delay between crawls in the same queue if a queue has > 1 thread. The Crawl-delay declared in the robots.txt is ignored in this case and this value is taken.|
|fetcher.server.delay.force||false||Defines the behavior of fetcher when the crawl-delay in the robots.txt is smaller than the value configured in
||The number of threads that fetch pages from all queues concurrently. This threads does the actual work of downloading the page. Increase this to get more throughput at a cost of higher network, CPU and memory utilisation. Tweak this value carefully while looking at your system resources to find a value that works best for your hardware and network infrastructure.|
||If URL redirects are allowed or not. If set to true, the crawler will emit the targeted URL in the StatusStream with the status DISCOVERED|
??? Defines what happens in the scenario where the request to the
||The protocols to support. Each of them has a corresponding
||The Protocol implementation for plain HTTP|
||The Protocol implementation for HTTP over SSL|
The values below are used by sub-classes of
AbstractIndexerBolt. Examples: StdOut, ElasticSearch. These classes persist the outcome of your crawling process and receive tuples enriched with Metadata (with all information gathered by previous Bolts)
|indexer.md.filter||-||A YAML List of
|indexer.md.mapping||-||A YAML List of
|indexer.text.fieldname||-||The fieldname that should be used to index the content of HTML body. The usage of this is again in the responsibility of the class that extends
|indexer.url.fieldname||-||Same as above -
This refers to persisting the status of a URL (e.g. ERROR, DISCOVERED etc.) along with a something like a
nextFetchDate that is being calculated by a Scheduler
||Using this cache helps to prevent persisting the same URLs over and over again. The
||A cache specification string that defines the size and behavior of the above cache.|
||In minutes - how to schedule re-visits of pages. 1 Day by default. This is used by the DefaultScheduler. If you need customized scheduling logic, just implement your own Scheduler. Note: the Scheduler class is not yet configurable. See or update (this issue)[https://github.com/DigitalPebble/storm-crawler/issues/104] if you need this bahvior. Should be quite easy to make the implemetation class configurable.|
||In minutes - how often to re-visit pages with a fetch error. Every two hours by default. Identified by tuples in the (StatusStream)[https://github.com/DigitalPebble/storm-crawler/wiki/statusStream] with the state of
||In minutes - how often to re-visit pages with an error (HTTP 4XX or 5XX). Every month by default. Identified by tuples in the (StatusStream)[https://github.com/DigitalPebble/storm-crawler/wiki/statusStream] with the state of
||Whether or not to emit outgoing links found in the parsed HTML document to the StatusStrean as
||Whether or not to add the anchor text (can be > 1) of (filtered) outgoing links with the key
||Whether or not to track the URL path of outgoing links (all URLs that the crawler crawled to find this link) in the Metadata. The Metadata field name for this is
||Whether or not to track the depth of a crawled URL. This is a simple counter that is being tracked for outgoing links in the Metadata and incremented by