Fetching latest commit…
Cannot retrieve the latest commit at this time.


This has been generated by the StormCrawler Maven Archetype as a starting point for building your own crawler. Have a look at the code and resources and modify them to your heart's content.

With Storm installed, you must first generate an uberjar:

mvn clean package

before submitting the topology using the storm command:

storm jar target/${artifactId}-${version}.jar ${package}.CrawlTopology -conf crawler-conf.yaml -local

This will run the topology in local mode. Simply remove the '-local' to run the topology in distributed mode.

You can also use Flux to do the same:

storm jar target/${artifactId}-${version}.jar  org.apache.storm.flux.Flux --local crawler.flux --sleep 86400000

Note that in local mode, Flux uses a default TTL for the topology of 60 secs. The command above runs the topology for 24 hours.

It is best to run the topology with --remote to benefit from the Storm UI and logging. In that case, the topology runs continuously, as intended.