A collection of resources for Elasticsearch:
- IndexerBolt for indexing documents fetched with StormCrawler
- Spout and StatusUpdaterBolt for persisting URL information in recursive crawls
- StatusMetricsBolt for sending the breakdown of URLs per status as metrics and display its evolution over time.
as well as examples of crawl and injection topologies.
We also have resources for Kibana to build basic real-time monitoring dashboards for the crawls, such as the one below.
If you are running StormCrawler in distributed mode on a Storm 1.0.3 cluster or below, you'll need to upgrade the log4j and slf4j dependencies (see STORM-2326). This isn't necessary in the more recent releases of Apache Storm.
Also with Elasticsearch 5.x, we now have to specify the following for the Maven Shade configuration:
<manifestEntries> <Change></Change> <Build-Date></Build-Date> </manifestEntries>
We'll assume that Elasticsearch and Kibana are installed and running on your machine. You'll also need Java, Maven and Storm installed.
With a basic project set up, such as the one generated from the archetype :
mvn archetype:generate -DarchetypeGroupId=com.digitalpebble.stormcrawler -DarchetypeArtifactId=storm-crawler-archetype -DarchetypeVersion=1.10
Copy the es-conf.yaml and flux files to the directory. You can then edit the pom.xml and add the dependency for the Elasticsearch module
<dependency> <groupId>com.digitalpebble.stormcrawler</groupId> <artifactId>storm-crawler-elasticsearch</artifactId> <version>1.10</version> </dependency>
Then we run the script
ES_IndexInit.sh, which creates 3 indices : one for persisting the status of URLs (status), a template mapping for persisting the Storm metrics (for any indices with a name matching metrics*) as well as a third index (index) for searching the documents fetched by StormCrawler (you should probably tune its mapping later on e.g. if you want to store the content field). You will also need to edit the script if Elasticsearch is running on a different machine.
We can inject the seed URLs into the status index by putting them in a text file with one URL per line and any keay values separated by tabulations e.g.
echo 'http://www.theguardian.com/newssitemap.xml isSitemap=true' > seeds.txt
Edit the -conf.yaml files as you see fit, as a general good practice, you should also specify the _http.agent._ configurations so that the servers you fetch from can identify you.
Then compile with
mvn clean package and inject the seeds with :
storm jar target/*-1.0-SNAPSHOT.jar org.apache.storm.flux.Flux --sleep 30000 --local es-injector.flux
The topology should terminate after 30 seconds, you should then be able to see the seeds in the status index.
When it's done run
storm jar target/*-1.0-SNAPSHOT.jar org.apache.storm.flux.Flux --local es-crawler.flux
to start the crawl. You can replace
--remote to run the topology on a Storm cluster.
- create the Index Patterns
Settings > Indices > Add New, enter
Index name or pattern, and press
Create. Repeat these steps also for
- to upload the dashboard configurations do
Settings > Objects > Importand select the file
kibana/status.json. Then go to
Dashboard, click on
Loads Saved Dashboardand select
Crawl Status. You should see a table containing a single line DISCOVERED 1.
- repeat the operation with the file
The Metrics dashboard in Kibana can be used to monitor the progress of the crawl.
Per time period metric indices (optional)
Note, a second option for the metrics index is available: the use of per time period indices. This best practice is discussed on the Elastic website.
The crawler config YAML must be updated to use either the day or month Elasticsearch metrics consumer, as shown below with the per day indices consumer:
#Metrics consumers: topology.metrics.consumer.register: - class: "org.apache.storm.metric.LoggingMetricsConsumer" parallelism.hint: 1 - class: "com.digitalpebble.stormcrawler.elasticsearch.metrics.IndexPerDayMetricsConsumer" parallelism.hint: 1