Running Apache Storm benchmark

Overview

Before you begin, make sure you compiled the application and created the required dataset: [Create dataset for Apache Storm benchmark](Create dataset for Apache Storm benchmark)

The Apache Storm benchmark contains the following topologies:

EnronTopology: Complete application benchmark
BareboneTopology: Same as EnronTopology but filter, modify, and metrics are unity bolts
TrivialTopology1: Same as BareboneTopology but filter and modify bolts are removed
TrivialTopology2: Same as TrivialTopology1 but serialization and deserialization bolts are removed This topology requires an unserialized but compressed dataset (create using com.ibm.streamsx.storm.email.benchmark.testing.CreateCompressedDatasetSequential)
RestrictedTopology: Same as TrivialTopology2 but without compression and decompression This topology requires an uncompressed dataset (create using com.ibm.streamsx.storm.email.benchmark.testing.CreateSerializedDatasetSequential)

Dataset Naming Convention

Each spout requires its own dedicated input file
The first spout would get name0.ext, the second name1.ext, and so on. Therefore, the dataset naming convention is name<n>.ext, where n starts from 0 and goes up till m-1 where, m is the parallelism of the spout
The dataset should be present on NFS

Configuration

Copy conf/storm.email.properties to your home folder (~/) and fill out the missing values At a minimum, these values will need to be filled in:

logspath: path to custom logs folder
filepath: path to input file
filename: name of the input file
fileext: extension of the input file

For instance, if you have four spouts, then you should have name0.ext, name1.ext, name2.ext, and name3.ext on your NFS. filepath would point to the folder where these 4 files are located, filename would be name, and fileext would be .ext

The parallelism of the pipeline can be varied by changing the values of the *spout and *bolt keys in the configuration file

Final Metrics Configuration

The final metrics are emitted by the Global Metrics Bolt
To this end, it needs to know the total number of emails. It gets this number from the configuration file (totalemails)
This number needs to be updated each time the dataset changes. Just uncomment the totalemails for the corresponding dataset in the configuration file

Version

The code works with Storm 0.8.2 and 0.9.0.1. The version is selected by setting a) storm.version in pom.xml and b) stormversion in storm.email.properties

To Run the application benchmark

storm jar target/storm-email-benchmark-1.0-jar-with-dependencies.jar com.ibm.streamsx.storm.email.benchmark.<topology_name> <local_or_remote> <job_id>

In local mode, Storm will be executed as a standalone local application

In remote mode, an existing Storm deployment will be used

These topologies make use of vanilla shuffle grouping. If you want to use localOrShuffle group instead, use com.ibm.streamsx.storm.email.benchmark.local.<topology_name>.

For some setups, especially single process ones, shuffle seems to perform better than localOrShuffle.

Results Collection

Final number of characters, words, and paragraphs, throughput, elapsed time, and number of processed emails can be retrieved from <logspath>/<job_id>/GlobalMetricsBolt_Final

See Configuration section above for details of "logspath"

Interval metrics can be obtained from <logspath>/<job_id>/GlobalMetricsBolt and <logspath>/<job_id>/GlobalMetricsBolt_Throughput
To collect CPU Time after the job has completed:
jps Note down the PIDs of all Worker processes
For each Worker PID ps -e -o pid,cputime | grep <pid>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly