Skip to content

Running Apache Storm benchmark

Zubair Nabi edited this page Jul 28, 2014 · 22 revisions

Overview

Before you begin, make sure you compiled the application and created the required dataset: [Create dataset for Apache Storm benchmark](Create dataset for Apache Storm benchmark)

The Apache Storm benchmark contains the following topologies:

  1. EnronTopology: Complete application benchmark
  2. BareboneTopology: Same as EnronTopology but filter, modify, and metrics are unity bolts
  3. TrivialTopology1: Same as BareboneTopology but filter and modify bolts are removed
  4. TrivialTopology2: Same as TrivialTopology1 but serialization and deserialization bolts are removed This topology requires an unserialized but compressed dataset (create using com.ibm.streamsx.storm.email.benchmark.testing.CreateCompressedDatasetSequential)
  5. RestrictedTopology: Same as TrivialTopology2 but without compression and decompression This topology requires an uncompressed dataset (create using com.ibm.streamsx.storm.email.benchmark.testing.CreateSerializedDatasetSequential)

Dataset Naming Convention

  • Each spout requires its own dedicated input file
  • The first spout would get name0.ext, the second name1.ext, and so on. Therefore, the dataset naming convention is name<n>.ext, where n starts from 0 and goes up till m-1 where, m is the parallelism of the spout
  • The dataset should be present on NFS

Configuration

Copy conf/storm.email.properties to your home folder (~/) and fill out the missing values At a minimum, these values will need to be filled in:

  1. logspath: path to custom logs folder
  2. filepath: path to input file
  3. filename: name of the input file
  4. fileext: extension of the input file

For instance, if you have four spouts, then you should have name0.ext, name1.ext, name2.ext, and name3.ext on your NFS. filepath would point to the folder where these 4 files are located, filename would be name, and fileext would be .ext

The parallelism of the pipeline can be varied by changing the values of the *spout and *bolt keys in the configuration file

Final Metrics Configuration

  1. The final metrics are emitted by the Global Metrics Bolt
  2. To this end, it needs to know the total number of emails. It gets this number from the configuration file (totalemails)
  3. This number needs to be updated each time the dataset changes. Just uncomment the totalemails for the corresponding dataset in the configuration file

Version

The code works with Storm 0.8.2 and 0.9.0.1. The version is selected by setting a) storm.version in pom.xml and b) stormversion in storm.email.properties

To Run the application benchmark

storm jar target/storm-email-benchmark-1.0-jar-with-dependencies.jar com.ibm.streamsx.storm.email.benchmark.<topology_name> <local_or_remote> <job_id>

In local mode, Storm will be executed as a standalone local application

In remote mode, an existing Storm deployment will be used

These topologies make use of vanilla shuffle grouping. If you want to use localOrShuffle group instead, use com.ibm.streamsx.storm.email.benchmark.local.<topology_name>.

For some setups, especially single process ones, shuffle seems to perform better than localOrShuffle.

Results Collection

Final number of characters, words, and paragraphs, throughput, elapsed time, and number of processed emails can be retrieved from <logspath>/<job_id>/GlobalMetricsBolt_Final

See Configuration section above for details of "logspath"

  1. Interval metrics can be obtained from <logspath>/<job_id>/GlobalMetricsBolt and <logspath>/<job_id>/GlobalMetricsBolt_Throughput
  2. To collect CPU Time after the job has completed:
  3. jps Note down the PIDs of all Worker processes
  4. For each Worker PID ps -e -o pid,cputime | grep <pid>