Skip to content

Create dataset for InfoSphere Streams benchmark

Zubair Nabi edited this page Jul 28, 2014 · 21 revisions

Before you begin, make sure you have prepared the dataset following the steps from here: [Preprocess Enron Email Dataset](Preprocess Enron Email Dataset)

Overview

The StreamsPrepareDataset project can be used to create the data set for the Streams benchmark. It reads the email dataset prepared from the previous step, and create a file that stores the emails in InfoSphere Streams binary format.

Prerequisites:

  1. Avro C++: 1.7.4

    Installation Guide: http://avro.apache.org/docs/1.7.4/api/cpp/html/index.html

    Make sure the include files are located at /usr/local/include and shared libraries at /usr/local/lib

  2. Boost 1.54.0 (required by Avro)

    Installation Guide: http://www.boost.org/doc/libs/1_54_0/doc/html/bbv2/installation.html

    Make sure the include files are located at /usr/local/include and shared libraries at /usr/local/lib

  3. Avro Email Schema File

    Create a folder emailavro under /usr/local/include

    Copy email.hh from StreamsAvroOperators/emailavro folder to /usr/local/include/emailavro

    You can also alternatively regenerate email.hh by using the Avro C++ compiler:

    1. Go to folder StreamsAvroOperators/
    2. Run the following command: avrogencpp -i email.avsc -o email.hh

    A email.hh file will be generated which you need to copy to the /usr/local/include/emailavro directory.

Compilation

To build the application:

  1. Go to the root directory of StreamsPrepareDataset
  2. type make all at the command line

Set up

Before you can run the application, copy the dataset generated from the previous preprocessing step ([Preprocess Enron Email Dataset](Preprocess Enron Email Dataset)) to StreamsPrepareDataset/data directory with extension .txt.

Execution

  1. Make sure a Streams instance is created and running.
  2. To submit the job to the Streams instance: streamtool submitjob -i <instanceName> output/Main/Distributed/Main.adl -P filename=<input filename in data dir>

Note that filename should not contain the extension, i.e. if the file in the data folder is foobar.txt then filename should be foobar.

Next Step:

[Running InfoSphere Streams benchmark ](Running InfoSphere Streams benchmark )