Benchmark

To show the efficiency of our column-based RDD, we measure performance with/without GPU by running a simple logistic regression program that uses map() and reduce().

We achieved 3.15x performance improvement of logistic regression (SparkGPULR) in examples on a 16-thread IvyBridge box with an NVIDIA K40 GPU card over that with no GPU card. We still have rooms to improve performance (e.g. eliminate data copy between map() and reduce())

Benchmark source files:

Spark code for non-GPU version

Spark code for GPU version, CUDA code

Program parameters:

N=1000000
D=400
ITERATIONS=5
Slices=128 (w/o GPU), 16 (with GPU)
MASTER=local[8] (w/o GPU), local[8] (with GPU)

Configurations:

Machine: nx360 M4, 2 sockets 8-core Intel Xeon E5-2667 3.3GHz, 256GB memory, with one NVIDIA K40m card
OS: RedHat 6.6
CUDA: 7.0
Java: IBM Java8  pxa6480sr2-20151023_01(SR2)
Spark version: https://github.com/kiszk/spark-gpu/commit/34e9b75c0cab297ed7feb8aef7072164b6a5972c

spark-env.sh
JAVA_HOME=/u/ishizaki/ibm-java-x86_64-802
CUDA_DEVICE_MAX_CONNECTION=32
CUDA_VISIBLE_DEVICES=0

spark-default.conf
spark.driver.extraJavaOptions           -Xmn96g -Xgcthreads8 -Xdump:system:none -Xdump:heap:none -Xtrace:none -Xnoloa -Xdisableexplicitgc
spark.eventLog.enabled          true
spark.eventLog.dir              file:///tmp/eventlog-ishizaki
spark.history.fs.logDirectory   file:///tmp/eventlog-ishizaki
spark.driver.cores      16
spark.driver.memory     144g
spark.serializer        org.apache.spark.serializer.KryoSerializer
spark.kryoserializer.buffer.max 512m
spark.akka.frameSize    1024
spark.history.ui.port   18080

How to run:

non-GPU version
$ MASTER='local[8]' bin/run-example SparkLR 128 1000000 400 5
GPU version
$ MASTER='local[8]' bin/run-example SparkGPULR 16 1000000 400 5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark

Benchmark source files:

Program parameters:

Configurations:

How to run:

Clone this wiki locally