Home

Short Name

Explore Spark SQL and its performance using TPC-DS workload

Short Description

Learn how to setup and run the TPC-DS benchmark to evaluate and measure the performance of your Spark SQL system.

Offering Type

Cognitive

Introduction

This journey demonstrates how to evaluate and test your Apache Spark cluster by using TPC-DS benchmark workloads. Two modes of execution are described: 1) using an interactive command line shell script, and 2) using a Jupyter notebook running in the IBM Data Science Experience (DSX).

Author

By Dilip Biswal and Rich Hagarty

Code

https://github.com/IBM/spark-tpc-ds-performance-test

Demo

N/A

Video

N/A

Overview

Apache Spark is a popular distributed data processing engine that is built around speed, ease of use and sophisticated analytics, with APIs in Java, Scala, Python, R, and SQL. Like other data processing engines, Spark has a unified optimization engine that computes the optimal way to execute a workload with the main purpose of reducing the disk IO and CPU usage.

We can evaluate and measure the performance of Spark SQL using the TPC-DS benchmark. TPC-DS is a widely used industry standard decision support benchmark that is used to evaluate performance of data processing engines. Given that TPC-DS exercises some key data warehouse features, running TPC-DS successfully reflects the readiness of Spark in terms of addressing the need of a data warehouse application. Apache Spark v2.0 supports all the ninety-nine decision support queries that is part of this TPC-DS benchmark.

There are many other useful reasons for running TPC-DS on your Spark installation:

Sanity check to make sure there are no possible configuration or installation issues.
To compare against other potential competing engine solutions.
Run before/after tests to verify performance gains when upgrading.

This journey is aimed at helping Spark developers quickly setup and run the TPC-DS benchmark in their own development setup.

When the reader has completed this journey, they will understand the following:

How to setup the TPC-DS toolkit
How to generate TPC-DS datasets at different scale factor
How to create Spark database artifacts
How to run TPC-DS benchmark queries on Spark in local mode and see the results
Things to consider when increasing the data scale and run against a spark cluster

Flow

Commandline
1. Compile the toolkit and generate the TPC-DS dataset by using the toolkit.
2. Create the spark tables and generate the TPC-DS queries.
3. Run the entire query set or a subset of queries and monitor the results.
Notebook
1. Create the spark tables with pre-generated dataset.
2. Run the entire query set or individual query.
3. View the query results or performance summary.
4. View the performance graph.

Included components

Apache Spark: An open-source, fast and general-purpose cluster computing system
Jupyter Notebook: An open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text.

Featured technologies

Data Science: Systems and scientific methods to analyze structured and unstructured data in order to extract knowledge and insights.
Artificial Intelligence: Artificial intelligence can be applied to disparate solution spaces to deliver disruptive technologies.
Python: Python is a programming language that lets you work more quickly and integrate your systems more effectively.

Blog

https://developer.ibm.com/code/?p=23793&preview=true

Links

Spark performance: https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
TPC-DS site: http://www.tpc.org/tpcds/
BigSQL blog: https://developer.ibm.com/hadoop/2017/07/13/announcing-bigsql-5-0/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly