In [1]:
import pyspark
try:
    sc
except NameError:    
    spark = pyspark.sql.SparkSession.builder.master("local[*]").appName("BD course").config("spark.hadoop.validateOutputSpecs", "false").getOrCreate()
    sc = spark.sparkContext

# Whole stage codegen

Whole-Stage Code Generation (aka Whole-Stage CodeGen) fuses multiple operators (as a subtree of plans that support code generation) together into a single Java function that is aimed at improving execution performance. It collapses a query into a single optimized function that eliminates virtual function calls and leverages CPU registers for intermediate data.

- Identity chains of operators (“stage” is a chains of trainsformations followed by a shuffle)
- Compile each stage into a single function (i.e. one can have map -> map -> flatMap -> groupBy -> sum() compiled into a single function)
- Functionality of a general purpose execution engine; performance as if hand built system just to run your query


**Note: Janino is used in Spark to compile a Java source code into a Java class.**

In [2]:
import time

# Define a simple benchmark function for measuring time taken
def benchmark(name, f):
    startTime = time.time()
    f()
    endTime = time.time()
    print ("Time taken in %s: %.4f seconds" % (name, endTime - startTime))

## How fast can Spark 1.6 sum up 1 billion numbers?

In [14]:
# This config turns off whole stage code generation, effectively changing the execution path to be similar to Spark 1.6.
spark.conf.set("spark.sql.codegen.wholeStage", "false")

In [15]:
f = lambda: spark.range(1000 * 1000 * 1000).selectExpr("sum(id)").show()
benchmark("Spark 1.6", f)

+------------------+
|           sum(id)|
+------------------+
|499999999500000000|
+------------------+

Time taken in Spark 1.6: 23.9768 seconds


## How fast can Spark 1.6 join 1 billion records?

In [16]:
# This config turns off whole stage code generation, effectively changing the execution path to be similar to Spark 1.6.
spark.conf.set("spark.sql.codegen.wholeStage", "false")

In [18]:
f = lambda: spark.range(1000 * 1000 * 1000).join(spark.range(1000), "id").count()
benchmark("Spark 1.6", f)

Time taken in Spark 1.6: 37.7517 seconds


## How fast can Spark 2.0 sum up 1 billion numbers?

In [3]:
# Now we turn on whole stage code generation to get the full Spark 2.0 experience
spark.conf.set("spark.sql.codegen.wholeStage", "true")

In [4]:
f = lambda: spark.range(1000 * 1000 * 1000).selectExpr("sum(id)").show()
benchmark("Spark 2.0", f)

+------------------+
|           sum(id)|
+------------------+
|499999999500000000|
+------------------+

Time taken in Spark 2.0: 5.2450 seconds


## How fast can Spark 2.0 join 1 billion records?

In [5]:
spark.conf.set("spark.sql.codegen.wholeStage", "true")

In [6]:
f = lambda: spark.range(1000 * 1000 * 1000).join(spark.range(1000), "id").count()
benchmark("Spark 2.0", f)

Time taken in Spark 2.0: 0.7305 seconds


## How fast can NumPy sum up 1 billion numbers?

In [None]:
import numpy as np

f = lambda: np.sum(np.arange(1000 * 1000 * 1000))
benchmark("NumPy", f)

## How fast can Pandas join 100 million numbers?

In [None]:
import numpy as np
import pandas as pd

a = pd.DataFrame({'id': np.arange(1000 * 1000 * 100)})
b = pd.DataFrame({'id': np.arange(1000)})
f = lambda: a.join(b, on='id', how='inner', lsuffix="_left", rsuffix="_right")
benchmark("Pandas", f)

**Note: for this example, there's an obviously faster way if we can assume that we can just use indexing. For arbitrary keys, however, this approach won't work.**

In [None]:
X = np.arange(1000 * 1000 * 1000)
keys = np.arange(1000)
f = lambda: np.shape(X[keys])[0]
benchmark("Numpy indexing", f)