# Example Spark job revisited

In [1]:
import findspark
findspark.init()

In [2]:
from pyspark.sql import SparkSession
import random

Create a standalone instance of Spark with one worker thread

In [3]:
spark = SparkSession.builder.appName("Pi").master("local[1]").getOrCreate()

Create a local data set consisting of 10 million rows of data: [0, 1, ... 9999999]

In [4]:
num_samples = 10000000
local_data = list(range(num_samples))
local_data[:10]

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Convert local data set into an rdd living in the JVM

In [5]:
rdd = spark.sparkContext.parallelize(local_data)

Add a filter (lazy execution means that there's no execution yet...)

We are applying a Python function to filter our RDD that lives in the JVM - magic!

In [6]:
def inside(p):
    x, y = random.random(), random.random()
    return x*x + y*y < 1

filtered_rdd = rdd.filter(inside)

Count the size of the filtered data set (this is an action, it triggers execution) and use to estimate pi

The result is returned from the JVM to Python via Py4J

In [7]:
%%time
count = filtered_rdd.count()
pi = 4 * count / num_samples
print(pi)

3.1415496
Wall time: 6.16 s


Stop the Spark session

In [8]:
spark.stop()

Spark is distributed, so we should be able to speed things up by adding another worker thread

In [9]:
spark = SparkSession.builder.appName("Pi").master("local[2]").getOrCreate()
rdd = spark.sparkContext.parallelize(range(num_samples))
filtered_rdd = rdd.filter(inside)

Fingers crossed 🤞

In [10]:
%%time
count = filtered_rdd.count()
pi = 4 * count / num_samples
print(pi)

3.141424
Wall time: 3.46 s


Can monitor progress using a web UI that displays useful information about the application http://localhost:4040

Parallelize takes a second argument telling Spark how many partitions to split our data into

In [11]:
rdd = spark.sparkContext.parallelize(range(num_samples*10), 100)
filtered_rdd = rdd.filter(inside)
count = filtered_rdd.count()
pi = 4 * count / (num_samples*10)
print(pi)

3.14182448


In [12]:
spark.stop()