## A hands-on demo of analyzing big data with Spark

Scan a novel, calculate pi, and run regression on 50 million rows

See https://towardsdatascience.com/a-hands-on-demo-of-analyzing-big-data-with-spark-68cb6600a295

### The analytics framework for big data

Spark is a framework for processing massive amounts of data. It works by partitioning your data into subsets, distributing the subsets to worker nodes (whether they’re logical CPU cores on your laptop [2] or entire machines in a cluster), and then coordinating the workers to analyze the data. In essence, Spark is a “divide and conquer” strategy.

A simple analogy can help visualize the value of this approach. Let’s say we want to count the number of books in a library. The “expensive computer” approach would be to teach someone to count books as fast as possible, training them for years to accurately count while sprinting. While fun to watch, this approach isn’t that useful − even Olympic sprinters can only run so fast, and you’re out of luck if your book-counter gets injured or decides to change professions!

The Spark approach, meanwhile, would be to get 100 random people, assign each one a section of the library, have them count the books in their section, and then add their answers together. This approach is more scalable, fault-tolerant, and cheaper… and probably still fun to watch.

Spark’s main data type is the resilient distributed dataset (RDD). An RDD is an abstraction of data distributed in many places, like how the entity “Walmart” is an abstraction of millions of people around the world. Working with RDDs feels like manipulating a simple array in memory, even though the underlying data may be spread across multiple machines.

### Getting started

Spark is mainly written in Scala but can be used from Java, Python, R, and SQL. We’ll use PySpark, the Python interface for Spark. To install PySpark, type `pip install pyspark` in the Terminal. You might also need to install or update Java. You’ll know everything is set up when you can type pyspark in the Terminal.

In [8]:
# !pyspark   # need to stop manually (as creates interactive session)

Below is a tiny PySpark demo. We start by manually defining the SparkSession to start a connection to Spark. (If you’re in the PySpark Terminal, this is already done for you.) We then create an RDD of an array, visualize the first two numbers, and print out the maximum. With .getNumPartitions, we see that Spark allocated our array to the eight logical cores on my machine.

In [9]:
from pyspark.sql import SparkSession

In [10]:
# Start Spark connection
spark = SparkSession.builder.getOrCreate()

In [11]:
# Allocate the numbers 0-999 to an RDD
numbers = range(1000)
rdd = spark.sparkContext.parallelize(numbers)

# Visualize RDD
print(rdd.take(2))  # [0, 1]
print(rdd.max())    # 999
print(rdd.getNumPartitions())  # 8

[0, 1]
999
8


In [12]:
print(rdd.take(10))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
