# Intro to Spark

We've just learned about some of the design patterns in Hadoop MapReduce. However, being a legacy technology and for what it is worth, Hadoop requires a significant amount of overhead to properly setup and configure even with a modern Docker container. Instead, we will look at the limitations of Hadoop MapReduce through the lens of a more modern distributed framework like Spark. By abstracting away many parallelization details Spark provides a flexible interface for the programmer. However a word of warning: don't let the ease of implementation lull you into complacency, scalable solutions still require attention to the details of smart algorithm design. 

Our goal is to get you up to speed and coding in Spark as quickly as possible; this is by no means a comprehensive tutorial. By the end of today's demo you should be able to:  
* ... __initialize__ a `SparkSession` in a local NB and use it to run a Spark Job.
* ... __access__ the Spark Job Tracker UI.
* ... __describe__ and __create__ RDDs from files or local Python objects.
* ... __explain__ the difference between actions and transformations.
* ... __decide__ when to `cache` or `broadcast` part of your data.
* ... __implement__ Word Counting, Sorting and Naive Bayes in Spark. 

__`NOTE:`__ Although RDD successor datatype, Spark dataframes, are becoming more common in production settings we've made a deliberate choice to teach you RDDs first beause building homegrown algorithm implementations is crucial to developing a deep understanding of machine learning and parallelization concepts -- which is the goal of this course. We'll still touch on dataframes in Week 5 when talking about Spark efficiency considerations and we'll do a deep dive into Spark dataframes and streaming solutions in Week 12.

__`Additional Resources:`__ The offical documentation pages offer a user friendly overview of the material covered in this week's readings: [Spark RDD Programming Guide](https://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-programming-guide).

### Notebook Set-Up

In [1]:
# imports
import re
import numpy as np
import pandas as pd

In [2]:
# store path to notebook
PWD = !pwd
PWD = PWD[0]

In [17]:
# make data directory if it doesn't already exist
!mkdir data

[]

### Load the data
Today we'll mostly be working with toy examples & data created on the fly in Python. However at the end of this demo we'll revisit Word Count & Naive Bayes using some of the dat3. Run the following cells to re-load the _Alice in Wonderland_ text & the 'Chinese' toy example.

In [3]:
# (Re)Download Alice Full text from Project Gutenberg - RUN THIS CELL AS IS (if Option 1 failed)
# NOTE: feel free to replace 'curl' with 'wget' or equivalent command of your choice.
!curl "http://www.gutenberg.org/files/11/11-0.txt" -o data/alice.txt
ALICE_TXT = PWD + "/data/alice.txt"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0


In [4]:
%%writefile data/chineseTrain.txt
D1	1		Chinese Beijing Chinese
D2	1		Chinese Chinese Shanghai
D3	1		Chinese Macao
D4	0		Tokyo Japan Chinese

Writing data/chineseTrain.txt


In [5]:
%%writefile data/chineseTest.txt
D5	1		Chinese Chinese Chinese Tokyo Japan
D6	1		Beijing Shanghai Trade
D7	0		Japan Macao Tokyo
D8	0		Tokyo Japan Trade

Writing data/chineseTest.txt


In [6]:
# naive bayes toy example data paths - ADJUST AS NEEDED
TRAIN_PATH = PWD + "/data/chineseTrain.txt"
TEST_PATH = PWD + "/data/chineseTest.txt"

# Exercise 1. Getting started with Spark. For more information, please rread Ch 3-4 from _Learning Spark: Lightning-Fast Big Data Analysis_ by Karau et. al. as well as a few blog posts that set the stage for Spark. From these readings you should be familiar with each of the following terms:

* __Spark session__
* __Spark context__
* __driver program__
* __executor nodes__
* __resilient distributed datasets (RDDs)__
* __pair RDDs__
* __actions__ and __transformations__
* __lazy evaluation__

The first code block below shows you how to start a `SparkSession` in a Jupyter Notebook. Next we show a simple example of creating and transforming a Spark RDD. Let's use this as a quick vocab review before we dive into more interesting examples. 

In [8]:
from pyspark.sql import SparkSession
app_name = "pyspark_demo"
master = "local[*]"

spark = SparkSession\
        .builder\
        .appName(app_name)\
        .master(master)\
        .getOrCreate()

sc = spark.sparkContext

In [9]:
# a small example
myData = sc.parallelize(range(1,100))
squares = myData.map(lambda x: (x,x**2))
oddSquares = squares.filter(lambda x: x[1] % 2 == 1)

In [10]:
oddSquares.take(5)

[(1, 1), (3, 9), (5, 25), (7, 49), (9, 81)]

 > __DISCUSSION QUESTIONS:__ For each key term from the reading, briefly explain what it means in the context of this demo code. Specifically:
 * _What is the 'driver program' here?_
 * _What does the spark context do? Do we have 'executors' per se?_
 * _List all RDDs and pair RDDs present in this example._
 * _List all transformations present in this example._
 * _List all actions present in this example._
 * _What does the concept of 'lazy evaluation' mean about the time it would take to run each cell in the example?_
 * _If we were working on a cluster, where would each transformation happen? would the data get shuffled?_