<div style="color:red;font-weight:bold;background:yellow;text-align:center;padding:10px;border:solid">
    <h1>RUN IN EMR CLUSTER ONLY</h1>
    If the URL of the current page does not begin with "ec2", then do **NOT** proceed!
</div>

# PySpark

This lab will look at using PySpark to parallelize some computation, and then we will look at a concrete example of using PySpark.

### Connecting to PySpark

In [2]:
name = !hostname
if "dsa" in name[0]:
    raise RuntimeError("Only run this notebook in the EMR Cluster!")
import findspark
findspark.init()
from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName("pyspark-lab")
sc = SparkContext(conf=conf)

### Creating an Resilient Distributed Dataset

`sc.parallelize()` will prepare a Resilient Distributed Dataset (RDD) that lives on the cluster. This enables parallel computation

In [2]:
# create the original data
data = range(0, 100)
# now that we have a list, we can make it parallelized
parallel_data = sc.parallelize(data)

### Using an RDD to Parallelize
Now that we have a RDD, we can perform operations such as `map` or `reduce`. 

For more details as well as the full list of available operations: [click here](https://spark.apache.org/docs/2.1.1/programming-guide.html#rdd-operations)

Lets double every odd number and halve every even number in the data. To do this, we must first write the function to take in a single element and return the halved or double value

In [3]:
def half_or_double(value):
    return value / 2 if value % 2 == 0 else value * 2

# hand it off
transformed_data = parallel_data.map(half_or_double)

### Getting and Checking Results

Now we use the `collect` function to get the results as a list, then we can compare them.

In [4]:
# get list
result_data = transformed_data.collect()
# print results
for original, transformed in zip(data, result_data):
    print("{} --> {}".format(original, int(transformed)))

0 --> 0
1 --> 2
2 --> 1
3 --> 6
4 --> 2
5 --> 10
6 --> 3
7 --> 14
8 --> 4
9 --> 18
10 --> 5
11 --> 22
12 --> 6
13 --> 26
14 --> 7
15 --> 30
16 --> 8
17 --> 34
18 --> 9
19 --> 38
20 --> 10
21 --> 42
22 --> 11
23 --> 46
24 --> 12
25 --> 50
26 --> 13
27 --> 54
28 --> 14
29 --> 58
30 --> 15
31 --> 62
32 --> 16
33 --> 66
34 --> 17
35 --> 70
36 --> 18
37 --> 74
38 --> 19
39 --> 78
40 --> 20
41 --> 82
42 --> 21
43 --> 86
44 --> 22
45 --> 90
46 --> 23
47 --> 94
48 --> 24
49 --> 98
50 --> 25
51 --> 102
52 --> 26
53 --> 106
54 --> 27
55 --> 110
56 --> 28
57 --> 114
58 --> 29
59 --> 118
60 --> 30
61 --> 122
62 --> 31
63 --> 126
64 --> 32
65 --> 130
66 --> 33
67 --> 134
68 --> 34
69 --> 138
70 --> 35
71 --> 142
72 --> 36
73 --> 146
74 --> 37
75 --> 150
76 --> 38
77 --> 154
78 --> 39
79 --> 158
80 --> 40
81 --> 162
82 --> 41
83 --> 166
84 --> 42
85 --> 170
86 --> 43
87 --> 174
88 --> 44
89 --> 178
90 --> 45
91 --> 182
92 --> 46
93 --> 186
94 --> 47
95 --> 190
96 --> 48
97 --> 194
98 --> 49
99 --> 1

# Estimating Pi
Now that we have seen a very simple example, lets use this new ability to estimate pi, the mathematical constant. 

Pi can be estimated by throwing darts at a dart board. A more mathematical explanation is [here](http://www.thephysicsmill.com/2014/05/03/throwing-darts-pi/)

We will perform the same steps as before to do this, and then we will compare with the actual value of pi

### Initialize Data and Create RDD

In [5]:
NUM_SAMPLES = 10000
data = range(0, NUM_SAMPLES)

parallel_data = sc.parallelize(data)

### Handoff For Computation

In [6]:
import random
def throw_dart(value):
    x, y = random.random(), random.random()
    return x**2 + y**2 < 1

# hand off
result = parallel_data.filter(throw_dart)

### Get Results & Compare to Actual Pi

In [7]:
import math
# get count
count = result.count()
# count is about pi/4, so we need to mult by 4 to get our approx
pi_estimation = 4 * count / NUM_SAMPLES

print("{:.11f}\n{:.11f}".format(pi_estimation, math.pi))

3.03360000000
3.14159265359


As we can see, our estimation is pretty bad. 
If we increase the NUM_SAMPLES, then our estimation will get closer. 
**Feel free to experiment!**

# Accumulators and Broadcast

Sometimes it is necessary to sum a list. 
To do this within a Spark Context, we use `accumulators`. While we could just use elementry Python to accomplish the task, we want to leverage the power and parallelization of Spark to perform this task. 

We can then sum in parallel using the `foreach` method on an RDD

In [8]:
# Create List
data = range(0,10)
# Create RDD
rdd = sc.parallelize(data)
# Create accumulators
acc = sc.accumulator(0)
# add them in parallel
rdd.foreach(lambda x: acc.add(x))

# get the value back
total = acc.value # <---the .value gets the accumulators value back
print(total)

45


In addition to accumulators, sometimes we need a broadcasted variable, which is a read-only value that we can read when parallelizing. 

This is necessary because we want all of our nodes to be able to see the same value in a parallel context. For example, if we want to add the number 5 to every number in our RDD, we could do it like this:

In [9]:
# create broadcast
broadcast = sc.broadcast(5)
# add 5 to every number
temp_result = rdd.map(lambda x: x + broadcast.value)
#get the results
result = temp_result.collect()

print(result)

[5, 6, 7, 8, 9, 10, 11, 12, 13, 14]


In this example, we add the number 5, but that can also be a variable. 


A use case for this is if you perform some computation and get a number and then want to parallelize some sort of operation using that computed number you could store it in a broadcast variable and then use a Spark RDD to parallelize your next computation.

### <span style="background:yellow">Your Turn - Task 1</span>

Create an array of values from 0-49. Then, use the spark cluster to look at each value.
If it is **odd**, then return 500. 
If it is **even**, then return the value minus 1. 

In [3]:
## Add your code in the cell below
## ---------------------------------

data = range(0, 50)
# Create RDD
parallel_data = sc.parallelize(data)


### Handoff for Computation
def function500orminus1(value):
    return -1 if value % 2 == 0 else 500

# hand it off
transformed_data = parallel_data.map(function500orminus1)


### Gather and View Results
# get list
result_data = transformed_data.collect()
# print results
for original, transformed in zip(data, result_data):
    print("{} --> {}".format(original, int(transformed)))







0 --> -1
1 --> 500
2 --> -1
3 --> 500
4 --> -1
5 --> 500
6 --> -1
7 --> 500
8 --> -1
9 --> 500
10 --> -1
11 --> 500
12 --> -1
13 --> 500
14 --> -1
15 --> 500
16 --> -1
17 --> 500
18 --> -1
19 --> 500
20 --> -1
21 --> 500
22 --> -1
23 --> 500
24 --> -1
25 --> 500
26 --> -1
27 --> 500
28 --> -1
29 --> 500
30 --> -1
31 --> 500
32 --> -1
33 --> 500
34 --> -1
35 --> 500
36 --> -1
37 --> 500
38 --> -1
39 --> 500
40 --> -1
41 --> 500
42 --> -1
43 --> 500
44 --> -1
45 --> 500
46 --> -1
47 --> 500
48 --> -1
49 --> 500


# Save your notebook, then `File > Close and Halt`

---