# Spark notebook basics

**SparkContext**: our way of comunicating to the Spark system

In [1]:
import findspark
findspark.init()

In [2]:
from pyspark import SparkContext
sc=SparkContext(master='local[4]')
print(sc)

<SparkContext master=local[4] appName=pyspark-shell>


 The `master='local[4]` runs spark locally in my notebook using 4 workers (since I have 4 cores, we have one worker per core)
 
 We must have only one SparkContext at a time. It is designed for a single user. Before running a new SparkContext, stop the current one.

In [3]:
# sc.stop() # stop current SparkContext

**RDDs** (Resilient Distributed Dataset): a list of elements stored in several computers

**Parallelize**: simplest wat of creating an RDD. It is of type `PythonRDD`

In [4]:
A=sc.parallelize(range(3))
A

PythonRDD[1] at RDD at PythonRDD.scala:48

**Collect**: RDD content is distributed among all executors. `collect()` is the inverse of `parallelize()`. Collects the elements of the RDD, returns a list

In [5]:
L=A.collect()
print(type(L))
print(L)

<class 'list'>
[0, 1, 2]


Using collect eleminates the benefits of parallelism!!

It is often tempting to `.collect()` an RDD to make it into a list, and then process it using standard Python. However, this means only the head node is performing the computation, thus not benefitting from Spark. Using RDD operations will make you use all the computers at your disposal

**Map**: applies an operation to each element of the RDD. Parameter is the function defining the operation. Returns a new RDD. Operation is performed in parallel on all execution. Each executor operates on the local data

In [6]:
A.map(lambda x: x*x).collect()

[0, 1, 4]

**Reduce**: takes an RDD, return a single value. Reduce operator takes two elements as input and returns one as output. Repeatedly applies a reduce operator. Each executor reduces the data local to it. The results from all executors are combined

In [7]:
A.reduce(lambda x,y: x+y)

3

In [8]:
# finds the shortest string
words=['this','is','the','best','thinkpad','ever!']
wordsRDD=sc.parallelize(words)
wordsRDD.reduce(lambda w,v: w if len(w)<len(v) else v)

'is'

In [9]:
# bad reduce operation
B=sc.parallelize([1,3,5,2])
B.reduce(lambda x,y: x-y)

-9

This isn't an operation in which the order doesn't matter, because $x-y$ is different from $y-x$. Which of the following did you do:

$$((1-3)-5)-2$$

or

$$(1-3)-(5-2)$$


Using regular functions instead of lambda functions: lambda functions are short and sweet, but sometimes it is hard to use it in one line, we can use a full-fledged functions instead

Suppose we want to find the last word in a lexigographical order among the longest words in the list

In [10]:
def largerThan(x,y):
    if len(x)>len(y):
        return x
    elif len(y)>len(x):
        return y
    else:
        if x>y:
            return x
        else:
            return y

In [11]:
wordsRDD.reduce(largerThan)

'thinkpad'

# Changing number of workers

**Effect of changing the number of workers**: when you initialize SparkContext, you can specify the number of workers, usually one worker per core, but it can be smaller or larger than that

In [12]:
from time import time
from pyspark import SparkContext
sc.stop()
for j in range(1,10):
    sc = SparkContext(master='local[%d]'%(j))
    tic=time()
    for i in range(10):
        sc.parallelize([1,2]*100000).reduce(lambda x,y:x+y)
    print('%2d executors, time=%4.3f'%(j,time()-tic))
    sc.stop()

 1 executors, time=2.352
 2 executors, time=1.912
 3 executors, time=1.964
 4 executors, time=1.792
 5 executors, time=1.895
 6 executors, time=2.801
 7 executors, time=3.518
 8 executors, time=3.280
 9 executors, time=2.872


# Execution plans lazy eval and caching

Suppose our task is $\sum^n_{i=1}x^2_i$. The standard (busy) way to do this is: i) calculate the square of each element; ii) sum the squares. This requires storing all intermediate results!

**Lazy evaluation**: postpone computing the square until result is needed, no need to intermediate results, scan through the data once, rather than twice. We create an RDD with one million elements to amplify the effects of lazy evaluation and caching:

In [13]:
%%time
sc=SparkContext()
RDD=sc.parallelize(range(1000000))

CPU times: user 14.5 ms, sys: 206 µs, total: 14.7 ms
Wall time: 75.4 ms


Define a computation: the role of the function `taketime` is to consume CPU cycles

In [14]:
from math import cos
def taketime(i):
    [cos(j) for j in range(100)]
    return cos(i)

In [15]:
%%time
taketime(1)

CPU times: user 55 µs, sys: 0 ns, total: 55 µs
Wall time: 61.3 µs


0.5403023058681398

Time units
* 1 second = 1000 millisecond ($ms$)
* 1 millisecond = 1000 microsecond ($\mu s$)
* 1 microsecond = 1000 nanosecond ($\eta s$)

Clock rate: one cycle of a 3GHz CPU takes $\frac{1}{3} \eta s$

`taketime(1000)` takes about $25\eta s=75,000$ clock cycles

**Map operation**

In [16]:
%%time
Interm=RDD.map(lambda x: taketime(x))

CPU times: user 41 µs, sys: 0 ns, total: 41 µs
Wall time: 48.9 µs


How come so fast??

We expected this map operation to take $1000000\times 25\eta s=25 seconds$

Why did the previous took ~30 seconds??

Because no computation was done! The cell defined an execution plan, but did not execute it yet

**Execution plans**: at this point the variable `Interm` doesn't point to an actual data structure. Instead, it points to an execution plan expressed as a **dependence graph**. The dependence graph defines how the RDDs are computed from each other. The dependenec graph associated with an RDD can be printed out using the method `toDebugString()`

In [17]:
print(Interm.toDebugString().decode())

(4) PythonRDD[1] at RDD at PythonRDD.scala:48 []
 |  ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:175 []


**Interm** (4) PythonRDD[1] at RDD at PythonRDD.scala:48 [], the [4] corresponds to the number of partitions

**RDD** ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:175

So, at this point, only the left blocks of the plan have been declared (the parallelize and the map), but not the reduce

**Actual execution**: the `reduce` command needs to output an actual output, Spark, therefore has to actually execute the `map` and the `reduce`. Some real computation needs to be done, which takes about 1-3 seconds depending on the machine used and on it's load

In [19]:
%%time
print('out=',Interm.reduce(lambda x,y:x+y))

out= -0.2887054679684464
CPU times: user 9 ms, sys: 4.03 ms, total: 13 ms
Wall time: 10.4 s


**How come so fast?**

We expected this map operation to take 25 seconds, map+reduce take only ~10 seconds, why?

Because we have 4 workers rather than one, because the measurement of a single iteration of `taketime` is an overestimate

**Executing a different calculation based on the same plan**: the plan defined by `Interm` might need to be executed more than once. For example, compute the number of map outputs that are larger than zero

In [21]:
%%time
print('out=',Interm.filter(lambda x:x>0).count())

out= 500000
CPU times: user 15.5 ms, sys: 0 ns, total: 15.5 ms
Wall time: 14.8 s


**The price of not materializing**: the runtime is similar of that of the reduce, because the intermediate results in `Interm` have not been saved in memory (materialized), they need to be recomputed

The middle block is executed twice, once for each final step

    RDD                     Interm
parallelize(range(1000000)) -> map(taketime()) -> reduce(lambda x,y: x+y)       -> number

                                                             \
                                                             filter(lambda x: x>0).count()   -> number
                                                             
**Caching intermediate results**: we sometimes want to keep the intermediate results in memory so that we can reuse them later without recalculating. This will reduce the running time, at the cost of requiring more memory.

The method `cache()` indicates that the RDD generated in this plan should be stored in memory. Note that this is a **plan to cache**. The actual caching will be done only when the final result is needed

In [22]:
%%time
Interm=RDD.map(lambda x: taketime(x)).cache()

CPU times: user 9.12 ms, sys: 487 µs, total: 9.6 ms
Wall time: 24.6 ms


**Plan to cache**: the definition of `Interm` is almost the same as before. However, the plan corresponding to `Interm` is more elaborate and contains information about how the intermediate results will be cached nad replicated. Note that `PythonRDD[4]` is now `Memory Serialized 1x Replicated`

In [23]:
print(Interm.toDebugString().decode())

(4) PythonRDD[7] at RDD at PythonRDD.scala:48 [Memory Serialized 1x Replicated]
 |  ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:175 [Memory Serialized 1x Replicated]


**Creating the cache**: the following command executes the first map-reduce command **and** caches the result of the `map` command in memory

In [26]:
%%time
print('out=',Interm.reduce(lambda x,y: x+y))

out= -0.2887054679684464
CPU times: user 12.7 ms, sys: 38 µs, total: 12.7 ms
Wall time: 361 ms


**Using the cache**: this time `Interm` is cached. Therefore the second use of `Interm` is much faster than when we did'nt use cache: 0.35 seconds instead of 10 seconds

In [27]:
%%time
print('out=',Interm.filter(lambda x: x>0).count())

out= 500000
CPU times: user 7.08 ms, sys: 8.08 ms, total: 15.2 ms
Wall time: 360 ms
