# Spark notebook basics

**SparkContext**: our way of comunicating to the Spark system

In [1]:
import findspark
findspark.init()

In [2]:
from pyspark import SparkContext
sc=SparkContext(master='local[4]')
print(sc)

<SparkContext master=local[4] appName=pyspark-shell>


 The `master='local[4]` runs spark locally in my notebook using 4 workers (since I have 4 cores, we have one worker per core)
 
 We must have only one SparkContext at a time. It is designed for a single user. Before running a new SparkContext, stop the current one.

In [5]:
# sc.stop() # stop current SparkContext

**RDDs** (Resilient Distributed Dataset): a list of elements stored in several computers

**Parallelize**: simplest wat of creating an RDD. It is of type `PythonRDD`

In [6]:
A=sc.parallelize(range(3))
A

PythonRDD[1] at RDD at PythonRDD.scala:48

**Collect**: RDD content is distributed among all executors. `collect()` is the inverse of `parallelize()`. Collects the elements of the RDD, returns a list

In [7]:
L=A.collect()
print(type(L))
print(L)

<class 'list'>
[0, 1, 2]


Using collect eleminates the benefits of parallelism!!

It is often tempting to `.collect()` an RDD to make it into a list, and then process it using standard Python. However, this means only the head node is performing the computation, thus not benefitting from Spark. Using RDD operations will make you use all the computers at your disposal

**Map**: applies an operation to each element of the RDD. Parameter is the function defining the operation. Returns a new RDD. Operation is performed in parallel on all execution. Each executor operates on the local data

In [8]:
A.map(lambda x: x*x).collect()

[0, 1, 4]

**Reduce**: takes an RDD, return a single value. Reduce operator takes two elements as input and returns one as output. Repeatedly applies a reduce operator. Each executor reduces the data local to it. The results from all executors are combined

In [13]:
A.reduce(lambda x,y: x+y)

3

In [14]:
# finds the shortest string
words=['this','is','the','best','thinkpad','ever!']
wordsRDD=sc.parallelize(words)
wordsRDD.reduce(lambda w,v: w if len(w)<len(v) else v)

'is'

In [23]:
# bad reduce operation
B=sc.parallelize([1,3,5,2])
B.reduce(lambda x,y: x-y)

-9

This isn't an operation in which the order doesn't matter, because $x-y$ is different from $y-x$. Which of the following did you do:

$$((1-3)-5)-2$$

or

$$(1-3)-(5-2)$$


Using regular functions instead of lambda functions: lambda functions are short and sweet, but sometimes it is hard to use it in one line, we can use a full-fledged functions instead

Suppose we want to find the last word in a lexigographical order among the longest words in the list

In [25]:
def largerThan(x,y):
    if len(x)>len(y):
        return x
    elif len(y)>len(x):
        return y
    else:
        if x>y:
            return x
        else:
            return y

In [27]:
wordsRDD.reduce(largerThan)

'thinkpad'