## Chaining

We can chain transformations and action to create a computation pipeline. Suppose we want to compute the sum of the squares where the elements $x_i$ are stored in an RDD. $$ \sum_{i=1}^n x_i^2 $$

## Start the Sparkcontext

In [1]:
import numpy as np
from pyspark import SparkContext

In [2]:
sc = SparkContext(master="local[4]")

In [4]:
B = sc.parallelize(np.random.randint(0,10,size=1000))
lst = B.collect()
for i in lst:
    print(i,end=', ')

2, 1, 3, 4, 5, 3, 9, 9, 5, 2, 0, 1, 0, 2, 4, 8, 3, 4, 4, 0, 9, 3, 5, 7, 5, 1, 8, 1, 7, 4, 8, 0, 9, 4, 6, 4, 5, 0, 9, 7, 7, 5, 7, 8, 8, 7, 2, 8, 7, 2, 8, 1, 8, 8, 0, 0, 8, 8, 3, 2, 6, 8, 8, 4, 2, 0, 5, 1, 3, 8, 8, 6, 7, 2, 3, 1, 8, 0, 2, 0, 1, 5, 0, 6, 0, 7, 2, 9, 9, 8, 1, 8, 2, 6, 1, 7, 2, 7, 5, 1, 2, 4, 0, 0, 8, 9, 7, 1, 9, 9, 2, 7, 1, 2, 0, 9, 0, 2, 9, 2, 2, 3, 0, 9, 9, 7, 8, 9, 1, 0, 9, 4, 6, 4, 3, 0, 1, 3, 8, 0, 8, 1, 2, 8, 7, 5, 4, 8, 4, 3, 5, 4, 0, 6, 0, 6, 0, 5, 0, 0, 2, 7, 8, 8, 5, 2, 9, 4, 6, 2, 2, 9, 5, 6, 3, 1, 4, 0, 0, 6, 7, 2, 0, 4, 0, 1, 4, 9, 2, 6, 7, 6, 3, 6, 7, 9, 2, 0, 9, 8, 2, 7, 0, 2, 0, 0, 0, 7, 9, 5, 3, 8, 7, 9, 3, 1, 1, 7, 9, 2, 4, 7, 2, 5, 8, 3, 6, 4, 8, 1, 2, 2, 1, 2, 5, 9, 2, 2, 5, 5, 0, 3, 4, 5, 5, 0, 3, 6, 3, 9, 4, 0, 8, 1, 6, 1, 9, 9, 7, 5, 8, 7, 1, 9, 1, 3, 6, 9, 6, 8, 9, 4, 9, 7, 8, 1, 8, 3, 3, 9, 5, 3, 7, 4, 9, 5, 7, 5, 1, 5, 6, 4, 2, 6, 3, 8, 4, 9, 2, 9, 4, 7, 6, 6, 4, 8, 7, 7, 3, 3, 6, 9, 3, 1, 6, 6, 5, 2, 2, 3, 6, 3, 8, 6, 6, 7, 7, 0, 4, 4, 1, 7, 6, 7

## Sequential syntax for chaining

Perform assignment after each computation

In [6]:
%%time
Squares = B.map(lambda x:x*x)
summation = Squares.reduce(lambda x,y:x+y)

CPU times: user 7.72 ms, sys: 2.11 ms, total: 9.84 ms
Wall time: 454 ms


In [8]:
print(summation)

29022


## Cascaded syntax for chaining

Combine computations into a single cascaded command

In [9]:
%%time
B.map(lambda x:x*x).reduce(lambda x,y:x+y)

CPU times: user 3.76 ms, sys: 2.3 ms, total: 6.06 ms
Wall time: 141 ms


29022

### Both syntaxes mean exactly the same thing

The only difference:
1. In the sequential syntax the intermediate RDD has a name Squares
2. In the cascaded syntax the intermediate RDD is anonymous

The execution is identical

### Sequential execution

The standard way that the map and reduce are executed is
1. perform the map
2. store the resulting RDD in memory
3. perform the reduce

### Disadvantages of Sequential execution

1. Intermediate result (Squares) requires memory space.
2. Two scans of memory (of B, then of Sequares) -double the cache - misses.

### Pipelined execution

Perform the whole computation in a single pass. For each element of B
1. Compute the Square
2. Enter the square as input to the reduce operation.

### Advantages of Pipelined execution

1. Less memory required - intermediate result is not stored.
2. Faster - only one pass through the Input RDD.

In [10]:
sc.stop()