# Reusing RDDs: when, how, why

Spark offers several options for RDD re-use including `persisting`, `caching`, and `checkpointing`,
although Spark does not perform any of these automatically. 

Spark does not do so by default since storing RDD for re-use breaks some pipelining and can be
a waste if the RDD is only used one or the data is inexpensive to recompute. Persisting/caching requires a lot of memory or disk and is unlikely to improve performance for operations that are preformed only once. 

See H. Karau and R. Warren, "High-performance Spark" O'Reilly book for more details.

## Iterative Computations

For transformations that use the same parent RDD multiple times, re-using an RDD
forces evaluation of that RDD and so can help avoid repeated computations. For
example, if you were performing a loop of joins to the same dataset, persisting that
dataset could lead to huge performance improvements since it ensures that the partitions
of that RDD will be available in-memory to do each join.

In the following example we are computing the root mean squared error (RMSE) on
a number of different RDDs representing predictions from different models. To do
this we have to join each RDD of predictions to an RDD of the data in the validation
set.

In [68]:
#create a sample dataset of key-value pairs
import random
keyValuePairs = [(i,random.randint(1,100)) for i in range(100000)]

#Define root-mean square function
def RMSE(rdd):
    from math import sqrt
    from operator import add
    n = rdd.count()
    return sqrt(rdd.map(lambda x: (x[0] - x[1]) ** 2).reduce(add) / float(n))

In [69]:
start = time.time() 
validationSet = sc.parallelize(keyValuePairs) #.cache()
#validationSet.unpersist()
testSet = validationSet.mapValues(lambda x: x+1)
print "Errors is: ", RMSE(testSet.join(validationSet).values())
end = time.time()
print "Elapsed time, step1: ", end-start    

testSet = validationSet.mapValues(lambda x: x+2)
print "Error is: ", RMSE(testSet.join(validationSet).values())
end2 = time.time()
print "Elapsed time, step2 ", end2-end    

Errors is:  1.0
Elapsed time, step1:  1.09853816032
Error is:  2.0
Elapsed time, step2  1.10933494568


RDDs can also be persisted in memory backed by disk in serialized form with replication across Workers. You can read more on RDD persistence: http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence. 

## The primary factors to consider while choosing a storage level

### Persistence Level: 

Storing in memory is the easiest option without too much overhead. If the entire RDD doesn't fit in memory, then in case of missing partitions, the partitions will be recomputed from the lineage automatically. But if an RDD which is formed from a wide transformation is going to be used heavily in an iterative or interactive fashion, then it is better to store it in memory backed by disk to ensure that the partitions are not recomputed.

### Serialization: 

The default serialization in Spark is Java serialization. However for better peformance, we recommend Kryo serialization, which you can learn more about here

### Replication: 

Spark, by default, provides fault tolerance by recomputing any missing partitions in the fly. To optimize for performance, you can optionally provide a replication factor. But note that this will increase the initial cache time and storage usage significantly.

To store an RDD in memory serialized backed by disk (deserialized) and a replication factor of 2:

## Caching DataFrames and tables in memory

Spark's SQL tables and DataFrames can be cached too. The tables and DataFrames are cached in the JVM memory and compressed using a simple algorithm in a columnar format. 

Typically however, the compression ratio will not be as good as something like parquet. To cache a DataFrame in memory:

In [6]:
N=100000
df = sqlContext.range(N)
df.cache()

DataFrame[id: bigint]

In [7]:
df.printSchema()
df.cache()

root
 |-- id: long (nullable = false)



DataFrame[id: bigint]

# Partitioning

Partition in Spark represents a unit of parallel execution that corresponds to one task. The number of tasks that can be executed concurrently is limited by the total number of executor cores in the Spark cluster. 

The partitioner object defines a function from an element of a pair RDD to a partition via a mapping from each record to partition number. By assigning a partitioner to an RDD we can guarantee something about the data that lives in each partition, for example that it falls within a given range (range partitioner), or includes only elements whose keys
have the same hash code (hash partitioner). 

## Changing the RDD partitioning

There are three methods that exists exclusively to change the way an RDD is partitioned. In the generic RDD class `repartition` and `coalesce` can be used simply change the number of partitions that the RDD uses, irrespective of the value of the records in the RDD. 

`repartition` transformation causes a shuffle, while `coalesce` is an optimized version of repartition that avoids a full shuffle if the desired number of partitions is less than the current number of partitions. When coalesce reduces the number of partitions, it does so by merely combining partitions and thus is not a wide transformation since the
partition can be determined at design time. 

For RDDs of key/value pairs, we can use a function called `partitionBy`, which takes a partition object rather than a
number of partitions and shuffles the RDD with the new partitioner. 

PartitionBy allows for much more control in the way that the records are partitioned since the
partitioner supports defining a function that assigns a partition to a record based on
the value of that key.

In [70]:
import time
N = 100000
rdd = sc.parallelize([x for x in range(N)])

In [73]:
print rdd.getNumPartitions()

4


In [74]:
print rdd.coalesce(2).getNumPartitions()
print rdd.repartition(2).getNumPartitions()

2
2


## Partitioning of key-value RDDs

Following plots show a schematic representation of how word count is performed in a distributed way using `groupByKey` and `reduceByKey` approach. Aggregation is going to be discussed later in the course in more details.

<img src="./debug/groupbykey.png">

<img src="./debug/reducebykey.png">

# Large scale caching and partitioning tests

Let us return to the Adroit cluster accounts, and rerun these tests in a distributed environment.
I have prepared a relatively large (order of a GB) file here:

```bash
[alexeys@adroit3 4_Optimize] ls -l /scratch/network/alexeys/BigDataCourse/large/test.json 
-rw-r--r-- 1 alexeys cses 1090671230 Apr  4 23:00 /scratch/network/alexeys/BigDataCourse/large/test.json
```

Change into the exercise folder:

```bash
cd BigDataCourse/4_Optimize
```

inspect the cache_partition.py source file, and submit it without changes to the cluster by running:

```bash
sbatch slurm.cmd
```

As the slurm\*out file appears in your submission area, connect to the Spark web UI:

```bash
firefox --no-remote http://<your master URL>:8080
```

where master URL will become available in the slurm output file.

# Broadcast variables

Broadcast variables allow to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.

Spark actions are executed through a set of stages, separated by distributed “shuffle” operations. Spark automatically broadcasts the common data needed by tasks within each stage. The data broadcasted this way is cached in serialized form and deserialized before running each task. This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.

## Creating broadcast variables

Broadcast variables are created from a variable v by calling SparkContext.broadcast(v). The broadcast variable is a wrapper around v, and its value can be accessed by calling the value method. The code below shows this:

In [1]:
broadcastVar = sc.broadcast([1, 2, 3])

In [2]:
broadcastVar.value

[1, 2, 3]

After the broadcast variable is created, it should be used instead of the value v in any functions run on the cluster so that v is not shipped to the nodes more than once. In addition, the object v should not be modified after it is broadcast in order to ensure that all nodes get the same value of the broadcast variable (e.g. if the variable is shipped to a new node later).

# Accumulators

Accumulators are variables that are only “added” to through an associative operation and can therefore be efficiently supported in parallel. They can be used to implement counters (as in MapReduce) or sums. Spark natively supports accumulators of numeric types, and programmers can add support for new types. If accumulators are created with a name, they will be displayed in Spark’s UI. This can be useful for understanding the progress of running stages (NOTE: this is not yet supported in Python).

An accumulator is created from an initial value v by calling SparkContext.accumulator(v). Tasks running on the cluster can then add to it using the add method or the += operator (in Scala and Python). However, they cannot read its value. Only the driver program can read the accumulator’s value, using its value method.

The code below shows an accumulator being used to add up the elements of an array:

In [3]:
accum = sc.accumulator(0)

In [4]:
sc.parallelize([1, 2, 3, 4]).foreach(lambda x: accum.add(x))

In [5]:
accum.value

10