# Caching

This notebook is a quick guide to explain different memory and disk persistence options in Spark.
Caching RDDs
To simply cache an RDD in memory in deserialized form:

In [1]:
import time
N = 10**5
rdd = sc.parallelize([x for x in range(N)])

In [2]:
#we will only get slow results when running the first time, because DAG will force it to cache!

from math import sqrt
start = time.time()
rdd1 = rdd.map(lambda x: x*x)
rdd1.collect()
end = time.time()
print "Elapsed time square ", end-start
start = time.time()
rdd2 = rdd.map(lambda x: sqrt(x))
rdd2.collect()
end = time.time()
print "Elapsed time sqrt ", end-start

Elapsed time square  1.19509005547
Elapsed time sqrt  0.145571947098


In [3]:
#in case we running same cell multiple times
rdd.unpersist()
rdd.cache()

start = time.time()
rdd1 = rdd.map(lambda x: x*x)
rdd1.collect()
end = time.time()
print "Elapsed time square ", end-start
start = time.time()
rdd2 = rdd.map(lambda x: sqrt(x))
rdd2.collect()
end = time.time()
print "Elapsed time sqrt ", end-start

Elapsed time square  0.194844961166
Elapsed time sqrt  0.142702817917


RDDs can also be persisted in memory backed by disk in serialized form with replication across Workers. You can read more on RDD persistence: http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence. 

## The primary factors to consider while choosing a storage level

### Persistence Level: 

Storing in memory is the easiest option without too much overhead. If the entire RDD doesn't fit in memory, then in case of missing partitions, the partitions will be recomputed from the lineage automatically. But if an RDD which is formed from a wide transformation is going to be used heavily in an iterative or interactive fashion, then it is better to store it in memory backed by disk to ensure that the partitions are not recomputed.

### Serialization: 

The default serialization in Spark is Java serialization. However for better peformance, we recommend Kryo serialization, which you can learn more about here

### Replication: 

Spark, by default, provides fault tolerance by recomputing any missing partitions in the fly. To optimize for performance, you can optionally provide a replication factor. But note that this will increase the initial cache time and storage usage significantly.

To store an RDD in memory serialized backed by disk (deserialized) and a replication factor of 2:

In [4]:
rdd.unpersist()

from pyspark.storagelevel import StorageLevel
rdd.persist(StorageLevel.MEMORY_AND_DISK)

start = time.time()
rdd1 = rdd.map(lambda x: x*x)
rdd1.collect()
end = time.time()
print "Elapsed time square ", end-start
start = time.time()
rdd2 = rdd.map(lambda x: sqrt(x))
rdd2.collect()
end = time.time()
print "Elapsed time sqrt ", end-start

Elapsed time square  0.15674495697
Elapsed time sqrt  0.101278066635


The "Storage" tab in the Spark UI shows the list of all RDDs cached and also where each partition of a RDD resides in the cluster.

To uncache:

In [5]:
rdd.unpersist()

ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:423

## Caching DataFrames and tables in memory

Spark's SQL tables and DataFrames can be cached too. The tables and DataFrames are cached in the JVM memory and compressed using a simple algorithm in a columnar format. 

Typically however, the compression ratio will not be as good as something like parquet. To cache a DataFrame in memory:

In [6]:
df = sqlContext.range(N)
df.cache()

DataFrame[id: bigint]

In [7]:
df.printSchema()
df.cache()

root
 |-- id: long (nullable = false)



DataFrame[id: bigint]

# Partitioning

In [8]:
import time
rdd = sc.parallelize([x for x in range(N)],5)

In [9]:
#we will only get slow results when running the first time, because DAG will force it to cache!
rdd.cache()

from math import sqrt
start = time.time()
rdd1 = rdd.map(lambda x: x*x)
rdd1.collect()
end = time.time()
print "Elapsed time square ", end-start
start = time.time()
rdd2 = rdd.map(lambda x: sqrt(x))
rdd2.collect()
end = time.time()
print "Elapsed time sqrt ", end-start

Elapsed time square  0.200296163559
Elapsed time sqrt  0.122011184692


# Large scale caching and partitioning tests

Let us return to the Adroit cluster accounts, and rerun these tests in a distributed environment.
I have prepared a relatively large (order of a GB) file here:

```bash
[alexeys@adroit3 4_Optimize] ls -l /scratch/network/alexeys/BigDataCourse/large/test.json 
-rw-r--r-- 1 alexeys cses 1090671230 Apr  4 23:00 /scratch/network/alexeys/BigDataCourse/large/test.json
```

Change into the exercise folder:

```bash
cd BigDataCourse/4_Optimize
```

inspect the cache_partition.py source file, and submit it without changes to the cluster by running:

```bash
sbatch slurm.cmd
```

As the slurm\*out file appears in your submission area, connect to the Spark web UI:

```bash
firefox --no-remote http://<your master URL>:8080
```

where master URL will become available in the slurm output file.

<img src="/Users/alexeys/Desktop/BigDataCourse_2016/4_Optimize/debug/SparkUI_1.png">

# Broadcast variables

Broadcast variables allow to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.

Spark actions are executed through a set of stages, separated by distributed “shuffle” operations. Spark automatically broadcasts the common data needed by tasks within each stage. The data broadcasted this way is cached in serialized form and deserialized before running each task. This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.

## Creating broadcast variables

Broadcast variables are created from a variable v by calling SparkContext.broadcast(v). The broadcast variable is a wrapper around v, and its value can be accessed by calling the value method. The code below shows this:

In [1]:
broadcastVar = sc.broadcast([1, 2, 3])

In [2]:
broadcastVar.value

[1, 2, 3]

After the broadcast variable is created, it should be used instead of the value v in any functions run on the cluster so that v is not shipped to the nodes more than once. In addition, the object v should not be modified after it is broadcast in order to ensure that all nodes get the same value of the broadcast variable (e.g. if the variable is shipped to a new node later).

# Accumulators

Accumulators are variables that are only “added” to through an associative operation and can therefore be efficiently supported in parallel. They can be used to implement counters (as in MapReduce) or sums. Spark natively supports accumulators of numeric types, and programmers can add support for new types. If accumulators are created with a name, they will be displayed in Spark’s UI. This can be useful for understanding the progress of running stages (NOTE: this is not yet supported in Python).

An accumulator is created from an initial value v by calling SparkContext.accumulator(v). Tasks running on the cluster can then add to it using the add method or the += operator (in Scala and Python). However, they cannot read its value. Only the driver program can read the accumulator’s value, using its value method.

The code below shows an accumulator being used to add up the elements of an array:

In [3]:
accum = sc.accumulator(0)

In [4]:
sc.parallelize([1, 2, 3, 4]).foreach(lambda x: accum.add(x))

In [5]:
accum.value

10