# Caching

This notebook is a quick guide to explain different memory and disk persistence options in Spark.
Caching RDDs
To simply cache an RDD in memory in deserialized form:

In [23]:
import time
rdd = sc.parallelize([x for x in range(100000)])

In [26]:
from math import sqrt
start = time.time()
rdd1 = rdd.map(lambda x: x*x)
rdd1.collect()
end = time.time()
print "Elapsed time square ", end-start
start = time.time()
rdd2 = rdd.map(lambda x: sqrt(x))
rdd2.collect()
end = time.time()
print "Elapsed time sqrt ", end-start

Elapsed time square  0.122296094894
Elapsed time sqrt  0.127455949783


In [28]:
rdd.cache()

start = time.time()
rdd1 = rdd.map(lambda x: x*x)
rdd1.collect()
end = time.time()
print "Elapsed time square ", end-start
start = time.time()
rdd2 = rdd.map(lambda x: sqrt(x))
rdd2.collect()
end = time.time()
print "Elapsed time sqrt ", end-start

Elapsed time square  0.128567934036
Elapsed time sqrt  0.0969128608704


RDDs can also be persisted in memory backed by disk in serialized form with replication across Workers. You can read more on RDD persistence here. 

## The primary factors to consider while choosing a storage level

### Persistence Level: 

Storing in memory is the easiest option without too much overhead. If the entire RDD doesn't fit in memory, then in case of missing partitions, the partitions will be recomputed from the lineage automatically. But if an RDD which is formed from a wide transformation is going to be used heavily in an iterative or interactive fashion, then it is better to store it in memory backed by disk to ensure that the partitions are not recomputed.

### Serialization: 

The default serialization in Spark is Java serialization. However for better peformance, we recommend Kryo serialization, which you can learn more about here

### Replication: 

Spark, by default, provides fault tolerance by recomputing any missing partitions in the fly. To optimize for performance, you can optionally provide a replication factor. But note that this will increase the initial cache time and storage usage significantly.

To store an RDD in memory serialized backed by disk (deserialized) and a replication factor of 2:

In [2]:
from pyspark.storagelevel import StorageLevel

val rdd2 = sc.parallelize(1 to 1000)
rdd2.persist(StorageLevel.MEMORY_AND_DISK_SER_2)

import org.apache.spark.storage._

SyntaxError: invalid syntax (<ipython-input-2-eaa54f93728e>, line 2)

The "Storage" tab in the Spark UI shows the list of all RDDs cached and also where each partition of a RDD resides in the cluster.

To uncache:

In [None]:
rdd.unpersist()

## Caching DataFrames and tables in memory

Spark's SQL tables and DataFrames can be cached too. The tables and DataFrames are cached in the JVM memory and compressed using a simple algorithm in a columnar format. Typically however, the compression ratio will not be as good as something like parquet. To cache a DataFrame in memory:

In [None]:
val df = sqlContext.range(100)
df.cache()

To cache/uncache a table in memory:

In [None]:
val tableName = "YOUR TABLE NAME HERE"
#// Load the table and cache it in memory as a best effort
sqlContext.cacheTable(tableName)                     
# // Your queries on df here..
val df = sqlContext.table(tableName)  
#// Uncache the table from memory
sqlContext.uncacheTable(tableName)                   
tableName: String = diamonds

To uncache all tables from memory:

In [None]:
sqlContext.clearCache()

# Partitioning