## Persistence (Caching)

If we are calling actio function on an RDD multiple times then <br>
It will get recreated again and again .<br>
To avoid that we use caching.

In [1]:
sampleRDD = sc.parallelize([1,2,3,4,5,6])

In [2]:
sampleRDD.count()

6

In [5]:
#descending order
sampleRDD.takeOrdered(4,lambda x:-x)

[6, 5, 4, 3]

Note : Here we are using actions on sampleRDD two times, so sampleRDD gets created two times .
We can avoid that by caling persist or cache.

##### Cache

Cache() uses the default storage level .<br>

1. <b>MEMORAY_ONLY for RDD<b><br>
2. <b>MEMORY_AND_DISK for Dataset</b>

In [6]:
sampleRDD.cache()

ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:423

##### persist(storage_level)

We use persist if we want to assign storage level other than the default.

MEMORY_ONLY - stored only in memory <br><br>
MEMORY_ONLY_SER (Only in Java and Scala)- Stored in Memory but Serialized , takes high CPU Time.<br><br>
MEMORY_AND_DISK - Spills to disk if there is too much data to fit in memory. <br><br>
MEORY_AND_DISK_SER (Only in Java and Scala)- Same as MEMORY_AND_DISK but serialized.<br><br>
DISK_ONLY- Stored in Disk.`<br><br>
MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.  - Same as the levels above, but replicate each partition on two cluster nodes. 

Please note that in Python objects are always seralized using pickle library and stored in JVM when we do persist.

In [13]:
filteredRDD = sampleRDD.filter(lambda x:x%2==0)



In [14]:
filteredRDD.persist(pyspark.StorageLevel.MEMORY_ONLY)

PythonRDD[5] at RDD at PythonRDD.scala:43

<b>To UNPERSIST the RDD </b>

Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion. If you would like to manually remove an RDD instead of waiting for it to fall out of the cache, use the RDD.unpersist() method.

In [17]:
filteredRDD.unpersist()

PythonRDD[5] at RDD at PythonRDD.scala:43

<b>Which Storage Level to Choose?</b>

1. If your RDDs fit comfortably with the default storage level (MEMORY_ONLY), leave them that way. This is the most CPU-efficient option, allowing operations on the RDDs to run as fast as possible.

2. Don’t spill to disk unless the functions that computed your datasets are expensive, or they filter a large amount of the data. Otherwise, recomputing a partition may be as fast as reading it from disk.

3. Use the replicated storage levels if you want fast fault recovery (e.g. if using Spark to serve requests from a web application). All the storage levels provide full fault tolerance by recomputing lost data, but the replicated ones let you continue running tasks on the RDD without waiting to recompute a lost partition.