In [0]:
# Spark Session
from pyspark.sql import SparkSession
spark = (
        SparkSession
        .builder
        .appName("Cache and Persist")
        .master("local[*]")
        .config("spark.executor.memory", '512M')
        .getOrCreate()
)
spark


In [0]:
# Read EMP CSV file with 10 million records
_schema = "first_name string, last_name string, job_title string, dob date, email string, phone string, salary double, department string, department_id integer"
emp = spark.read.schema(_schema).option("header",True).csv("/data/input/datasets/employee_records.csv")

In [0]:
emp.where("salary>1000").show(truncate=False)

+----------+---------+-----------------+----------+------------------------------+---------------+------------------+------------------+-------------+
|first_name|last_name|job_title        |dob       |email                         |phone          |salary            |department        |department_id|
+----------+---------+-----------------+----------+------------------------------+---------------+------------------+------------------+-------------+
|Jennifer  |Williams |HR Specialist    |1951-01-21|Jennifer.Williams.@example.com|+1-845-311-804 |42951.90537045701 |Finance           |6            |
|James     |Miller   |Sales Executive  |1939-09-25|James.Miller.@example.com     |+1-274-633-7306|50933.8591162336  |Data and Analytics|6            |
|Linda     |Jones    |Data Scientist   |2023-05-26|Linda.Jones.@example.com      |+1-149-733-8924|66274.49226944339 |Data and Analytics|2            |
|Srishti   |Smith    |Data Engineer    |2003-01-16|Srishti.Smith.@example.com    |+1-790-373-5

In [0]:
# CACHE dataframe (cache or persist)
emp.cache()

Out[7]: DataFrame[first_name: string, last_name: string, job_title: string, dob: date, email: string, phone: string, salary: double, department: string, department_id: int]

- For Cache, you have to trigger an action for it to actually do something.
- That ACTION can be COUNT or WRITE.
- COunt or Write will read the whole dataframe rather than reading a partial dataframe.
- Let's see it below.

In [0]:
emp.cache().count() # MEMORY AND DISK # This will get you the count (10 million records)


Out[8]: 10000000

The above action will create CACHE in the '*Storage*' tab of '*Spark UI*'

- *'MEMORY_AND_DISK'* is the default storage level, if you run CACHE for dataframe and dataset in PySpark.
- But it is *'MEMORY_ONLY'*  for RDD's


In [0]:
emp.where("salary>1000").show(truncate=False)

+----------+---------+-----------------+----------+------------------------------+---------------+------------------+------------------+-------------+
|first_name|last_name|job_title        |dob       |email                         |phone          |salary            |department        |department_id|
+----------+---------+-----------------+----------+------------------------------+---------------+------------------+------------------+-------------+
|Jennifer  |Williams |HR Specialist    |1951-01-21|Jennifer.Williams.@example.com|+1-845-311-804 |42951.90537045701 |Finance           |6            |
|James     |Miller   |Sales Executive  |1939-09-25|James.Miller.@example.com     |+1-274-633-7306|50933.8591162336  |Data and Analytics|6            |
|Linda     |Jones    |Data Scientist   |2023-05-26|Linda.Jones.@example.com      |+1-149-733-8924|66274.49226944339 |Data and Analytics|2            |
|Srishti   |Smith    |Data Engineer    |2003-01-16|Srishti.Smith.@example.com    |+1-790-373-5

- Now because you used Cache before running the above cell, the above df with filter executed even faster(less than 1 second) as compared to before(check cell 3 above- It had taken 9 seconds to run without Cache).
- Check Spark UI *SQL/DatFrame* Tab for more info.

In [0]:
# Remove CACHE
emp.unpersist()

Out[10]: DataFrame[first_name: string, last_name: string, job_title: string, dob: date, email: string, phone: string, salary: double, department: string, department_id: int]

The above action will REMOVE CACHE from the 'Storage' tab of 'Spark UI', which was created earlier

In [0]:
emp_cache = emp.cache()
emp_cache.count()

Out[12]: 10000000

In [0]:
emp.where("salary>1000").show(truncate=False)

+----------+---------+-----------------+----------+------------------------------+---------------+------------------+------------------+-------------+
|first_name|last_name|job_title        |dob       |email                         |phone          |salary            |department        |department_id|
+----------+---------+-----------------+----------+------------------------------+---------------+------------------+------------------+-------------+
|Jennifer  |Williams |HR Specialist    |1951-01-21|Jennifer.Williams.@example.com|+1-845-311-804 |42951.90537045701 |Finance           |6            |
|James     |Miller   |Sales Executive  |1939-09-25|James.Miller.@example.com     |+1-274-633-7306|50933.8591162336  |Data and Analytics|6            |
|Linda     |Jones    |Data Scientist   |2023-05-26|Linda.Jones.@example.com      |+1-149-733-8924|66274.49226944339 |Data and Analytics|2            |
|Srishti   |Smith    |Data Engineer    |2003-01-16|Srishti.Smith.@example.com    |+1-790-373-5

From the above two cells we can understand that:
- It took less then 1 second to run because-
- SPark keeps a lineage and whenever we cache a data and refer the original dataframe ('emp' in this case). it tried to read the data from cache, rather than from the orginal data source 

In [0]:
emp.unpersist()

Out[23]: DataFrame[first_name: string, last_name: string, job_title: string, dob: date, email: string, phone: string, salary: double, department: string, department_id: int]

In [0]:
emp_cache_fil = emp.where("salary<50000").cache()
emp_cache_fil.count()

Out[26]: 1428007

- The below code will not read from a CACHE because the filter had triggered a partial cache.
- Now if you point the original data frame, the Cache will be invalidated and Spark will again read the data from the original source
- This is why, partial CACHE is very dangerous if not done properly

In [0]:
emp.where("salary<60000").cache()

Out[29]: DataFrame[first_name: string, last_name: string, job_title: string, dob: date, email: string, phone: string, salary: double, department: string, department_id: int]

In [0]:
# To REMOVE all the Cache
spark.catalog.clearCache()

**Different Storage levels involved in CACHE**

In [0]:
# MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER
import pyspark
emp_persist= emp.persist(pyspark.StorageLevel.MEMORY_ONLY)


- This time the Cache will show 'Storage Level:Memory Serialized 1x Replicated'.
- Cache puts the data as unserialized,but if we put it with persist, the data gets serialized in the memory only.
Let's do that below

In [0]:
emp_persist.write.format("noop").mode("overwrite").save()

> NOTE: 
- If the size of your data is greater than the MEMORY you have specified(will throw out of memory error), ALWAYS remember to use *'MEMORY_AND DISK'*.
- For PySpark, whenever we do '*MEMORY_ONLY*', by default the data is serialized in *'MEMORY_AND_DISK'*. So, you cannot use *'MEMORY_ONLY_SER'* and *'MEMORY_AND_DISK_SER'* in PySpark.
- *'MEMORY_ONLY_SER'* is used for Scala and Java jobs. 

In [0]:
spark.catalog.clearCache()

In [0]:
# DISK_ONLY, MEMORY_ONLY_2, MEMORY_AND_DISK_2
emp_persist= emp.persist(pyspark.StorageLevel.MEMORY_ONLY_2)
emp_persist.write.format("noop").mode("overwrite").save()

- This time the Cache will show 'Storage Level:Memory Serialized 2x Replicated'.
- It implies that it is replicated twice in all the executors.

> IMPORTANT DIFFERENCE between CACHE and PERSIST method:
- PERSIST: You need to define the storage level you need to use
- CACHE: The default storage level is '*MEMORY_AND_DISK*' and the data is de-serialized.
