<a href="https://colab.research.google.com/github/Shivayogi-A/Pyspark_programming/blob/master/Caching_in_Pyspark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!apt-get update # Update apt-get repository.
!apt-get install openjdk-8-jdk-headless -qq > /dev/null # Install Java.
!wget -q http://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz # Download Apache Sparks.
!tar xf spark-3.1.1-bin-hadoop3.2.tgz # Unzip the tgz file.
!pip install -q findspark # Install findspark. Adds PySpark to the System path during runtime.

# Set environment variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop3.2"

!ls

# Initialize findspark
import findspark
findspark.init()

# create a spark session

from pyspark.sql import SparkSession
Spark = SparkSession.builder\
        .appName("Studentfilter")\
        .getOrCreate()

Get:1 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Get:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
Hit:3 http://archive.ubuntu.com/ubuntu jammy InRelease
Ign:4 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Get:5 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease [1,581 B]
Get:6 https://r2u.stat.illinois.edu/ubuntu jammy Release [5,713 B]
Get:7 https://r2u.stat.illinois.edu/ubuntu jammy Release.gpg [793 B]
Get:8 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Hit:9 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:11 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Get:12 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]
Get:13 http://security.ubuntu.com/ubuntu jammy-security/universe amd64 Packages [1,131 kB]
Get:14 http:/

**Pyspark cache()**\
Pyspark cache() method is used to cache the intermediate results of the transformation so that other transformation runs on top of cached will perform faster. Caching the result of the transformation is one of the optimization tricks to improve the performance of the long-running PySpark applications/jobs.

cache() is a lazy evaluation in PySpark meaning it will not cache the results until you call the action operation. In this article, I will explain what is cache, how it improves performance, and how to cache PySpark DataFrame results with examples.

**Benefits of Caching**\
Caching a DataFrame that can be reused for multi-operations will significantly improve any PySpark job. Below are the benefits of cache().

**Cost-efficient** – Spark computations are very expensive hence reusing the computations are used to save cost.\
**Time-efficient** – Reusing repeated computations saves lots of time.\
**Execution time** – Saves execution time of the job and we can perform more jobs on the same cluster.



**Why do we need Cache in PySpark?**\
First, let’s run some transformations without cache and understand what is the performance issue.



In [9]:
emp_data = [( 894954, 'Shiva','AIA','Bengaluru'),
        ( 894941, 'Vaishnavi','AIA','Bengaluru'),
         ( 894950, 'Suraj','AIA','Pune'),
        ( 894921, 'Shubham','DEV','Chennai'),
        ( 894930, 'Gautam','DEV','Pune'),
        (894900, 'Abhijit','AIA','Bengaluru')
]

schema = ["emp_id","name","domain","location"]

emp_df = Spark.createDataFrame(emp_data, schema = schema,)
emp_df.show(truncate = False)
emp_df.printSchema()

from pyspark.sql.functions import col
df2 = emp_df.where(col("domain")=='AIA')
df2.show()

df3 = df2.where(col("location")=='Bengaluru')
df3.show()

+------+---------+------+---------+
|emp_id|name     |domain|location |
+------+---------+------+---------+
|894954|Shiva    |AIA   |Bengaluru|
|894941|Vaishnavi|AIA   |Bengaluru|
|894950|Suraj    |AIA   |Pune     |
|894921|Shubham  |DEV   |Chennai  |
|894930|Gautam   |DEV   |Pune     |
|894900|Abhijit  |AIA   |Bengaluru|
+------+---------+------+---------+

root
 |-- emp_id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- domain: string (nullable = true)
 |-- location: string (nullable = true)

+------+---------+------+---------+
|emp_id|     name|domain| location|
+------+---------+------+---------+
|894954|    Shiva|   AIA|Bengaluru|
|894941|Vaishnavi|   AIA|Bengaluru|
|894950|    Suraj|   AIA|     Pune|
|894900|  Abhijit|   AIA|Bengaluru|
+------+---------+------+---------+

+------+---------+------+---------+
|emp_id|     name|domain| location|
+------+---------+------+---------+
|894954|    Shiva|   AIA|Bengaluru|
|894941|Vaishnavi|   AIA|Bengaluru|
|894900|  Abhi

Let’s assume you have billions of records as in the above example in company db. Since action triggers the transformations, in the above example df2.show() is the first action hence it triggers the execution of reading data from **emp_data** list, and then df.where().

We also have another action df3.show(), this again triggers execution of reading data from **emp_data** list, df.where() and df2.where().

So in the above example, we are reading the data twice and df.where() twice. when you are detailing large number of records, this will become a performance issue and it can be easily avoided by caching the results of spark.read() and df2.where() as below.



In [None]:
##Syntax of cache()

DataFrame.cache()

**Using PySpark Cache**\
From the above example, let’s add cache() statement to spark.read() and df.where() transformations. When df2.show() executes, this triggers spark.createDataframe(..).cache() which reads the data from **emp_data** and caches the result in memory. and df.where(..).cache() also caches the result in memory.

When df3.show() executes, it just performs the df2.where() on top of cache results of df2, without re-executing previous transformations.

In [11]:
emp_df = Spark.createDataFrame(emp_data, schema = schema,).cache()
emp_df.show(truncate = False)

df2 = emp_df.where(col("domain")=='AIA').cache()
df2.show()

df3 = df2.where(col("location")=='Bengaluru')
df3.show()

+------+---------+------+---------+
|emp_id|name     |domain|location |
+------+---------+------+---------+
|894954|Shiva    |AIA   |Bengaluru|
|894941|Vaishnavi|AIA   |Bengaluru|
|894950|Suraj    |AIA   |Pune     |
|894921|Shubham  |DEV   |Chennai  |
|894930|Gautam   |DEV   |Pune     |
|894900|Abhijit  |AIA   |Bengaluru|
+------+---------+------+---------+

+------+---------+------+---------+
|emp_id|     name|domain| location|
+------+---------+------+---------+
|894954|    Shiva|   AIA|Bengaluru|
|894941|Vaishnavi|   AIA|Bengaluru|
|894950|    Suraj|   AIA|     Pune|
|894900|  Abhijit|   AIA|Bengaluru|
+------+---------+------+---------+

+------+---------+------+---------+
|emp_id|     name|domain| location|
+------+---------+------+---------+
|894954|    Shiva|   AIA|Bengaluru|
|894941|Vaishnavi|   AIA|Bengaluru|
|894900|  Abhijit|   AIA|Bengaluru|
+------+---------+------+---------+



In the above code, we are reading data into DataFrame emp_df. Applying where transformation on emp_df will result in df2 that contains only records where domain="AIA" and caching this DataFrame. As discussed cache() will not perform the transformation as they are lazy in nature. When df2.show() executed then only the code where(col(domain) =="AIA").cache() will be evaluated and caches the result into df2..

By applying where transformation on df2 with location=Bengaluru, since the df2 is already cached, the spark will look for the data that is cached and thus uses that DataFrame. Above is the output after performing a transformation on df2 which is read into df3, then applying action show().