# Caching Data

### Use _cache()_

Create a large data set with couple of columns

In [1]:
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql import SparkSession
spark = (
    SparkSession
    .builder
    .appName("07_chap")
    .config("spark.sql.catalogImplementation", "hive")
    .getOrCreate()
    )
sc = spark.sparkContext

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/05/14 23:30:48 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/05/14 23:30:49 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [2]:
%%time
from pyspark.sql.functions import col

df = spark.range(1 * 10000000).toDF("id").withColumn("square", col("id") * col("id"))
df.cache().count()

[Stage 1:>                                                        (0 + 12) / 12]

CPU times: user 3.05 ms, sys: 1.96 ms, total: 5.01 ms
Wall time: 8.57 s


                                                                                

10000000

In [3]:
%%time
df.count()

CPU times: user 823 μs, sys: 891 μs, total: 1.71 ms
Wall time: 546 ms


10000000

Check the Spark UI storage tab to see where the data is stored.

In [4]:
df.unpersist() # If you do not unpersist, df2 below will not be cached because it has the same query plan as df

DataFrame[id: bigint, square: bigint]

### Use _persist(StorageLevel.Level)_

In [5]:
from pyspark import StorageLevel

df2 = spark.range(1 * 10000000).toDF("id").withColumn("square", col("id") * col("id"))
df2.persist(StorageLevel.DISK_ONLY).count()

                                                                                

10000000

In [6]:
df2.count()

10000000

Check the Spark UI storage tab to see where the data is stored

In [7]:
df2.unpersist()

DataFrame[id: bigint, square: bigint]

In [8]:
df.createOrReplaceTempView("dfTable")
spark.sql("CACHE TABLE dfTable")

                                                                                

DataFrame[]

Check the Spark UI storage tab to see where the data is stored.

In [9]:
spark.sql("SELECT count(*) FROM dfTable").show()

+--------+
|count(1)|
+--------+
|10000000|
+--------+



In [12]:
spark.sql("DROP TABLE dfTable")

DataFrame[]