# Understanding Caching and Persist in PySpark

## Learning Objectives

By the end of this notebook, you will understand:

1. **What caching and persist are** and why they're essential for performance
2. **The difference** between `cache()` and `persist()`
3. **Storage levels** available in Spark and when to use each
4. **When to cache** vs when not to cache
5. **Best practices** for caching in production
6. **Common mistakes** and how to avoid them

## Prerequisites

- Understanding of Spark architecture (executors, cores, tasks) - see `08_a_Spark_Architecture.ipynb`
- Understanding of Spark transformations and actions - see `03_basic_dataframe_operations.ipynb`
- Basic familiarity with Spark DataFrame operations

---

> **Note:** This notebook builds on the concepts from `08_a_Spark_Architecture.ipynb` and `08_b_Partitions_Concepts.ipynb`. Make sure you understand how Spark executes transformations and actions before proceeding.


## Introduction: The Problem Caching Solves

### Common Scenario

You're working with a Spark DataFrame and you need to use it multiple times:

```python
# Read data
df = spark.read.parquet("large_dataset.parquet")

# First use: Calculate total sales
total_sales = df.agg({"sales": "sum"}).collect()

# Second use: Calculate average sales
avg_sales = df.agg({"sales": "avg"}).collect()

# Third use: Count records
record_count = df.count()
```

### ‚ö†Ô∏è Important: Understanding Spark's Lazy Evaluation

**Common Misconception:**
> "When I do `df = spark.read.parquet(...)`, Spark reads the data and stores it in memory, so subsequent actions just use that stored data."

**Reality:**
> **Spark uses lazy evaluation!** The `spark.read.parquet()` call **does NOT read any data**. It only creates a **plan** for how to read the data when needed.

**What Actually Happens:**

```python
# Step 1: This does NOT read data - just creates a plan
df = spark.read.parquet("large_dataset.parquet")
# df is just a "recipe" - no data has been read yet!

# Step 2: First action triggers execution
total_sales = df.agg({"sales": "sum"}).collect()
# NOW Spark reads from disk ‚Üí computes sum ‚Üí returns result
# But the data is NOT stored anywhere!

# Step 3: Second action triggers execution AGAIN
avg_sales = df.agg({"sales": "avg"}).collect()
# Spark reads from disk AGAIN ‚Üí computes avg ‚Üí returns result
# The previous read is gone - no data was stored!

# Step 4: Third action triggers execution AGAIN
record_count = df.count()
# Spark reads from disk AGAIN ‚Üí counts ‚Üí returns result
```

**Visual Representation:**

```
df = spark.read.parquet(...)
  ‚Üì
  [Just a plan - no data read]

Action 1: df.agg({"sales": "sum"}).collect()
  ‚Üì
  Read from disk ‚Üí Compute sum ‚Üí Return result
  [Data is NOT stored - it's discarded after use]

Action 2: df.agg({"sales": "avg"}).collect()
  ‚Üì
  Read from disk AGAIN ‚Üí Compute avg ‚Üí Return result
  [Previous read is gone - must read again!]

Action 3: df.count()
  ‚Üì
  Read from disk AGAIN ‚Üí Count ‚Üí Return result
  [Must read a third time!]
```

**Why This Happens:**
- Spark doesn't store intermediate results by default
- Each action executes the **entire plan from scratch**
- Without caching, there's no memory of previous computations
- The DataFrame `df` is just a plan, not stored data

**What happens?**
- Each action (sum, avg, count) triggers a **full recomputation**
- Spark reads the data from disk **three times** (once per action)
- This is slow and wasteful!

### The Solution: Caching

**With caching:**
```python
# Read and cache data
df = spark.read.parquet("large_dataset.parquet").cache()

# First use: Reads from disk, stores in memory
total_sales = df.agg({"sales": "sum"}).collect()

# Second use: Reads from memory (fast!)
avg_sales = df.agg({"sales": "avg"}).collect()

# Third use: Reads from memory (fast!)
record_count = df.count()
```

**Benefits:**
- Data is read from disk **only once**
- Subsequent operations read from **memory** (much faster!)
- Significant performance improvement for iterative algorithms

### Key Insight

> **Caching stores intermediate results in memory (or disk) so they don't need to be recomputed. This is essential when you reuse the same DataFrame multiple times.**


## Understanding Spark's Lazy Evaluation

### How Spark Executes Code

**Spark uses lazy evaluation:**

```python
# This is a TRANSFORMATION - nothing happens yet!
df_filtered = df.filter(df.amount > 1000)

# This is an ACTION - now Spark executes everything!
result = df_filtered.count()
```

**What happens:**
1. **Transformations** (filter, select, join) are **lazy** - they build a plan
2. **Actions** (count, collect, show) are **eager** - they trigger execution
3. When an action is called, Spark executes the **entire plan** from scratch

### üîç Key Point: DataFrames Don't Store Data!

**Important Understanding:**

```python
# This line does NOT read or store any data!
df = spark.read.parquet("data.parquet")
```

**What `df` actually is:**
- `df` is **NOT** a container with data in memory
- `df` is a **plan** (or "recipe") for how to read and process data
- No data is read until an **action** is called
- After an action completes, the data is **discarded** (not stored)

**Think of it like this:**
- **Traditional approach (Pandas):** `df = pd.read_csv()` ‚Üí Data is loaded into memory and stored
- **Spark approach:** `df = spark.read.parquet()` ‚Üí Only a plan is created, no data loaded

**This is why each action reads from disk again:**
- Action 1: Executes plan ‚Üí Reads disk ‚Üí Computes ‚Üí Returns result ‚Üí **Data discarded**
- Action 2: Executes plan ‚Üí Reads disk ‚Üí Computes ‚Üí Returns result ‚Üí **Data discarded**
- Action 3: Executes plan ‚Üí Reads disk ‚Üí Computes ‚Üí Returns result ‚Üí **Data discarded**

**Without caching, Spark has no memory of previous reads!**

### The Problem: Recomputation

**Without caching:**

```python
df = spark.read.parquet("data.parquet")
df_filtered = df.filter(df.amount > 1000)

# Action 1: Triggers full computation
count1 = df_filtered.count()  # Reads from disk, filters, counts

# Action 2: Triggers full computation AGAIN!
count2 = df_filtered.count()  # Reads from disk AGAIN, filters AGAIN, counts AGAIN
```

**Visual Representation:**

```
Action 1: Read ‚Üí Filter ‚Üí Count
         ‚Üì
Action 2: Read ‚Üí Filter ‚Üí Count  (recomputes everything!)
```

**This is inefficient!** We're doing the same work twice.

### The Solution: Cache Intermediate Results

**With caching:**

```python
df = spark.read.parquet("data.parquet")
df_filtered = df.filter(df.amount > 1000).cache()

# Action 1: Computes and stores in memory
count1 = df_filtered.count()  # Read ‚Üí Filter ‚Üí Count ‚Üí Store in memory

# Action 2: Uses cached data
count2 = df_filtered.count()  # Read from memory ‚Üí Count (fast!)
```

**Visual Representation:**

```
Action 1: Read ‚Üí Filter ‚Üí Count ‚Üí Store in Memory
         ‚Üì
Action 2: Read from Memory ‚Üí Count  (no recomputation!)
```

**Key Insight:**
> **Caching breaks the recomputation cycle by storing intermediate results. Once cached, subsequent actions use the cached data instead of recomputing.**


## What is Cache vs Persist?

### Cache: The Convenience Method

**`cache()`** is a convenience method that calls `persist()` with the default storage level.

```python
# These are equivalent:
df.cache()
df.persist()  # Uses default storage level: MEMORY_AND_DISK
```

**Default behavior:**
- Stores data in **memory** if possible
- If memory is full, **spills to disk**
- This is the safest default for most use cases

### Persist: The Flexible Method

**`persist()`** allows you to specify **exactly how** data should be stored.

```python
from pyspark import StorageLevel

# Store only in memory
df.persist(StorageLevel.MEMORY_ONLY)

# Store in memory and disk
df.persist(StorageLevel.MEMORY_AND_DISK)

# Store only on disk
df.persist(StorageLevel.DISK_ONLY)
```

### Key Difference

| Method | Flexibility | Default Behavior |
|--------|------------|------------------|
| `cache()` | ‚ùå Fixed (uses default) | MEMORY_AND_DISK |
| `persist()` | ‚úÖ Customizable | You choose the storage level |

### When to Use Each

**Use `cache()` when:**
- Default storage level (MEMORY_AND_DISK) is fine
- You want simplicity
- You're prototyping or learning

**Use `persist()` when:**
- You need specific storage behavior
- You want to optimize for memory usage
- You're in production and need fine-grained control


## Storage Levels: Understanding Your Options

### Available Storage Levels

Spark provides several storage levels to balance performance and resource usage:

| Storage Level | Memory | Disk | Deserialized | Replication |
|--------------|--------|------|--------------|-------------|
| `MEMORY_ONLY` | ‚úÖ | ‚ùå | ‚úÖ | 1√ó |
| `MEMORY_ONLY_SER` | ‚úÖ | ‚ùå | ‚ùå | 1√ó |
| `MEMORY_AND_DISK` | ‚úÖ | ‚úÖ | ‚úÖ | 1√ó |
| `MEMORY_AND_DISK_SER` | ‚úÖ | ‚úÖ | ‚ùå | 1√ó |
| `DISK_ONLY` | ‚ùå | ‚úÖ | ‚úÖ | 1√ó |
| `MEMORY_ONLY_2` | ‚úÖ | ‚ùå | ‚úÖ | 2√ó |
| `MEMORY_AND_DISK_2` | ‚úÖ | ‚úÖ | ‚úÖ | 2√ó |

### Understanding the Options

**Memory vs Disk:**
- **Memory**: Fast but limited capacity
- **Disk**: Slower but unlimited capacity
- **Memory_AND_DISK**: Best of both - tries memory first, spills to disk

**Deserialized vs Serialized:**
- **Deserialized** (MEMORY_ONLY): Objects stored as-is (fast access, more memory)
- **Serialized** (MEMORY_ONLY_SER): Objects stored as bytes (slower access, less memory)

**Replication:**
- **1√ó**: One copy (default)
- **2√ó**: Two copies (fault tolerance, but uses 2√ó memory)

### Choosing the Right Storage Level

**MEMORY_ONLY** (fastest, but risky):
```python
df.persist(StorageLevel.MEMORY_ONLY)
```
- ‚úÖ Fastest access
- ‚ùå If memory is full, partitions are recomputed (not cached)
- Use when: Data fits in memory, you want maximum speed

**MEMORY_AND_DISK** (default, safest):
```python
df.persist(StorageLevel.MEMORY_AND_DISK)  # Same as df.cache()
```
- ‚úÖ Fast access from memory
- ‚úÖ Falls back to disk if memory is full
- ‚úÖ No recomputation needed
- Use when: Default choice, production-safe

**DISK_ONLY** (slowest, but reliable):
```python
df.persist(StorageLevel.DISK_ONLY)
```
- ‚úÖ Uses no memory
- ‚úÖ Always available (no recomputation)
- ‚ùå Slower than memory
- Use when: Memory is constrained, data is too large for memory

**MEMORY_ONLY_SER** (memory-efficient):
```python
df.persist(StorageLevel.MEMORY_ONLY_SER)
```
- ‚úÖ Uses less memory (serialized format)
- ‚ùå Slower access (needs deserialization)
- Use when: Data is large, memory is limited


## When Should You Cache?

### ‚úÖ Good Use Cases for Caching

**1. Iterative Algorithms (Machine Learning)**

```python
# Training a model requires multiple passes over data
df = spark.read.parquet("training_data.parquet").cache()

for epoch in range(10):
    # Each epoch needs to read the data
    model.train(df)  # Uses cached data - fast!
```

**2. Multiple Actions on Same DataFrame**

```python
df = spark.read.parquet("sales.parquet").cache()

# Multiple aggregations
total = df.agg({"sales": "sum"}).collect()
avg = df.agg({"sales": "avg"}).collect()
count = df.count()
# All use cached data - no recomputation!
```

**3. Reusing Data After Expensive Transformations**

```python
# Expensive transformation (join, filter, etc.)
df_processed = (
    df1.join(df2, on="key")
    .filter(df1.amount > 1000)
    .cache()  # Cache after expensive operations
)

# Use multiple times
result1 = df_processed.groupBy("region").agg({"sales": "sum"})
result2 = df_processed.groupBy("product").agg({"sales": "sum"})
```

**4. Loops with Same Data**

```python
df = spark.read.parquet("data.parquet").cache()

for region in ["North", "South", "East", "West"]:
    result = df.filter(df.region == region).count()
    # Each iteration uses cached data
```

### ‚ùå When NOT to Cache

**1. Data Used Only Once**

```python
# ‚ùå BAD: Cache is unnecessary
df = spark.read.parquet("data.parquet").cache()
result = df.count()  # Used only once - cache is wasted!
```

**2. Data Too Large for Memory**

```python
# ‚ùå BAD: Will cause memory issues
huge_df = spark.read.parquet("1TB_data.parquet").cache()  # Won't fit!
```

**3. Streaming Data**

```python
# ‚ùå BAD: Streaming data is continuous, can't cache effectively
stream_df = spark.readStream.parquet("streaming_path/")
# Don't cache streaming DataFrames
```

**4. Data That Changes Frequently**

```python
# ‚ùå BAD: If data changes, cache becomes stale
df = spark.read.parquet("frequently_updated_data.parquet").cache()
# Cache may contain outdated data
```

### Decision Tree

```
Will you use this DataFrame multiple times?
‚îÇ
‚îú‚îÄ NO ‚Üí Don't cache ‚ùå
‚îÇ
‚îî‚îÄ YES
   ‚îÇ
   ‚îú‚îÄ Does data fit in memory?
   ‚îÇ  ‚îÇ
   ‚îÇ  ‚îú‚îÄ NO ‚Üí Use DISK_ONLY or don't cache ‚ùå
   ‚îÇ  ‚îÇ
   ‚îÇ  ‚îî‚îÄ YES ‚Üí Cache ‚úÖ
   ‚îÇ     ‚îÇ
   ‚îÇ     ‚îî‚îÄ Is it streaming or frequently changing?
   ‚îÇ        ‚îÇ
   ‚îÇ        ‚îú‚îÄ YES ‚Üí Don't cache ‚ùå
   ‚îÇ        ‚îÇ
   ‚îÇ        ‚îî‚îÄ NO ‚Üí Cache ‚úÖ
```


## Practical Example: Demonstrating Caching

Let's see caching in action with a practical example.


In [1]:
# Initialize Spark Session
from pyspark.sql import SparkSession
import time

# Create Spark session
spark = SparkSession.builder \
    .appName("CachingDemo") \
    .master("local[*]") \
    .getOrCreate()

print("=" * 70)
print("SPARK SESSION INITIALIZED")
print("=" * 70)
print(f"Spark Version: {spark.version}")
print(f"Default Parallelism: {spark.sparkContext.defaultParallelism}")
print("=" * 70)


26/01/03 06:07:53 WARN Utils: Your hostname, N-MacBookPro-37.local resolves to a loopback address: 127.0.0.1; using 192.168.1.4 instead (on interface en0)
26/01/03 06:07:53 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
26/01/03 06:07:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


SPARK SESSION INITIALIZED
Spark Version: 3.5.1
Default Parallelism: 11


### Step 1: Create Sample Data


In [2]:
# Create a DataFrame with sample data
# Simulating a scenario where we'll use the data multiple times

data = [(i, f"Product_{i % 100}", 100.0 + i, f"Region_{i % 5}") 
        for i in range(100000)]
columns = ["id", "product_name", "price", "region"]

df = spark.createDataFrame(data, columns)

print("=" * 70)
print("SAMPLE DATA CREATED")
print("=" * 70)
print(f"Total records: {df.count():,}")
print(f"Partitions: {df.rdd.getNumPartitions()}")
print("=" * 70)


SAMPLE DATA CREATED


[Stage 0:>                                                        (0 + 11) / 11]

Total records: 100,000
Partitions: 11


                                                                                

### Step 2: Without Caching (The Problem)


In [3]:
# Apply an expensive transformation (filter)
df_filtered = df.filter(df.price > 5000)

print("=" * 70)
print("WITHOUT CACHING: Multiple Actions")
print("=" * 70)

# Action 1: Count
print("\n1Ô∏è‚É£  First action (count)...")
start = time.time()
count1 = df_filtered.count()
time1 = time.time() - start
print(f"   Result: {count1:,} records")
print(f"   Time: {time1:.3f} seconds")
print(f"   Note: Full computation from scratch")

# Action 2: Sum
print("\n2Ô∏è‚É£  Second action (sum)...")
start = time.time()
total = df_filtered.agg({"price": "sum"}).collect()[0][0]
time2 = time.time() - start
print(f"   Result: ${total:,.2f}")
print(f"   Time: {time2:.3f} seconds")
print(f"   Note: Full recomputation from scratch (wasteful!)")

# Action 3: Average
print("\n3Ô∏è‚É£  Third action (average)...")
start = time.time()
avg = df_filtered.agg({"price": "avg"}).collect()[0][0]
time3 = time.time() - start
print(f"   Result: ${avg:,.2f}")
print(f"   Time: {time3:.3f} seconds")
print(f"   Note: Full recomputation from scratch again!")

total_time = time1 + time2 + time3
print(f"\n‚è±Ô∏è  Total time: {total_time:.3f} seconds")
print(f"‚ö†Ô∏è  Problem: Each action recomputes everything!")
print("=" * 70)


WITHOUT CACHING: Multiple Actions

1Ô∏è‚É£  First action (count)...
   Result: 95,099 records
   Time: 0.323 seconds
   Note: Full computation from scratch

2Ô∏è‚É£  Second action (sum)...
   Result: $4,997,452,450.00
   Time: 0.224 seconds
   Note: Full recomputation from scratch (wasteful!)

3Ô∏è‚É£  Third action (average)...
   Result: $52,550.00
   Time: 0.173 seconds
   Note: Full recomputation from scratch again!

‚è±Ô∏è  Total time: 0.720 seconds
‚ö†Ô∏è  Problem: Each action recomputes everything!


### Step 3: With Caching (The Solution)


In [4]:
# Apply the same transformation and CACHE it
df_filtered_cached = df.filter(df.price > 5000).cache()

print("=" * 70)
print("WITH CACHING: Multiple Actions")
print("=" * 70)

# Action 1: Count (triggers computation and caching)
print("\n1Ô∏è‚É£  First action (count) - triggers caching...")
start = time.time()
count1 = df_filtered_cached.count()
time1 = time.time() - start
print(f"   Result: {count1:,} records")
print(f"   Time: {time1:.3f} seconds")
print(f"   Note: Computes and stores in memory")

# Action 2: Sum (uses cached data)
print("\n2Ô∏è‚É£  Second action (sum) - uses cache...")
start = time.time()
total = df_filtered_cached.agg({"price": "sum"}).collect()[0][0]
time2 = time.time() - start
print(f"   Result: ${total:,.2f}")
print(f"   Time: {time2:.3f} seconds")
print(f"   Note: Reads from memory (much faster!)")

# Action 3: Average (uses cached data)
print("\n3Ô∏è‚É£  Third action (average) - uses cache...")
start = time.time()
avg = df_filtered_cached.agg({"price": "avg"}).collect()[0][0]
time3 = time.time() - start
print(f"   Result: ${avg:,.2f}")
print(f"   Time: {time3:.3f} seconds")
print(f"   Note: Reads from memory (much faster!)")

total_time_cached = time1 + time2 + time3
print(f"\n‚è±Ô∏è  Total time: {total_time_cached:.3f} seconds")

# Compare
if total_time > 0:
    speedup = total_time / total_time_cached
    print(f"\nüöÄ Caching is {speedup:.2f}√ó faster!")
    print(f"   ‚Ä¢ Without cache: {total_time:.3f}s")
    print(f"   ‚Ä¢ With cache: {total_time_cached:.3f}s")
    print(f"   ‚Ä¢ Time saved: {total_time - total_time_cached:.3f}s")
print("=" * 70)


WITH CACHING: Multiple Actions

1Ô∏è‚É£  First action (count) - triggers caching...
   Result: 95,099 records
   Time: 0.643 seconds
   Note: Computes and stores in memory

2Ô∏è‚É£  Second action (sum) - uses cache...
   Result: $4,997,452,450.00
   Time: 0.082 seconds
   Note: Reads from memory (much faster!)

3Ô∏è‚É£  Third action (average) - uses cache...
   Result: $52,550.00
   Time: 0.072 seconds
   Note: Reads from memory (much faster!)

‚è±Ô∏è  Total time: 0.797 seconds

üöÄ Caching is 0.90√ó faster!
   ‚Ä¢ Without cache: 0.720s
   ‚Ä¢ With cache: 0.797s
   ‚Ä¢ Time saved: -0.078s


### Step 4: Understanding Storage Levels


In [5]:
from pyspark import StorageLevel

print("=" * 70)
print("UNDERSTANDING STORAGE LEVELS")
print("=" * 70)

# Create filtered DataFrame
df_filtered = df.filter(df.price > 5000)

# Test different storage levels
storage_levels = {
    "MEMORY_ONLY": StorageLevel.MEMORY_ONLY,
    "MEMORY_AND_DISK": StorageLevel.MEMORY_AND_DISK,
    "DISK_ONLY": StorageLevel.DISK_ONLY,
}

print("\nTesting different storage levels...\n")

for name, level in storage_levels.items():
    print(f"üì¶ {name}:")
    df_test = df_filtered.persist(level)
    
    # First action (triggers caching)
    start = time.time()
    count = df_test.count()
    time1 = time.time() - start
    
    # Second action (uses cache)
    start = time.time()
    _ = df_test.agg({"price": "sum"}).collect()
    time2 = time.time() - start
    
    print(f"   ‚Ä¢ First action: {time1:.3f}s (computes and caches)")
    print(f"   ‚Ä¢ Second action: {time2:.3f}s (uses cache)")
    print(f"   ‚Ä¢ Speedup: {time1/time2:.2f}√ó faster on second action")
    
    # Unpersist to free memory
    df_test.unpersist()
    print()

print("=" * 70)
print("üí° Key Insight: All storage levels provide caching benefits!")
print("   The difference is WHERE data is stored (memory vs disk).")
print("=" * 70)


UNDERSTANDING STORAGE LEVELS

Testing different storage levels...

üì¶ MEMORY_ONLY:


26/01/03 06:09:35 WARN CacheManager: Asked to cache already cached data.


   ‚Ä¢ First action: 0.114s (computes and caches)
   ‚Ä¢ Second action: 0.075s (uses cache)
   ‚Ä¢ Speedup: 1.51√ó faster on second action

üì¶ MEMORY_AND_DISK:
   ‚Ä¢ First action: 0.381s (computes and caches)
   ‚Ä¢ Second action: 0.074s (uses cache)
   ‚Ä¢ Speedup: 5.17√ó faster on second action

üì¶ DISK_ONLY:
   ‚Ä¢ First action: 0.285s (computes and caches)
   ‚Ä¢ Second action: 0.058s (uses cache)
   ‚Ä¢ Speedup: 4.91√ó faster on second action

üí° Key Insight: All storage levels provide caching benefits!
   The difference is WHERE data is stored (memory vs disk).


### Step 5: Checking Cache Status


In [None]:
# Check if a DataFrame is cached
print("=" * 70)
print("CHECKING CACHE STATUS")
print("=" * 70)

# Create and cache a DataFrame
df_cached = df.filter(df.price > 5000).cache()

print("\n1Ô∏è‚É£  Before first action (not yet cached):")
print(f"   Is cached: {df_cached.is_cached}")
print(f"   Storage level: {df_cached.storageLevel}")

# Trigger caching with an action
print("\n2Ô∏è‚É£  Triggering cache with action...")
_ = df_cached.count()

print(f"   Is cached: {df_cached.is_cached}")
print(f"   Storage level: {df_cached.storageLevel}")
print(f"   Note: Storage level shows where data is stored (Memory/Disk)")

# Unpersist (remove from cache)
print("\n3Ô∏è‚É£  Unpersisting (removing from cache)...")
df_cached.unpersist()

print(f"   Is cached: {df_cached.is_cached}")
print(f"   Storage level: {df_cached.storageLevel}")

print("\n" + "=" * 70)
print("üí° Key Points:")
print("   ‚Ä¢ is_cached: Check if DataFrame is cached")
print("   ‚Ä¢ storageLevel: See how data is stored (prints readable format)")
print("   ‚Ä¢ unpersist(): Remove from cache to free memory")
print("=" * 70)


CHECKING CACHE STATUS

1Ô∏è‚É£  Before first action (not yet cached):
   Is cached: True
   Storage level: Disk Memory Deserialized 1x Replicated

2Ô∏è‚É£  Triggering cache with action...
   Is cached: True
   Storage level: Disk Memory Deserialized 1x Replicated
   Note: Storage level shows where data is stored (Memory/Disk)

3Ô∏è‚É£  Unpersisting (removing from cache)...
   Is cached: False
   Storage level: Serialized 1x Replicated

üí° Key Points:
   ‚Ä¢ is_cached: Check if DataFrame is cached
   ‚Ä¢ storageLevel: See how data is stored (prints readable format)
   ‚Ä¢ unpersist(): Remove from cache to free memory


26/01/03 11:20:03 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 1014487 ms exceeds timeout 120000 ms
26/01/03 11:20:03 WARN SparkContext: Killing executors is not supported by current scheduler.
26/01/03 11:25:06 WARN Executor: Issue communicating with driver in heartbeater
org.apache.spark.SparkException: Exception thrown in awaitResult: 
	at org.apache.spark.util.SparkThreadUtils$.awaitResult(SparkThreadUtils.scala:56)
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:310)
	at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
	at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:101)
	at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:85)
	at org.apache.spark.storage.BlockManagerMaster.registerBlockManager(BlockManagerMaster.scala:80)
	at org.apache.spark.storage.BlockManager.reregister(BlockManager.scala:642)
	at org.apache.spark.executor.Executor.reportHeartBeat(Executor.scala:1223)
	at 

## Real-World Use Cases

### Use Case 1: Machine Learning - Iterative Training

**Scenario:** Training a model requires multiple passes over the same data.


In [7]:
# Use Case 1: Machine Learning - Iterative Training

print("=" * 70)
print("USE CASE 1: Machine Learning - Iterative Training")
print("=" * 70)

# Simulate training data that will be used multiple times
training_data = df.filter(df.price > 5000).cache()

print("\nüìä Training a model requires multiple passes over data...")
print("   Without caching, each epoch would recompute everything!\n")

# Simulate multiple training epochs
epochs = 5
print(f"Training for {epochs} epochs:")

start_total = time.time()
for epoch in range(epochs):
    start = time.time()
    # Simulate training step (using the data)
    count = training_data.count()
    time_epoch = time.time() - start
    print(f"   Epoch {epoch + 1}: {time_epoch:.3f}s ({count:,} records)")

total_time = time.time() - start_total
print(f"\n‚úÖ Total training time: {total_time:.3f} seconds")
print(f"üí° First epoch computes and caches, subsequent epochs use cache!")
print("=" * 70)

# Clean up
training_data.unpersist()


USE CASE 1: Machine Learning - Iterative Training

üìä Training a model requires multiple passes over data...
   Without caching, each epoch would recompute everything!

Training for 5 epochs:
   Epoch 1: 0.140s (95,099 records)


26/01/03 06:10:41 WARN CacheManager: Asked to cache already cached data.


   Epoch 2: 0.074s (95,099 records)
   Epoch 3: 0.063s (95,099 records)
   Epoch 4: 0.056s (95,099 records)
   Epoch 5: 0.072s (95,099 records)

‚úÖ Total training time: 0.406 seconds
üí° First epoch computes and caches, subsequent epochs use cache!


DataFrame[id: bigint, product_name: string, price: double, region: string]

### Use Case 2: Multiple Aggregations

**Scenario:** You need to compute multiple statistics on the same filtered dataset.


In [8]:
# Use Case 2: Multiple Aggregations

print("=" * 70)
print("USE CASE 2: Multiple Aggregations")
print("=" * 70)

# Filter data (expensive operation)
df_filtered = df.filter(df.price > 5000).cache()

print("\nüìä Computing multiple statistics on filtered data...\n")

# Multiple aggregations
stats = {}

# Count
start = time.time()
stats['count'] = df_filtered.count()
stats['count_time'] = time.time() - start

# Sum
start = time.time()
stats['total'] = df_filtered.agg({"price": "sum"}).collect()[0][0]
stats['sum_time'] = time.time() - start

# Average
start = time.time()
stats['avg'] = df_filtered.agg({"price": "avg"}).collect()[0][0]
stats['avg_time'] = time.time() - start

# Min
start = time.time()
stats['min'] = df_filtered.agg({"price": "min"}).collect()[0][0]
stats['min_time'] = time.time() - start

# Max
start = time.time()
stats['max'] = df_filtered.agg({"price": "max"}).collect()[0][0]
stats['max_time'] = time.time() - start

print("Results:")
print(f"   Count: {stats['count']:,} records")
print(f"   Total: ${stats['total']:,.2f}")
print(f"   Average: ${stats['avg']:,.2f}")
print(f"   Min: ${stats['min']:,.2f}")
print(f"   Max: ${stats['max']:,.2f}")

print(f"\n‚è±Ô∏è  Timing:")
print(f"   First action (count): {stats['count_time']:.3f}s (computes and caches)")
print(f"   Subsequent actions: {stats['sum_time']:.3f}s, {stats['avg_time']:.3f}s, etc. (use cache)")

print(f"\nüí° Insight: First action does the work, rest are fast!")
print("=" * 70)

# Clean up
df_filtered.unpersist()


USE CASE 2: Multiple Aggregations

üìä Computing multiple statistics on filtered data...

Results:
   Count: 95,099 records
   Total: $4,997,452,450.00
   Average: $52,550.00
   Min: $5,001.00
   Max: $100,099.00

‚è±Ô∏è  Timing:
   First action (count): 0.309s (computes and caches)
   Subsequent actions: 0.055s, 0.060s, etc. (use cache)

üí° Insight: First action does the work, rest are fast!


DataFrame[id: bigint, product_name: string, price: double, region: string]

### Use Case 3: Reusing After Expensive Joins

**Scenario:** After an expensive join, you need to use the result multiple times.


In [9]:
# Use Case 3: Reusing After Expensive Joins

print("=" * 70)
print("USE CASE 3: Reusing After Expensive Joins")
print("=" * 70)

# Create a second DataFrame for joining
data2 = [(i % 100, f"Category_{i % 10}") for i in range(100)]
columns2 = ["id", "category"]
df2 = spark.createDataFrame(data2, columns2)

# Expensive join operation
print("\nüîó Performing expensive join operation...")
df_joined = df.join(df2, on="id", how="inner").cache()

# Trigger the join and cache
start = time.time()
count = df_joined.count()
join_time = time.time() - start
print(f"   Join completed: {count:,} records in {join_time:.3f}s")
print(f"   Data is now cached!")

# Use the joined data multiple times
print("\nüìä Using joined data for multiple analyses...\n")

# Analysis 1: Group by category
start = time.time()
by_category = df_joined.groupBy("category").agg({"price": "avg"}).collect()
time1 = time.time() - start
print(f"   Analysis 1 (by category): {time1:.3f}s (uses cache)")

# Analysis 2: Group by region
start = time.time()
by_region = df_joined.groupBy("region").agg({"price": "sum"}).collect()
time2 = time.time() - start
print(f"   Analysis 2 (by region): {time2:.3f}s (uses cache)")

# Analysis 3: Filter and count
start = time.time()
filtered_count = df_joined.filter(df_joined.price > 10000).count()
time3 = time.time() - start
print(f"   Analysis 3 (filtered count): {time3:.3f}s (uses cache)")

print(f"\nüí° Insight: Join happens once, all analyses use cached result!")
print("=" * 70)

# Clean up
df_joined.unpersist()


USE CASE 3: Reusing After Expensive Joins

üîó Performing expensive join operation...


                                                                                

   Join completed: 100 records in 1.481s
   Data is now cached!

üìä Using joined data for multiple analyses...

   Analysis 1 (by category): 0.800s (uses cache)
   Analysis 2 (by region): 0.341s (uses cache)
   Analysis 3 (filtered count): 0.347s (uses cache)

üí° Insight: Join happens once, all analyses use cached result!


DataFrame[id: bigint, product_name: string, price: double, region: string, category: string]

## Best Practices

### ‚úÖ DO

1. **Cache after expensive transformations**
   ```python
   # Cache after expensive operations
   df = df.join(other_df).filter(...).cache()
   ```

2. **Cache when reusing data multiple times**
   ```python
   # If you'll use it 2+ times, cache it
   df_cached = df.filter(...).cache()
   ```

3. **Use appropriate storage levels**
   ```python
   # For large data, use MEMORY_AND_DISK
   df.persist(StorageLevel.MEMORY_AND_DISK)
   ```

4. **Unpersist when done**
   ```python
   # Free memory when you're done
   df.unpersist()
   ```

5. **Cache before iterative operations**
   ```python
   # Machine learning, loops, etc.
   training_data = df.cache()
   for epoch in range(10):
       model.train(training_data)
   ```

6. **Check cache status**
   ```python
   # Verify caching worked
   if df.is_cached:
       print("Data is cached!")
   ```

### ‚ùå DON'T

1. **Don't cache data used only once**
   ```python
   # ‚ùå BAD: Unnecessary cache
   df = spark.read.parquet("data.parquet").cache()
   result = df.count()  # Used only once!
   ```

2. **Don't cache data too large for memory**
   ```python
   # ‚ùå BAD: Will cause memory issues
   huge_df = spark.read.parquet("1TB_data.parquet").cache()
   ```

3. **Don't forget to unpersist**
   ```python
   # ‚ùå BAD: Memory leak
   df = df.cache()
   # ... use df ...
   # Forgot to unpersist - memory not freed!
   ```

4. **Don't cache streaming data**
   ```python
   # ‚ùå BAD: Streaming can't be cached effectively
   stream_df = spark.readStream.parquet("path/").cache()
   ```

5. **Don't cache unnecessarily**
   ```python
   # ‚ùå BAD: Simple operations don't need caching
   df = spark.read.parquet("data.parquet").select("col1").cache()
   result = df.count()  # Simple operation, cache not needed
   ```

6. **Don't cache after every transformation**
   ```python
   # ‚ùå BAD: Too many caches
   df1 = df.filter(...).cache()
   df2 = df1.select(...).cache()
   df3 = df2.groupBy(...).cache()
   # Only cache when you'll reuse multiple times!
   ```


## Common Mistakes and How to Avoid Them

### Mistake 1: Caching Data Used Only Once

**Wrong:**
```python
df = spark.read.parquet("data.parquet").cache()
result = df.count()  # ‚ùå Used only once - cache is wasted!
```

**Correct:**
```python
df = spark.read.parquet("data.parquet")
result = df.count()  # ‚úÖ No cache needed for single use
```

### Mistake 2: Not Unpersisting

**Wrong:**
```python
df = df.cache()
# ... use df multiple times ...
# ‚ùå Forgot to unpersist - memory leak!
```

**Correct:**
```python
df = df.cache()
# ... use df multiple times ...
df.unpersist()  # ‚úÖ Free memory when done
```

### Mistake 3: Caching Before Expensive Operations

**Wrong:**
```python
df = spark.read.parquet("data.parquet").cache()  # ‚ùå Cache too early
df_filtered = df.filter(...)  # Expensive operation not cached!
result = df_filtered.count()
```

**Correct:**
```python
df = spark.read.parquet("data.parquet")
df_filtered = df.filter(...).cache()  # ‚úÖ Cache after expensive operation
result = df_filtered.count()
```

### Mistake 4: Assuming Cache is Immediate

**Wrong:**
```python
df = df.cache()
print("Data is cached!")  # ‚ùå Not yet! Cache happens on first action
```

**Correct:**
```python
df = df.cache()
_ = df.count()  # ‚úÖ Trigger cache with an action
print("Data is now cached!")
```

### Mistake 5: Caching Too Much Data

**Wrong:**
```python
# ‚ùå Caching everything - memory issues!
df1 = spark.read.parquet("data1.parquet").cache()
df2 = spark.read.parquet("data2.parquet").cache()
df3 = spark.read.parquet("data3.parquet").cache()
# All cached simultaneously - may not fit in memory!
```

**Correct:**
```python
# ‚úÖ Cache only what you need, when you need it
df1 = spark.read.parquet("data1.parquet").cache()
# ... use df1 ...
df1.unpersist()

df2 = spark.read.parquet("data2.parquet").cache()
# ... use df2 ...
df2.unpersist()
```

### Mistake 6: Not Understanding Storage Levels

**Wrong:**
```python
# ‚ùå Using MEMORY_ONLY for large data
huge_df = spark.read.parquet("large_data.parquet")
huge_df.persist(StorageLevel.MEMORY_ONLY)  # May not fit!
```

**Correct:**
```python
# ‚úÖ Use MEMORY_AND_DISK for safety
huge_df = spark.read.parquet("large_data.parquet")
huge_df.persist(StorageLevel.MEMORY_AND_DISK)  # Falls back to disk
```


## Understanding When Cache Actually Happens

### Important: Cache is Lazy Too!

**Key Point:** Calling `cache()` or `persist()` doesn't actually cache the data immediately!

```python
df = df.cache()  # This just marks it for caching - nothing happens yet!
```

**Cache happens on the first action:**

```python
df = df.cache()  # Marked for caching
_ = df.count()   # NOW it actually caches (first action triggers it)
```

### Visual Timeline

```
Step 1: df.cache()
        ‚Üì
        Marks DataFrame for caching (no computation yet)
        
Step 2: df.count()  (first action)
        ‚Üì
        Computes the DataFrame AND stores it in cache
        
Step 3: df.agg(...)  (second action)
        ‚Üì
        Reads from cache (fast!)
```

### Why This Matters

**Common mistake:**
```python
df = df.cache()
print(f"Cached: {df.is_cached}")  # False! Not cached yet!
```

**Correct:**
```python
df = df.cache()
_ = df.count()  # Trigger cache
print(f"Cached: {df.is_cached}")  # True! Now it's cached!
```


## Cache vs Unpersist: Memory Management

### When to Unpersist

**Always unpersist when you're done:**

```python
df = df.cache()
# ... use df multiple times ...
df.unpersist()  # Free memory
```

### Why Unpersist Matters

**Memory is limited:**
- Cached data takes up memory
- If you cache too much, you'll run out of memory
- Unpersisting frees memory for other operations

### Best Practice Pattern

```python
# Pattern: Cache, use, unpersist
df = expensive_operation().cache()

try:
    # Use cached data
    result1 = df.agg(...)
    result2 = df.groupBy(...)
    # ... more operations ...
finally:
    # Always unpersist, even if errors occur
    df.unpersist()
```

### Unpersist Options

```python
# Unpersist (removes from cache)
df.unpersist()

# Unpersist blocking (waits for completion)
df.unpersist(blocking=True)

# Unpersist non-blocking (returns immediately)
df.unpersist(blocking=False)  # Default
```


## Key Takeaways

### The Core Concept

**Caching:**
- ‚úÖ Stores intermediate results in memory/disk
- ‚úÖ Prevents recomputation of expensive operations
- ‚úÖ Essential for iterative algorithms
- ‚úÖ Speeds up multiple actions on same DataFrame

**When to cache:**
- Data used multiple times
- After expensive transformations
- Before iterative operations (ML, loops)
- When memory allows

**When NOT to cache:**
- Data used only once
- Data too large for memory
- Streaming data
- Frequently changing data

### The Golden Rules

1. **Cache after expensive operations, not before**
2. **Cache when you'll reuse data 2+ times**
3. **Always unpersist when done**
4. **Use appropriate storage levels**
5. **Remember: cache() is lazy - first action triggers it**

### Remember

1. **cache() = persist() with default storage level**
2. **persist() = customizable storage level**
3. **Cache is lazy - happens on first action**
4. **Unpersist to free memory**
5. **Check is_cached to verify status**

### Next Steps

- Practice caching in your own Spark jobs
- Monitor Spark UI to see cache utilization
- Experiment with different storage levels
- Review `08_performance_optimization.ipynb` for more optimization techniques


## Summary

### What We Learned

1. **What caching and persist are**
   - Methods to store intermediate results
   - Prevent recomputation of expensive operations
   - Essential for performance optimization

2. **The difference between cache() and persist()**
   - `cache()`: Convenience method with default storage level
   - `persist()`: Flexible method with customizable storage levels
   - Both achieve the same goal with different flexibility

3. **Storage levels available**
   - MEMORY_ONLY: Fastest, but risky
   - MEMORY_AND_DISK: Default, safest
   - DISK_ONLY: Slowest, but reliable
   - Serialized versions for memory efficiency

4. **When to cache**
   - Multiple actions on same DataFrame
   - Iterative algorithms (ML)
   - After expensive transformations
   - When memory allows

5. **Best practices**
   - Cache after expensive operations
   - Always unpersist when done
   - Use appropriate storage levels
   - Check cache status

### The Bottom Line

> **Caching is a powerful optimization technique that stores intermediate results to avoid recomputation. Use it when you'll reuse data multiple times, but remember to unpersist when done to free memory. Cache after expensive operations, not before, and choose the right storage level for your use case.**

---

**Related Notebooks:**
- `08_a_Spark_Architecture.ipynb` - Understanding executors, cores, and tasks
- `08_b_Partitions_Concepts.ipynb` - Understanding partitions and optimization
- `08_c_Coalesce.ipynb` - Efficiently reducing partitions
- `08_performance_optimization.ipynb` - Comprehensive performance optimization guide


In [None]:
# Clean up
spark.stop()
print("Spark session stopped.")
