# Module 8 - Performance & Optimization

## Introduction

Performance optimization is crucial for production Spark applications. This notebook covers key optimization techniques including partitioning, bucketing, caching, and serialization.

## What You'll Learn

- **Types of Spark Optimization**: Application code level vs Resource level optimization
- **Resource Level Optimization**: Executor strategies (Thin, Fat, Balanced)
- **Repartitioning and coalescing**: When and how to change partition counts
- **Bucketing**: Data organization for optimized joins
- **Caching and persistence**: Storing DataFrames to avoid recomputation
- **Serialization**: Kryo vs Java serialization
- **Best practices for performance**: Key optimization strategies


## Types of Spark Optimization

Spark optimization can be broadly categorized into two main types:

### 1. Application Code Level Optimization

**Application Code Level Optimization** focuses on improving the efficiency of your Spark code and transformations. This includes:

- **Use of Cache/Persist**: Storing intermediate results in memory to avoid recomputation
- **Using `reduceByKey()` instead of `groupByKey()`**: `reduceByKey()` performs local aggregation before shuffling, reducing network traffic
- **Broadcast Joins**: Broadcasting small tables to avoid shuffling large tables
- **Filter Early**: Applying filters as early as possible to reduce data size
- **Partitioning Strategies**: Using appropriate partitioning (repartition, coalesce) for optimal parallelism
- **Avoiding Unnecessary Shuffles**: Minimizing data movement across the network

**Example - Cache Usage:**
```python
# Without cache - recomputes expensive operation multiple times
df_filtered = df.filter(df.Salary > 50000)
result1 = df_filtered.groupBy("Department").count()  # Computes filter
result2 = df_filtered.agg({"Salary": "avg"})         # Recomputes filter again

# With cache - computes once, reuses result
df_filtered = df.filter(df.Salary > 50000).cache()
result1 = df_filtered.groupBy("Department").count()  # Computes and caches
result2 = df_filtered.agg({"Salary": "avg"})         # Uses cached data
```

**Example - Efficient Aggregations:**
```python
# Inefficient - multiple operations causing multiple shuffles
df.groupBy("Department").agg(sum("Salary").alias("TotalSalary"))  # Shuffles all data

# Efficient - use aggregate functions that can be computed locally first
from pyspark.sql.functions import sum
df.groupBy("Department").agg(sum("Salary").alias("TotalSalary"))  # Spark optimizes with local aggregation
```

### 2. Cluster Level Optimization / Resource Level Optimization

**Resource Level Optimization** focuses on allocating the right amount of cluster resources (memory, CPU cores, executors) to ensure efficient job execution.

**Key Resources:**
- **Memory (RAM)**: Determines how much data can be held in memory
- **CPU Cores**: Determines parallelism and concurrent task execution
- **Executors**: Containers that hold resources (CPU & RAM)

**Why Resource Optimization Matters:**
- **Under-allocation**: Jobs run slowly, resources are underutilized
- **Over-allocation**: Wastes resources, can cause memory pressure and garbage collection issues
- **Inefficient allocation**: Can lead to poor parallelism, network bottlenecks, or executor failures

---

## Resource Level Optimization - Executor Strategies

### Scenario Setup

Consider a **10-node cluster** (10 worker nodes) with the following specifications:
- Each machine has: **16 CPU cores** and **64 GB RAM**
- Total cluster resources: **160 CPU cores** and **640 GB RAM**

### Understanding Executors

**Executor / JVM** is a container of resources (CPU & RAM). A single node can have more than one executor.

**Key Question:** What is the ideal number of executors that a node can have for efficient processing?

---

### Strategy 1: Thin Executors

**Idea:** Create more executors, each holding minimal resources.

**Configuration Example:**
- Total executors per node: **16 executors**
- Each executor: **1 CPU core** and **4 GB RAM**
- Total across cluster: **160 executors** (16 per node × 10 nodes)

**Characteristics:**
- Maximum parallelism (one task per executor)
- Fine-grained resource allocation

**Drawbacks:**
1. **No Multithreading**: With only 1 core per executor, you cannot run multiple tasks concurrently on the same executor
2. **Shared Variables Lead to Many Copies**: Broadcast variables and shared data structures are replicated across many executors, increasing memory overhead
3. **High Overhead**: More executors mean more JVM overhead, more network connections, and more coordination overhead
4. **Poor Resource Utilization**: Small executors may not efficiently utilize available memory

**Example Scenario:**
```python
# Thin executor configuration (inefficient)
spark = SparkSession.builder \
    .appName("Thin Executors") \
    .config("spark.executor.instances", "160") \
    .config("spark.executor.cores", "1") \
    .config("spark.executor.memory", "4g") \
    .getOrCreate()
# Problem: 160 executors with 1 core each = no concurrent tasks per executor
```

---

### Strategy 2: Fat Executors

**Idea:** Give maximum resources to each executor.

**Configuration Example:**
- Total executors per node: **1 executor**
- Each executor: **16 CPU cores** and **64 GB RAM**
- Total across cluster: **10 executors** (1 per node × 10 nodes)

**Characteristics:**
- Maximum resources per executor
- Can run multiple concurrent tasks (up to 16 per executor)

**Drawbacks:**
1. **HDFS Throughput Suffers**: 
   - HDFS (Hadoop Distributed File System) has a limit on concurrent connections per node
   - With only 1 executor per node, you're limited in parallel data reads from HDFS
   - Typically, HDFS can handle 5-10 concurrent connections per node efficiently
   - This creates a bottleneck when reading large datasets
   
2. **Takes a Lot of Time for Garbage Collection (GC)**:
   - Large heap size (64 GB) means more objects in memory
   - GC pauses become longer as heap size increases
   - Long GC pauses can cause task failures and timeouts
   - Can lead to executor heartbeat timeouts and job failures

**Example Scenario:**
```python
# Fat executor configuration (inefficient)
spark = SparkSession.builder \
    .appName("Fat Executors") \
    .config("spark.executor.instances", "10") \
    .config("spark.executor.cores", "16") \
    .config("spark.executor.memory", "64g") \
    .getOrCreate()
# Problem: HDFS bottleneck + long GC pauses
```

---

### Strategy 3: Right / Balanced Strategy for Creating Containers

**Idea:** Balance between parallelism, HDFS throughput, and GC efficiency.

**Recommended Configuration:**
- **5 executors per node**
- Each executor: **3 CPU cores** and **~12 GB RAM** (leaving some for OS and overhead)
- Total across cluster: **50 executors** (5 per node × 10 nodes)

**Why This Works:**

1. **Optimal HDFS Throughput**:
   - 5 executors per node = 5 concurrent HDFS connections
   - This matches HDFS's optimal concurrent connection capacity (typically 5-10)
   - Maximizes data read/write throughput

2. **Efficient Garbage Collection**:
   - ~12 GB heap per executor is manageable for GC
   - GC pauses are shorter and more predictable
   - Reduces risk of executor timeouts

3. **Good Parallelism**:
   - 3 cores per executor allows 3 concurrent tasks
   - 50 executors × 3 cores = 150 concurrent tasks across cluster
   - Good utilization of available CPU cores

4. **Resource Allocation Formula**:
   ```
   Per Node:
   - Total cores: 16
   - Total RAM: 64 GB
   - Executors per node: 5
   - Cores per executor: 3 (16 ÷ 5 ≈ 3, leaving 1 for OS)
   - RAM per executor: ~12 GB (64 GB ÷ 5 ≈ 12.8 GB, leaving some for OS)
   ```

**Example Configuration:**
```python
# Balanced executor configuration (recommended)
spark = SparkSession.builder \
    .appName("Balanced Executors") \
    .config("spark.executor.instances", "50") \
    .config("spark.executor.cores", "3") \
    .config("spark.executor.memory", "12g") \
    .config("spark.executor.memoryFraction", "0.8") \
    .getOrCreate()
# Benefits: Good HDFS throughput + efficient GC + good parallelism
```

**Key Considerations:**
- Leave 1 core per node for OS and system processes
- Leave ~10-20% of RAM per node for OS and system processes
- Aim for 5-10 executors per node for optimal HDFS throughput
- Keep executor memory between 8-16 GB for efficient GC
- Ensure executor cores allow for good parallelism (typically 3-5 cores per executor)

---

### Summary: Executor Strategy Comparison

| Strategy | Executors/Node | Cores/Executor | RAM/Executor | HDFS Throughput | GC Efficiency | Parallelism |
|----------|---------------|----------------|--------------|-----------------|---------------|-------------|
| **Thin** | 16 | 1 | 4 GB | Good | Good | Poor (no multithreading) |
| **Fat** | 1 | 16 | 64 GB | Poor | Poor (long GC) | Good |
| **Balanced** | 5 | 3 | 12 GB | Optimal | Good | Good |

**Best Practice:** Use the **Balanced Strategy** for most production workloads. Adjust based on:
- Your specific workload characteristics
- Data size and access patterns
- Available cluster resources
- Performance monitoring results


In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.storagelevel import StorageLevel

# Create SparkSession
spark = SparkSession.builder \
    .appName("Optimization Techniques") \
    .master("local[*]") \
    .getOrCreate()

# Helper function to get partition count (accesses underlying partition metadata)
def get_partition_count(df):
    """Get the number of partitions in a DataFrame."""
    return df.rdd.getNumPartitions()

# Create sample DataFrame
data = [
    ("Alice", "Sales", 50000, "2024-01"),
    ("Bob", "IT", 60000, "2024-01"),
    ("Charlie", "Sales", 70000, "2024-01"),
    ("Diana", "IT", 55000, "2024-01"),
    ("Eve", "HR", 65000, "2024-01")
] * 100  # Multiply to have more data

schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Department", StringType(), True),
    StructField("Salary", IntegerType(), True),
    StructField("Month", StringType(), True)
])

df = spark.createDataFrame(data, schema)
print(f"DataFrame created with {df.count()} rows")
print(f"Number of partitions: {get_partition_count(df)}")


25/12/28 21:42:50 WARN Utils: Your hostname, N-MacBookPro-37.local resolves to a loopback address: 127.0.0.1; using 192.168.1.2 instead (on interface en0)
25/12/28 21:42:50 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/12/28 21:42:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/12/28 21:42:51 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
25/12/28 21:42:51 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
25/12/28 21:42:51 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
25/12/28 21:42:51 WARN Utils: Service 'SparkUI' could not bind on port 4043. Attempting port 4044.
25/12/28 21:42:51 WARN Utils: Service 'SparkUI' could not bind on port 4044. Attempting 

DataFrame created with 500 rows
Number of partitions: 11


                                                                                

----------------------------------------
Exception occurred during processing of request from ('127.0.0.1', 52077)
Traceback (most recent call last):
  File "/opt/homebrew/Cellar/python@3.11/3.11.12_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/socketserver.py", line 317, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/opt/homebrew/Cellar/python@3.11/3.11.12_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/socketserver.py", line 348, in process_request
    self.finish_request(request, client_address)
  File "/opt/homebrew/Cellar/python@3.11/3.11.12_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/socketserver.py", line 361, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/opt/homebrew/Cellar/python@3.11/3.11.12_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/socketserver.py", line 755, in __init__
    self.handle()
  File "/opt/homebrew/lib/python3.11/site-packages/py

## Repartition vs Coalesce: Key Differences

### Overview

Both `repartition()` and `coalesce()` are DataFrame API methods used to change the number of partitions, but they work differently and have different use cases. Understanding the theoretical differences is crucial for writing efficient Spark applications.

### Understanding Partitions and Shuffles

**What are Partitions?**
- Partitions are logical divisions of data in Spark
- Each partition is processed by a single task on an executor
- Partitions determine parallelism - more partitions = more parallel tasks
- Optimal partition size is typically 128MB-200MB

**What is a Shuffle?**
- A shuffle is the process of redistributing data across partitions
- Requires data to be:
  1. Serialized on the source executor
  2. Transferred over the network to target executors
  3. Deserialized on the target executor
  4. Written to disk (if memory is insufficient)
- Shuffles are **expensive operations** because they involve:
  - Network I/O (data transfer between nodes)
  - Disk I/O (spilling to disk if memory is full)
  - CPU overhead (serialization/deserialization)
  - Coordination overhead (tracking data movement)

### Repartition

**`repartition()` can both INCREASE and DECREASE the number of partitions.**

**Theoretical Background - How Repartition Works:**

1. **Full Shuffle Operation:**
   - Repartition **always** triggers a full shuffle, regardless of whether you're increasing or decreasing partitions
   - All data is redistributed across the network
   - Uses hash partitioning by default (or range partitioning for sorted data)
   - Each record is assigned to a partition based on a hash function

2. **Internal Mechanism:**
   - Spark calculates a hash value for each record (or uses the partition key)
   - Records are grouped by target partition ID
   - Data is serialized and sent over the network to appropriate executors
   - Target executors receive, deserialize, and write data to new partitions
   - This happens for **every record** in the DataFrame

3. **Why Shuffle is Necessary for Repartition:**
   - To increase partitions: Need to redistribute data to create new partitions
   - To decrease partitions: Need to redistribute data to merge into fewer partitions
   - To partition by column: Need to ensure records with same key are on same partition

4. **Performance Characteristics:**
   - **Expensive**: Network transfer, serialization overhead
   - **Time Complexity**: O(n) where n is number of records
   - **Network Bandwidth**: All data crosses the network
   - **Memory**: May cause spills to disk if memory is insufficient

**When to INCREASE partitions (use `repartition()`):**

1. **Too few large partitions**: When you have a small number of very large partitions that are causing:
   - **Memory pressure on executors**: Large partitions may exceed executor memory, causing OOM errors or excessive disk spills
   - **Long-running tasks that can't be parallelized**: Few partitions mean fewer tasks, leading to poor resource utilization
   - **Underutilization of cluster resources**: If you have 100 executors but only 2 partitions, 98 executors sit idle
   - **Example**: After reading a single large file that creates only 1-2 partitions, you want to parallelize processing across all available cores

2. **After filtering large amounts of data**: When a filter operation significantly reduces data size, you may end up with fewer partitions than optimal. Increasing partitions helps:
   - **Better utilize available executors**: More partitions = more tasks = better parallelism
   - **Improve parallelism for subsequent operations**: Downstream operations benefit from better distribution
   - **Example**: Filtering 1TB dataset to 10GB might leave you with only 10 partitions when you could use 100 for better parallelism

3. **Before expensive operations**: When you're about to perform expensive transformations, increasing partitions can:
   - **Distribute work more evenly**: Prevents some executors from being overloaded while others are idle
   - **Prevent straggler tasks**: Even distribution reduces the impact of slow tasks
   - **Example**: Before complex aggregations or joins that benefit from more parallel tasks

4. **Partitioning by column for joins**: When you need to partition by specific columns to optimize joins:
   - `df.repartition("join_column")` ensures data with same join key is on same partition
   - **Co-locates related data**: Records that will be joined together are on the same partition
   - **Reduces shuffle during join**: If both DataFrames are partitioned by join key, join can be done locally
   - **Essential for efficient join operations**: Especially for large-scale joins

**When to DECREASE partitions (use `coalesce()` instead):**
- When you have too many small partitions (see Coalesce section below)
- **Important**: While `repartition()` can decrease partitions, it's **inefficient** for this purpose because it performs a full shuffle
- **Theoretical reason**: When decreasing partitions, you don't need to redistribute data - you just need to merge existing partitions. A shuffle is unnecessary overhead.

### Coalesce

**`coalesce()` can ONLY DECREASE the number of partitions (cannot increase).**

**Theoretical Background - How Coalesce Works:**

1. **No Shuffle Operation:**
   - Coalesce **does not** perform a shuffle
   - It merges partitions that are already on the same executor
   - Data stays on the same executor - no network transfer required

2. **Internal Mechanism:**
   - Spark identifies partitions that can be merged together
   - Partitions on the same executor are combined locally
   - No serialization/deserialization of data
   - No network transfer
   - Simply reorganizes partition boundaries within the same executor

3. **Why Coalesce Cannot Increase Partitions:**
   - To increase partitions, you need to split existing partitions
   - Splitting requires redistributing data to create new partitions
   - This redistribution requires a shuffle operation
   - Coalesce is designed to avoid shuffles, so it cannot increase partitions
   - If you try to increase partitions with coalesce, Spark simply returns the original partition count

4. **Performance Characteristics:**
   - **Efficient**: No network transfer, no serialization overhead
   - **Time Complexity**: O(1) per partition merge (just metadata changes)
   - **Network Bandwidth**: Zero - all operations are local
   - **Memory**: Minimal overhead - just reorganizing partition metadata

**When to use `coalesce()`:**

1. **Too many small partitions**: After operations that create many small partitions:
   - **After filtering or transformations**: Operations like `filter()`, `map()`, or `flatMap()` can create many small partitions
   - **Partition size below optimal**: When partition size is much smaller than optimal (128MB-200MB)
   - **Performance impact**: Too many small partitions cause:
     - High task scheduling overhead (each partition = one task)
     - Poor resource utilization (tasks finish too quickly)
     - Excessive metadata overhead
   - **Example**: After a filter that creates 1000 partitions with only a few rows each, coalesce to 10-20 partitions

2. **Before writing to storage**: When writing files, too many partitions create:
   - **Many small files**: Each partition writes to a separate file, creating thousands of tiny files
   - **Storage system inefficiency**: HDFS, S3, and other systems perform poorly with many small files
   - **Metadata overhead**: File systems struggle with large numbers of small files
   - **Query performance**: Reading many small files is slower than reading fewer large files
   - **Example**: Reducing from 1000 partitions to 10 before writing to HDFS/S3 creates 10 files instead of 1000

3. **After operations that increase partitions**: When previous operations created more partitions than needed:
   - **After joins**: Joins can create many partitions (product of input partitions)
   - **After certain transformations**: Some operations multiply partition count
   - **Resource optimization**: Reducing partitions frees up resources for other operations
   - **Example**: After a join that creates 500 partitions, coalesce to 50 for better performance

### Why Coalesce Exists if Repartition Can Do Both?

**Theoretical Foundation: Efficiency and Resource Optimization**

This is a fundamental question in Spark optimization. If `repartition()` can both increase and decrease partitions, why does `coalesce()` exist?

**Key Difference: Efficiency and Shuffle Avoidance**

When **decreasing** partitions, the theoretical difference is:

**`coalesce()` - No Shuffle:**
- **Local operation**: Merges partitions that are already on the same executor
- **No network transfer**: Data never leaves the executor
- **No serialization**: Data stays in memory in its current format
- **Minimal overhead**: Only partition metadata is reorganized
- **Time complexity**: O(1) per partition merge - essentially constant time
- **Resource usage**: Minimal CPU, zero network, minimal memory

**`repartition()` - Full Shuffle:**
- **Distributed operation**: Redistributes ALL data across the network
- **Network transfer**: Every record crosses the network
- **Serialization overhead**: Data must be serialized before transfer, deserialized after
- **Disk I/O**: May spill to disk if memory is insufficient
- **Time complexity**: O(n) where n is number of records - linear time
- **Resource usage**: High CPU (serialization), high network bandwidth, high memory/disk

**Detailed Example Scenario:**

Consider you have 1000 partitions with 1MB each (total 1GB of data) and want to reduce to 10 partitions:

**Using `coalesce(10)`:**
1. Spark identifies 100 partitions per target partition (1000/10 = 100)
2. On each executor, local partitions are merged together
3. No data leaves the executor
4. **Time**: ~1-2 seconds (just metadata reorganization)
5. **Network**: 0 bytes transferred
6. **CPU**: Minimal (just merging partition boundaries)
7. **Memory**: Minimal overhead

**Using `repartition(10)`:**
1. Spark calculates hash for each of 1GB of records
2. All 1GB of data is serialized
3. All 1GB of data is transferred over the network
4. All 1GB of data is deserialized on target executors
5. Data is written to new partitions
6. **Time**: ~30-60 seconds (depending on network speed)
7. **Network**: 1GB transferred (plus overhead)
8. **CPU**: High (serialization/deserialization)
9. **Memory**: High (may cause spills to disk)

**Performance Impact:**

The performance difference can be **10-100x** depending on:
- Data size
- Network bandwidth
- Number of partitions
- Cluster size

**Why This Matters:**

1. **Cost**: Network transfer costs money in cloud environments
2. **Time**: Shuffles are often the bottleneck in Spark jobs
3. **Resource utilization**: Unnecessary shuffles waste cluster resources
4. **Scalability**: As data grows, shuffle cost grows linearly, but coalesce stays constant

**Rule of Thumb:**
- **To INCREASE partitions**: Use `repartition()` (shuffle is necessary to redistribute data)
- **To DECREASE partitions**: Use `coalesce()` (no shuffle needed, much more efficient)
- **To partition by column**: Use `repartition(column)` (shuffle is necessary for proper key-based distribution)

**When Repartition is Acceptable for Decreasing:**

There's one scenario where `repartition()` might be preferred even when decreasing:
- When you need **even distribution** of data across partitions
- `coalesce()` may create uneven partitions (some larger, some smaller)
- `repartition()` ensures perfectly even distribution
- **Trade-off**: Accept shuffle overhead for better data distribution


In [2]:
# Example 1: Repartition to INCREASE partitions
print("=== Example 1: Increasing Partitions ===")
print(f"Original partitions: {get_partition_count(df)}")
df_increased = df.repartition(20)  # Increase from 11 to 20
print(f"After repartition(20): {get_partition_count(df_increased)} partitions")
print("Use case: Better parallelism when you have too few large partitions\n")

# Example 2: Repartition by column (for joins)
print("=== Example 2: Repartition by Column ===")
df_repartitioned_by_col = df.repartition("Department")
print(f"Repartitioned by Department: {get_partition_count(df_repartitioned_by_col)} partitions")
print("Use case: Optimize joins on Department column\n")

# Example 3: Repartition to DECREASE (inefficient - shown for comparison)
print("=== Example 3: Decreasing with Repartition (INEFFICIENT) ===")
print(f"Original partitions: {get_partition_count(df)}")
df_decreased_repartition = df.repartition(2)  # Can decrease, but inefficient
print(f"After repartition(2): {get_partition_count(df_decreased_repartition)} partitions")
print("⚠️  WARNING: This causes a full shuffle even though we're reducing partitions!")
print("   Use coalesce() instead for better performance when decreasing partitions")


=== Example 1: Increasing Partitions ===
Original partitions: 11
After repartition(20): 20 partitions
Use case: Better parallelism when you have too few large partitions

=== Example 2: Repartition by Column ===
Repartitioned by Department: 1 partitions
Use case: Optimize joins on Department column

=== Example 3: Decreasing with Repartition (INEFFICIENT) ===
Original partitions: 11
After repartition(2): 2 partitions
   Use coalesce() instead for better performance when decreasing partitions


## Coalesce - Efficient Partition Reduction

**`coalesce()` can ONLY DECREASE partitions (cannot increase).**

**Key Characteristics:**
- **No shuffle** - merges partitions locally on the same executor
- More efficient than `repartition()` when reducing partition count
- Cannot increase partitions (will return original partition count if you try to increase)

**When to use `coalesce()`:**
1. **Too many small partitions**: After operations that create many small partitions
2. **Before writing files**: To avoid creating too many small output files
3. **After filtering**: When filter operations leave you with many empty/small partitions
4. **Memory optimization**: When you want to reduce partition overhead

**Remember**: Use `coalesce()` instead of `repartition()` when decreasing partitions for better performance!


In [3]:
# Example 1: Coalesce to reduce partitions (EFFICIENT - no shuffle)
print("=== Example 1: Decreasing Partitions with Coalesce ===")
print(f"Original partitions: {get_partition_count(df)}")
df_coalesced = df.coalesce(2)
print(f"After coalesce(2): {get_partition_count(df_coalesced)} partitions")
print("✓ Efficient: No shuffle, just merges partitions locally\n")

# Example 2: Attempting to increase with coalesce (won't work)
print("=== Example 2: Attempting to Increase with Coalesce ===")
print(f"Original partitions: {get_partition_count(df)}")
df_coalesce_increase = df.coalesce(20)  # Trying to increase
print(f"After coalesce(20): {get_partition_count(df_coalesce_increase)} partitions")
print("⚠️  Note: Coalesce cannot increase partitions!")
print("   It returns the original partition count if you try to increase\n")

# Example 3: Comparison - Coalesce vs Repartition for decreasing
print("=== Example 3: Performance Comparison ===")
print("When decreasing partitions:")
print("  • coalesce():  No shuffle, fast, efficient")
print("  • repartition(): Full shuffle, slow, wasteful")
print("\nBest Practice:")
print("  • Use repartition() to INCREASE partitions")
print("  • Use coalesce() to DECREASE partitions")


=== Example 1: Decreasing Partitions with Coalesce ===
Original partitions: 11
After coalesce(2): 2 partitions
✓ Efficient: No shuffle, just merges partitions locally

=== Example 2: Attempting to Increase with Coalesce ===
Original partitions: 11
After coalesce(20): 11 partitions
⚠️  Note: Coalesce cannot increase partitions!
   It returns the original partition count if you try to increase

=== Example 3: Performance Comparison ===
When decreasing partitions:
  • coalesce():  No shuffle, fast, efficient
  • repartition(): Full shuffle, slow, wasteful

Best Practice:
  • Use repartition() to INCREASE partitions
  • Use coalesce() to DECREASE partitions


## Caching and Persistence

### Why Do We Need Caching?

**Question**: What is the need for Cache when we are working with DataFrames that are already in-memory?

Even though DataFrames are in-memory constructs, caching is still essential for the following reasons:

1. **Avoid Disk I/O for Subsequent Operations**: Caching stores frequently used data in-memory, eliminating the need to hit the disk for subsequent data loading to create the DataFrame.

2. **Single Initial Load**: With caching, data needs to be loaded only once initially and can be reused for further transformations without having to read from the disk and create DataFrames multiple times.

3. **Save Processing Time**: The results of transformations that could potentially be used multiple times for further processing can also be cached to save processing time.

### Key Points About Caching

- **DataFrames which are reused multiple times** need to be cached for performance benefits.
- **Never cache large DataFrames** that could consume the majority of available memory. Cache medium-sized DataFrames that will be reused.
- **Cache is Lazy**: Caching itself is a lazy operation - the data is only cached when an action is triggered.

### Caching RDDs, DataFrames, and Spark Tables

**RDDs**:
- By default caches to **memory only**

**DataFrames & High-level Constructs**:
- By default caches to **memory**, if there is not enough memory available, then caches to **disk**
- Default storage level: `MEMORY_AND_DISK`

**Note on DataFrames Caching to Disk**:

If the data is already on Disk, then why cache to disk again?

There are two parts to the Disk space of every Worker Node:
- **HDFS**: Distributed file system where data is initially stored
- **Local Disk**: Local storage on each worker node

Initially, data is present in HDFS. Caching brings this data to the **local disk storage**, with which faster data access can be achieved. Local disk access is much faster than accessing data from HDFS across the network.

### Cache vs Persist

**`cache()`**:
- Simpler method with limited storage level options
- Data can be cached only in-memory or on disk
- Default: `MEMORY_AND_DISK`

**`persist()`**:
- More flexible than `cache()` in terms of storage level options
- Default storage levels are memory and disk, but this can be changed by setting an optional parameter
- Allows fine-grained control over storage levels

### Storage Levels

Common storage levels available:

- **`MEMORY_ONLY`**: Store in memory only (fastest, but data may be lost if memory is insufficient)
- **`MEMORY_AND_DISK`**: Store in memory, spill to disk if needed (default for DataFrames)
- **`DISK_ONLY`**: Store on disk only (slower but more reliable)
- **`MEMORY_ONLY_SER`**: Serialized in memory (more space efficient, requires deserialization)
- **`MEMORY_AND_DISK_SER`**: Serialized in memory, spill to disk if needed

### When to Use Caching

**Best suited for caching**:
- DataFrames that are **not too large** and are **reused frequently**
- Intermediate results of expensive transformations that will be used multiple times
- DataFrames used in iterative algorithms or loops

**Avoid caching**:
- Very large DataFrames that consume most of available memory
- DataFrames that are used only once
- DataFrames that are cheap to recompute

### Benefits of Caching

- **Improves performance** by avoiding redoing already performed computations
- **Saves computational cost** by reducing redundant processing
- **Reduces network I/O** when data is cached locally on worker nodes

### Viewing Cache Information in Spark UI

**Spark UI** is where you can view the currently executed job details of a Spark application. It also displays other details like:
- Number of executors that are allocated
- Job execution timeline
- Task details and performance metrics

**For cached data**:
- Caching details are available under the **Storage tab** on the Spark UI
- You can see which DataFrames/RDDs are cached
- Memory size and storage level information
- Cache hit/miss statistics

**Access Spark UI**:
- Local mode: `http://localhost:4040`
- Cluster mode: Check your cluster's Spark UI URL


In [4]:
# Cache a DataFrame (default: MEMORY_AND_DISK)
df_cached = df.filter(df.Salary > 55000).cache()

print("DataFrame cached")
print("First access (computes and caches):")
df_cached.count()

print("\nSecond access (uses cache - much faster):")
df_cached.count()


DataFrame cached
First access (computes and caches):

Second access (uses cache - much faster):


300

In [5]:
# Use different storage levels
df_memory_only = df.filter(df.Salary > 55000).persist(StorageLevel.MEMORY_ONLY)
df_disk_only = df.filter(df.Salary > 55000).persist(StorageLevel.DISK_ONLY)

print("DataFrames persisted with different storage levels")
print("Use MEMORY_ONLY when data fits in memory")
print("Use DISK_ONLY when memory is limited")


DataFrames persisted with different storage levels
Use MEMORY_ONLY when data fits in memory
Use DISK_ONLY when memory is limited


25/12/28 21:42:55 WARN CacheManager: Asked to cache already cached data.
25/12/28 21:42:55 WARN CacheManager: Asked to cache already cached data.


In [6]:
# Unpersist when done (free up memory)
df_cached.unpersist()
print("DataFrame unpersisted - memory freed")


DataFrame unpersisted - memory freed


## Partitioning with partitionBy

**Partitioning** is a process of structuring the data in the underlying file system in an efficient way so that only a subset of data will be scanned whenever a query is fired, instead of scanning the entire data, leading to performance gains.

### Why Partitioning?

Consider an example scenario: Working with an orders dataset that is randomly stored.

**Without Partitioning:**
- When you run a query, the query execution time will be noticeably high as **all the files have to be scanned** for processing the query.
- Spark needs to read through all data files to find the relevant records.

**With Partitioning:**
- In order to optimize the query performance, the data needs to be partitioned.
- Only a **subset of data will be scanned** for processing, thereby improving the performance significantly.
- This is called **partition pruning** - scanning only the necessary partitions and skipping the other partitions.

### When to Use partitionBy

**Key Guidelines:**

1. **Low Cardinality**: `partitionBy` should be applied on the column which has **low cardinality** (less number of distinct values).
   - Example: `Department` (IT, Sales, HR) - good for partitioning
   - Example: `CustomerID` (millions of unique values) - NOT good for partitioning

2. **Frequent Filtering**: `partitionBy` should be applied on the column where **filtering is used frequently** in queries on that particular column.
   - If you frequently filter by `country`, partition by `country`
   - If you frequently filter by `date`, partition by `date`

3. **Partition Pruning**: In a query, if filtering is applied on a **non-partitioned column**, then it doesn't provide any performance gains.
   - Example: If data is partitioned by `country` but you filter by `customer_id`, you won't get performance benefits.

### Multiple Column Partitioning

Partitioning can be applied simultaneously on **more than one column** by specifying the comma-separated list of columns in the `partitionBy` clause.

```python
df.write.partitionBy("State", "City").parquet("path/to/output")
```

This creates **2 levels of folders (nested)**:
- **Top-level folder**: `State` (if there are X states, X top-level folders will be created)
- **Sub-folder**: `City` (if there are Y cities in a State, then Y subfolders are created for that particular state)

**Example Structure:**
```
output/
  ├── State=CA/
  │   ├── City=LosAngeles/
  │   └── City=SanFrancisco/
  └── State=NY/
      ├── City=NewYork/
      └── City=Buffalo/
```

### Example: Partitioning Orders Data

```python
# Partition by country for better query performance
orders_df.write \
    .format("parquet") \
    .mode("overwrite") \
    .partitionBy("country") \
    .save("path/to/partitioned_orders")

# Query with partition pruning (only scans relevant partition)
spark.read.parquet("path/to/partitioned_orders") \
    .filter("country = 'USA'") \
    .show()  # Only scans State=USA/ folder
```


In [7]:
# Example: Partitioning Data

# Create sample orders data
orders_data = [
    ("INV001", "USA", "CA", 100.0),
    ("INV002", "USA", "NY", 200.0),
    ("INV003", "UK", "London", 150.0),
    ("INV004", "USA", "CA", 300.0),
    ("INV005", "UK", "Manchester", 250.0),
]

orders_df = spark.createDataFrame(orders_data, ["invoice_no", "country", "city", "amount"])

print("Original DataFrame:")
orders_df.show()

print("\n" + "="*70)
print("Writing with partitionBy('country'):")
print("="*70)

# Write with partitioning by country
orders_df.write \
    .format("parquet") \
    .mode("overwrite") \
    .partitionBy("country") \
    .save("data/partitioned_orders")

print("\nData written with partitioning by 'country'")
print("Folder structure:")
print("  data/partitioned_orders/")
print("    ├── country=USA/")
print("    └── country=UK/")
print("\nWhen querying with filter on 'country', only relevant partition is scanned!")


Original DataFrame:
+----------+-------+----------+------+
|invoice_no|country|      city|amount|
+----------+-------+----------+------+
|    INV001|    USA|        CA| 100.0|
|    INV002|    USA|        NY| 200.0|
|    INV003|     UK|    London| 150.0|
|    INV004|    USA|        CA| 300.0|
|    INV005|     UK|Manchester| 250.0|
+----------+-------+----------+------+


Writing with partitionBy('country'):





Data written with partitioning by 'country'
Folder structure:
  data/partitioned_orders/
    ├── country=USA/
    └── country=UK/

When querying with filter on 'country', only relevant partition is scanned!


                                                                                

In [8]:
# Example: Multiple Column Partitioning

print("="*70)
print("Writing with partitionBy('country', 'city'):")
print("="*70)

# Write with partitioning by country and city (nested folders)
orders_df.write \
    .format("parquet") \
    .mode("overwrite") \
    .partitionBy("country", "city") \
    .save("data/partitioned_orders_nested")

print("\nData written with partitioning by 'country' and 'city'")
print("Nested folder structure:")
print("  data/partitioned_orders_nested/")
print("    ├── country=USA/")
print("    │   ├── city=CA/")
print("    │   └── city=NY/")
print("    └── country=UK/")
print("        ├── city=London/")
print("        └── city=Manchester/")
print("\nThis creates 2 levels of folders for efficient querying!")


Writing with partitionBy('country', 'city'):

Data written with partitioning by 'country' and 'city'
Nested folder structure:
  data/partitioned_orders_nested/
    ├── country=USA/
    │   ├── city=CA/
    │   └── city=NY/
    └── country=UK/
        ├── city=London/
        └── city=Manchester/

This creates 2 levels of folders for efficient querying!


## Bucketing with bucketBy

**Bucketing** is a data organization technique used when there are a **large number of distinct values (High Cardinality)**, making it a better choice over Partitioning in such cases.

### When to Use Bucketing vs Partitioning

- **Partitioning**: Best for **low cardinality** columns (few distinct values)
- **Bucketing**: Best for **high cardinality** columns (many distinct values)

### How Bucketing Works

In case of bucketing:
- The **number of buckets** and the **bucketing column** has to be defined upfront
- Passed as parameters to the `bucketBy` clause
- Based on a **hash function**, the records will be moved to different buckets/files

### Benefits of Bucketing

Bucketing helps in 2 ways:

1. **Skipping the irrelevant data**: When querying on the bucketed column, only the relevant bucket(s) need to be scanned
2. **Join Optimizations**: Pre-bucketed tables can join faster without shuffling data

### Bucketing Syntax

```python
customer_final_df.write \
    .format("parquet") \
    .mode("overwrite") \
    .bucketBy(4, "customer_id") \
    .saveAsTable("database.customersnew")
```

**Important**: In case of bucketing, a **managed Spark table** has to be created to save the bucketed data (using `saveAsTable()`).

### Bucket Distribution

Every partition will have files equal to the number of buckets mentioned while creating the DataFrame.

**Example**: If you specify `bucketBy(4, "customer_id")`, each partition will contain 4 bucket files.

### Query Performance with Bucketing

**Query on non-bucketed column:**
```python
spark.sql("SELECT * FROM customersnew WHERE customer_state='TX'")
# Scans all buckets - no performance gain
```

**Query on bucketed column:**
```python
spark.sql("SELECT * FROM customersnew WHERE customer_id=10")
# Only scans the relevant bucket - significant performance gain!
```

This results in **significant performance gains** as only one file (bucket) will be scanned to get the desired results.

### Storing and Retrieving Data with Bucketing

**Consider an example where the number of fixed buckets = 4**

**Storing the data:**
- Hash function used is **modulo** (since there are 4 buckets, modulo 4 is applied)
- Example: `1 % 4 = 1` → Bucket 1, `2 % 4 = 2` → Bucket 2, `3 % 4 = 3` → Bucket 3, `4 % 4 = 0` → Bucket 0

**Retrieving the data:**
- Hash function used is **modulo**
- Say you are required to retrieve record 9, then `9 % 4 = 1`
- This implies data is present in **Bucket 1**
- Only Bucket 1 is scanned and the rest are skipped

### Combining Partitioning and Bucketing

**Important**: A combination of **Partitioning followed by Bucketing** is possible, where there will be two-level filtering:
1. First level: Partition pruning (folder level)
2. Second level: Bucket pruning (file level)

**However**: **Bucketing followed by Partitioning is NOT possible**. Because:
- Partitioning results in **folders**
- Bucketing results in **files**
- Having files inside a folder is possible, but we cannot have a folder inside a file

**Example of Partitioning + Bucketing:**
```python
df.write \
    .format("parquet") \
    .mode("overwrite") \
    .partitionBy("country") \
    .bucketBy(10, "customer_id") \
    .saveAsTable("database.customers")
```

This creates:
- Folders by `country` (partitioning)
- Within each country folder, 10 bucket files by `customer_id` (bucketing)


In [9]:
# Example: Bucketing (requires Hive/Spark catalog support)
# Note: Bucketing requires saveAsTable() - cannot use save() with bucketBy

print("="*70)
print("Bucketing Example:")
print("="*70)
print("""
# Bucketing syntax (requires Hive/Spark catalog):
customer_final_df.write \\
    .format("parquet") \\
    .mode("overwrite") \\
    .bucketBy(4, "customer_id") \\
    .saveAsTable("database.customersnew")

# Query on bucketed column (only scans relevant bucket):
spark.sql("SELECT * FROM database.customersnew WHERE customer_id=10")
# This scans only 1 bucket out of 4 - significant performance gain!

# Query on non-bucketed column (scans all buckets):
spark.sql("SELECT * FROM database.customersnew WHERE customer_state='TX'")
# This scans all 4 buckets - no performance benefit
""")

print("\n" + "="*70)
print("Key Points:")
print("="*70)
print("1. Bucketing uses hash function (modulo) to distribute data")
print("2. Each partition contains files equal to number of buckets")
print("3. Querying on bucketed column = only relevant bucket scanned")
print("4. Bucketing requires saveAsTable() - cannot use save()")
print("5. Partitioning + Bucketing is possible (folders + files)")
print("6. Bucketing + Partitioning is NOT possible (cannot have folders in files)")


Bucketing Example:

# Bucketing syntax (requires Hive/Spark catalog):
customer_final_df.write \
    .format("parquet") \
    .mode("overwrite") \
    .bucketBy(4, "customer_id") \
    .saveAsTable("database.customersnew")

# Query on bucketed column (only scans relevant bucket):
spark.sql("SELECT * FROM database.customersnew WHERE customer_id=10")
# This scans only 1 bucket out of 4 - significant performance gain!

# Query on non-bucketed column (scans all buckets):
spark.sql("SELECT * FROM database.customersnew WHERE customer_state='TX'")
# This scans all 4 buckets - no performance benefit


Key Points:
1. Bucketing uses hash function (modulo) to distribute data
2. Each partition contains files equal to number of buckets
3. Querying on bucketed column = only relevant bucket scanned
4. Bucketing requires saveAsTable() - cannot use save()
5. Partitioning + Bucketing is possible (folders + files)
6. Bucketing + Partitioning is NOT possible (cannot have folders in files)


## Serialization

**Serialization** is how Spark sends data between nodes. Two main options:

1. **Java Serialization** (default): Slower, more compatible
2. **Kryo Serialization**: Faster, requires registration of classes

**Best Practice**: Use Kryo serialization for better performance.


In [10]:
# Example: Configure Kryo serialization (typically done in SparkSession config)
print("Kryo serialization configuration (example):")
print("""
spark = SparkSession.builder \\
    .appName("Optimized App") \\
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \\
    .config("spark.kryo.registrationRequired", "false") \\
    .getOrCreate()
""")
print("\nKryo is faster than Java serialization but requires class registration")
print("For most use cases, default serialization is fine")


Kryo serialization configuration (example):

spark = SparkSession.builder \
    .appName("Optimized App") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .config("spark.kryo.registrationRequired", "false") \
    .getOrCreate()


Kryo is faster than Java serialization but requires class registration
For most use cases, default serialization is fine


25/12/28 23:48:46 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 904416 ms exceeds timeout 120000 ms
25/12/28 23:48:46 WARN SparkContext: Killing executors is not supported by current scheduler.
25/12/28 23:58:24 ERROR Inbox: Ignoring error
org.apache.spark.SparkException: Exception thrown in awaitResult: 
	at org.apache.spark.util.SparkThreadUtils$.awaitResult(SparkThreadUtils.scala:56)
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:310)
	at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
	at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:102)
	at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:110)
	at org.apache.spark.util.RpcUtils$.makeDriverRef(RpcUtils.scala:36)
	at org.apache.spark.storage.BlockManagerMasterEndpoint.driverEndpoint$lzycompute(BlockManagerMasterEndpoint.scala:124)
	at org.apache.spark.storage.BlockManagerMasterEndpoint.org$apache$spark$storage$BlockManagerMasterEndpoint$$

## Performance Best Practices Summary

1. **Minimize Shuffles**: 
   - Filter early
   - Use broadcast joins for small tables
   - Avoid unnecessary repartitions

2. **Optimize Partitions**:
   - Aim for 128MB-200MB per partition
   - Use coalesce to reduce partitions (no shuffle)
   - Use repartition when increasing partitions

3. **Cache Strategically**:
   - Cache DataFrames used multiple times
   - Unpersist when done
   - Choose appropriate storage level

4. **Use Columnar Formats**:
   - Prefer Parquet over CSV
   - Use partitioning for large datasets

5. **Broadcast Small Tables**:
   - Use `broadcast()` for tables < 100MB
   - Avoids shuffling large tables

6. **Filter Early**:
   - Apply filters as early as possible
   - Reduces data size for subsequent operations


## Summary

In this notebook, you learned:

1. **Repartitioning**: Changing number of partitions (causes shuffle)
2. **Coalesce**: Reducing partitions efficiently (no shuffle)
3. **Caching**: Storing DataFrames in memory/disk to avoid recomputation
4. **Storage Levels**: Different caching strategies (MEMORY_ONLY, DISK_ONLY, etc.)
5. **Bucketing**: Organizing data for optimized joins
6. **Serialization**: Kryo vs Java serialization
7. **Best Practices**: Key optimization strategies

**Key Takeaway**: Optimization is about minimizing shuffles, using appropriate partitions, caching strategically, and choosing the right data formats. Always measure performance before and after optimizations.

**Next Steps**: In Module 9, we'll learn how to use Spark UI to debug performance bottlenecks.
