# Understanding Partitions: ADLS File Partitions vs Spark Executor Partitions

## Learning Objectives

By the end of this notebook, you will understand:

1. **The difference** between file partitions in storage (ADLS) and partitions in Spark executors
2. **Why idle partitions occur** and how they waste resources
3. **How to diagnose** partition-related performance issues
4. **Best practices** for optimizing partition counts
5. **Common mistakes** and how to avoid them

## Prerequisites

- Basic understanding of Spark architecture (executors, cores, tasks)
- Familiarity with reading data from cloud storage (ADLS, S3, etc.)
- Understanding of distributed computing concepts


## Step 1: Understanding Your Configuration

Before we dive into partitions, let's understand the setup we're working with.

### Cluster Configuration

```python
# Your Spark cluster setup
num_executors = 4
cores_per_executor = 4
total_cores = num_executors * cores_per_executor  # 16 cores total
```

**What this means:**
- You have **4 executors** (JVM processes running on worker nodes)
- Each executor has **4 CPU cores** available
- Total available parallelism: **16 cores**

**Common Misconception:**
> "I have 16 cores, so I have 16-way parallelism"

**Reality:** This is only true if you have **at least 16 partitions**. Having 16 cores doesn't automatically give you 16-way parallelism - you need 16 tasks (one per partition) to utilize all cores.

### Data Storage Configuration (ADLS)

```python
# Your data layout in ADLS
data_path = "abfss://container@storageaccount.dfs.core.windows.net/sales/"

# Files in the sales directory:
# sales/
#   ‚îú‚îÄ‚îÄ part-00000.parquet (256 GB)
#   ‚îú‚îÄ‚îÄ part-00001.parquet (256 GB)
#   ‚îú‚îÄ‚îÄ part-00002.parquet (256 GB)
#   ‚îî‚îÄ‚îÄ part-00003.parquet (256 GB)
```

**What this means:**
- You have **4 large Parquet files** in your storage
- Each file is **256 GB** (1 TB total)
- These are **file partitions** - a way to organize data in storage

**Important Distinction:**
- **File partitions** (storage level) ‚â† **Spark partitions** (compute level)
- The number of files does NOT automatically equal the number of Spark partitions


## Step 2: What We're Trying to Achieve

### Goal: Maximum Resource Utilization

**What we want:**
- All 16 cores working simultaneously
- No idle executors or cores
- Efficient data processing with minimal waste

**Why this matters:**
- **Cost efficiency**: You're paying for all 16 cores - use them!
- **Performance**: More parallelism = faster job completion
- **Scalability**: Understanding this helps you scale properly

### The Ideal Scenario

```
Executor 1: [Task 1] [Task 2] [Task 3] [Task 4] ‚Üí All 4 cores busy
Executor 2: [Task 5] [Task 6] [Task 7] [Task 8] ‚Üí All 4 cores busy
Executor 3: [Task 9] [Task 10] [Task 11] [Task 12] ‚Üí All 4 cores busy
Executor 4: [Task 13] [Task 14] [Task 15] [Task 16] ‚Üí All 4 cores busy
```

**Key Principle:**
> **Number of partitions should be ‚â• number of cores (ideally 2-4√ó for better load balancing)**


## Step 3: Common Mistake - What Happens When You Read Data Naively

### The Problematic Code

```python
# This is what many beginners do (and it's WRONG for our scenario)
sales_df = spark.read.parquet("abfss://container@storageaccount.dfs.core.windows.net/sales/")
```

### What Spark Actually Does

When you read data, Spark follows this rule:

> **One input partition = one task = one core**

**What Spark sees:**
- 4 Parquet files in the directory
- Parquet files are columnar and technically splittable, but **file count dominates** the partition creation
- Spark creates: **4 partitions** ‚Üí **4 tasks**

**Visual Representation:**

```
Spark's View:
‚îú‚îÄ‚îÄ Partition 0 (from part-00000.parquet) ‚Üí Task 0
‚îú‚îÄ‚îÄ Partition 1 (from part-00001.parquet) ‚Üí Task 1
‚îú‚îÄ‚îÄ Partition 2 (from part-00002.parquet) ‚Üí Task 2
‚îî‚îÄ‚îÄ Partition 3 (from part-00003.parquet) ‚Üí Task 3
```

### The Painful Reality: Executor-Level Task Assignment

**How Spark distributes tasks:**

```
Executor 1: 
  ‚îú‚îÄ‚îÄ Task 0 (processing part-00000.parquet, 256 GB)
  ‚îî‚îÄ‚îÄ Cores: [‚óè] [‚óã] [‚óã] [‚óã]  ‚Üí 1 core used, 3 cores IDLE

Executor 2:
  ‚îú‚îÄ‚îÄ Task 1 (processing part-00001.parquet, 256 GB)
  ‚îî‚îÄ‚îÄ Cores: [‚óè] [‚óã] [‚óã] [‚óã]  ‚Üí 1 core used, 3 cores IDLE

Executor 3:
  ‚îú‚îÄ‚îÄ Task 2 (processing part-00002.parquet, 256 GB)
  ‚îî‚îÄ‚îÄ Cores: [‚óè] [‚óã] [‚óã] [‚óã]  ‚Üí 1 core used, 3 cores IDLE

Executor 4:
  ‚îú‚îÄ‚îÄ Task 3 (processing part-00003.parquet, 256 GB)
  ‚îî‚îÄ‚îÄ Cores: [‚óè] [‚óã] [‚óã] [‚óã]  ‚Üí 1 core used, 3 cores IDLE
```

**Resource Utilization:**
- **Used cores:** 4 out of 16
- **Idle cores:** 12 out of 16
- **Utilization:** 25% (75% waste!)

**This is not "suboptimal" - this is embarrassingly bad for production!**


## Step 4: Why Spark Behaves This Way (Understanding the Fundamentals)

### Key Spark Rules (No Myths)

**Spark does NOT:**
- ‚ùå Split one task across multiple cores
- ‚ùå Parallelize processing within a single partition
- ‚ùå Allow threads to cooperate on the same partition

**Why?**
- A partition is processed by **one thread** (one core)
- Threads don't cooperate on the same partition
- JVM safety and determinism reasons
- Ensures consistent, predictable results

### Common Misconception

**Wrong belief:**
> "My executor has 4 cores, so it will process the 256 GB partition 4 times faster"

**Reality:**
> "An executor with 4 cores can process **4 partitions in parallel**, not one partition faster"

**Think of it this way:**
- Each core is like a worker
- Each partition is like a job
- One worker can only do one job at a time
- To use 4 workers, you need 4 jobs (partitions)

### Why Spark Doesn't Auto-Fix This

**You might ask:** "Why doesn't Spark just split the data better automatically?"

**Answer:** Spark respects input layout and avoids assumptions:

1. **File boundaries matter**: Spark respects how data was written
2. **Existing partitioning**: If data was partitioned a certain way, Spark assumes it was intentional
3. **Avoids accidental shuffles**: Spark won't automatically repartition to avoid expensive operations
4. **Assumption**: "If you wrote 4 huge files, you probably meant it"

**That assumption is often wrong - but Spark won't guess. You need to tell it explicitly.**


In [None]:
# This cell is intentionally left empty - Spark session will be created in the practical demo section

In [None]:
# This cell is intentionally left empty - Spark session will be created in the practical demo section


SPARK SESSION INITIALIZED
Spark Version: 3.5.1
Spark App Name: PartitionOptimizationDemo
Master: local[*]


## Practical Demonstration: Creating Sample Data Files

Before we demonstrate the partition concept, let's create 4 Parquet files in the data folder to simulate the real-world scenario.


In [9]:
# Step 1: Create 4 Parquet Files to Simulate Real-World Scenario
# This mimics having 4 large files in ADLS (like part-00000.parquet, part-00001.parquet, etc.)

import os
import glob
from pyspark.sql import Row
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, DateType
from pyspark.sql.functions import col, lit
from datetime import date, timedelta

print("=" * 70)
print("CREATING 4 PARQUET FILES TO DEMONSTRATE PARTITION CONCEPT")
print("=" * 70)

# Define the output directory
output_dir = "data/sales_demo"

# Clean up existing directory if it exists
import shutil
if os.path.exists(output_dir):
    shutil.rmtree(output_dir)
    print(f"Cleaned up existing directory: {output_dir}")

# Create sample sales data
# Each file will represent a different region with substantial data
regions = ["North", "South", "East", "West"]
records_per_file = 50000  # Enough data to see the effect, but manageable

# Collect all data first
all_dataframes = []

for i, region in enumerate(regions):
    print(f"\nCreating data for {region} region (File {i+1}/4)...")
    
    # Generate sample sales data
    data = []
    base_date = date(2023, 1, 1)
    
    for j in range(records_per_file):
        data.append(Row(
            sale_id=i * records_per_file + j,
            region=region,
            product_id=f"PROD_{j % 1000:04d}",
            customer_id=f"CUST_{j % 5000:05d}",
            sale_amount=round(100.0 + (j % 1000) * 0.5, 2),
            sale_date=base_date + timedelta(days=j % 365),
            quantity=(j % 10) + 1
        ))
    
    # Create DataFrame
    df = spark.createDataFrame(data)
    all_dataframes.append(df)
    
    print(f"  ‚úì Created DataFrame for {region} region")
    print(f"  ‚úì Records: {records_per_file:,}")

# Combine all dataframes
print(f"\nCombining all regions into a single dataset...")
combined_df = all_dataframes[0]
for df in all_dataframes[1:]:
    combined_df = combined_df.union(df)

# Write to parquet with exactly 4 partitions (4 files)
# This simulates having exactly 4 files in storage
print(f"\nWriting to parquet with 4 partitions (4 files)...")
combined_df.coalesce(4).write.mode("overwrite").parquet(output_dir)

# Verify the files were created
parquet_files = glob.glob(f"{output_dir}/*.parquet")
if not parquet_files:
    # Sometimes files are in subdirectories
    parquet_files = glob.glob(f"{output_dir}/**/*.parquet", recursive=True)

total_size = sum(os.path.getsize(f) for f in parquet_files) if parquet_files else 0
file_count = len(parquet_files)

print("\n" + "=" * 70)
print("‚úÖ All 4 Parquet files created successfully!")
print(f"üìÅ Location: {output_dir}/")
print(f"üìä Number of parquet files: {file_count}")
if total_size > 0:
    print(f"üíæ Total size: {total_size / (1024*1024):.2f} MB")
print("=" * 70)


CREATING 4 PARQUET FILES TO DEMONSTRATE PARTITION CONCEPT
Cleaned up existing directory: data/sales_demo

Creating data for North region (File 1/4)...
  ‚úì Created DataFrame for North region
  ‚úì Records: 50,000

Creating data for South region (File 2/4)...
  ‚úì Created DataFrame for South region
  ‚úì Records: 50,000

Creating data for East region (File 3/4)...
  ‚úì Created DataFrame for East region
  ‚úì Records: 50,000

Creating data for West region (File 4/4)...
  ‚úì Created DataFrame for West region
  ‚úì Records: 50,000

Combining all regions into a single dataset...

Writing to parquet with 4 partitions (4 files)...


26/01/02 20:42:18 WARN TaskSetManager: Stage 0 contains a task of very large size (2304 KiB). The maximum recommended task size is 1000 KiB.
[Stage 0:>                                                          (0 + 4) / 4]


‚úÖ All 4 Parquet files created successfully!
üìÅ Location: data/sales_demo/
üìä Number of parquet files: 4
üíæ Total size: 1.30 MB


                                                                                

## Step 4: Why Spark Behaves This Way (Understanding the Fundamentals)

### Key Spark Rules (No Myths)

**Spark does NOT:**
- ‚ùå Split one task across multiple cores
- ‚ùå Parallelize processing within a single partition
- ‚ùå Allow threads to cooperate on the same partition

**Why?**
- A partition is processed by **one thread** (one core)
- Threads don't cooperate on the same partition
- JVM safety and determinism reasons
- Ensures consistent, predictable results

### Common Misconception

**Wrong belief:**
> "My executor has 4 cores, so it will process the 256 GB partition 4 times faster"

**Reality:**
> "An executor with 4 cores can process **4 partitions in parallel**, not one partition faster"

**Think of it this way:**
- Each core is like a worker
- Each partition is like a job
- One worker can only do one job at a time
- To use 4 workers, you need 4 jobs (partitions)

### Why Spark Doesn't Auto-Fix This

**You might ask:** "Why doesn't Spark just split the data better automatically?"

**Answer:** Spark respects input layout and avoids assumptions:

1. **File boundaries matter**: Spark respects how data was written
2. **Existing partitioning**: If data was partitioned a certain way, Spark assumes it was intentional
3. **Avoids accidental shuffles**: Spark won't automatically repartition to avoid expensive operations
4. **Assumption**: "If you wrote 4 huge files, you probably meant it"

**That assumption is often wrong - but Spark won't guess. You need to tell it explicitly.**


In [10]:
# Fixed version of the demonstration cell
# Copy this code to replace cell 8 in the notebook

# Practical Demonstration: Reading 4 Parquet Files and Diagnosing Partition Issues

import os
import multiprocessing
import glob
import time

print("=" * 70)
print("STEP 1: Understanding Your System Configuration")
print("=" * 70)

# Get actual CPU cores on your system
physical_cores = multiprocessing.cpu_count()
print(f"\nüñ•Ô∏è  Physical CPU Cores on Your System: {physical_cores}")

# Get Spark's default parallelism (usually equals total cores available to Spark)
default_parallelism = spark.sparkContext.defaultParallelism
print(f"‚öôÔ∏è  Spark Default Parallelism: {default_parallelism}")

# Try to get executor info
num_executors = 1
cores_per_executor = default_parallelism
try:
    # Try different methods to get executor info depending on Spark version
    status_tracker = spark.sparkContext.statusTracker()
    if hasattr(status_tracker, 'getExecutorInfos'):
        executors = status_tracker.getExecutorInfos()
        num_executors = len(executors)
        print(f"üì¶ Number of Executors: {num_executors}")
        if executors:
            cores_per_executor = executors[0].totalCores
            print(f"üîß Cores per Executor: {cores_per_executor}")
            total_spark_cores = num_executors * cores_per_executor
            print(f"üìä Total Spark Cores: {total_spark_cores}")
    else:
        raise AttributeError("getExecutorInfos not available")
except Exception as e:
    print(f"‚ÑπÔ∏è  Executor info not available: {e}")
    print(f"‚ÑπÔ∏è  Using default parallelism ({default_parallelism}) as total available cores")
    print(f"‚ÑπÔ∏è  Assuming 1 executor with {default_parallelism} cores (local mode)")

print("\n" + "=" * 70)
print("STEP 2: Reading 4 Parquet Files (The Problem)")
print("=" * 70)

# Read the 4 parquet files we created
# Spark will read all parquet files in subdirectories
# This simulates reading from ADLS: spark.read.parquet("abfss://.../sales/")
sales_df = spark.read.parquet("data/sales_demo/")

# Check how many partitions Spark created
num_partitions = sales_df.rdd.getNumPartitions()
total_records = sales_df.count()

print(f"\nüìÅ Files Read: 4 parquet files from data/sales_demo/")
print(f"üìä Total Records: {total_records:,}")
print(f"üî¢ Spark Partitions Created: {num_partitions}")
print(f"‚öôÔ∏è  Available Cores: {default_parallelism}")

# Diagnosis
print("\n" + "-" * 70)
print("DIAGNOSIS:")
print("-" * 70)

if num_partitions < default_parallelism:
    waste_percentage = (1 - num_partitions / default_parallelism) * 100
    idle_cores = default_parallelism - num_partitions
    utilization = (num_partitions / default_parallelism) * 100
    
    print(f"‚ö†Ô∏è  PROBLEM DETECTED!")
    print(f"   ‚Ä¢ You have {num_partitions} partitions but {default_parallelism} cores")
    print(f"   ‚Ä¢ {idle_cores} cores will be IDLE (doing nothing)")
    print(f"   ‚Ä¢ Resource utilization: {utilization:.1f}%")
    print(f"   ‚Ä¢ Waste: {waste_percentage:.1f}% of your compute resources")
    print(f"\n   This means:")
    print(f"   ‚Ä¢ Only {num_partitions} tasks will run in parallel")
    print(f"   ‚Ä¢ {idle_cores} cores will sit idle, wasting resources")
    print(f"   ‚Ä¢ Your job will run much slower than it could")
else:
    print("‚úÖ Partition count looks good!")

print("\n" + "=" * 70)
print("STEP 3: Visualizing the Problem")
print("=" * 70)

print(f"\nCurrent Situation:")
print(f"  Files in storage: 4")
print(f"  Spark partitions: {num_partitions}")
print(f"  Available cores: {default_parallelism}")
print(f"  Number of executors: {num_executors}")
print(f"  Cores per executor: {cores_per_executor}")

# Visualize task distribution based on actual configuration
if num_executors == 1:
    # Local mode - single executor
    print(f"\n  Task Distribution (Local Mode - 1 executor with {cores_per_executor} cores):")
    task_visual = " ".join([f"[Task {i+1}]" for i in range(min(num_partitions, 4))])
    idle_visual = " ".join(["[‚óã]"] * max(0, min(cores_per_executor - num_partitions, 4)))
    print(f"    Executor 1: {task_visual} {idle_visual}")
    if cores_per_executor > 4:
        print(f"                ... ({num_partitions} tasks total, {default_parallelism - num_partitions} cores idle)")
    else:
        print(f"                ‚Üë Only {num_partitions} tasks, {default_parallelism - num_partitions} cores idle")
else:
    # Cluster mode - multiple executors
    tasks_per_executor_visual = max(1, num_partitions // num_executors)
    print(f"\n  Task Distribution ({num_executors} executors with {cores_per_executor} cores each):")
    for i in range(min(num_executors, num_partitions)):
        idle_cores_vis = max(0, cores_per_executor - 1)
        idle_dots = "[‚óã] " * min(idle_cores_vis, 3)
        if idle_cores_vis > 3:
            idle_dots += f"... ({idle_cores_vis} total idle)"
        print(f"    Executor {i+1}: [Task {i+1}] {idle_dots}‚Üí 1 core used, {idle_cores_vis} idle")
    if num_partitions < num_executors:
        print(f"    (Only {num_partitions} tasks for {num_executors} executors)")
print(f"    Total: {num_partitions} cores busy, {default_parallelism - num_partitions} cores idle")

print("\n" + "=" * 70)
print("STEP 4: The Solution - Optimizing with Repartition")
print("=" * 70)

# Calculate optimal partition count (2-4√ó core count)
optimal_partitions = default_parallelism * 2
print(f"\nüéØ Target Partitions: {optimal_partitions} (2√ó your {default_parallelism} cores)")

# Repartition the data
print(f"\nRepartitioning from {num_partitions} to {optimal_partitions} partitions...")
sales_df_optimized = sales_df.repartition(optimal_partitions)

# Verify
optimized_partitions = sales_df_optimized.rdd.getNumPartitions()
optimized_records = sales_df_optimized.count()

print(f"\n‚úÖ Repartitioning Complete!")
print(f"   ‚Ä¢ New partition count: {optimized_partitions}")
print(f"   ‚Ä¢ Records preserved: {optimized_records:,} (same as before)")
print(f"   ‚Ä¢ Available cores: {default_parallelism}")

# Show improvement
utilization_after = (optimized_partitions / default_parallelism) * 100
print(f"\nüìà Improvement:")
print(f"   ‚Ä¢ Resource utilization: {utilization_after:.1f}%")
print(f"   ‚Ä¢ All {default_parallelism} cores can now be utilized")
print(f"   ‚Ä¢ Tasks will be queued for smooth execution")

print(f"\n  Optimized Task Distribution:")
if num_executors == 1:
    # Local mode
    tasks_per_executor = optimized_partitions
    print(f"    Executor 1: [{tasks_per_executor} tasks] ‚Üí All {cores_per_executor} cores busy + queued tasks")
    print(f"    Total: {optimized_partitions} tasks ‚Üí All {default_parallelism} cores utilized!")
else:
    # Cluster mode
    tasks_per_executor = optimized_partitions // num_executors
    for i in range(num_executors):
        print(f"    Executor {i+1}: [{tasks_per_executor} tasks] ‚Üí All {cores_per_executor} cores busy + queued tasks")
    print(f"    Total: {optimized_partitions} tasks ‚Üí All {default_parallelism} cores utilized!")

print("\n" + "=" * 70)
print("STEP 5: Performance Comparison")
print("=" * 70)

# Show a simple operation to demonstrate the difference
print("\nRunning a simple aggregation to show the difference...")

# Without optimization
print(f"\n‚è±Ô∏è  Without optimization ({num_partitions} partitions):")
start = time.time()
result_bad = sales_df.groupBy("region").agg({"sale_amount": "sum"}).collect()
time_bad = time.time() - start
print(f"   Time taken: {time_bad:.2f} seconds")

# With optimization
print(f"\n‚è±Ô∏è  With optimization ({optimal_partitions} partitions):")
start = time.time()
result_good = sales_df_optimized.groupBy("region").agg({"sale_amount": "sum"}).collect()
time_good = time.time() - start
print(f"   Time taken: {time_good:.2f} seconds")

if time_bad > 0 and time_good > 0:
    speedup = time_bad / time_good if time_good > 0 else 1
    print(f"\nüìä Speedup: {speedup:.2f}√ó faster with optimized partitions")
    print(f"   (Note: Speedup may vary based on data size and cluster configuration)")
    if speedup < 1:
        print(f"   ‚ÑπÔ∏è  For small datasets, overhead of repartitioning may outweigh benefits")
        print(f"   ‚ÑπÔ∏è  Benefits are more pronounced with larger datasets and more cores")

print("\n" + "=" * 70)
print("‚úÖ DEMONSTRATION COMPLETE!")
print("=" * 70)
print("\nKey Takeaway:")
print(f"  ‚Ä¢ Started with {num_partitions} partitions (from 4 files)")
print(f"  ‚Ä¢ Optimized to {optimized_partitions} partitions")
print(f"  ‚Ä¢ Now utilizing all {default_parallelism} cores efficiently!")
print("=" * 70)



STEP 1: Understanding Your System Configuration

üñ•Ô∏è  Physical CPU Cores on Your System: 11
‚öôÔ∏è  Spark Default Parallelism: 11
‚ÑπÔ∏è  Executor info not available: getExecutorInfos not available
‚ÑπÔ∏è  Using default parallelism (11) as total available cores
‚ÑπÔ∏è  Assuming 1 executor with 11 cores (local mode)

STEP 2: Reading 4 Parquet Files (The Problem)

üìÅ Files Read: 4 parquet files from data/sales_demo/
üìä Total Records: 200,000
üî¢ Spark Partitions Created: 4
‚öôÔ∏è  Available Cores: 11

----------------------------------------------------------------------
DIAGNOSIS:
----------------------------------------------------------------------
‚ö†Ô∏è  PROBLEM DETECTED!
   ‚Ä¢ You have 4 partitions but 11 cores
   ‚Ä¢ 7 cores will be IDLE (doing nothing)
   ‚Ä¢ Resource utilization: 36.4%
   ‚Ä¢ Waste: 63.6% of your compute resources

   This means:
   ‚Ä¢ Only 4 tasks will run in parallel
   ‚Ä¢ 7 cores will sit idle, wasting resources
   ‚Ä¢ Your job will run much slowe

## Step 5: How to Diagnose Partition Issues

### Check Your Partition Count

**Always check this first:**

```python
# Read your data
sales_df = spark.read.parquet("abfss://container@storageaccount.dfs.core.windows.net/sales/")

# Check partition count
num_partitions = sales_df.rdd.getNumPartitions()
print(f"Number of partitions: {num_partitions}")

# Compare with available cores
total_cores = spark.sparkContext.defaultParallelism
print(f"Total available cores: {total_cores}")

# Diagnosis
if num_partitions < total_cores:
    print(f"‚ö†Ô∏è  WARNING: You have {num_partitions} partitions but {total_cores} cores!")
    print(f"‚ö†Ô∏è  You are wasting {total_cores - num_partitions} cores!")
else:
    print("‚úÖ Partition count looks good!")
```

### Understanding the Output

For our scenario:
```
Number of partitions: 4
Total available cores: 16
‚ö†Ô∏è  WARNING: You have 4 partitions but 16 cores!
‚ö†Ô∏è  You are wasting 12 cores!
```

**Rule of thumb:**
- If `num_partitions < total_cores` ‚Üí You're wasting resources
- Ideal: `num_partitions = 2-4 √ó total_cores` (for better load balancing and hiding I/O wait times)


## Step 6: The Right Way - Optimized Partition Strategy

### Option 1: Repartition After Read (Simple & Explicit)

**Recommended for beginners:**

```python
# Step 1: Read the data (creates 4 partitions from 4 files)
sales_df = spark.read.parquet("abfss://container@storageaccount.dfs.core.windows.net/sales/")

# Step 2: Explicitly repartition to match your cluster
# Target: 2-4√ó your core count for optimal performance
sales_df_optimized = sales_df.repartition(32)  # 2√ó your 16 cores

# Verify
print(f"Partitions after repartition: {sales_df_optimized.rdd.getNumPartitions()}")
```

**What happens:**
1. Spark reads 4 files ‚Üí creates 4 initial partitions
2. `repartition(32)` triggers a **shuffle** to redistribute data into 32 partitions
3. Each partition is now ~32 GB instead of 256 GB
4. Spark creates 32 tasks ‚Üí can utilize all 16 cores (with tasks queued)

### Option 2: Repartition During Read (More Efficient)

**Better approach - avoids initial 4-partition creation:**

```python
# Read and repartition in one step
sales_df = (
    spark.read.parquet("abfss://container@storageaccount.dfs.core.windows.net/sales/")
    .repartition(32)
)
```

### Optimized Task Distribution

**After repartitioning to 32 partitions:**

```
Executor 1: 
  ‚îú‚îÄ‚îÄ Tasks: [1] [2] [3] [4] [5] [6] [7] [8]
  ‚îî‚îÄ‚îÄ Cores: [‚óè] [‚óè] [‚óè] [‚óè] [‚óè] [‚óè] [‚óè] [‚óè]  ‚Üí All 4 cores busy, 4 tasks queued

Executor 2:
  ‚îú‚îÄ‚îÄ Tasks: [9] [10] [11] [12] [13] [14] [15] [16]
  ‚îî‚îÄ‚îÄ Cores: [‚óè] [‚óè] [‚óè] [‚óè] [‚óè] [‚óè] [‚óè] [‚óè]  ‚Üí All 4 cores busy, 4 tasks queued

Executor 3:
  ‚îú‚îÄ‚îÄ Tasks: [17] [18] [19] [20] [21] [22] [23] [24]
  ‚îî‚îÄ‚îÄ Cores: [‚óè] [‚óè] [‚óè] [‚óè] [‚óè] [‚óè] [‚óè] [‚óè]  ‚Üí All 4 cores busy, 4 tasks queued

Executor 4:
  ‚îú‚îÄ‚îÄ Tasks: [25] [26] [27] [28] [29] [30] [31] [32]
  ‚îî‚îÄ‚îÄ Cores: [‚óè] [‚óè] [‚óè] [‚óè] [‚óè] [‚óè] [‚óè] [‚óè]  ‚Üí All 4 cores busy, 4 tasks queued
```

**Resource Utilization:**
- **Used cores:** 16 out of 16
- **Idle cores:** 0
- **Utilization:** 100% ‚úÖ
- **Task queue:** As soon as one task finishes, the next starts immediately

**This is what a healthy Spark job looks like!**


## Step 7: Understanding the Impact on Operations

### Example: Broadcast Join with Optimized Partitions

**Scenario:** Joining sales data (large) with product catalog (small)

```python
# Small table (will be broadcast)
products_df = spark.read.parquet("abfss://.../products/")  # 100 MB

# Large table (our sales data)
sales_df = (
    spark.read.parquet("abfss://.../sales/")
    .repartition(32)  # Optimized partitions
)

# Broadcast join
result = sales_df.join(
    broadcast(products_df),
    on="product_id",
    how="inner"
)
```

### What Happens in the Optimized Setup

**Broadcast side (small table):**
- Still broadcast once per executor (no change)
- Memory impact unchanged

**Fact side (sales data):**
- Each executor processes **8 smaller partitions** (32 partitions √∑ 4 executors)
- Each partition is **~32 GB** instead of 256 GB
- **Benefits:**
  - ‚úÖ Faster task completion (smaller chunks)
  - ‚úÖ Better garbage collection behavior (less memory pressure)
  - ‚úÖ Reduced spill risk (less data per task)
  - ‚úÖ Better load balancing (if some partitions are slower, others compensate)

### Visual Comparison

**‚ùå Bad Layout (4 partitions):**
```
4 partitions ‚Üí 4 tasks ‚Üí 16 cores ‚Üí 12 idle
Each task processes 256 GB
Slow, inefficient, wasteful
```

**‚úÖ Optimized Layout (32 partitions):**
```
32 partitions ‚Üí 32 tasks ‚Üí 16 cores ‚Üí always busy
Each task processes ~32 GB
Fast, efficient, all resources utilized
```


---

# Practical Demonstration: Hands-On with Real Data

This section provides a hands-on demonstration using actual Parquet files. You'll:
1. Create 4 Parquet files to simulate real-world storage
2. Read them and observe partition behavior
3. **View Spark UI to see parallelism before repartitioning**
4. Repartition based on your machine's actual cores
5. **View Spark UI again to see improved parallelism**
6. Compare performance

## Initializing Spark Session

Before we start, let's create a Spark session with UI enabled.


In [11]:
spark.stop()

In [None]:
# Initialize Spark Session with UI enabled
from pyspark.sql import SparkSession
import multiprocessing

# Stop any existing Spark session
try:
    spark.stop()
except:
    pass

# Get actual CPU cores on your system
physical_cores = multiprocessing.cpu_count()
print(f"üñ•Ô∏è  Detected {physical_cores} CPU cores on your system")

# Create Spark session with UI enabled
spark = SparkSession.builder \
    .appName("PartitionOptimizationDemo") \
    .master("local[*]") \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
    .getOrCreate()

# Display Spark version and configuration
print("=" * 70)
print("SPARK SESSION INITIALIZED")
print("=" * 70)
print(f"Spark Version: {spark.version}")
print(f"Spark App Name: {spark.sparkContext.appName}")
print(f"Master: {spark.sparkContext.master}")
print(f"Default Parallelism: {spark.sparkContext.defaultParallelism}")

# Get Spark UI URL
ui_url = spark.sparkContext.uiWebUrl
print(f"\nüåê Spark UI URL: {ui_url}")
print("\nüí° TIP: Open this URL in your browser to monitor job execution!")
print("   You'll see task distribution, parallelism, and resource utilization")
print("=" * 70)


üñ•Ô∏è  Detected 11 CPU cores on your system
SPARK SESSION INITIALIZED
Spark Version: 3.5.1
Spark App Name: PartitionOptimizationDemo
Master: local[*]
Default Parallelism: 11

üåê Spark UI URL: http://192.168.1.4:4040

üí° TIP: Open this URL in your browser to monitor job execution!
   You'll see task distribution, parallelism, and resource utilization


----------------------------------------
Exception occurred during processing of request from ('127.0.0.1', 59718)
Traceback (most recent call last):
  File "/opt/homebrew/Cellar/python@3.11/3.11.12_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/socketserver.py", line 317, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/opt/homebrew/Cellar/python@3.11/3.11.12_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/socketserver.py", line 348, in process_request
    self.finish_request(request, client_address)
  File "/opt/homebrew/Cellar/python@3.11/3.11.12_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/socketserver.py", line 361, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/opt/homebrew/Cellar/python@3.11/3.11.12_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/socketserver.py", line 755, in __init__
    self.handle()
  File "/opt/homebrew/lib/python3.11/site-packages/py

## Step 1: Creating Sample Data Files

Let's create 4 Parquet files to simulate the real-world scenario of having 4 large files in storage.


In [13]:
# Step 1: Create 4 Parquet Files to Simulate Real-World Scenario
# This mimics having 4 large files in ADLS (like part-00000.parquet, part-00001.parquet, etc.)

import os
import glob
import shutil
from pyspark.sql import Row
from datetime import date, timedelta

print("=" * 70)
print("CREATING 4 PARQUET FILES TO DEMONSTRATE PARTITION CONCEPT")
print("=" * 70)

# Define the output directory
output_dir = "data/sales_demo"

# Clean up existing directory if it exists
if os.path.exists(output_dir):
    shutil.rmtree(output_dir)
    print(f"Cleaned up existing directory: {output_dir}")

# Create sample sales data
# Each file will represent a different region with substantial data
regions = ["North", "South", "East", "West"]
records_per_file = 50000  # Enough data to see the effect, but manageable

# Collect all data first
all_dataframes = []

for i, region in enumerate(regions):
    print(f"\nCreating data for {region} region (File {i+1}/4)...")
    
    # Generate sample sales data
    data = []
    base_date = date(2023, 1, 1)
    
    for j in range(records_per_file):
        data.append(Row(
            sale_id=i * records_per_file + j,
            region=region,
            product_id=f"PROD_{j % 1000:04d}",
            customer_id=f"CUST_{j % 5000:05d}",
            sale_amount=round(100.0 + (j % 1000) * 0.5, 2),
            sale_date=base_date + timedelta(days=j % 365),
            quantity=(j % 10) + 1
        ))
    
    # Create DataFrame
    df = spark.createDataFrame(data)
    all_dataframes.append(df)
    
    print(f"  ‚úì Created DataFrame for {region} region")
    print(f"  ‚úì Records: {records_per_file:,}")

# Combine all dataframes
print(f"\nCombining all regions into a single dataset...")
combined_df = all_dataframes[0]
for df in all_dataframes[1:]:
    combined_df = combined_df.union(df)

# Write to parquet with exactly 4 partitions (4 files)
# This simulates having exactly 4 files in storage
print(f"\nWriting to parquet with 4 partitions (4 files)...")
combined_df.coalesce(4).write.mode("overwrite").parquet(output_dir)

# Verify the files were created
parquet_files = glob.glob(f"{output_dir}/*.parquet")
if not parquet_files:
    # Sometimes files are in subdirectories
    parquet_files = glob.glob(f"{output_dir}/**/*.parquet", recursive=True)

total_size = sum(os.path.getsize(f) for f in parquet_files) if parquet_files else 0
file_count = len(parquet_files)

print("\n" + "=" * 70)
print("‚úÖ All 4 Parquet files created successfully!")
print(f"üìÅ Location: {output_dir}/")
print(f"üìä Number of parquet files: {file_count}")
if total_size > 0:
    print(f"üíæ Total size: {total_size / (1024*1024):.2f} MB")
print("=" * 70)


CREATING 4 PARQUET FILES TO DEMONSTRATE PARTITION CONCEPT
Cleaned up existing directory: data/sales_demo

Creating data for North region (File 1/4)...
  ‚úì Created DataFrame for North region
  ‚úì Records: 50,000

Creating data for South region (File 2/4)...
  ‚úì Created DataFrame for South region
  ‚úì Records: 50,000

Creating data for East region (File 3/4)...
  ‚úì Created DataFrame for East region
  ‚úì Records: 50,000

Creating data for West region (File 4/4)...
  ‚úì Created DataFrame for West region
  ‚úì Records: 50,000

Combining all regions into a single dataset...

Writing to parquet with 4 partitions (4 files)...


26/01/02 20:49:56 WARN TaskSetManager: Stage 0 contains a task of very large size (2304 KiB). The maximum recommended task size is 1000 KiB.
[Stage 0:>                                                          (0 + 4) / 4]


‚úÖ All 4 Parquet files created successfully!
üìÅ Location: data/sales_demo/
üìä Number of parquet files: 4
üíæ Total size: 1.30 MB


                                                                                

## Step 2: Reading Data and Observing the Problem

Now let's read the data and see how Spark creates partitions. **This is where you should check Spark UI!**


In [14]:
# Step 2: Read the 4 Parquet Files and Diagnose the Problem

import multiprocessing
import time

print("=" * 70)
print("STEP 1: Understanding Your System Configuration")
print("=" * 70)

# Get actual CPU cores on your system
physical_cores = multiprocessing.cpu_count()
print(f"\nüñ•Ô∏è  Physical CPU Cores on Your System: {physical_cores}")

# Get Spark's default parallelism (usually equals total cores available to Spark)
default_parallelism = spark.sparkContext.defaultParallelism
print(f"‚öôÔ∏è  Spark Default Parallelism: {default_parallelism}")

# Get Spark UI URL
ui_url = spark.sparkContext.uiWebUrl
print(f"\nüåê Spark UI URL: {ui_url}")
print("   üëâ OPEN THIS URL NOW to monitor the job execution!")

# Try to get executor info
num_executors = 1
cores_per_executor = default_parallelism
try:
    status_tracker = spark.sparkContext.statusTracker()
    if hasattr(status_tracker, 'getExecutorInfos'):
        executors = status_tracker.getExecutorInfos()
        num_executors = len(executors)
        print(f"\nüì¶ Number of Executors: {num_executors}")
        if executors:
            cores_per_executor = executors[0].totalCores
            print(f"üîß Cores per Executor: {cores_per_executor}")
            total_spark_cores = num_executors * cores_per_executor
            print(f"üìä Total Spark Cores: {total_spark_cores}")
    else:
        raise AttributeError("getExecutorInfos not available")
except Exception as e:
    print(f"\n‚ÑπÔ∏è  Executor info not available: {e}")
    print(f"‚ÑπÔ∏è  Using default parallelism ({default_parallelism}) as total available cores")
    print(f"‚ÑπÔ∏è  Assuming 1 executor with {default_parallelism} cores (local mode)")

print("\n" + "=" * 70)
print("STEP 2: Reading 4 Parquet Files (The Problem)")
print("=" * 70)
print("\n‚ö†Ô∏è  IMPORTANT: Check Spark UI now!")
print(f"   Go to: {ui_url}")
print("   Navigate to 'Jobs' or 'Stages' tab")
print("   You'll see only 4 tasks running (one per partition)")
print("   Most of your cores will be idle!\n")

# Read the 4 parquet files we created
# This simulates reading from ADLS: spark.read.parquet("abfss://.../sales/")
sales_df = spark.read.parquet("data/sales_demo/")

# Check how many partitions Spark created
num_partitions = sales_df.rdd.getNumPartitions()
total_records = sales_df.count()

print(f"üìÅ Files Read: 4 parquet files from data/sales_demo/")
print(f"üìä Total Records: {total_records:,}")
print(f"üî¢ Spark Partitions Created: {num_partitions}")
print(f"‚öôÔ∏è  Available Cores: {default_parallelism}")

# Diagnosis
print("\n" + "-" * 70)
print("DIAGNOSIS:")
print("-" * 70)

if num_partitions < default_parallelism:
    waste_percentage = (1 - num_partitions / default_parallelism) * 100
    idle_cores = default_parallelism - num_partitions
    utilization = (num_partitions / default_parallelism) * 100
    
    print(f"‚ö†Ô∏è  PROBLEM DETECTED!")
    print(f"   ‚Ä¢ You have {num_partitions} partitions but {default_parallelism} cores")
    print(f"   ‚Ä¢ {idle_cores} cores will be IDLE (doing nothing)")
    print(f"   ‚Ä¢ Resource utilization: {utilization:.1f}%")
    print(f"   ‚Ä¢ Waste: {waste_percentage:.1f}% of your compute resources")
    print(f"\n   This means:")
    print(f"   ‚Ä¢ Only {num_partitions} tasks will run in parallel")
    print(f"   ‚Ä¢ {idle_cores} cores will sit idle, wasting resources")
    print(f"   ‚Ä¢ Your job will run much slower than it could")
    
    print(f"\nüìä Check Spark UI:")
    print(f"   ‚Ä¢ Go to: {ui_url}")
    print(f"   ‚Ä¢ Look at the 'Stages' tab")
    print(f"   ‚Ä¢ You should see only {num_partitions} tasks")
    print(f"   ‚Ä¢ Notice how many cores are idle!")
else:
    print("‚úÖ Partition count looks good!")

print("\n" + "=" * 70)
print("STEP 3: Visualizing the Problem")
print("=" * 70)

print(f"\nCurrent Situation:")
print(f"  Files in storage: 4")
print(f"  Spark partitions: {num_partitions}")
print(f"  Available cores: {default_parallelism}")
print(f"  Number of executors: {num_executors}")
print(f"  Cores per executor: {cores_per_executor}")

# Visualize task distribution based on actual configuration
if num_executors == 1:
    # Local mode - single executor
    print(f"\n  Task Distribution (Local Mode - 1 executor with {cores_per_executor} cores):")
    task_visual = " ".join([f"[Task {i+1}]" for i in range(min(num_partitions, 4))])
    idle_visual = " ".join(["[‚óã]"] * max(0, min(cores_per_executor - num_partitions, 4)))
    print(f"    Executor 1: {task_visual} {idle_visual}")
    if cores_per_executor > 4:
        print(f"                ... ({num_partitions} tasks total, {default_parallelism - num_partitions} cores idle)")
    else:
        print(f"                ‚Üë Only {num_partitions} tasks, {default_parallelism - num_partitions} cores idle")
else:
    # Cluster mode - multiple executors
    tasks_per_executor_visual = max(1, num_partitions // num_executors)
    print(f"\n  Task Distribution ({num_executors} executors with {cores_per_executor} cores each):")
    for i in range(min(num_executors, num_partitions)):
        idle_cores_vis = max(0, cores_per_executor - 1)
        idle_dots = "[‚óã] " * min(idle_cores_vis, 3)
        if idle_cores_vis > 3:
            idle_dots += f"... ({idle_cores_vis} total idle)"
        print(f"    Executor {i+1}: [Task {i+1}] {idle_dots}‚Üí 1 core used, {idle_cores_vis} idle")
    if num_partitions < num_executors:
        print(f"    (Only {num_partitions} tasks for {num_executors} executors)")
print(f"    Total: {num_partitions} cores busy, {default_parallelism - num_partitions} cores idle")


STEP 1: Understanding Your System Configuration

üñ•Ô∏è  Physical CPU Cores on Your System: 11
‚öôÔ∏è  Spark Default Parallelism: 11

üåê Spark UI URL: http://192.168.1.4:4040
   üëâ OPEN THIS URL NOW to monitor the job execution!

‚ÑπÔ∏è  Executor info not available: getExecutorInfos not available
‚ÑπÔ∏è  Using default parallelism (11) as total available cores
‚ÑπÔ∏è  Assuming 1 executor with 11 cores (local mode)

STEP 2: Reading 4 Parquet Files (The Problem)

‚ö†Ô∏è  IMPORTANT: Check Spark UI now!
   Go to: http://192.168.1.4:4040
   Navigate to 'Jobs' or 'Stages' tab
   You'll see only 4 tasks running (one per partition)
   Most of your cores will be idle!

üìÅ Files Read: 4 parquet files from data/sales_demo/
üìä Total Records: 200,000
üî¢ Spark Partitions Created: 4
‚öôÔ∏è  Available Cores: 11

----------------------------------------------------------------------
DIAGNOSIS:
----------------------------------------------------------------------
‚ö†Ô∏è  PROBLEM DETECTED!
   

## Step 3: Optimizing with Repartition Based on Your Machine's Cores

Now we'll repartition the data based on the actual number of cores detected on your machine. **Check Spark UI again after this!**


In [15]:
# Step 3: The Solution - Optimizing with Repartition Based on Your Machine's Cores

print("=" * 70)
print("STEP 4: The Solution - Optimizing with Repartition")
print("=" * 70)

# Calculate optimal partition count based on YOUR machine's cores
# Using 2√ó core count for optimal load balancing
optimal_partitions = default_parallelism * 2
print(f"\nüéØ Target Partitions: {optimal_partitions} (2√ó your {default_parallelism} cores)")
print(f"   This is calculated based on YOUR machine's actual core count: {physical_cores}")

# Repartition the data
print(f"\nRepartitioning from {num_partitions} to {optimal_partitions} partitions...")
print("   This will trigger a shuffle operation...")
sales_df_optimized = sales_df.repartition(optimal_partitions)

# Verify
optimized_partitions = sales_df_optimized.rdd.getNumPartitions()
optimized_records = sales_df_optimized.count()

print(f"\n‚úÖ Repartitioning Complete!")
print(f"   ‚Ä¢ New partition count: {optimized_partitions}")
print(f"   ‚Ä¢ Records preserved: {optimized_records:,} (same as before)")
print(f"   ‚Ä¢ Available cores: {default_parallelism}")

# Show improvement
utilization_after = (optimized_partitions / default_parallelism) * 100
print(f"\nüìà Improvement:")
print(f"   ‚Ä¢ Resource utilization: {utilization_after:.1f}%")
print(f"   ‚Ä¢ All {default_parallelism} cores can now be utilized")
print(f"   ‚Ä¢ Tasks will be queued for smooth execution")

print(f"\n  Optimized Task Distribution:")
if num_executors == 1:
    # Local mode
    tasks_per_executor = optimized_partitions
    print(f"    Executor 1: [{tasks_per_executor} tasks] ‚Üí All {cores_per_executor} cores busy + queued tasks")
    print(f"    Total: {optimized_partitions} tasks ‚Üí All {default_parallelism} cores utilized!")
else:
    # Cluster mode
    tasks_per_executor = optimized_partitions // num_executors
    for i in range(num_executors):
        print(f"    Executor {i+1}: [{tasks_per_executor} tasks] ‚Üí All {cores_per_executor} cores busy + queued tasks")
    print(f"    Total: {optimized_partitions} tasks ‚Üí All {default_parallelism} cores utilized!")

print(f"\nüìä Check Spark UI NOW to see the difference!")
print(f"   ‚Ä¢ Go to: {ui_url}")
print(f"   ‚Ä¢ Navigate to 'Stages' tab")
print(f"   ‚Ä¢ You should now see {optimized_partitions} tasks instead of {num_partitions}")
print(f"   ‚Ä¢ Notice how all cores are now being utilized!")
print(f"   ‚Ä¢ Compare this with what you saw before repartitioning!")


STEP 4: The Solution - Optimizing with Repartition

üéØ Target Partitions: 22 (2√ó your 11 cores)
   This is calculated based on YOUR machine's actual core count: 11

Repartitioning from 4 to 22 partitions...
   This will trigger a shuffle operation...

‚úÖ Repartitioning Complete!
   ‚Ä¢ New partition count: 22
   ‚Ä¢ Records preserved: 200,000 (same as before)
   ‚Ä¢ Available cores: 11

üìà Improvement:
   ‚Ä¢ Resource utilization: 200.0%
   ‚Ä¢ All 11 cores can now be utilized
   ‚Ä¢ Tasks will be queued for smooth execution

  Optimized Task Distribution:
    Executor 1: [22 tasks] ‚Üí All 11 cores busy + queued tasks
    Total: 22 tasks ‚Üí All 11 cores utilized!

üìä Check Spark UI NOW to see the difference!
   ‚Ä¢ Go to: http://192.168.1.4:4040
   ‚Ä¢ Navigate to 'Stages' tab
   ‚Ä¢ You should now see 22 tasks instead of 4
   ‚Ä¢ Notice how all cores are now being utilized!
   ‚Ä¢ Compare this with what you saw before repartitioning!


## Step 4: Performance Comparison

Let's run a simple operation to see the performance difference. **Watch Spark UI during execution!**


In [None]:
# Step 4: Performance Comparison

print("=" * 70)
print("STEP 5: Performance Comparison")
print("=" * 70)

print(f"\nüåê Keep Spark UI open: {ui_url}")
print("   Watch the 'Stages' tab to see task distribution in real-time!")

# Show a simple operation to demonstrate the difference
print("\nRunning a simple aggregation to show the difference...")
print("   üëÄ Watch Spark UI to see the difference in parallelism!")

# Without optimization
print(f"\n‚è±Ô∏è  Without optimization ({num_partitions} partitions):")
print("   üëâ Check Spark UI - you should see only 4 tasks running")
start = time.time()
result_bad = sales_df.groupBy("region").agg({"sale_amount": "sum"}).collect()
time_bad = time.time() - start
print(f"   Time taken: {time_bad:.2f} seconds")
print(f"   Tasks: {num_partitions} (only {num_partitions} cores utilized)")

# With optimization
print(f"\n‚è±Ô∏è  With optimization ({optimal_partitions} partitions):")
print(f"   üëâ Check Spark UI - you should see {optimal_partitions} tasks running")
print(f"   üëâ Notice how all {default_parallelism} cores are now busy!")
start = time.time()
result_good = sales_df_optimized.groupBy("region").agg({"sale_amount": "sum"}).collect()
time_good = time.time() - start
print(f"   Time taken: {time_good:.2f} seconds")
print(f"   Tasks: {optimized_partitions} (all {default_parallelism} cores utilized)")

if time_bad > 0 and time_good > 0:
    speedup = time_bad / time_good if time_good > 0 else 1
    print(f"\nüìä Speedup: {speedup:.2f}√ó faster with optimized partitions")
    print(f"   (Note: Speedup may vary based on data size and cluster configuration)")
    if speedup < 1:
        print(f"   ‚ÑπÔ∏è  For small datasets, overhead of repartitioning may outweigh benefits")
        print(f"   ‚ÑπÔ∏è  Benefits are more pronounced with larger datasets and more cores")
        print(f"   ‚ÑπÔ∏è  The key benefit is better resource utilization, not always speed")

print("\n" + "=" * 70)
print("‚úÖ DEMONSTRATION COMPLETE!")
print("=" * 70)
print("\nKey Takeaways:")
print(f"  ‚Ä¢ Started with {num_partitions} partitions (from 4 files)")
print(f"  ‚Ä¢ Optimized to {optimized_partitions} partitions (based on {default_parallelism} cores)")
print(f"  ‚Ä¢ Now utilizing all {default_parallelism} cores efficiently!")
print(f"\nüìä Spark UI Observations:")
print(f"  ‚Ä¢ Before: {num_partitions} tasks ‚Üí {num_partitions} cores busy, {default_parallelism - num_partitions} idle")
print(f"  ‚Ä¢ After: {optimized_partitions} tasks ‚Üí All {default_parallelism} cores busy")
print(f"  ‚Ä¢ Check Spark UI at: {ui_url}")
print("=" * 70)


STEP 5: Performance Comparison

üåê Keep Spark UI open: http://192.168.1.4:4040
   Watch the 'Stages' tab to see task distribution in real-time!

Running a simple aggregation to show the difference...
   üëÄ Watch Spark UI to see the difference in parallelism!

‚è±Ô∏è  Without optimization (4 partitions):
   üëâ Check Spark UI - you should see only 4 tasks running
   Time taken: 0.40 seconds
   Tasks: 4 (only 4 cores utilized)

‚è±Ô∏è  With optimization (22 partitions):
   üëâ Check Spark UI - you should see 22 tasks running
   üëâ Notice how all 11 cores are now busy!
   Time taken: 0.31 seconds
   Tasks: 22 (all 11 cores utilized)

üìä Speedup: 1.29√ó faster with optimized partitions
   (Note: Speedup may vary based on data size and cluster configuration)

‚úÖ DEMONSTRATION COMPLETE!

Key Takeaways:
  ‚Ä¢ Started with 4 partitions (from 4 files)
  ‚Ä¢ Optimized to 22 partitions (based on 11 cores)
  ‚Ä¢ Now utilizing all 11 cores efficiently!

üìä Spark UI Observations:
  ‚Ä¢ 

26/01/03 00:53:23 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 921503 ms exceeds timeout 120000 ms
26/01/03 00:53:23 WARN SparkContext: Killing executors is not supported by current scheduler.
26/01/03 00:59:52 WARN Executor: Issue communicating with driver in heartbeater
org.apache.spark.SparkException: Exception thrown in awaitResult: 
	at org.apache.spark.util.SparkThreadUtils$.awaitResult(SparkThreadUtils.scala:56)
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:310)
	at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
	at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:101)
	at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:85)
	at org.apache.spark.storage.BlockManagerMaster.registerBlockManager(BlockManagerMaster.scala:80)
	at org.apache.spark.storage.BlockManager.reregister(BlockManager.scala:642)
	at org.apache.spark.executor.Executor.reportHeartBeat(Executor.scala:1223)
	at o

## Step 8: Production-Grade Rules to Follow

### Rule 1: Files Are NOT Partitions

**Critical distinction:**
- **Storage layout** (file partitions) ‚â† **Compute layout** (Spark partitions)
- Stop conflating them!

**Example:**
- You might have 4 files in ADLS (storage partitioning)
- But you need 32+ Spark partitions for optimal compute

### Rule 2: Always Check Partition Count

**Before running expensive operations:**

```python
# Always verify
num_partitions = df.rdd.getNumPartitions()
total_cores = spark.sparkContext.defaultParallelism

if num_partitions < total_cores:
    print(f"‚ö†Ô∏è  WARNING: Only {num_partitions} partitions for {total_cores} cores!")
    print("Consider repartitioning!")
```

**If `num_partitions < total_cores`:**
- You are wasting money and time
- Your cluster is underutilized
- Your job will run slower than necessary

### Rule 3: Broadcast Join ‚â† Performance Fix

**Common mistake:**
> "I'll use broadcast join to fix my performance issues"

**Reality:**
- Broadcast join removes shuffle (good!)
- But if your partitions are wrong, broadcast won't save you
- You still need proper partitioning for the large table

**Both matter:**
- ‚úÖ Broadcast small tables (avoid shuffle)
- ‚úÖ Properly partition large tables (utilize cores)

### Rule 4: Choose Partition Count Wisely

**Guidelines:**
- **Minimum:** `num_partitions ‚â• total_cores`
- **Ideal:** `num_partitions = 2-4 √ó total_cores`
- **Why 2-4√ó?**
  - Hides I/O wait times (while one task waits for I/O, others run)
  - Better load balancing (handles data skew)
  - Allows for task queuing (smooth execution)

**Too many partitions:**
- Overhead from task scheduling
- Small tasks are inefficient
- Generally avoid: `num_partitions > 10 √ó total_cores`

### Rule 5: Understand When Repartitioning Happens

**`repartition()`:**
- Triggers a **full shuffle**
- Redistributes data across all partitions
- Use when you need to change partition count significantly

**`coalesce()`:**
- Reduces partition count **without shuffle**
- Combines adjacent partitions
- Use when reducing partitions (more efficient than repartition)

**Example:**
```python
# If you have 100 partitions and want 32
df.coalesce(32)  # More efficient (no shuffle)

# If you have 4 partitions and want 32
df.repartition(32)  # Necessary (requires shuffle)
```


## Step 9: Practical Example - Before and After

### Scenario Setup

Let's see a complete example with code:

```python
from pyspark.sql import SparkSession
from pyspark.sql.functions import broadcast

# Initialize Spark
spark = SparkSession.builder \
    .appName("PartitionOptimization") \
    .config("spark.executor.instances", "4") \
    .config("spark.executor.cores", "4") \
    .getOrCreate()

# Check your cluster configuration
print("=== Cluster Configuration ===")
print(f"Default parallelism: {spark.sparkContext.defaultParallelism}")
print(f"Executor instances: {spark.conf.get('spark.executor.instances')}")
print(f"Cores per executor: {spark.conf.get('spark.executor.cores')}")
```

### Before Optimization (The Problem)

```python
# ‚ùå BAD: Naive read
sales_df_bad = spark.read.parquet("abfss://container@storageaccount.dfs.core.windows.net/sales/")

print("=== Before Optimization ===")
print(f"Partitions: {sales_df_bad.rdd.getNumPartitions()}")
print(f"Available cores: {spark.sparkContext.defaultParallelism}")
print(f"Utilization: {sales_df_bad.rdd.getNumPartitions() / spark.sparkContext.defaultParallelism * 100:.1f}%")

# This will be slow and wasteful!
# result_bad = sales_df_bad.groupBy("region").agg({"sales": "sum"})
```

**Output:**
```
=== Before Optimization ===
Partitions: 4
Available cores: 16
Utilization: 25.0%
```

### After Optimization (The Solution)

```python
# ‚úÖ GOOD: Optimized read with repartition
sales_df_good = (
    spark.read.parquet("abfss://container@storageaccount.dfs.core.windows.net/sales/")
    .repartition(32)  # 2√ó your core count
)

print("=== After Optimization ===")
print(f"Partitions: {sales_df_good.rdd.getNumPartitions()}")
print(f"Available cores: {spark.sparkContext.defaultParallelism}")
print(f"Utilization: {sales_df_good.rdd.getNumPartitions() / spark.sparkContext.defaultParallelism * 100:.1f}%")

# This will be fast and efficient!
# result_good = sales_df_good.groupBy("region").agg({"sales": "sum"})
```

**Output:**
```
=== After Optimization ===
Partitions: 32
Available cores: 16
Utilization: 200.0%  (2√ó for better load balancing)
```

### Performance Comparison

**Expected improvements:**
- **Execution time:** 3-4√ó faster (utilizing all cores)
- **Resource utilization:** 25% ‚Üí 100%+
- **Cost efficiency:** Same cost, 3-4√ó more work done


## Step 10: Key Takeaways and Mental Model

### The Core Concept (Lock This In)

**Visual Mental Model:**

```
‚ùå BAD LAYOUT:
Storage:  [File1] [File2] [File3] [File4]
Spark:    [Part1] [Part2] [Part3] [Part4]
Tasks:    [Task1] [Task2] [Task3] [Task4]
Cores:    [‚óè] [‚óã] [‚óã] [‚óã]  [‚óè] [‚óã] [‚óã] [‚óã]  [‚óè] [‚óã] [‚óã] [‚óã]  [‚óè] [‚óã] [‚óã] [‚óã]
          ‚Üë Only 4 cores used, 12 idle

‚úÖ OPTIMIZED LAYOUT:
Storage:  [File1] [File2] [File3] [File4]
Spark:    [P1][P2][P3]...[P32]  (repartitioned)
Tasks:    [T1][T2][T3]...[T32]
Cores:    [‚óè][‚óè][‚óè][‚óè] [‚óè][‚óè][‚óè][‚óè] [‚óè][‚óè][‚óè][‚óè] [‚óè][‚óè][‚óè][‚óè] ... (all busy)
          ‚Üë All 16 cores utilized, tasks queued for smooth execution
```

### Key Takeaways

1. **Storage partitions ‚â† Compute partitions**
   - Files in ADLS are storage-level organization
   - Spark partitions are compute-level organization
   - They can (and often should) be different!

2. **One partition = one task = one core**
   - This is a fundamental Spark rule
   - More partitions = more parallelism potential
   - But only if you have enough cores

3. **Always check partition count**
   - Use `df.rdd.getNumPartitions()`
   - Compare with `spark.sparkContext.defaultParallelism`
   - If partitions < cores, you're wasting resources

4. **Optimal partition count**
   - Minimum: Equal to number of cores
   - Ideal: 2-4√ó number of cores
   - Too many: Overhead and inefficiency

5. **Repartition when needed**
   - Use `repartition()` to increase partitions (triggers shuffle)
   - Use `coalesce()` to decrease partitions (no shuffle)
   - Always verify the result

### Common Mistakes to Avoid

1. ‚ùå Assuming file count = optimal partition count
2. ‚ùå Not checking partition count before expensive operations
3. ‚ùå Thinking broadcast join fixes all performance issues
4. ‚ùå Creating too many or too few partitions
5. ‚ùå Not understanding the difference between storage and compute partitions

### Next Steps

- Practice: Try reading your own data and checking partition counts
- Experiment: Compare performance with different partition counts
- Monitor: Use Spark UI to visualize task distribution
- Learn: Understand how partitioning affects joins, aggregations, and shuffles


## Summary

### What We Learned

1. **Configuration Understanding**
   - Cluster: 4 executors √ó 4 cores = 16 total cores
   - Storage: 4 files √ó 256 GB = 1 TB total data

2. **The Problem**
   - Naive read creates 4 partitions ‚Üí 4 tasks
   - Only 4 cores used ‚Üí 12 cores idle (75% waste)

3. **The Solution**
   - Repartition to 32 partitions (2√ó core count)
   - Creates 32 tasks ‚Üí all 16 cores utilized
   - Better load balancing and I/O hiding

4. **Best Practices**
   - Always check partition count
   - Aim for 2-4√ó core count
   - Understand storage vs compute partitions
   - Use repartition/coalesce appropriately

### Remember

> **"Files are not partitions. Storage layout ‚â† Compute layout. Always verify your partition count matches your cluster capacity."**

This understanding is crucial for writing efficient, production-grade Spark applications!
