# Practical Demonstration: Executors and Partitions

## Overview

This notebook provides a **hands-on demonstration** of the partition concepts covered in `08_a_Partitions_Concepts.ipynb`.

## What You'll Do

1. **Create 4 Parquet files** to simulate real-world storage (like ADLS)
2. **Read them and observe** partition behavior
3. **View Spark UI** to see parallelism **before repartitioning**
4. **Repartition** based on your machine's actual cores
5. **View Spark UI again** to see improved parallelism **after repartitioning**
6. **Compare performance** before and after optimization

## Prerequisites

- Complete `08_a_Partitions_Concepts.ipynb` first to understand the concepts
- Spark installed and configured
- Jupyter notebook environment ready

---

> **Note:** This is the **practice notebook**. For concepts and theory, see `08_a_Partitions_Concepts.ipynb`


## Step 1: Initializing Spark Session

Before we start, let's create a Spark session with UI enabled so we can monitor job execution.


In [1]:
# Initialize Spark Session with UI enabled
from pyspark.sql import SparkSession
import multiprocessing

# Stop any existing Spark session
try:
    spark.stop()
except:
    pass

# Get actual CPU cores on your system
physical_cores = multiprocessing.cpu_count()
print(f"üñ•Ô∏è  Detected {physical_cores} CPU cores on your system")

# Create Spark session with UI enabled
spark = SparkSession.builder \
    .appName("PartitionOptimizationDemo") \
    .master("local[*]") \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
    .getOrCreate()

# Display Spark version and configuration
print("=" * 70)
print("SPARK SESSION INITIALIZED")
print("=" * 70)
print(f"Spark Version: {spark.version}")
print(f"Spark App Name: {spark.sparkContext.appName}")
print(f"Master: {spark.sparkContext.master}")
print(f"Default Parallelism: {spark.sparkContext.defaultParallelism}")

# Get Spark UI URL
ui_url = spark.sparkContext.uiWebUrl
print(f"\nüåê Spark UI URL: {ui_url}")
print("\nüí° TIP: Open this URL in your browser to monitor job execution!")
print("   You'll see task distribution, parallelism, and resource utilization")
print("=" * 70)


üñ•Ô∏è  Detected 11 CPU cores on your system


26/01/03 12:03:21 WARN Utils: Your hostname, N-MacBookPro-37.local resolves to a loopback address: 127.0.0.1; using 192.168.1.20 instead (on interface en0)
26/01/03 12:03:21 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
26/01/03 12:03:22 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
26/01/03 12:03:22 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


SPARK SESSION INITIALIZED
Spark Version: 3.5.1
Spark App Name: PartitionOptimizationDemo
Master: local[*]
Default Parallelism: 11

üåê Spark UI URL: http://192.168.1.20:4041

üí° TIP: Open this URL in your browser to monitor job execution!
   You'll see task distribution, parallelism, and resource utilization


## Step 2: Creating Sample Data Files

Let's create 4 Parquet files to simulate the real-world scenario of having 4 large files in storage (like ADLS).


In [2]:
# Step 1: Create 4 Parquet Files to Simulate Real-World Scenario
# This mimics having 4 large files in ADLS (like part-00000.parquet, part-00001.parquet, etc.)

import os
import glob
import shutil
from pyspark.sql import Row
from datetime import date, timedelta

print("=" * 70)
print("CREATING 4 PARQUET FILES TO DEMONSTRATE PARTITION CONCEPT")
print("=" * 70)

# Define the output directory
output_dir = "data/sales_demo"

# Clean up existing directory if it exists
if os.path.exists(output_dir):
    shutil.rmtree(output_dir)
    print(f"Cleaned up existing directory: {output_dir}")

# Create sample sales data
# Each file will represent a different region with substantial data
regions = ["North", "South", "East", "West"]
records_per_file = 50000  # Enough data to see the effect, but manageable

# Collect all data first
all_dataframes = []

for i, region in enumerate(regions):
    print(f"\nCreating data for {region} region (File {i+1}/4)...")
    
    # Generate sample sales data
    data = []
    base_date = date(2023, 1, 1)
    
    for j in range(records_per_file):
        data.append(Row(
            sale_id=i * records_per_file + j,
            region=region,
            product_id=f"PROD_{j % 1000:04d}",
            customer_id=f"CUST_{j % 5000:05d}",
            sale_amount=round(100.0 + (j % 1000) * 0.5, 2),
            sale_date=base_date + timedelta(days=j % 365),
            quantity=(j % 10) + 1
        ))
    
    # Create DataFrame
    df = spark.createDataFrame(data)
    all_dataframes.append(df)
    
    print(f"  ‚úì Created DataFrame for {region} region")
    print(f"  ‚úì Records: {records_per_file:,}")

# Combine all dataframes
print(f"\nCombining all regions into a single dataset...")
combined_df = all_dataframes[0]
for df in all_dataframes[1:]:
    combined_df = combined_df.union(df)

# Write to parquet with exactly 4 partitions (4 files)
# This simulates having exactly 4 files in storage
print(f"\nWriting to parquet with 4 partitions (4 files)...")
combined_df.coalesce(4).write.mode("overwrite").parquet(output_dir)

# Verify the files were created
parquet_files = glob.glob(f"{output_dir}/*.parquet")
if not parquet_files:
    # Sometimes files are in subdirectories
    parquet_files = glob.glob(f"{output_dir}/**/*.parquet", recursive=True)

total_size = sum(os.path.getsize(f) for f in parquet_files) if parquet_files else 0
file_count = len(parquet_files)

print("\n" + "=" * 70)
print("‚úÖ All 4 Parquet files created successfully!")
print(f"üìÅ Location: {output_dir}/")
print(f"üìä Number of parquet files: {file_count}")
if total_size > 0:
    print(f"üíæ Total size: {total_size / (1024*1024):.2f} MB")
print("=" * 70)


CREATING 4 PARQUET FILES TO DEMONSTRATE PARTITION CONCEPT
Cleaned up existing directory: data/sales_demo

Creating data for North region (File 1/4)...
  ‚úì Created DataFrame for North region
  ‚úì Records: 50,000

Creating data for South region (File 2/4)...
  ‚úì Created DataFrame for South region
  ‚úì Records: 50,000

Creating data for East region (File 3/4)...
  ‚úì Created DataFrame for East region
  ‚úì Records: 50,000

Creating data for West region (File 4/4)...
  ‚úì Created DataFrame for West region
  ‚úì Records: 50,000

Combining all regions into a single dataset...

Writing to parquet with 4 partitions (4 files)...


26/01/03 12:03:33 WARN TaskSetManager: Stage 0 contains a task of very large size (2304 KiB). The maximum recommended task size is 1000 KiB.
[Stage 0:>                                                          (0 + 4) / 4]


‚úÖ All 4 Parquet files created successfully!
üìÅ Location: data/sales_demo/
üìä Number of parquet files: 4
üíæ Total size: 1.30 MB


                                                                                

## Step 3: Reading Data and Observing the Problem

Now let's read the data and see how Spark creates partitions. **‚ö†Ô∏è IMPORTANT: Check Spark UI now!**

You should see only 4 tasks running (one per partition), and most of your cores will be idle.


In [3]:
# Step 2: Read the 4 Parquet Files and Diagnose the Problem

import multiprocessing
import time

print("=" * 70)
print("STEP 1: Understanding Your System Configuration")
print("=" * 70)

# Get actual CPU cores on your system
physical_cores = multiprocessing.cpu_count()
print(f"\nüñ•Ô∏è  Physical CPU Cores on Your System: {physical_cores}")

# Get Spark's default parallelism (usually equals total cores available to Spark)
default_parallelism = spark.sparkContext.defaultParallelism
print(f"‚öôÔ∏è  Spark Default Parallelism: {default_parallelism}")

# Get Spark UI URL
ui_url = spark.sparkContext.uiWebUrl
print(f"\nüåê Spark UI URL: {ui_url}")
print("   üëâ OPEN THIS URL NOW to monitor the job execution!")

# Try to get executor info
num_executors = 1
cores_per_executor = default_parallelism
try:
    status_tracker = spark.sparkContext.statusTracker()
    if hasattr(status_tracker, 'getExecutorInfos'):
        executors = status_tracker.getExecutorInfos()
        num_executors = len(executors)
        print(f"\nüì¶ Number of Executors: {num_executors}")
        if executors:
            cores_per_executor = executors[0].totalCores
            print(f"üîß Cores per Executor: {cores_per_executor}")
            total_spark_cores = num_executors * cores_per_executor
            print(f"üìä Total Spark Cores: {total_spark_cores}")
    else:
        raise AttributeError("getExecutorInfos not available")
except Exception as e:
    print(f"\n‚ÑπÔ∏è  Executor info not available: {e}")
    print(f"‚ÑπÔ∏è  Using default parallelism ({default_parallelism}) as total available cores")
    print(f"‚ÑπÔ∏è  Assuming 1 executor with {default_parallelism} cores (local mode)")

print("\n" + "=" * 70)
print("STEP 2: Reading 4 Parquet Files (The Problem)")
print("=" * 70)
print("\n‚ö†Ô∏è  IMPORTANT: Check Spark UI now!")
print(f"   Go to: {ui_url}")
print("   Navigate to 'Jobs' or 'Stages' tab")
print("   You'll see only 4 tasks running (one per partition)")
print("   Most of your cores will be idle!\n")

# Read the 4 parquet files we created
# This simulates reading from ADLS: spark.read.parquet("abfss://.../sales/")
sales_df = spark.read.parquet("data/sales_demo/")

# Check how many partitions Spark created
num_partitions = sales_df.rdd.getNumPartitions()
total_records = sales_df.count()

print(f"üìÅ Files Read: 4 parquet files from data/sales_demo/")
print(f"üìä Total Records: {total_records:,}")
print(f"üî¢ Spark Partitions Created: {num_partitions}")
print(f"‚öôÔ∏è  Available Cores: {default_parallelism}")

# Diagnosis
print("\n" + "-" * 70)
print("DIAGNOSIS:")
print("-" * 70)

if num_partitions < default_parallelism:
    waste_percentage = (1 - num_partitions / default_parallelism) * 100
    idle_cores = default_parallelism - num_partitions
    utilization = (num_partitions / default_parallelism) * 100
    
    print(f"‚ö†Ô∏è  PROBLEM DETECTED!")
    print(f"   ‚Ä¢ You have {num_partitions} partitions but {default_parallelism} cores")
    print(f"   ‚Ä¢ {idle_cores} cores will be IDLE (doing nothing)")
    print(f"   ‚Ä¢ Resource utilization: {utilization:.1f}%")
    print(f"   ‚Ä¢ Waste: {waste_percentage:.1f}% of your compute resources")
    print(f"\n   This means:")
    print(f"   ‚Ä¢ Only {num_partitions} tasks will run in parallel")
    print(f"   ‚Ä¢ {idle_cores} cores will sit idle, wasting resources")
    print(f"   ‚Ä¢ Your job will run much slower than it could")
    
    print(f"\nüìä Check Spark UI:")
    print(f"   ‚Ä¢ Go to: {ui_url}")
    print(f"   ‚Ä¢ Look at the 'Stages' tab")
    print(f"   ‚Ä¢ You should see only {num_partitions} tasks")
    print(f"   ‚Ä¢ Notice how many cores are idle!")
else:
    print("‚úÖ Partition count looks good!")

print("\n" + "=" * 70)
print("STEP 3: Visualizing the Problem")
print("=" * 70)

print(f"\nCurrent Situation:")
print(f"  Files in storage: 4")
print(f"  Spark partitions: {num_partitions}")
print(f"  Available cores: {default_parallelism}")
print(f"  Number of executors: {num_executors}")
print(f"  Cores per executor: {cores_per_executor}")

# Visualize task distribution based on actual configuration
if num_executors == 1:
    # Local mode - single executor
    print(f"\n  Task Distribution (Local Mode - 1 executor with {cores_per_executor} cores):")
    task_visual = " ".join([f"[Task {i+1}]" for i in range(min(num_partitions, 4))])
    idle_visual = " ".join(["[‚óã]"] * max(0, min(cores_per_executor - num_partitions, 4)))
    print(f"    Executor 1: {task_visual} {idle_visual}")
    if cores_per_executor > 4:
        print(f"                ... ({num_partitions} tasks total, {default_parallelism - num_partitions} cores idle)")
    else:
        print(f"                ‚Üë Only {num_partitions} tasks, {default_parallelism - num_partitions} cores idle")
else:
    # Cluster mode - multiple executors
    tasks_per_executor_visual = max(1, num_partitions // num_executors)
    print(f"\n  Task Distribution ({num_executors} executors with {cores_per_executor} cores each):")
    for i in range(min(num_executors, num_partitions)):
        idle_cores_vis = max(0, cores_per_executor - 1)
        idle_dots = "[‚óã] " * min(idle_cores_vis, 3)
        if idle_cores_vis > 3:
            idle_dots += f"... ({idle_cores_vis} total idle)"
        print(f"    Executor {i+1}: [Task {i+1}] {idle_dots}‚Üí 1 core used, {idle_cores_vis} idle")
    if num_partitions < num_executors:
        print(f"    (Only {num_partitions} tasks for {num_executors} executors)")
print(f"    Total: {num_partitions} cores busy, {default_parallelism - num_partitions} cores idle")


STEP 1: Understanding Your System Configuration

üñ•Ô∏è  Physical CPU Cores on Your System: 11
‚öôÔ∏è  Spark Default Parallelism: 11

üåê Spark UI URL: http://192.168.1.20:4041
   üëâ OPEN THIS URL NOW to monitor the job execution!

‚ÑπÔ∏è  Executor info not available: getExecutorInfos not available
‚ÑπÔ∏è  Using default parallelism (11) as total available cores
‚ÑπÔ∏è  Assuming 1 executor with 11 cores (local mode)

STEP 2: Reading 4 Parquet Files (The Problem)

‚ö†Ô∏è  IMPORTANT: Check Spark UI now!
   Go to: http://192.168.1.20:4041
   Navigate to 'Jobs' or 'Stages' tab
   You'll see only 4 tasks running (one per partition)
   Most of your cores will be idle!

üìÅ Files Read: 4 parquet files from data/sales_demo/
üìä Total Records: 200,000
üî¢ Spark Partitions Created: 4
‚öôÔ∏è  Available Cores: 11

----------------------------------------------------------------------
DIAGNOSIS:
----------------------------------------------------------------------
‚ö†Ô∏è  PROBLEM DETECTED!
 

## Step 4: Optimizing with Repartition Based on Your Machine's Cores

Now we'll repartition the data based on the actual number of cores detected on your machine. **‚ö†Ô∏è Check Spark UI again after this!**

You should now see many more tasks (equal to 2√ó your core count), and all cores should be busy.


In [4]:
# Step 3: The Solution - Optimizing with Repartition Based on Your Machine's Cores

print("=" * 70)
print("STEP 4: The Solution - Optimizing with Repartition")
print("=" * 70)

# Calculate optimal partition count based on YOUR machine's cores
# Using 2√ó core count for optimal load balancing
optimal_partitions = default_parallelism * 2
print(f"\nüéØ Target Partitions: {optimal_partitions} (2√ó your {default_parallelism} cores)")
print(f"   This is calculated based on YOUR machine's actual core count: {physical_cores}")

# Repartition the data
print(f"\nRepartitioning from {num_partitions} to {optimal_partitions} partitions...")
print("   This will trigger a shuffle operation...")
sales_df_optimized = sales_df.repartition(optimal_partitions)

# Verify
optimized_partitions = sales_df_optimized.rdd.getNumPartitions()
optimized_records = sales_df_optimized.count()

print(f"\n‚úÖ Repartitioning Complete!")
print(f"   ‚Ä¢ New partition count: {optimized_partitions}")
print(f"   ‚Ä¢ Records preserved: {optimized_records:,} (same as before)")
print(f"   ‚Ä¢ Available cores: {default_parallelism}")

# Show improvement
utilization_after = (optimized_partitions / default_parallelism) * 100
print(f"\nüìà Improvement:")
print(f"   ‚Ä¢ Resource utilization: {utilization_after:.1f}%")
print(f"   ‚Ä¢ All {default_parallelism} cores can now be utilized")
print(f"   ‚Ä¢ Tasks will be queued for smooth execution")

print(f"\n  Optimized Task Distribution:")
if num_executors == 1:
    # Local mode
    tasks_per_executor = optimized_partitions
    print(f"    Executor 1: [{tasks_per_executor} tasks] ‚Üí All {cores_per_executor} cores busy + queued tasks")
    print(f"    Total: {optimized_partitions} tasks ‚Üí All {default_parallelism} cores utilized!")
else:
    # Cluster mode
    tasks_per_executor = optimized_partitions // num_executors
    for i in range(num_executors):
        print(f"    Executor {i+1}: [{tasks_per_executor} tasks] ‚Üí All {cores_per_executor} cores busy + queued tasks")
    print(f"    Total: {optimized_partitions} tasks ‚Üí All {default_parallelism} cores utilized!")

print(f"\nüìä Check Spark UI NOW to see the difference!")
print(f"   ‚Ä¢ Go to: {ui_url}")
print(f"   ‚Ä¢ Navigate to 'Stages' tab")
print(f"   ‚Ä¢ You should now see {optimized_partitions} tasks instead of {num_partitions}")
print(f"   ‚Ä¢ Notice how all cores are now being utilized!")
print(f"   ‚Ä¢ Compare this with what you saw before repartitioning!")


STEP 4: The Solution - Optimizing with Repartition

üéØ Target Partitions: 22 (2√ó your 11 cores)
   This is calculated based on YOUR machine's actual core count: 11

Repartitioning from 4 to 22 partitions...
   This will trigger a shuffle operation...

‚úÖ Repartitioning Complete!
   ‚Ä¢ New partition count: 22
   ‚Ä¢ Records preserved: 200,000 (same as before)
   ‚Ä¢ Available cores: 11

üìà Improvement:
   ‚Ä¢ Resource utilization: 200.0%
   ‚Ä¢ All 11 cores can now be utilized
   ‚Ä¢ Tasks will be queued for smooth execution

  Optimized Task Distribution:
    Executor 1: [22 tasks] ‚Üí All 11 cores busy + queued tasks
    Total: 22 tasks ‚Üí All 11 cores utilized!

üìä Check Spark UI NOW to see the difference!
   ‚Ä¢ Go to: http://192.168.1.20:4041
   ‚Ä¢ Navigate to 'Stages' tab
   ‚Ä¢ You should now see 22 tasks instead of 4
   ‚Ä¢ Notice how all cores are now being utilized!
   ‚Ä¢ Compare this with what you saw before repartitioning!


## Step 5: Performance Comparison

Let's run a simple operation to see the performance difference. **Watch Spark UI during execution!**

You'll see the difference in parallelism - before: only 4 tasks, after: many more tasks utilizing all cores.


In [5]:
# Step 4: Performance Comparison

print("=" * 70)
print("STEP 5: Performance Comparison")
print("=" * 70)

print(f"\nüåê Keep Spark UI open: {ui_url}")
print("   Watch the 'Stages' tab to see task distribution in real-time!")

# Show a simple operation to demonstrate the difference
print("\nRunning a simple aggregation to show the difference...")
print("   üëÄ Watch Spark UI to see the difference in parallelism!")

# Without optimization
print(f"\n‚è±Ô∏è  Without optimization ({num_partitions} partitions):")
print("   üëâ Check Spark UI - you should see only 4 tasks running")
start = time.time()
result_bad = sales_df.groupBy("region").agg({"sale_amount": "sum"}).collect()
time_bad = time.time() - start
print(f"   Time taken: {time_bad:.2f} seconds")
print(f"   Tasks: {num_partitions} (only {num_partitions} cores utilized)")

# With optimization
print(f"\n‚è±Ô∏è  With optimization ({optimal_partitions} partitions):")
print(f"   üëâ Check Spark UI - you should see {optimal_partitions} tasks running")
print(f"   üëâ Notice how all {default_parallelism} cores are now busy!")
start = time.time()
result_good = sales_df_optimized.groupBy("region").agg({"sale_amount": "sum"}).collect()
time_good = time.time() - start
print(f"   Time taken: {time_good:.2f} seconds")
print(f"   Tasks: {optimized_partitions} (all {default_parallelism} cores utilized)")

if time_bad > 0 and time_good > 0:
    speedup = time_bad / time_good if time_good > 0 else 1
    print(f"\nüìä Speedup: {speedup:.2f}√ó faster with optimized partitions")
    print(f"   (Note: Speedup may vary based on data size and cluster configuration)")
    if speedup < 1:
        print(f"   ‚ÑπÔ∏è  For small datasets, overhead of repartitioning may outweigh benefits")
        print(f"   ‚ÑπÔ∏è  Benefits are more pronounced with larger datasets and more cores")
        print(f"   ‚ÑπÔ∏è  The key benefit is better resource utilization, not always speed")

print("\n" + "=" * 70)
print("‚úÖ DEMONSTRATION COMPLETE!")
print("=" * 70)
print("\nKey Takeaways:")
print(f"  ‚Ä¢ Started with {num_partitions} partitions (from 4 files)")
print(f"  ‚Ä¢ Optimized to {optimized_partitions} partitions (based on {default_parallelism} cores)")
print(f"  ‚Ä¢ Now utilizing all {default_parallelism} cores efficiently!")
print(f"\nüìä Spark UI Observations:")
print(f"  ‚Ä¢ Before: {num_partitions} tasks ‚Üí {num_partitions} cores busy, {default_parallelism - num_partitions} idle")
print(f"  ‚Ä¢ After: {optimized_partitions} tasks ‚Üí All {default_parallelism} cores busy")
print(f"  ‚Ä¢ Check Spark UI at: {ui_url}")
print("=" * 70)


STEP 5: Performance Comparison

üåê Keep Spark UI open: http://192.168.1.20:4041
   Watch the 'Stages' tab to see task distribution in real-time!

Running a simple aggregation to show the difference...
   üëÄ Watch Spark UI to see the difference in parallelism!

‚è±Ô∏è  Without optimization (4 partitions):
   üëâ Check Spark UI - you should see only 4 tasks running
   Time taken: 0.47 seconds
   Tasks: 4 (only 4 cores utilized)

‚è±Ô∏è  With optimization (22 partitions):
   üëâ Check Spark UI - you should see 22 tasks running
   üëâ Notice how all 11 cores are now busy!
   Time taken: 0.56 seconds
   Tasks: 22 (all 11 cores utilized)

üìä Speedup: 0.83√ó faster with optimized partitions
   (Note: Speedup may vary based on data size and cluster configuration)
   ‚ÑπÔ∏è  For small datasets, overhead of repartitioning may outweigh benefits
   ‚ÑπÔ∏è  Benefits are more pronounced with larger datasets and more cores
   ‚ÑπÔ∏è  The key benefit is better resource utilization, not always

## Summary

### What You Demonstrated

1. **Created 4 Parquet files** - Simulated real-world storage scenario
2. **Read data naively** - Observed only 4 partitions created
3. **Checked Spark UI** - Saw only 4 tasks, most cores idle
4. **Repartitioned** - Based on your machine's actual cores (2√ó core count)
5. **Checked Spark UI again** - Saw many more tasks, all cores busy
6. **Compared performance** - Observed the difference in parallelism

### Key Observations from Spark UI

**Before Repartitioning:**
- Only 4 tasks running
- Most cores idle
- Poor resource utilization

**After Repartitioning:**
- Many tasks running (2√ó your core count)
- All cores busy
- Optimal resource utilization

### Next Steps

- Apply this knowledge to your own data
- Always check partition count before expensive operations
- Use Spark UI to monitor and optimize your jobs
- Review the concepts notebook (`08_a_Partitions_Concepts.ipynb`) for deeper understanding
