# Understanding Partitions: ADLS File Partitions vs Spark Executor Partitions

## Learning Objectives

By the end of this notebook, you will understand:

1. **The difference** between file partitions in storage (ADLS) and partitions in Spark executors
2. **Why idle partitions occur** and how they waste resources
3. **How to diagnose** partition-related performance issues
4. **Best practices** for optimizing partition counts
5. **Common mistakes** and how to avoid them

## Prerequisites

- Basic understanding of Spark architecture (executors, cores, tasks)
- Familiarity with reading data from cloud storage (ADLS, S3, etc.)
- Understanding of distributed computing concepts

---

> **Note:** This is the **concepts notebook**. For hands-on practical demonstration with real data and Spark UI monitoring, see `08_a_Partitions_Practice.ipynb`


## Step 1: Understanding Your Configuration

Before we dive into partitions, let's understand the setup we're working with.

### Cluster Configuration

```python
# Your Spark cluster setup
num_executors = 4
cores_per_executor = 4
total_cores = num_executors * cores_per_executor  # 16 cores total
```

**What this means:**
- You have **4 executors** (JVM processes running on worker nodes)
- Each executor has **4 CPU cores** available
- Total available parallelism: **16 cores**

**Common Misconception:**
> "I have 16 cores, so I have 16-way parallelism"

**Reality:** This is only true if you have **at least 16 partitions**. Having 16 cores doesn't automatically give you 16-way parallelism - you need 16 tasks (one per partition) to utilize all cores.

### Data Storage Configuration (ADLS)

```python
# Your data layout in ADLS
data_path = "abfss://container@storageaccount.dfs.core.windows.net/sales/"

# Files in the sales directory:
# sales/
#   ├── part-00000.parquet (256 GB)
#   ├── part-00001.parquet (256 GB)
#   ├── part-00002.parquet (256 GB)
#   └── part-00003.parquet (256 GB)
```

**What this means:**
- You have **4 large Parquet files** in your storage
- Each file is **256 GB** (1 TB total)
- These are **file partitions** - a way to organize data in storage

**Important Distinction:**
- **File partitions** (storage level) ≠ **Spark partitions** (compute level)
- The number of files does NOT automatically equal the number of Spark partitions


## Step 2: What We're Trying to Achieve

### Goal: Maximum Resource Utilization

**What we want:**
- All 16 cores working simultaneously
- No idle executors or cores
- Efficient data processing with minimal waste

**Why this matters:**
- **Cost efficiency**: You're paying for all 16 cores - use them!
- **Performance**: More parallelism = faster job completion
- **Scalability**: Understanding this helps you scale properly

### The Ideal Scenario

```
Executor 1: [Task 1] [Task 2] [Task 3] [Task 4] → All 4 cores busy
Executor 2: [Task 5] [Task 6] [Task 7] [Task 8] → All 4 cores busy
Executor 3: [Task 9] [Task 10] [Task 11] [Task 12] → All 4 cores busy
Executor 4: [Task 13] [Task 14] [Task 15] [Task 16] → All 4 cores busy
```

**Key Principle:**
> **Number of partitions should be ≥ number of cores (ideally 2-4× for better load balancing)**


## Step 3: Common Mistake - What Happens When You Read Data Naively

### The Problematic Code

```python
# This is what many beginners do (and it's WRONG for our scenario)
sales_df = spark.read.parquet("abfss://container@storageaccount.dfs.core.windows.net/sales/")
```

### What Spark Actually Does

When you read data, Spark follows this rule:

> **One input partition = one task = one core**

**What Spark sees:**
- 4 Parquet files in the directory
- Parquet files are columnar and technically splittable, but **file count dominates** the partition creation
- Spark creates: **4 partitions** → **4 tasks**

**Visual Representation:**

```
Spark's View:
├── Partition 0 (from part-00000.parquet) → Task 0
├── Partition 1 (from part-00001.parquet) → Task 1
├── Partition 2 (from part-00002.parquet) → Task 2
└── Partition 3 (from part-00003.parquet) → Task 3
```

### The Painful Reality: Executor-Level Task Assignment

**How Spark distributes tasks:**

```
Executor 1: 
  ├── Task 0 (processing part-00000.parquet, 256 GB)
  └── Cores: [●] [○] [○] [○]  → 1 core used, 3 cores IDLE

Executor 2:
  ├── Task 1 (processing part-00001.parquet, 256 GB)
  └── Cores: [●] [○] [○] [○]  → 1 core used, 3 cores IDLE

Executor 3:
  ├── Task 2 (processing part-00002.parquet, 256 GB)
  └── Cores: [●] [○] [○] [○]  → 1 core used, 3 cores IDLE

Executor 4:
  ├── Task 3 (processing part-00003.parquet, 256 GB)
  └── Cores: [●] [○] [○] [○]  → 1 core used, 3 cores IDLE
```

**Resource Utilization:**
- **Used cores:** 4 out of 16
- **Idle cores:** 12 out of 16
- **Utilization:** 25% (75% waste!)

**This is not "suboptimal" - this is embarrassingly bad for production!**


## Step 4: Why Spark Behaves This Way (Understanding the Fundamentals)

### Key Spark Rules (No Myths)

**Spark does NOT:**
- ❌ Split one task across multiple cores
- ❌ Parallelize processing within a single partition
- ❌ Allow threads to cooperate on the same partition

**Why?**
- A partition is processed by **one thread** (one core)
- Threads don't cooperate on the same partition
- JVM safety and determinism reasons
- Ensures consistent, predictable results

### Common Misconception

**Wrong belief:**
> "My executor has 4 cores, so it will process the 256 GB partition 4 times faster"

**Reality:**
> "An executor with 4 cores can process **4 partitions in parallel**, not one partition faster"

**Think of it this way:**
- Each core is like a worker
- Each partition is like a job
- One worker can only do one job at a time
- To use 4 workers, you need 4 jobs (partitions)

### Why Spark Doesn't Auto-Fix This

**You might ask:** "Why doesn't Spark just split the data better automatically?"

**Answer:** Spark respects input layout and avoids assumptions:

1. **File boundaries matter**: Spark respects how data was written
2. **Existing partitioning**: If data was partitioned a certain way, Spark assumes it was intentional
3. **Avoids accidental shuffles**: Spark won't automatically repartition to avoid expensive operations
4. **Assumption**: "If you wrote 4 huge files, you probably meant it"

**That assumption is often wrong - but Spark won't guess. You need to tell it explicitly.**


## Step 5: How to Diagnose Partition Issues

### Check Your Partition Count

**Always check this first:**

```python
# Read your data
sales_df = spark.read.parquet("abfss://container@storageaccount.dfs.core.windows.net/sales/")

# Check partition count
num_partitions = sales_df.rdd.getNumPartitions()
print(f"Number of partitions: {num_partitions}")

# Compare with available cores
total_cores = spark.sparkContext.defaultParallelism
print(f"Total available cores: {total_cores}")

# Diagnosis
if num_partitions < total_cores:
    print(f"⚠️  WARNING: You have {num_partitions} partitions but {total_cores} cores!")
    print(f"⚠️  You are wasting {total_cores - num_partitions} cores!")
else:
    print("✅ Partition count looks good!")
```

### Understanding the Output

For our scenario:
```
Number of partitions: 4
Total available cores: 16
⚠️  WARNING: You have 4 partitions but 16 cores!
⚠️  You are wasting 12 cores!
```

**Rule of thumb:**
- If `num_partitions < total_cores` → You're wasting resources
- Ideal: `num_partitions = 2-4 × total_cores` (for better load balancing and hiding I/O wait times)


## Step 6: The Right Way - Optimized Partition Strategy

### Option 1: Repartition After Read (Simple & Explicit)

**Recommended for beginners:**

```python
# Step 1: Read the data (creates 4 partitions from 4 files)
sales_df = spark.read.parquet("abfss://container@storageaccount.dfs.core.windows.net/sales/")

# Step 2: Explicitly repartition to match your cluster
# Target: 2-4× your core count for optimal performance
sales_df_optimized = sales_df.repartition(32)  # 2× your 16 cores

# Verify
print(f"Partitions after repartition: {sales_df_optimized.rdd.getNumPartitions()}")
```

**What happens:**
1. Spark reads 4 files → creates 4 initial partitions
2. `repartition(32)` triggers a **shuffle** to redistribute data into 32 partitions
3. Each partition is now ~32 GB instead of 256 GB
4. Spark creates 32 tasks → can utilize all 16 cores (with tasks queued)

### Option 2: Repartition During Read (More Efficient)

**Better approach - avoids initial 4-partition creation:**

```python
# Read and repartition in one step
sales_df = (
    spark.read.parquet("abfss://container@storageaccount.dfs.core.windows.net/sales/")
    .repartition(32)
)
```

### Optimized Task Distribution

**After repartitioning to 32 partitions:**

```
Executor 1: 
  ├── Tasks: [1] [2] [3] [4] [5] [6] [7] [8]
  └── Cores: [●] [●] [●] [●] [●] [●] [●] [●]  → All 4 cores busy, 4 tasks queued

Executor 2:
  ├── Tasks: [9] [10] [11] [12] [13] [14] [15] [16]
  └── Cores: [●] [●] [●] [●] [●] [●] [●] [●]  → All 4 cores busy, 4 tasks queued

Executor 3:
  ├── Tasks: [17] [18] [19] [20] [21] [22] [23] [24]
  └── Cores: [●] [●] [●] [●] [●] [●] [●] [●]  → All 4 cores busy, 4 tasks queued

Executor 4:
  ├── Tasks: [25] [26] [27] [28] [29] [30] [31] [32]
  └── Cores: [●] [●] [●] [●] [●] [●] [●] [●]  → All 4 cores busy, 4 tasks queued
```

**Resource Utilization:**
- **Used cores:** 16 out of 16
- **Idle cores:** 0
- **Utilization:** 100% ✅
- **Task queue:** As soon as one task finishes, the next starts immediately

**This is what a healthy Spark job looks like!**


## Step 7: Understanding the Impact on Operations

### Example: Broadcast Join with Optimized Partitions

**Scenario:** Joining sales data (large) with product catalog (small)

```python
# Small table (will be broadcast)
products_df = spark.read.parquet("abfss://.../products/")  # 100 MB

# Large table (our sales data)
sales_df = (
    spark.read.parquet("abfss://.../sales/")
    .repartition(32)  # Optimized partitions
)

# Broadcast join
result = sales_df.join(
    broadcast(products_df),
    on="product_id",
    how="inner"
)
```

### What Happens in the Optimized Setup

**Broadcast side (small table):**
- Still broadcast once per executor (no change)
- Memory impact unchanged

**Fact side (sales data):**
- Each executor processes **8 smaller partitions** (32 partitions ÷ 4 executors)
- Each partition is **~32 GB** instead of 256 GB
- **Benefits:**
  - ✅ Faster task completion (smaller chunks)
  - ✅ Better garbage collection behavior (less memory pressure)
  - ✅ Reduced spill risk (less data per task)
  - ✅ Better load balancing (if some partitions are slower, others compensate)

### Visual Comparison

**❌ Bad Layout (4 partitions):**
```
4 partitions → 4 tasks → 16 cores → 12 idle
Each task processes 256 GB
Slow, inefficient, wasteful
```

**✅ Optimized Layout (32 partitions):**
```
32 partitions → 32 tasks → 16 cores → always busy
Each task processes ~32 GB
Fast, efficient, all resources utilized
```


## Step 8: Production-Grade Rules to Follow

### Rule 1: Files Are NOT Partitions

**Critical distinction:**
- **Storage layout** (file partitions) ≠ **Compute layout** (Spark partitions)
- Stop conflating them!

**Example:**
- You might have 4 files in ADLS (storage partitioning)
- But you need 32+ Spark partitions for optimal compute

### Rule 2: Always Check Partition Count

**Before running expensive operations:**

```python
# Always verify
num_partitions = df.rdd.getNumPartitions()
total_cores = spark.sparkContext.defaultParallelism

if num_partitions < total_cores:
    print(f"⚠️  WARNING: Only {num_partitions} partitions for {total_cores} cores!")
    print("Consider repartitioning!")
```

**If `num_partitions < total_cores`:**
- You are wasting money and time
- Your cluster is underutilized
- Your job will run slower than necessary

### Rule 3: Broadcast Join ≠ Performance Fix

**Common mistake:**
> "I'll use broadcast join to fix my performance issues"

**Reality:**
- Broadcast join removes shuffle (good!)
- But if your partitions are wrong, broadcast won't save you
- You still need proper partitioning for the large table

**Both matter:**
- ✅ Broadcast small tables (avoid shuffle)
- ✅ Properly partition large tables (utilize cores)

### Rule 4: Choose Partition Count Wisely

**Guidelines:**
- **Minimum:** `num_partitions ≥ total_cores`
- **Ideal:** `num_partitions = 2-4 × total_cores`
- **Why 2-4×?**
  - Hides I/O wait times (while one task waits for I/O, others run)
  - Better load balancing (handles data skew)
  - Allows for task queuing (smooth execution)

**Too many partitions:**
- Overhead from task scheduling
- Small tasks are inefficient
- Generally avoid: `num_partitions > 10 × total_cores`

### Rule 5: Understand When Repartitioning Happens

**`repartition()`:**
- Triggers a **full shuffle**
- Redistributes data across all partitions
- Use when you need to change partition count significantly

**`coalesce()`:**
- Reduces partition count **without shuffle**
- Combines adjacent partitions
- Use when reducing partitions (more efficient than repartition)

**Example:**
```python
# If you have 100 partitions and want 32
df.coalesce(32)  # More efficient (no shuffle)

# If you have 4 partitions and want 32
df.repartition(32)  # Necessary (requires shuffle)
```


## Step 9: Practical Example - Before and After

### Scenario Setup

Let's see a complete example with code:

```python
from pyspark.sql import SparkSession
from pyspark.sql.functions import broadcast

# Initialize Spark
spark = SparkSession.builder \
    .appName("PartitionOptimization") \
    .config("spark.executor.instances", "4") \
    .config("spark.executor.cores", "4") \
    .getOrCreate()

# Check your cluster configuration
print("=== Cluster Configuration ===")
print(f"Default parallelism: {spark.sparkContext.defaultParallelism}")
print(f"Executor instances: {spark.conf.get('spark.executor.instances')}")
print(f"Cores per executor: {spark.conf.get('spark.executor.cores')}")
```

### Before Optimization (The Problem)

```python
# ❌ BAD: Naive read
sales_df_bad = spark.read.parquet("abfss://container@storageaccount.dfs.core.windows.net/sales/")

print("=== Before Optimization ===")
print(f"Partitions: {sales_df_bad.rdd.getNumPartitions()}")
print(f"Available cores: {spark.sparkContext.defaultParallelism}")
print(f"Utilization: {sales_df_bad.rdd.getNumPartitions() / spark.sparkContext.defaultParallelism * 100:.1f}%")

# This will be slow and wasteful!
# result_bad = sales_df_bad.groupBy("region").agg({"sales": "sum"})
```

**Output:**
```
=== Before Optimization ===
Partitions: 4
Available cores: 16
Utilization: 25.0%
```

### After Optimization (The Solution)

```python
# ✅ GOOD: Optimized read with repartition
sales_df_good = (
    spark.read.parquet("abfss://container@storageaccount.dfs.core.windows.net/sales/")
    .repartition(32)  # 2× your core count
)

print("=== After Optimization ===")
print(f"Partitions: {sales_df_good.rdd.getNumPartitions()}")
print(f"Available cores: {spark.sparkContext.defaultParallelism}")
print(f"Utilization: {sales_df_good.rdd.getNumPartitions() / spark.sparkContext.defaultParallelism * 100:.1f}%")

# This will be fast and efficient!
# result_good = sales_df_good.groupBy("region").agg({"sales": "sum"})
```

**Output:**
```
=== After Optimization ===
Partitions: 32
Available cores: 16
Utilization: 200.0%  (2× for better load balancing)
```

### Performance Comparison

**Expected improvements:**
- **Execution time:** 3-4× faster (utilizing all cores)
- **Resource utilization:** 25% → 100%+
- **Cost efficiency:** Same cost, 3-4× more work done


## Step 10: Key Takeaways and Mental Model

### The Core Concept (Lock This In)

**Visual Mental Model:**

```
❌ BAD LAYOUT:
Storage:  [File1] [File2] [File3] [File4]
Spark:    [Part1] [Part2] [Part3] [Part4]
Tasks:    [Task1] [Task2] [Task3] [Task4]
Cores:    [●] [○] [○] [○]  [●] [○] [○] [○]  [●] [○] [○] [○]  [●] [○] [○] [○]
          ↑ Only 4 cores used, 12 idle

✅ OPTIMIZED LAYOUT:
Storage:  [File1] [File2] [File3] [File4]
Spark:    [P1][P2][P3]...[P32]  (repartitioned)
Tasks:    [T1][T2][T3]...[T32]
Cores:    [●][●][●][●] [●][●][●][●] [●][●][●][●] [●][●][●][●] ... (all busy)
          ↑ All 16 cores utilized, tasks queued for smooth execution
```

### Key Takeaways

1. **Storage partitions ≠ Compute partitions**
   - Files in ADLS are storage-level organization
   - Spark partitions are compute-level organization
   - They can (and often should) be different!

2. **One partition = one task = one core**
   - This is a fundamental Spark rule
   - More partitions = more parallelism potential
   - But only if you have enough cores

3. **Always check partition count**
   - Use `df.rdd.getNumPartitions()`
   - Compare with `spark.sparkContext.defaultParallelism`
   - If partitions < cores, you're wasting resources

4. **Optimal partition count**
   - Minimum: Equal to number of cores
   - Ideal: 2-4× number of cores
   - Too many: Overhead and inefficiency

5. **Repartition when needed**
   - Use `repartition()` to increase partitions (triggers shuffle)
   - Use `coalesce()` to decrease partitions (no shuffle)
   - Always verify the result

### Common Mistakes to Avoid

1. ❌ Assuming file count = optimal partition count
2. ❌ Not checking partition count before expensive operations
3. ❌ Thinking broadcast join fixes all performance issues
4. ❌ Creating too many or too few partitions
5. ❌ Not understanding the difference between storage and compute partitions

### Next Steps

- **Practice:** See `08_a_Partitions_Practice.ipynb` for hands-on demonstration
- **Experiment:** Try reading your own data and checking partition counts
- **Monitor:** Use Spark UI to visualize task distribution
- **Learn:** Understand how partitioning affects joins, aggregations, and shuffles


## Summary

### What We Learned

1. **Configuration Understanding**
   - Cluster: 4 executors × 4 cores = 16 total cores
   - Storage: 4 files × 256 GB = 1 TB total data

2. **The Problem**
   - Naive read creates 4 partitions → 4 tasks
   - Only 4 cores used → 12 cores idle (75% waste)

3. **The Solution**
   - Repartition to 32 partitions (2× core count)
   - Creates 32 tasks → all 16 cores utilized
   - Better load balancing and I/O hiding

4. **Best Practices**
   - Always check partition count
   - Aim for 2-4× core count
   - Understand storage vs compute partitions
   - Use repartition/coalesce appropriately

### Remember

> **"Files are not partitions. Storage layout ≠ Compute layout. Always verify your partition count matches your cluster capacity."**

This understanding is crucial for writing efficient, production-grade Spark applications!

---

## Next: Hands-On Practice

Now that you understand the concepts, proceed to **`08_a_Partitions_Practice.ipynb`** for a hands-on demonstration where you'll:
- Create actual Parquet files
- Read them and observe partition behavior
- Monitor Spark UI to see parallelism
- Repartition based on your machine's cores
- Compare performance before and after optimization
