# Understanding Partitions in PySpark

Partitions are the fundamental unit of parallelism in PySpark - understanding them is crucial for performance optimization.

## What are Partitions?

**Partitions** are chunks of data that are distributed across the cluster. Each partition:
- Is processed by one task
- Can be on a different node
- Enables parallel processing
- Affects performance and memory usage

In [None]:
from pyspark.sql import SparkSession

# Create Spark session
spark = SparkSession.builder \n    .appName("UnderstandingPartitions") \n    .getOrCreate()

sc = spark.sparkContext
print(f"Default parallelism: {sc.defaultParallelism}")

## Creating RDDs with Different Partitions

In [None]:
# Create RDD with default partitions
numbers_default = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
print(f"Default partitions: {numbers_default.getNumPartitions()}")
print(f"Data: {numbers_default.collect()}")

# Create RDD with specific number of partitions
numbers_3_parts = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 3)
print(f"3 partitions: {numbers_3_parts.getNumPartitions()}")
print(f"Data: {numbers_3_parts.collect()}")

## Partition Distribution

Let's see how data is distributed across partitions:

In [None]:
# Function to see partition contents
def print_partition_data(partition_id, data):
    print(f"Partition {partition_id}: {list(data)}")
    return data

# Map partitions to see distribution
numbers_3_parts.mapPartitions(lambda x: [list(x)]).collect()

In [None]:
# Better way to inspect partitions
def inspect_partition(partition_index, data):
    data_list = list(data)
    print(f"Partition {partition_index}: {data_list} (size: {len(data_list)})")
    return data_list

# Use foreachPartition to inspect
print("Partition inspection:")
numbers_3_parts.foreachPartition(lambda x: print(f"Partition content: {list(x)}"))

# To see results, we need to use a different approach
partitioned_data = numbers_3_parts.mapPartitionsWithIndex(inspect_partition).collect()
print("
Collected results show partition processing order")

## Impact of Partition Count

Different partition counts affect performance:

In [None]:
import time

# Create larger dataset
large_data = list(range(1, 10001))  # 1 to 10000

# Test with different partition counts
for num_partitions in [1, 4, 8, 16]:
    rdd = sc.parallelize(large_data, num_partitions)
    start_time = time.time()
    count = rdd.map(lambda x: x * 2).reduce(lambda a, b: a + b)
    end_time = time.time()
    
    print(f"Partitions: {num_partitions}, Time: {end_time - start_time:.3f}s, Result: {count}")

## Repartitioning Operations

Changing partition count affects data distribution:

In [None]:
# Start with 2 partitions
rdd = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 2)
print(f"Original partitions: {rdd.getNumPartitions()}")

# repartition() - full shuffle
repartitioned = rdd.repartition(4)
print(f"After repartition(4): {repartitioned.getNumPartitions()}")

# coalesce() - reduce partitions without full shuffle
coalesced = repartitioned.coalesce(2)
print(f"After coalesce(2): {coalesced.getNumPartitions()}")

## Partitioning Best Practices

### When to increase partitions:
- Large datasets
- CPU-intensive operations
- Available cluster resources

### When to decrease partitions:
- Small datasets
- I/O intensive operations
- Memory constraints

### Optimal partition size:
- 100-200MB per partition (uncompressed)
- 2-4x number of cores
- Balance between parallelism and overhead

In [None]:
# Example: Optimal partitioning
large_dataset = list(range(1, 100001))  # 100k elements

# Calculate optimal partitions based on data size
data_size_mb = len(large_dataset) * 8 / (1024 * 1024)  # Rough estimate
optimal_partitions = max(2, int(data_size_mb / 128))  # ~128MB per partition

print(f"Data size estimate: {data_size_mb:.1f}MB")
print(f"Suggested partitions: {optimal_partitions}")

# Create RDD with optimal partitioning
optimal_rdd = sc.parallelize(large_dataset, optimal_partitions)
print(f"Actual partitions: {optimal_rdd.getNumPartitions()}")

## Key Takeaways

1. **Partitions = Parallelism**: More partitions = more parallel tasks
2. **Size Matters**: Too small = overhead, too large = memory issues
3. **Shuffle Operations**: repartition() causes full shuffle, coalesce() is more efficient
4. **Monitor Performance**: Use Spark UI to understand partition distribution
5. **Tune for Workload**: Different workloads need different partition strategies

**Remember**: Understanding partitions is key to optimizing PySpark performance!