# Demonstrating Distributed Computing in PySpark

This notebook shows how PySpark performs distributed computing. Each example highlights how Spark splits data and runs operations in parallel across partitions and executors.

## 1. Parallelizing a Collection

Even simple lists can be split and processed in parallel using Spark's RDD API.

In [0]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext

# Create a list of numbers
numbers = list(range(1, 100001))

# Parallelize into 4 partitions
rdd = sc.parallelize(numbers, numSlices=4)

# Perform a distributed sum
total = rdd.sum()
print("Sum using distributed computing:", total)

## 2. Show Partition Distribution

You can see how data is split by checking partition info.

In [0]:
def show_partition(index, iterator):
    yield f"Partition: {index}, Count: {sum(1 for _ in iterator)}"

partition_info = rdd.mapPartitionsWithIndex(show_partition).collect()
for info in partition_info:
    print(info)

## 3. DataFrame Example: Rows per Partition

DataFrames are split across partitions too. Let's see the distribution.

In [0]:
df = spark.range(0, 100000).repartition(4)

def partition_counter(iterator):
    yield sum(1 for _ in iterator)

counts = df.rdd.mapPartitions(partition_counter).collect()
for i, c in enumerate(counts):
    print(f"Partition {i}: {c} rows")




    1. 20     2. 20     3. 20    4. 20

## 4. Parallel Processing: Heavy Computation Example

Let's simulate a slow operation and see how Spark speeds it up by running in parallel.

In [0]:
import time

def slow_square(x):
    time.sleep(0.01)  # Simulate a heavy computation
    return x*x

# This will run much faster in parallel than a standard Python loop
squared_rdd = rdd.map(slow_square)
sample = squared_rdd.take(10)
print(sample)

## 5. Explaining the Distributed Nature

- Spark breaks up your job into tasks and runs them in parallel across executors/cores.
- You can control the number of partitions for parallelism.
- Even on a laptop, Spark will use all available CPU cores; on a cluster, it will use multiple machines.
- Check the Spark UI to see stages and tasks distributed.