### Default Partitions for RDD/DataFrame

`sc.defaultParallelism`

- Sets the default number of partitions for RDD operations and shuffles.
- Usually equals the total number of executor cores in the Spark cluster.
- Creates 8 paritions by default

`spark.sql.files.maxPartitionBytes`

- Sets the max data size (in bytes) per partition when reading files (default: 128 MB).
- Lowering increases parallelism (more partitions); raising reduces parallelism (fewer partitions).

### Check Default Parameters

In [0]:
sc.defaultParallelism

Out[1]: 8

In [0]:
spark.conf.get("spark.sql.files.maxPartitionBytes")

Out[2]: '134217728b'

### Generate data with Spark Environment

In [0]:
from pyspark.sql.types import IntegerType
df = spark.createDataFrame(range(10), IntegerType())
df.rdd.getNumPartitions()

Out[4]: 8

### Verify the data with all Partitions

In [0]:
df.rdd.glom().collect()

Out[6]: [[Row(value=0)],
 [Row(value=1)],
 [Row(value=2)],
 [Row(value=3), Row(value=4)],
 [Row(value=5)],
 [Row(value=6)],
 [Row(value=7)],
 [Row(value=8), Row(value=9)]]

### Read External File

In [0]:
dbutils.fs.ls("/FileStore/dbutils_test/")

Out[7]: [FileInfo(path='dbfs:/FileStore/dbutils_test/superstore.csv', name='superstore.csv', size=229150, modificationTime=1759061573000),
 FileInfo(path='dbfs:/FileStore/dbutils_test/superstore_sample.csv', name='superstore_sample.csv', size=210298, modificationTime=1759061578000),
 FileInfo(path='dbfs:/FileStore/dbutils_test/test.text', name='test.text', size=33, modificationTime=1759061711000)]

In [0]:
df = spark.read.option("inferSchema", True).option("header", True).option("sep", ";").csv("/FileStore/dbutils_test/")
df.rdd.getNumPartitions()

Out[8]: 3

Change the `maxPartitionBytes` parameter which changes the Number of Partitions

In [0]:
spark.conf.set("spark.sql.files.maxPartitionBytes", "200000")
spark.conf.get("spark.sql.files.maxPartitionBytes") # verification


Out[29]: '200000'

In [0]:
df = spark.read.option("inferSchema", True).option("header", True).option("sep", ";").csv("/FileStore/dbutils_test/")
df.rdd.getNumPartitions()

Out[30]: 5

### Creating Single Partition with all data is not good for performance, as one core would process entire data while all other cores are kept idle

In [0]:
rdd2 = sc.parallelize(range(100), 1)
rdd2.getNumPartitions()

Out[31]: 1

### Repartition

In [0]:
from pyspark.sql.types import IntegerType
df = spark.createDataFrame(range(10), IntegerType())
df.rdd.glom().collect()

Out[32]: [[Row(value=0)],
 [Row(value=1)],
 [Row(value=2)],
 [Row(value=3), Row(value=4)],
 [Row(value=5)],
 [Row(value=6)],
 [Row(value=7)],
 [Row(value=8), Row(value=9)]]

In [0]:
df1 = df.repartition(20)
df1.rdd.getNumPartitions()

Out[33]: 20

In [0]:
df1.rdd.glom().collect()

Out[35]: [[],
 [Row(value=8)],
 [Row(value=9)],
 [],
 [Row(value=1)],
 [],
 [Row(value=6)],
 [],
 [Row(value=3)],
 [Row(value=0), Row(value=2), Row(value=4)],
 [],
 [Row(value=7)],
 [],
 [],
 [Row(value=5)],
 [],
 [],
 [],
 [],
 []]

In [0]:
df1 = df.repartition(2)
df1.rdd.getNumPartitions()

Out[38]: 2

In [0]:
df1.rdd.glom().collect()

Out[39]: [[Row(value=2),
  Row(value=3),
  Row(value=5),
  Row(value=6),
  Row(value=7),
  Row(value=9)],
 [Row(value=0), Row(value=1), Row(value=4), Row(value=8)]]

### Coalesce

In [0]:
df2 = df.coalesce(2)
df2.rdd.getNumPartitions()

Out[40]: 2

In [0]:
df2.rdd.glom().collect()

Out[41]: [[Row(value=0), Row(value=1), Row(value=2), Row(value=3), Row(value=4)],
 [Row(value=5), Row(value=6), Row(value=7), Row(value=8), Row(value=9)]]

`coalesce`

- Reduces the number of partitions by merging existing ones without a full shuffle (faster, less expensive).
- Often used to decrease partitions before writing data (e.g., to create a single output file).
- Does not guarantee even data distribution; some partitions may be much larger than others.

`repartition`

- Changes the number of partitions by performing a full shuffle of the data (more expensive).
- Can both increase or decrease the number of partitions and ensures even data distribution.
- Useful for optimizing parallelism before expensive operations or balancing data across partitions.