In [1]:
!ls /opt/spark/data

Fire_Department_Calls_For_Service__2016__20240816.csv


In [2]:
from pyspark.sql import SparkSession

spark = (SparkSession
             .builder
             #.enableHiveSupport()
             .master("spark://spark-master:7077")
             .config("spark.sql.warehouse.dir", "/opt/spark/spark-warehouse")
             .getOrCreate()
             )

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/04/28 18:47:06 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [4]:
spark.conf.set("spark.sql.adaptive.enabled", "false")

In [6]:
spark.conf.get("spark.sql.shuffle.partitions")

'200'

# Constructs of repartititioning in spark:

1. `def repartition(numPartitions: Int): Dataset[T]`: 
- Returns a new Dataset that has exactly numPartitions partitions. 
- Here data gets equally distributed into numPartitions based on round-robin algorithm. It doesn't use hash partitioning.

2. `def repartition(partitionExprs: Column*): Dataset[T]`: 
- Returns a new Dataset partitioned by the given partitioning expressions, using spark.sql.shuffle.partitions as number of partitions. 
- The resulting Dataset is hash partitioned.
- This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL) where framework ensures that all rows of a key will land in same partition/reducer. 
- It doesn't ensure that all rows of each key will land in unique partition/reducer. 
- So rows of more than one unique key can land into same partition/reducer.

3. `def repartition(numPartitions: Int, partitionExprs: Column*): Dataset[T]`: 
- Returns a new Dataset partitioned by the given partitioning expressions into numPartitions. 
- The resulting Dataset is hash partitioned. This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).
- The only difference between this and above one is, here numPartitions is not default value(spark.sql.shuffle.partitions).

Let’s use the below dataset to test different scenarios.
About Data: 



In [3]:
raw_df=(
    spark
    .read
    .format("csv")
    .option("header", True)
    .option("inferSchema", False)
    .load("/opt/spark/data/Fire_Department_Calls_For_Service__2016__20240816.csv")
)

                                                                                

In [5]:
raw_df.rdd.getNumPartitions()

10

Scenarios 1: Writing Data without physical partitioning

Using repartition(numPartitions: Int) without partitionBy:

In [None]:
df
.repartition(5)
.write
.mode("overwrite")
.save("dummy_location_1")

Spark will use round robin algorithm and distribute the data evenly across 5 output files. There will not be any use of hash partitioning.

2. Using repartition(partitionExprs: Column*) without partitionBy:

In [None]:
df
.repartition($"dept")
.write
.mode("overwrite")
.save("dummy_location_2")

In this case, Spark will try to create 1 to spark.sql.shuffle.partitions(200 by default) files based on hash algorithm. Let’s assume that the number of partition of df is 200(= default value of spark.sql.shuffle.partitions).

file_number=hash(KEY) modulus(%) Number of partitions

Here, we have 12 distinct values of dept column, and hash value of each string is unique. Since we are taking modulus of hash value of key with number of partition, so we will get at max 12 output files(if remainder of modulus are different for each key). There is high possibility that rows of more than one key can land in the same file resulting to less number of files than 12. This is because remainder(modulus) can be same. for ex:

210%200 = 410%200 = 610%200 = 10
So, even if the hash values of each string is different, result of modulus can be same. 

3. Using repartition(numPartitions: Int, partitionExprs: Column*) without partitionBy:

In [None]:
df
.repartition(4, $"dept")
.write
.mode("overwrite")
.save("dummy_location_3")

In this case, it’s same as previous one but here the numPartitions is provided as a parameter of the function instead of using spark.sql.shuffle.partitions.

file_number=hash(KEY)%4

So, there can be at max we can 4 files.

Scenarios 2: Writing Data with physical partitioning

Using partitionBy without repartition:


In [None]:
df
.write
.partitionBy("dept")
.mode("overwrite")
.save("dummy_location_4")


Here, spark will use spark.sql.shuffle.partitions/df.rdd.getNumPartitions to decide number of files in each partition’s directory. Let’s say, rows of dept=xyz is scattered over n number of partitions, spark will write n files for partition dept=xyz in dummy_location_4/dept=xyz directory. Similarly, for any other dept, data is scattered over some random partitions, spark will write the number of files same as number of partitions where the rows are available for the given dept.

In this example, since number of partition of df is 200, spark will try to write at max 200 files/directory depending on how many spark partitions contains the rows of given dept.



2. Using partitionBy with repartition(numPartitions):



In [None]:
df
.repartition(2)
.write
.partitionBy("dept")
.mode("overwrite")
.save("dummy_location_5")


Here, due to repartition(2), spark will push the data from spark.sql.shuffle.partitions/df.rdd.getNumPartitions partitions to only 2 partitions. Before writing the data to storage, spark will reduce the partition to 2. Now for any key, data will be available in any of the 2 partitions or in both partitions. So, here we can get at max 2 files/partition’s directory.

3. Using partitionBy with repartition(partitionExprs: Column*):

In [None]:
df
.repartition($"dept")
.write
.partitionBy("dept")
.mode("overwrite")
.save("dummy_location_6")

Because of df.repartition($"dept") the number of partition will change to spark.sql.shuffle.partitions but data will be available on partition ≤ number of distinct values of dept, remaining partitions will be empty.

Because of repartition(dept), all rows of a key will be available in only one of the partition. Reparation ensures that the rows of same key will not land in more than one partitions. But one partition may or may not contain all the rows of more than one key.

Since the partitioning column is also same i.e dept, and all rows of a given dept can be found only on 1 spark partition due to repartition($”dept”). So, we will have only 1 file/Partition on file system.

Here, we have one more scenario:

In [None]:
df
.repartition($"gender")
.write
.partitionBy("dept")
.mode("overwrite")
.save("dummy_location_7")

Guess, what will be the number of files written to storage?

4. Using partitionBy with repartition(numPartitions: Int, partitionExprs: Column*):

In [None]:
df
.repartition(4, $"dept")
.write
.partitionBy("dept")
.mode("overwrite")
.save("dummy_location_8")

This is same as previous case, but here modulus= hash_value(dept)%4. So all the rows of dept can be at max scattered across 4 partitions. But, all rows of the same key(dept) can be found in any one of the partitions.

So, all the rows of each dept can be found in any one of the partitions and since the physical partitions column is also dept, we will have only 1 file/Partition on file system.