**Repartition Vs Coalesce:**

| **Feature**                 | **repartition()**                                  | **coalesce()**                                      |
|-----------------------------|---------------------------------------------------|-----------------------------------------------------|
| **Functionality**            | Reshuffles the data and can increase or decrease partitions. | Reduces the number of partitions by merging adjacent partitions. |
| **Number of Partitions**     | Can increase or decrease the number of partitions. | Only decreases the number of partitions. It cannot increase partitions. |
| **Shuffling**                | Involves a full shuffle, which is costly and may be slow. | Involves less shuffle (only adjacent partitions), making it more efficient. |
| **Use Case**                 | Used when you need to increase the number of partitions or when a shuffle is required. | Best for reducing the number of partitions (e.g., before writing to disk). |
| **Performance**              | Can be slower due to the full shuffle.           | More efficient since it avoids full shuffle and only merges adjacent partitions. |
| **Typical Use Case**         | Scaling up data processing (e.g., increasing parallelism). | Optimizing the number of partitions when you want to reduce overhead (e.g., before saving data). |
| **Partition Splitting**      | Can split a large partition into smaller ones.    | Cannot split partitions; only merges them. |
| **Internal Mechanism**       | Triggers a full shuffle across all partitions.    | Merges adjacent partitions without a full shuffle. |
| **API**                      | `DataFrame.repartition(numPartitions)`           | `DataFrame.coalesce(numPartitions)`                 |
| **Example**                  | `df.repartition(10)`                             | `df.coalesce(2)`                                    |
| **Performance Consideration**| Costly if the number of partitions is reduced significantly. | More efficient for reducing partitions but should not be used for increasing partitions. |



**Key Differences:**

**Repartition:**
-   Involves a full shuffle of the data.
-   Useful for increasing the number of partitions.
-   More computationally expensive.

**Coalesce:**
-   Involves merging adjacent partitions without a full shuffle.
-   Primarily used for decreasing the number of partitions.
-   More efficient when reducing the number of partitions, especially when the number is reduced by a large factor (e.g., from hundreds to a few).

In [1]:
import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("PySparkExample") \
    .getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/12/23 09:21:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


**Examples for repartition:**

In [2]:
orders_base = spark.sparkContext.textFile("/Users/sugumarsrinivasan/Documents/data/sample_orders_1GB.csv")

Num of Partitions before repartition

In [3]:
orders_base.getNumPartitions()

43

Num of Partitions after increasing the partitions using repartition()

In [4]:
repartitioned_orders_base = orders_base.repartition(50)

In [5]:
repartitioned_orders_base.getNumPartitions()

50

In [5]:
repartitioned_orders_base.saveAsTextFile("/Users/sugumarsrinivasan/Documents/data/repartition_result")

                                                                                

Number of partitions after decreasing the partitions using repartition()

In [10]:
new_orders_base = repartitioned_orders_base.repartition(10)

In [11]:
new_orders_base.getNumPartitions()

10

![Local Image](./screenshots/spark-repartition-job.png)
![Local Image](./screenshots/spark-repartition-stage.png)

**Examples for Coaleasce:**

Num of partitions before reducing the partition count:

In [12]:
orders_base.getNumPartitions()

43

In [4]:
new_orders_rdd = orders_base.coalesce(5)

Num of Partitions after decrease the partitions using coalesce()

In [5]:
new_orders_rdd.getNumPartitions()

5

In [6]:
new_orders_rdd.saveAsTextFile("/Users/sugumarsrinivasan/Documents/data/coalesce_result")

                                                                                

![Local Image](./screenshots/spark-coalesce-job.png)
![Local Image](./screenshots/spark-coalesce-stage.png)