# COALESCE() + REPARTITION()

The above example creates 5 partitions as specified in `master("local[5]")` and the data is distributed across all these 5 partitions.

- `Partition 1: 0 1 2 3`
- `Partition 2: 4 5 6 7`
- `Partition 3: 8 9 10 11`
- `Partition 4: 12 13 14 15`
- `Partition 5: 16 17 18 19`

In [7]:
import pyspark
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder
    .appName("example-coalesce-repartition")
    .master("local[5]")
    .getOrCreate()
)

In [2]:
df=spark.range(0,20)
print(df.rdd.getNumPartitions())


5


In [3]:
df.write.mode("overwrite").csv("../files/partition/partition.csv")

### DataFrame repartition()
- repartition() method is used to increase or decrease the partitions. 
- the below example increases the partitions from 5 to 6 by moving data from all partitions.

>
- `Partition 1 : 14 1 5`
- `Partition 2 : 4 16 15`
- `Partition 3 : 8 3 18`
- `Partition 4 : 12 2 19`
- `Partition 5 : 6 17 7 0`
- `Partition 6 : 9 10 11 13`

In [4]:
df2 = df.repartition(6)
print(df2.rdd.getNumPartitions())

6


### DataFrame coalesce()
- coalesce() is used only to `decrease` the number of partitions. 
- this is an optimized or improved version of repartition() where the movement of the data across the partitions is fewer using coalesce.
- the below example we are reducing 5 to 2 partitions, the data movement happens only from 3 partitions and it moves to remain 2 partitions.

>
- `Partition 1 : 0 1 2 3 8 9 10 11`
- `Partition 2 : 4 5 6 7 12 13 14 15 16 17 18 19`

In [5]:
df3 = df.coalesce(2)
print(df3.rdd.getNumPartitions())

2


### Default Shuffle Partition

- Calling `groupBy()`, `union()`, `join()` and similar functions on DataFrame results in shuffling data between multiple executors and even machines and finally repartitions data into `200 partitions by default`. 
- PySpark default defines shuffling partition to 200 using `spark.sql.shuffle.partitions` configuration.



In [6]:
df4 = df.groupBy("id").count()
print(df4.rdd.getNumPartitions())

5


Which of the following code blocks `reduces` a DataFrame from 12 to 6 partitions and performs a full shuffle?
>
- `DataFrame.repartition(12)`
- `DataFrame.coalesce(6).shuffle()`
- `DataFrame.coalesce(6)`
- `DataFrame.coalesce(6, shuffle=True)`
- `DataFrame.repartition(6)`

The code block displayed below contains an error. When the code block below has executed, it should have divided DataFrame transactionsDf into 14 parts, based on columns storeId and transactionDate (in this order). Find the error.

>
Code block:
- `transactionsDf.coalesce(14, (“storeId”, “transactionDate”))`

In [9]:
from pyspark.sql.functions import *
from pyspark.sql.types import *

data = [(1, 3, 4, 25, 1, None, 1587915332),
         (2, 6, 7, 2, 2, None, 1586815312),
         (3, 3, None, 25, 3, None, 1585824821),
         (4, None, None, 3, 2, None, 1583244275),
         (5, None, None, None, 2, None, 1575285427),
         (6, 3, 2, 25, 2, None, 1572733275)]

schema = StructType([StructField('transactionId', IntegerType(), True),
                     StructField('predError', IntegerType(), True),
                     StructField('value', IntegerType(), True),
                     StructField('storeId', IntegerType(), True),
                     StructField('productId', IntegerType(), True),
                     StructField('f', IntegerType(), True),
                     StructField('transactionDate', LongType(), True)])

transactionsDf = spark.createDataFrame(data=data, schema=schema)

In [10]:
# TypeError: coalesce() takes 2 positional arguments but 3 were given
transactionsDf.coalesce(14, ('storeId', 'transactionDate'))

TypeError: coalesce() takes 2 positional arguments but 3 were given

In [15]:
# Operator coalesce needs to be replaced by repartition and the parentheses around the column names need to be replaced by square brackets.
transactionsDf.repartition(14, ['storeId', 'transactionDate'])

DataFrame[transactionId: int, predError: int, value: int, storeId: int, productId: int, f: int, transactionDate: bigint]