<p><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/1e/UNAL_Logosimbolo.svg/583px-UNAL_Logosimbolo.svg.png" alt="" width="1280" height="300" /></p>

# PARTITIONS

In PySpark, `repartition()` is used to change the number of partitions of a DataFrame. It reshuffles the data across the cluster to create exactly the number of partitions you specify.

## WHY IS REPARTITION IMPORTANT?

* Optimizing parallelism: More partitions can improve parallelism and speed up large jobs.
* Avoiding data skew: You can spread out uneven data more evenly across partitions.
* Preparing for joins: Matching partitioning before joins can reduce shuffling and improve performance.
* Efficient writing: You can control the number of output files when writing data (e.g., avoid too many small files).

## KEY CONCEPTS

### SHUFLE

In Apache Spark, a shuffle is a costly operation where data is redistributed across partitions and nodes in the cluster. It happens when actions like groupBy(), join(), or distinct() require data to be rearranged to complete the task.

Why is shuffle important?
Shuffle is slow and resource-heavy because it involves reading and writing data between nodes and storage. This increases both execution time and memory usage.

Common operations that cause shuffle:
  * `groupBy()` : Moves data to group by key.
  * `join()` : Combines data based on a key, across partitions.
  * `distinct()` : Removes duplicates, requiring data movement.
  * `sort()` / `orderBy()`: Requires total ordering, so data must be shuffled across partitions.

### PARTITION DATAFRAME

Each partition in Spark is a unit of work that can be processed in parallel.  
But it is not the same as a core.

### How it works:

```
Spark divides a DataFrame into partitions.
|_ Each partition is processed by a task.
  |_ A core can execute one task at a time.
```

- If you have 8 partitions and 4 cores → Spark will process 4 partitions at the same time, then the other 4.
- If you have 8 partitions and 8 cores → Spark can process all 8 in parallel.

### DIFFERENCE BETWEEN REPARTITION AND COALESCE

- **`repartition()`** reshuffles all the data and evenly distributes it across the specified number of partitions.  
  This ensures balanced partitions but involves a full shuffle, which can be expensive in terms of performance.

- **`coalesce()`** reduces the number of partitions by merging existing ones without moving much data.  
  It is faster and more efficient, especially when decreasing partitions.  
  However, it does not rebalance the data, so partition sizes may become uneven.

**Tip:**  
Use `repartition()` when you need evenly distributed data (e.g., before heavy joins or aggregations).  
Use `coalesce()` when you simply want to reduce partitions (e.g., before writing data) without the cost of a full shuffle.

## DATASET

In [0]:
elements = [
    {"id": 1, "name": "July", "age": 34, "salary": 550, "role": "admin"},
    {"id": 1, "name": "July", "age": 34, "salary": 550, "role": "admin"},
    {"id": 2, "name": "Gabriel", "age": 29, "salary": 720, "role": "developer"},
    {"id": 3, "name": "Luis", "age": 42, "salary": 610, "role": "developer"},
    {"id": 4, "name": "John", "age": 51, "salary": 890, "role": "manager"},
    {"id": 5, "name": "Daniel", "age": 27, "salary": 480, "role": "developer"},
    {"id": 6, "name": "Mary", "age": 38, "salary": 700, "role": "admin"},
    {"id": 7, "name": "Monica", "age": 33, "salary": 460, "role": "tester"},
    {"id": 8, "name": "Andrea", "age": 45, "salary": 680, "role": "admin"},
    {"id": 9, "name": "Sebastian", "age": 31, "salary": 530, "role": "developer"},
    {"id": 10, "name": "Johana", "age": 26, "salary": 410, "role": "tester"},
    {"id": 11, "name": None, "age": 26, "salary": None, "role": "tester"},
    {"id": 12, "name": "Juan", "age": 45, "salary": 680, "role": None},
]
df = spark.createDataFrame(elements)
display(df)

## REPARTITION

#### GET

In [0]:
print(df.rdd.getNumPartitions())

#### SET


##### REPARTITION ONLY

This redistributes the rows randomly across 10 partitions. It doesn’t consider any specific column, so the data can end up in any partition. This is useful when you just want to balance the workload across more or fewer partitions, for example:

- When your DataFrame is highly unbalanced.
- When you need more parallelism (to speed up certain operations).

Disadvantage:

It doesn’t guarantee that related data will stay together. For instance, if you later perform a `groupBy` on a column, Spark might need to shuffle the data across the network because that column wasn’t considered during partitioning.

In [0]:
df_part_only = df.repartition(10)

In [0]:
print(df_part_only.rdd.getNumPartitions())

In [0]:
# with defaul partitions
df.groupBy("role").count().display()

In [0]:
# with repartitions
df_part_only.groupBy("role").count().display()

In [0]:
df_part_only = df.repartition(1)

##### REPARTITION WITH COLUMN

Here, Spark distributes the rows based on the specified column. It uses a hash of the column to decide which partition each row goes to. This is very useful when:

- You want to perform operations like `groupBy` or `join` using that column.
- You want to minimize shuffle during those steps since the data is already grouped by that key.

Advantage:

- Reduces data movement in subsequent operations on the key column.
- Improves performance if your workflow involves many aggregations or joins on that column.

Recomendation:

If you’re going to do `df.groupBy("key_column").agg(...)`, it's better to partition by "key_column" beforehand.

In [0]:
df_part_columns = df.repartition(2, "role")

In [0]:
print(df_part_columns.rdd.getNumPartitions())

In [0]:
df_part_columns.groupBy("role").count().display()

In [0]:
df_part_columns = df.repartition(1, "role")

## COALESCE

In PySpark, `coalesce()` is a function that reduces the number of partitions in a DataFrame to a specified number.  
It is a transformation operation that merges existing partitions, minimizing data movement.  
Unlike `repartition()`, `coalesce()` does not rebalance the partitions, which may result in partitions of uneven sizes.

### SET UP

In [0]:
df_coal = df.coalesce(1)
df_coal.groupBy("role").count().display()

### PROBLEM WITH SMALL FILES

equal number of partitions == number of files

In [0]:
# original df
df.rdd.getNumPartitions()

In [0]:
# grouping and checking partitins
cls_partitions = df.groupBy("role").count()
print(cls_partitions.rdd.getNumPartitions())

In [0]:
df.write.format("csv").mode("ignore").save("file:///tmp/x.csv")

In [0]:
%sh
ls -l /tmp/x.csv

In [0]:
df.coalesce(1).write.format("csv").mode("ignore").save("file:///tmp/x.csv2")

In [0]:
%sh
ls -l /tmp/x.csv2

## PARTITION BY

When you use `partitionBy` while writing data in Spark, it organizes the output files into separate folders based on the distinct values of one or more columns.
This is known as storage partitioning and is different from in-memory partitions used during processing.

Why use partitionBy?

It significantly improves read and filter performance because Spark can skip entire folders during queries — a technique called partition pruning. 

**NOTE: Partition pruning is when Spark avoids reading entire partitions (folders) because it knows it doesn't need them based on your filter.**

Ideal when you frequently filter or query by specific columns (e.g., country, date, region).

![](https://miro.medium.com/v2/resize:fit:1072/1*n853xsKLRFCqxd3FyMyMdg.png)



### SIMPLE

In [0]:
df.write.format("csv").mode("ignore").partitionBy("role").save("file:///tmp/part.csv")

In [0]:
%sh
ls -R /tmp/part.csv

### MULTI COLUMN

In [0]:
df.write.format("csv").mode("ignore").partitionBy("role", "name").save("file:///tmp/part.csv2")

In [0]:
%sh
ls -R /tmp/part.csv2

### PARTITION PRUNING

In [0]:
spark.read.csv("file:///tmp/part.csv", sep=",", ).display()

In [0]:
spark.read.csv("file:///tmp/part.csv/role=admin", sep=",", ).display()

### CHOOSING BAD PARTITION

In [0]:
df.write.format("csv").mode("ignore").partitionBy("id").save("file:///tmp/part.csv3")

In [0]:
%sh
ls -R /tmp/part.csv3

## PARAMETERS

### maxPartitionBytes
Max size per partition when reading files 

In [0]:
size = spark.conf.get("spark.sql.files.maxPartitionBytes")
print(f"{int(size[:-1]) / (1024 * 1024)} MB")