In [None]:
import datetime as dt

from src.config import settings
from src.spark_lakehouse import get_spark_session

spark = get_spark_session("Repartition Example")

df = spark.read.json(settings.SPARK_CLUSTER_DATA_DIR + "sparkify_log_small.json")
df.printSchema()

## Explore & do some transformations and actions
See how Spark works, especially on the executor tab [of the Spark UI]. For example, write is an action, fill it in with your desired path and look at the executor tab


In [None]:
get_day = spark.udf.register(
    "get_day", lambda ts: dt.datetime.fromtimestamp(ts / 1000).day
)
df_day = df.withColumn("day", get_day(df.ts))
df_day.select("day").distinct().show()

In [None]:
df_day.write.partitionBy("day").csv(
    settings.SPARK_CLUSTER_DATA_DIR + "repartitioned_by_day", mode="overwrite"
)

## Repartition
Now, try doing repartition. Write another path, and take a look at Executor tab. What changed?

**Answer**: Repartition changes the number of partitions that the data is divided into. This can affect how the data is distributed across the executors and can lead to changes in performance and resource utilization. When you repartition the data, Spark will shuffle the data across the cluster to create the specified number of partitions, which can lead to increased network I/O and CPU usage during the shuffle operation.

In this particular case, it appaers that partitioning by day creates more partitions (one for each day; 3 in total) than partitioning by the number of workers (2), so some workers have to handle multiple partitions.

In [None]:
df_day.repartition(2).write.csv(
    settings.SPARK_CLUSTER_DATA_DIR + "repartitioned_2_partitions", mode="overwrite"
)