# Dynamic Partition Pruning
## 1. First let's understand Static Partition Pruning
- This type of partition pruning (skipping partitions that are not required) occurs before the query execution
- Spark looks at the WHERE clause that references the partition columns. Say for instance, `df.filter(F.col("date")=='2025-01-01')` => if `df` is partitioned on date then only the 2025-01-01 partition will be read
- Happens during query compilation phase

## 2. Dynamic Pruning
- This type of pruning is done during the run-time
- It's useful in the case of joins where one dataset depends on the value from the other. Say if you wanted to analyse top-performing stores and you two tables `sales` and `popular_stores`. During the join operation with the help of dynamic partition pruning you will not need to read all the partitions from `sales` (presumably the larger table) - instead only the stores that are present in `popular_stores` table
- For Dynamic Pruning to be effective the other side of the data set (the look-up table) should be small
- **Why is it called Dynamic?**
    1. It's occurs during run-time
    2. The pruning condition is not static or hard-coded; the filter condition is not known before-hand
- One of the data sets needs to be partitioned on the filter column for Dynamic Pruning to work -> say in the above example, if the sales dataset was not partitioned on store_id, then Spark would have to do a complete table scan

In [1]:
from pyspark.storagelevel import StorageLevel
from pyspark.sql.types import *
from pyspark.sql import functions as F
from pyspark.sql import SparkSession

spark = (SparkSession
         .builder
         .config("spark.driver.memory","10g")
         .master("local[*]")
         .appName("DPP")
         .getOrCreate()
        )

sc = spark.sparkContext
sc.setLogLevel("ERROR")

In [2]:
df_listening = spark.read.parquet("listening-activity-partitioned")
df_listening.printSchema()

root
 |-- activity_id: integer (nullable = true)
 |-- song_id: integer (nullable = true)
 |-- listen_datetime: timestamp (nullable = true)
 |-- listen_duration: integer (nullable = true)
 |-- listen_hour: integer (nullable = true)
 |-- listen_date: date (nullable = true)



In [3]:
df_listening.show(5)

+-----------+-------+--------------------+---------------+-----------+-----------+
|activity_id|song_id|     listen_datetime|listen_duration|listen_hour|listen_date|
+-----------+-------+--------------------+---------------+-----------+-----------+
|       4456|     16|2023-07-18 10:15:...|            151|         10| 2023-07-18|
|       4457|     65|2023-07-18 10:15:...|            181|         10| 2023-07-18|
|       4458|     60|2023-07-18 10:15:...|            280|         10| 2023-07-18|
|       4459|      3|2023-07-18 10:15:...|            249|         10| 2023-07-18|
|       4460|     45|2023-07-18 10:15:...|            130|         10| 2023-07-18|
+-----------+-------+--------------------+---------------+-----------+-----------+
only showing top 5 rows



In [4]:
df_songs = spark.read.csv("Spotify_Songs.csv", header=True, inferSchema=True)
df_songs.show(5)

+-------+------+---------+--------------------+
|song_id| title|artist_id|        release_date|
+-------+------+---------+--------------------+
|      1|Song_1|        2|2021-10-15 10:15:...|
|      2|Song_2|       45|2020-12-07 10:15:...|
|      3|Song_3|       25|2022-07-11 10:15:...|
|      4|Song_4|       25|2019-03-09 10:15:...|
|      5|Song_5|       26|2019-09-07 10:15:...|
+-------+------+---------+--------------------+
only showing top 5 rows



In [5]:
df_songs.printSchema()

root
 |-- song_id: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- artist_id: integer (nullable = true)
 |-- release_date: timestamp (nullable = true)



### **Problem Statement:** Measure the listening activity of songs on their release dates, for all songs released after 2019-12-31

In [7]:
# let's create a date column that can be used for DPP of listening activity
df_songs = df_songs.withColumnRenamed("release_date", "release_datetime")\
                    .withColumn("release_date", F.to_date(F.col("release_datetime"), "yyyy-MM-dd"))
df_songs.show(5)

+-------+------+---------+--------------------+------------+
|song_id| title|artist_id|    release_datetime|release_date|
+-------+------+---------+--------------------+------------+
|      1|Song_1|        2|2021-10-15 10:15:...|  2021-10-15|
|      2|Song_2|       45|2020-12-07 10:15:...|  2020-12-07|
|      3|Song_3|       25|2022-07-11 10:15:...|  2022-07-11|
|      4|Song_4|       25|2019-03-09 10:15:...|  2019-03-09|
|      5|Song_5|       26|2019-09-07 10:15:...|  2019-09-07|
+-------+------+---------+--------------------+------------+
only showing top 5 rows



In [11]:
# filter for only songs released after 2019-12-31
df_selected_songs = df_songs.filter(F.col("release_date") > F.lit('2019-12-31'))

# Join listenting activity with selected songs
df_joined = df_listening.join(df_selected_songs,
                             on=(df_listening.song_id==df_selected_songs.song_id) & (df_listening.listen_date == df_songs.release_date),
                             how="inner"
                            )

In [12]:
df_joined.explain(mode='formatted')

== Physical Plan ==
AdaptiveSparkPlan (8)
+- BroadcastHashJoin Inner BuildRight (7)
   :- Filter (2)
   :  +- Scan parquet  (1)
   +- BroadcastExchange (6)
      +- Project (5)
         +- Filter (4)
            +- Scan csv  (3)


(1) Scan parquet 
Output [6]: [activity_id#0, song_id#1, listen_datetime#2, listen_duration#3, listen_hour#4, listen_date#5]
Batched: true
Location: InMemoryFileIndex [file:/home/jovyan/listening-activity-partitioned]
PartitionFilters: [isnotnull(listen_date#5), dynamicpruningexpression(listen_date#5 IN dynamicpruning#148)]
PushedFilters: [IsNotNull(song_id)]
ReadSchema: struct<activity_id:int,song_id:int,listen_datetime:timestamp,listen_duration:int,listen_hour:int>

(2) Filter
Input [6]: [activity_id#0, song_id#1, listen_datetime#2, listen_duration#3, listen_hour#4, listen_date#5]
Condition : isnotnull(song_id#1)

(3) Scan csv 
Output [4]: [song_id#55, title#56, artist_id#57, release_date#58]
Batched: false
Location: InMemoryFileIndex [file:/home/jovyan/Spo

### Understanding the Query Plan with DPP


1. First, there are two main data sources:
   - A partitioned parquet file containing listening activity (Scan 1)
   - A CSV file containing song information (Scan 3)

The key indicator of dynamic partition pruning is in Scan (1):
```
PartitionFilters: [isnotnull(listen_date#5), dynamicpruningexpression(listen_date#5 IN dynamicpruning#148)]
```

Here's how the execution flows:

1. The songs table (Scan 3) is filtered first to get songs released after 2020-01-01:
```
Condition : (((gettimestamp(release_date#58, yyyy-MM-dd, TimestampType, Some(Etc/UTC), false) >= 2020-01-01 00:00:00)
```

2. This filtered songs dataset is broadcast (BroadcastExchange 6) since it's likely smaller than the listening activity data

3. The dynamic pruning happens through a subquery (shown at bottom as "Subquery:1"):
   - It collects the release dates from the filtered songs
   - These dates are used to create a filter condition: `listen_date#5 IN dynamicpruning#148`
   - This filter is then pushed down to the parquet scan, meaning Spark will only read partitions of the listening activity data that match release dates from the songs table

The efficiency comes from:
- Only reading relevant partitions from the large listening activity dataset
- Broadcasting the smaller filtered songs dataset
- Using the release dates from songs to dynamically prune partitions before reading them

This is particularly efficient because:
1. The songs table is filtered first (removing pre-2020 songs)
2. The resulting release dates are used to prune partitions in listening activity
3. Only partitions that could potentially match in the join are read from disk

Without dynamic pruning, Spark would need to read all partitions of the listening activity data before performing the join filter.