In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("DAG").getOrCreate()
print("Let's learn DAGs")

Let's learn DAGs


In [2]:
transactions_file = "transactions.parquet"
df_transactions = spark.read.parquet(transactions_file)
#df_transactions.printSchema()

- Following job was created while reading the file.
- But why a job for just reading the file?
- This is because Spark just reades the meta data of the file in order for it's optimisation
![image.png](attachment:5b1d88a9-2ff6-4487-bce6-b39daad669d6.png)

In [3]:
df_transactions.show(5, False)

+----------+----------+----------+---------------+----------+----+-----+---+-------------+------+-----------+
|cust_id   |start_date|end_date  |txn_id         |date      |year|month|day|expense_type |amt   |city       |
+----------+----------+----------+---------------+----------+----+-----+---+-------------+------+-----------+
|C0YDPQWPBJ|2010-07-01|2018-12-01|TZ5SMKZY9S03OQJ|2018-10-07|2018|10   |7  |Entertainment|10.42 |boston     |
|C0YDPQWPBJ|2010-07-01|2018-12-01|TYIAPPNU066CJ5R|2016-03-27|2016|3    |27 |Motor/Travel |44.34 |portland   |
|C0YDPQWPBJ|2010-07-01|2018-12-01|TETSXIK4BLXHJ6W|2011-04-11|2011|4    |11 |Entertainment|3.18  |chicago    |
|C0YDPQWPBJ|2010-07-01|2018-12-01|TQKL1QFJY3EM8LO|2018-02-22|2018|2    |22 |Groceries    |268.97|los_angeles|
|C0YDPQWPBJ|2010-07-01|2018-12-01|TYL6DFP09PPXMVB|2010-10-16|2010|10   |16 |Entertainment|2.66  |chicago    |
+----------+----------+----------+---------------+----------+----+-----+---+-------------+------+-----------+
only showi

- A **showString** Job is created when I called the .show() method. It has an input unlike the initial stage

![image.png](attachment:3ea845b1-19ca-4167-b358-c2b451068535.png)

#### DAG for the Show Job
Since it's a Parquet file it converts it from columnar to Row based format
  
##### ColumnarToRow

number of output row  -> The 4,096 output rows in the ColumnarToRow step represent all the rows read from the Parquet files



number of input batches -> 1 batch indicates that all the rows (4,096) were processed together in memory. This typically happens when the data size is small enough to fit in a single batch.: 1
  
![image.png](attachment:0f2b284c-1d40-46fa-89c8-ffb2e47c311d.png)

In [4]:
customers_file = "customers.parquet"
df_customers = spark.read.parquet(customers_file)
#df_customers.printSchema()

## 1. DAG for Narrow Transforms


In [5]:
from pyspark.sql import functions as F

In [9]:
df_narrow_transform = df_customers.filter(F.col('city')=='boston')\
                                        .withColumn('first_name', F.split('name', ' ').getItem(0))\
                                        .withColumn('last_name', F.split('name', ' ').getItem(1))\
                                        .withColumn('age', F.col('age')+F.lit(5))\
                                        .select("cust_id", "first_name", "last_name", "age", "gender", "birthday")

# Lets write the dataframe to invoke a Spark job - this will read the whole dataset unlike .show()
# noop - No operation, simulates a write but does not actually write. Good for testing

df_narrow_transform.write.format("noop").mode("overwrite").save('df_narrow_transform.parquet')

- All the withColumn and select commands are put into the Project stage before that a filter is applied 

![image.png](attachment:eba32b6a-cd6a-41b8-bb83-3f94b0d6c34e.png)

## 2. DAG for Wide Transforms

### 2.1. SortMergeJoin

In [10]:
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)

In [12]:
df_joined = (df_transactions.join(df_customers, how="inner", on="cust_id"))

# Again lets trigger a job using the write NOOP
df_joined.write.format("noop").mode("overwrite").save("df_joined.parquet")

- The join triggered 3 jobs
    - first: reading the df_transctions parquet. Involved a shuffle
    - second: reading df_customers. Involved a shuffle
    - third: the join operation
    - Key note: the number of tasks in job 1,2 are the same as the no. of partitions the files had
![image.png](attachment:0227d6ea-6bcd-41b0-9ef0-3244cb1e3514.png)

#### Reading the DAG
- Spark applied a not null filter on cust_id
![image.png](attachment:ac12ad83-493c-43ad-b051-715817ab1ce9.png)


- Then comes the **Shuffle**

![image.png](attachment:21b73323-34f7-49b6-ae22-2d3fd7ba6194.png)

- AQE - **Adaptive Query Execution** : AQE is dynamically optimizing your join operation at runtime
    - Let's break it down for df_transactions:
    - number of coalesced partitions: 24 --> Spark reduced the no. of partitions to 24
    - number of skewed partitions: 1 --> But then it found a partition with skew (a lot more data than the others)
    - number of skewed partition splits: 12 --> So lets split the skewed partition to 12
    - number of partitions: 36 --> Final no. of partitions

![image.png](attachment:18fd0471-4d2a-4966-a2c2-472e390f46fd.png)

- These are two parallel sort operations preparing data for the join
- This is performing a skew-aware join (skew=true parameter) --> Outputs 39,790,092 rows

![image.png](attachment:76cc5f8d-4af2-4a7b-92b2-afc2ccb8fb8a.png)

### 2.2. Broadcast Join

In [13]:
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 10485760)

In [14]:
df_broadcast_joined = (df_transactions.join(F.broadcast(df_customers), how="inner", on="cust_id"))

# Again lets trigger a job using the write NOOP
df_joined.write.format("noop").mode("overwrite").save("df_broadcast_joined.parquet")

- here we have 2 jobs
    - first job to broadcast the smaller dataset
    - second job with the actual join operation

![image.png](attachment:2b22f294-247b-41b7-b573-cf1f7d64c125.png)

#### DAG for Broadcast

- **BroadcastExchange**
    - The smaller dataset is broadcasted to all the nodes
    - Think of it like photocopying the smaller table and distributing amongst all nodes
- **BroadcastHashJoin**
    - join operation happens in each of the partition since the smaller table has been broadcasted there's no need of shuffle
      
![image.png](attachment:cd7d6f10-d3dc-4e8d-907c-6b3b282120a5.png)
![image.png](attachment:056d7126-3844-4112-8c41-651ba7d7cc71.png)

## 3. DAG for Group By

In [15]:
df_city_counts = df_transactions.groupby('city').count()

df_city_counts.show()

+-------------+-------+
|         city|  count|
+-------------+-------+
|    san_diego|3977780|
|      chicago|3979023|
|       denver|3980274|
|       boston|3978268|
|      seattle|3980022|
|  los_angeles|3982028|
|     new_york|3977480|
|san_francisco|3977094|
| philadelphia|3978193|
|     portland|3979930|
+-------------+-------+



- The first stage basically reads in the data and shuffles because of the group by

![image.png](attachment:9fcadb7b-48dc-4093-9d44-b94987143b0e.png)

#### DAG for Group By
- **HashAggregate**
    - If you hover it you will see that this step basically means a partial aggregation within the partitions.
    - So it does a count of records split by city

![image.png](attachment:e1351337-7bd8-4bbd-9561-97b06692c0ef.png)

- **Exchange**
    - Shuffle happens on the city
    - All of the same keys goes to the same partitions. If City A in partition #1 then all the records with City A will go to partition #1
    - Default no. of partitions here was 200
   
![image.png](attachment:45db2e6f-79bd-4832-9d4c-6edd0444deea.png)

- **AQEShuffleRead**
    - Reduces the no. of partitions from 200 to 1
    - Now does the final count within that reduced partition
  
![image.png](attachment:c5f0763e-3bce-4763-82c3-4a78fd60aed1.png)


### 3.1. Group By Count Distinct

In [16]:
df_txn_per_city = df_transactions.groupby('cust_id').agg(F.countDistinct('city').alias('city_count'))

df_txn_per_city.show(5, False)

+----------+----------+
|cust_id   |city_count|
+----------+----------+
|CPP8BY8U93|10        |
|CYB8BX9LU1|10        |
|CFRT841CCD|10        |
|CA0TSNMYDK|10        |
|COZ8NONEVZ|10        |
+----------+----------+
only showing top 5 rows



- For this group by there are multiple HashAggregates involved
    - In this first one the function is empty --> this means that Spark is just trying to find the Distinct pairs of cust_id and city within the partitions

![image.png](attachment:4a8d0871-b02c-4b47-860e-31f1cfa869aa.png)

- Exchange happens on the pair of cust_id, city.
    - For instance: (Joyan99, Pune) might go to p#1, so all the other transactions done by Joyan99 in Pune will go to p#1
    - At the end we have 200 partitions

![image.png](attachment:72f1c30a-1ddb-4f68-910d-f253766b9188.png)

- AQE does it job, reduces the #of partitions from 200 to 1
  
![image.png](attachment:b2753336-f082-486c-a8c9-8208d08ae88f.png)

- Here again we see two HashAggregates
    - The first one again is a distinct function since we merged all the partitions
    - The second is a partial count

![image.png](attachment:63fba438-6a2b-467b-80b2-cfa425f3c9b5.png)

- Again followed by an Exchange on cust_id this time and followed by AQE

 ![image.png](attachment:3858e628-c802-4709-a446-dfee544e0cbb.png)