## Optimized Queries

#### Tips to Optimize Performance

Partitioning: We define appropriate partitions to avoid unnecessary reads and improve query performance.

Partition Filters: If you know which specific partitions you want to read, applying filters on the partitions can significantly reduce the read time.

Repartitioning: If the data is unevenly distributed, you can use repartition() to redistribute the DataFrame based on a key column.

Compression Codec: We use Snappy, as it provides good compression and decompression performance.

Shuffle Partitions: We set a fixed value for spark.sql.shuffle.partitions to improve parallelism during operations like joins and aggregations.
Additionally, we can explore techniques like caching for small tables (dimensions) that are frequently accessed, and broadcast join to optimize joins between the Fact table and the dimension tables.

In [0]:
%run ./Utils

In [0]:
from delta.tables import DeltaTable
from pyspark.sql.functions import *
from pyspark.sql.functions import year, sum, broadcast,desc

In [0]:
df_fact_sales = spark.read.table('sales_case.gold_fact_sales')
display(df_fact_sales.take(10))

SalesDate,sk_product,sk_category,sk_segment,sk_manufacturer,sk_client,Units,UnitPrice,UnitCost,SalesTotal,Year,Month
2011-03-04,77,2,6,1,5534,1,124.42,90.83,90.83,2011,3
2011-03-08,77,2,6,1,9318,1,124.42,90.83,90.83,2011,3
2011-03-08,77,2,6,1,4263,1,124.42,90.83,90.83,2011,3
2011-03-11,77,2,6,1,60129548537,1,124.42,90.83,90.83,2011,3
2011-03-25,77,2,6,1,8589939250,1,124.42,90.83,90.83,2011,3
2011-03-17,77,2,6,1,8589942648,1,124.42,90.83,90.83,2011,3
2011-03-17,77,2,6,1,34359742238,1,124.42,90.83,90.83,2011,3
2011-03-17,77,2,6,1,42949673405,1,124.42,90.83,90.83,2011,3
2011-03-17,77,2,6,1,60129543047,1,124.42,90.83,90.83,2011,3
2011-03-17,77,2,6,1,42949675525,1,124.42,90.83,90.83,2011,3


In [0]:
df_dim_product = spark.read.table('sales_case.gold_dim_product')
display(df_dim_product.take(10))

ProductID,Product,Category,sk_product
585,Maximus UC-50,Urban,1
423,Maximus UM-28,Urban,2
533,Maximus UE-21,Urban,3
540,Maximus UC-05,Mix,4
681,Maximus UC-46,Urban,5
628,Maximus UC-93,Urban,6
547,Maximus UC-12,Mix,7
415,Maximus UM-20,Urban,8
653,Maximus UC-18,Urban,9
512,Maximus UR-01,Urban,10


In [0]:
df_dim_region = spark.read.table('sales_case.gold_dim_region')
display(df_dim_region.take(10))

City,State,Region,District,Country,PostalCode,sk_region
Marseilles,IL,Central,District #27,USA,61341,1
Rhome,TX,Central,District #22,USA,76078,2
Quakertown,PA,East,District #04,USA,18951,3
Troy,OH,East,District #16,USA,45373,4
Lima,OH,East,District #16,USA,45806,5
Wildwood,GA,East,District #19,USA,30757,6
Tulsa,OK,Central,District #21,USA,74105,7
Rolla,MO,Central,District #21,USA,65401,8
Painesville,OH,East,District #14,USA,44077,9
Summerville,GA,East,District #09,USA,30747,10


In [0]:
df_dim_category = spark.read.table('sales_case.gold_dim_category')
display(df_dim_category.take(10))

Category,sk_category
Mix,1
Urban,2
Youth,3
Accessory,4
Rural,5


In [0]:
df_dim_client = spark.read.table('sales_case.gold_dim_client')
display(df_dim_client.take(10))

ClientID,Name,Email,sk_region,sk_client
62235,Chava Mason,chava.mason@xyza.com,64,1
120680,Zoe Levy,zoe.levy@xyza.com,22,2
92819,Basia Combs,basia.combs@xyza.com,3,3
22111,Bernard Holcomb,bernard.holcomb@xyza.com,9,4
8221,Adrienne Blankenship,adrienne.blankenship@xyza.com,14,5
113970,Gage Cole,gage.cole@xyza.com,17,6
46548,Yoko Carver,yoko.carver@xyza.com,51,7
41122,Yuli Acosta,yuli.acosta@xyza.com,4,8
113981,Ryan Madden,ryan.madden@xyza.com,17,9
92814,Paula Mays,paula.mays@xyza.com,3,10


In [0]:
df_dim_manufacturer = spark.read.table('sales_case.gold_dim_manufacturer')
display(df_dim_manufacturer.take(10))

ManufacturerID,Manufacturer,sk_manufacturer
7,VanArsdel,1


In [0]:
df_dim_segment = spark.read.table('sales_case.gold_dim_segment')
display(df_dim_segment.take(10))

Segment,sk_segment
All Season,1
Extreme,2
Productivity,3
Regular,4
Convenience,5
Moderation,6
Youth,7
Accessory,8
Select,9


#### Optimization of Read with Predicate Pushdown

Make sure that queries are taking advantage of predicate pushdown, which means that filters are applied directly when reading the data, improving efficiency.


In [0]:
# Using predicate pushdown doing filters when reading data
df_fact_sales_filtered = spark.read.table('sales_case.gold_fact_sales').filter("year = 2012 AND Month = 10")
display(df_fact_sales_filtered.take(10))

SalesDate,sk_product,sk_category,sk_segment,sk_manufacturer,sk_client,Units,UnitPrice,UnitCost,SalesTotal,Year,Month
2012-10-01,81,2,6,1,17179894748,1,102.37,74.73,74.73,2012,10
2012-10-08,81,2,6,1,25769831779,1,102.37,74.73,74.73,2012,10
2012-10-31,81,2,6,1,17179895037,1,102.37,74.73,74.73,2012,10
2012-10-10,81,2,6,1,24358,1,102.37,74.73,74.73,2012,10
2012-10-28,81,2,6,1,25068,1,102.37,74.73,74.73,2012,10
2012-10-19,81,2,6,1,25769830618,1,102.37,74.73,74.73,2012,10
2012-10-14,81,2,6,1,17179895723,1,102.37,74.73,74.73,2012,10
2012-10-22,81,2,6,1,34359764255,1,102.37,74.73,74.73,2012,10
2012-10-23,81,2,6,1,34359762699,1,102.37,74.73,74.73,2012,10
2012-10-19,81,2,6,1,51539631609,1,102.37,74.73,74.73,2012,10


Broadcast Join

1. Broadcast Join:

The broadcast() function is applied to dimension tables (dim_produto_df and dim_cliente_df). This replicates the dimension tables to all nodes, allowing joins to be performed locally on each node, without the need for communication between nodes, which improves performance in distributed clusters.

2. Join with Broadcast:

Joins are made between the original key columns (IDProduto, IDCliente) and the dimension tables to obtain the surrogate keys (SK_Produto, SK_Cliente).

3. Partitioning:

We add Year and Month columns to optimize the storage of the fact table and improve performance in temporal queries. The table is partitioned by these columns.

Advantages of Broadcast Join:

Reduces data movement during the join operation, as small dimensions are replicated to all nodes.
Increases performance when the dimension tables are significantly smaller than the fact table, which is common in data warehouse architectures.

Disadvantages of Broadcast Join:

Memory Limitation: The smaller DataFrame must fit in memory on all nodes. If the DataFrame is too large, it can cause memory shortage errors.

In [0]:
df_fact_sales = spark.read.table('sales_case.gold_fact_sales')
df_dim_category = spark.read.table('sales_case.gold_dim_category')

# display(df_fact_sales.take(10))
# display(df_dim_category.take(10))

# Using broadcast on category DF
df_dim_category = broadcast(df_dim_category)

# Joing dfs
joined_df = df_fact_sales.join(df_dim_category, df_fact_sales.sk_category == df_dim_category.sk_category)

# Grouping by category and year doing a sum of sales total
final_result_df = joined_df.groupBy("Category", "year")\
        .agg(sum("SalesTotal").alias("SalesTotal"))\
        .orderBy("year",desc("SalesTotal"))


display(final_result_df.take(10))

Category,year,SalesTotal
Urban,2011,6462865.989999663
Accessory,2011,636777.2600000366
Mix,2011,577400.910000008
Youth,2011,56238.13000000067
Rural,2011,1556.88
Urban,2012,6947654.179999314
Accessory,2012,746667.1100000343
Mix,2012,486981.05000000895
Youth,2012,140185.2000000025
Rural,2012,119.76


### Cleaning DF from Memory to optmmize

In [0]:
import gc
gc.collect()

df_fact_sales.unpersist()
df_dim_category.unpersist()
final_result_df.unpersist()