## E-Commerce Data Curated Analytics 
### Dimensional Modeling & Aggregation

This notebook represents the final stage in the E-commerce ETL pipeline. It:
1. Creates a dimensional model (star schema) from processed data
2. Builds business aggregations for analytics
3. Performs data quality checks
4. Writes curated data to the data lake for consumption

### Configuration and Environment Setup

The following cell initializes the Spark environment and defines storage paths for the ETL process.

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
from pyspark.sql.types import *
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.types import IntegerType, BooleanType, StringType
from pyspark.sql.functions import col, date_format, year, month, dayofmonth, quarter, weekofyear, dayofweek, when, date_add, expr, lit, current_timestamp, monotonically_increasing_id, date_format
from pyspark.sql.functions import col, current_timestamp, count, countDistinct, sum as _sum, avg, when, hash as _hash

# Initialize Spark Session
spark = SparkSession.builder.getOrCreate()

# Define function for generating ADLS paths (reusing from your existing code)
def get_adls_path(container: str, folder: str) -> str:
    """
    Generate an ADLS path based on the container and folder.
    """
    storage_account = "ecomsalessa"
    return f"abfss://{container}@{storage_account}.dfs.core.windows.net/{folder}/"

# Define source and destination paths
processed_container = "processed"
processed_folder = "ecommerce-dataset-l0"
processed_path = get_adls_path(processed_container, processed_folder)

curated_container = "curated"
curated_folder = "ecommerce-dataset-l1"
curated_path = get_adls_path(curated_container, curated_folder)

StatementMeta(ecomsparkpool, 29, 2, Finished, Available, Finished)

### Data Loading from Processed Layer

This section loads the processed datasets from the previous pipeline stage. These datasets have been:
1. Cleaned and standardized
2. Enhanced with derived fields
3. Enriched with English product category translations
4. Optimized for analytical processing

Data verification ensures all required fields and translations are present before proceeding.

In [2]:
# Load processed datasets
print("Loading processed datasets...")
customers_df = spark.read.parquet(processed_path + "customers/")
orders_df = spark.read.parquet(processed_path + "orders/")
order_items_df = spark.read.parquet(processed_path + "order_items/")
order_payments_df = spark.read.parquet(processed_path + "order_payments/")
order_reviews_df = spark.read.parquet(processed_path + "order_reviews/")
products_df = spark.read.parquet(processed_path + "products/")
sellers_df = spark.read.parquet(processed_path + "sellers/")
category_names_df = spark.read.parquet(processed_path + "category_names/")
geolocation_df = spark.read.parquet(processed_path + "geolocation/")

# Display schemas to verify data loading
print("Verifying loaded datasets...")
customers_df.printSchema()
orders_df.printSchema()

# Verify English category names are available
print("\nVerifying product categories:")
products_df.select("product_category_name", "product_category_name_english").show(5)
print(f"Products with English categories: {products_df.filter(F.col('product_category_name_english').isNotNull()).count()}")
print(f"Products without English categories: {products_df.filter(F.col('product_category_name_english').isNull()).count()}")

StatementMeta(ecomsparkpool, 29, 3, Finished, Available, Finished)

Loading processed datasets...
Verifying loaded datasets...
root
 |-- customer_id: string (nullable = true)
 |-- customer_unique_id: string (nullable = true)
 |-- customer_zip_code_prefix: integer (nullable = true)
 |-- customer_city: string (nullable = true)
 |-- customer_state: string (nullable = true)

root
 |-- order_id: string (nullable = true)
 |-- customer_id: string (nullable = true)
 |-- order_status: string (nullable = true)
 |-- order_purchase_timestamp: timestamp (nullable = true)
 |-- order_approved_at: timestamp (nullable = true)
 |-- order_delivered_carrier_date: timestamp (nullable = true)
 |-- order_delivered_customer_date: timestamp (nullable = true)
 |-- order_estimated_delivery_date: timestamp (nullable = true)
 |-- shipping_time_days: integer (nullable = true)
 |-- delivery_time_days: integer (nullable = true)
 |-- total_delivery_time_days: integer (nullable = true)
 |-- is_delayed: boolean (nullable = true)
 |-- delay_days: integer (nullable = true)
 |-- is_approve

### Dimension Tables Creation

This section implements the dimensional model by creating standardized dimension tables:

1. **Customer Dimension**: Geographic and identification attributes of customers
2. **Product Dimension**: Product attributes including categories and physical dimensions
3. **Seller Dimension**: Seller attributes and location information
4. **Date Dimension**: A complete calendar with date hierarchies and attributes
5. **Geography Dimension**: Location data for spatial analysis

These dimension tables form the reference points for the star schema design.

In [3]:
# 1. CREATE DIMENSION TABLES
print("Creating dimension tables...")

# 1.1 Customer Dimension
dim_customer = customers_df.select(
    F.col("customer_id").alias("CustomerID"),
    F.col("customer_unique_id").alias("CustomerUniqueID"),
    F.col("customer_zip_code_prefix").alias("CustomerZipCodePrefix"),
    F.col("customer_city").alias("CustomerCity"),
    F.col("customer_state").alias("CustomerState")
).withColumn("RowEffectiveDate", F.current_timestamp()
).withColumn("RowExpirationDate", F.lit(None).cast("timestamp")
).withColumn("CurrentFlag", F.lit(True).cast("boolean"))

# 1.2 Product Dimension
from pyspark.sql.types import FloatType

# 1.2 Product Dimension
dim_product = products_df.join(
    category_names_df,
    products_df.product_category_name == category_names_df.product_category_name,
    "left"
).select(
    products_df.product_id.alias("ProductID"),
    products_df.product_category_name.alias("ProductCategoryName"),
    F.coalesce(category_names_df.product_category_name_english, F.lit("uncategorized")).alias("ProductCategoryNameEnglish"),
    
    F.col("product_weight_g").cast(FloatType()).alias("ProductWeightG"),
    F.col("product_length_cm").cast(FloatType()).alias("ProductLengthCm"),
    F.col("product_height_cm").cast(FloatType()).alias("ProductHeightCm"),
    F.col("product_width_cm").cast(FloatType()).alias("ProductWidthCm"),
    
    (F.col("product_length_cm").cast(FloatType()) *
     F.col("product_height_cm").cast(FloatType()) *
     F.col("product_width_cm").cast(FloatType())).alias("ProductVolumeCm3"),

    F.when(
        products_df.product_length_cm.isNull() |
        products_df.product_height_cm.isNull() |
        products_df.product_width_cm.isNull(), "Unknown"
    ).otherwise(
        F.when((products_df.product_length_cm.cast(FloatType()) *
                products_df.product_height_cm.cast(FloatType()) *
                products_df.product_width_cm.cast(FloatType())) < 1000, "Small")
         .when((products_df.product_length_cm.cast(FloatType()) *
                products_df.product_height_cm.cast(FloatType()) *
                products_df.product_width_cm.cast(FloatType())) < 8000, "Medium")
         .otherwise("Large")
    ).alias("SizeCategory")
).withColumn("RowEffectiveDate", F.current_timestamp()
).withColumn("RowExpirationDate", F.lit(None).cast("timestamp")
).withColumn("CurrentFlag", F.lit(True).cast("boolean"))

# 1.3 Seller Dimension
dim_seller = sellers_df.select(
    F.col("seller_id").alias("SellerID"),
    F.col("seller_zip_code_prefix").alias("SellerZipCodePrefix"),
    F.col("seller_city").alias("SellerCity"),
    F.col("seller_state").alias("SellerState")
).withColumn("RowEffectiveDate", F.current_timestamp()
).withColumn("RowExpirationDate", F.lit(None).cast("timestamp")
).withColumn("CurrentFlag", F.lit(True).cast("boolean"))

# 1.4 Date Dimension
# Generate date range from min to max order date
min_date = orders_df.agg(F.min("order_purchase_timestamp").cast("date")).collect()[0][0]
max_date = orders_df.agg(F.max("order_purchase_timestamp").cast("date")).collect()[0][0]

date_range = spark.sql(f"""
SELECT explode(sequence(to_date('{min_date}'), to_date('{max_date}'), interval 1 day)) as date
""")

dim_date = date_range.select(
    F.date_format("date", "yyyyMMdd").cast(IntegerType()).alias("DateKey"),
    F.date_format("date", "yyyy-MM-dd").alias("DateID"),
    F.col("date").alias("Date"),
    year("date").alias("Year"),
    month("date").alias("Month"),
    dayofmonth("date").alias("Day"),
    quarter("date").alias("Quarter"),
    weekofyear("date").alias("WeekOfYear"),
    dayofweek("date").alias("DayOfWeek"),
    when(dayofweek("date").isin(1, 7), True).otherwise(False).cast(BooleanType()).alias("IsWeekend"),
    date_format("date", "MMMM").alias("MonthName"),
    date_format("date", "EEEE").alias("DayName"),
    when(month("date") > 6, year("date") + 1).otherwise(year("date")).alias("FiscalYear"),
    when(month("date").between(7, 9), 1)
     .when(month("date").between(10, 12), 2)
     .when(month("date").between(1, 3), 3)
     .otherwise(4).alias("FiscalQuarter"),
    lit(None).cast(StringType()).alias("Holiday"),
    lit(False).cast(BooleanType()).alias("IsHoliday")
)

# 1.5 Geography Dimension
# Aggregate geolocation data to avoid duplicates
dim_geography = geolocation_df.groupBy("geolocation_zip_code_prefix").agg(
    F.first("geolocation_city").alias("City"),
    F.first("geolocation_state").alias("State"),
    F.avg("geolocation_lat").alias("Latitude"),
    F.avg("geolocation_lng").alias("Longitude")
).withColumnRenamed("geolocation_zip_code_prefix", "ZipCodePrefix") \
 .withColumn("Region", F.lit(None).cast("string")) \
 .withColumn("RowEffectiveDate", F.current_timestamp()) \
 .withColumn("RowExpirationDate", F.lit(None).cast("timestamp")) \
 .withColumn("CurrentFlag", F.lit(True).cast("boolean"))


StatementMeta(ecomsparkpool, 29, 4, Finished, Available, Finished)

Creating dimension tables...


### Fact Tables Creation

This section builds the central fact tables that contain measures and foreign keys to dimensions:

1. **Sales Fact**: Transaction-level sales data with order details and performance metrics
2. **Reviews Fact**: Customer feedback data with response time analysis

These fact tables contain the quantitative metrics that will be analyzed across various dimensions.

In [4]:
# 2. CREATE FACT TABLES
print("Creating fact tables...")

fact_sales = order_items_df.join(
    orders_df, "order_id"
).join(
    order_payments_df.groupBy("order_id").agg(
        F.first("payment_type").alias("payment_type")
    ), "order_id", "left"
).join(
    customers_df.select("customer_id", "customer_zip_code_prefix"), "customer_id", "left"
).join(
    sellers_df.select("seller_id", "seller_state"), "seller_id", "left"
).withColumn(
    "IsCrossState", F.when(F.col("customer_state") != F.col("seller_state"), True).otherwise(False).cast(BooleanType())
).select(
    F.col("order_id").alias("OrderID"),
    F.col("order_item_id").alias("OrderItemID"),
    F.col("customer_id").alias("CustomerID"),
    F.col("product_id").alias("ProductID"),
    F.col("seller_id").alias("SellerID"),
    F.date_format("order_purchase_timestamp", "yyyyMMdd").cast(IntegerType()).alias("DateKey"),
    F.col("order_status").alias("StatusID"),
    F.col("customer_zip_code_prefix").alias("ZipCodePrefix"),
    F.col("order_purchase_timestamp").alias("OrderPurchaseTimestamp"),
    F.col("order_delivered_customer_date").alias("OrderDeliveredCustomerDate"),
    F.col("price").alias("Price"),
    F.col("freight_value").alias("FreightValue"),
    (F.col("price") + F.col("freight_value")).alias("TotalItemValue"),
    F.col("shipping_days").alias("ShippingDays"),
    F.col("delivery_days").alias("DeliveryDays"),
    F.col("total_days").alias("TotalDays"),
    F.col("is_delayed").cast(BooleanType()).alias("IsDelayed"),
    F.col("delay_days").alias("DelayDays"),
    F.col("IsCrossState"),
    F.col("payment_type").alias("PaymentType")
)

# 2.2 Customer Reviews Fact Table
fact_reviews = order_reviews_df.join(
    orders_df.select("order_id", "customer_id", "order_purchase_timestamp"), "order_id"
).select(
    F.col("order_id").alias("OrderID"),
    F.col("review_id").alias("ReviewID"),
    F.col("customer_id").alias("CustomerID"),
    F.date_format("order_purchase_timestamp", "yyyyMMdd").cast(IntegerType()).alias("DateKey"),
    F.col("review_score").alias("ReviewScore"),
    F.col("review_comment_message").alias("ReviewCommentMessage"),
    F.col("review_creation_date").alias("ReviewCreationDate"),
    F.col("review_answer_timestamp").alias("ReviewAnswerTimestamp"),
    F.datediff("review_answer_timestamp", "review_creation_date").alias("ReviewResponseDays")
)


StatementMeta(ecomsparkpool, 29, 5, Finished, Available, Finished)

Creating fact tables...


### Business Aggregations Creation

This section generates pre-calculated aggregations for common business analytics needs:

1. **Sales by Category**: Product category performance metrics
2. **Sales by State**: Geographic sales distribution analysis
3. **Seller Performance**: Seller efficiency and delivery metrics
4. **Monthly Sales Trends**: Time-series analysis of sales patterns
5. **Order Status Analysis**: Order fulfillment and cancellation analysis
6. **Cross-State Analysis**: Performance metrics for orders shipped across state borders
7. **Product Size Analysis**: Impact of physical dimensions on sales and shipping
8. **Payment Method Analysis**: Payment preference patterns

Each aggregation includes relevant counts, sums, and averages to support business decision-making.

In [5]:
# 3. CREATE BUSINESS AGGREGATIONS
print("Creating business aggregations...")

# 3.1 Sales by Category
agg_sales_by_category = fact_sales.join(
    dim_product, "product_id"
).groupBy(
    col("product_category_name_english").alias("CategoryKey")
).agg(
    count("order_id").alias("OrdersCount"),
    countDistinct("customer_id").alias("UniqueCustomers"),
    _sum("price").alias("TotalSales"),
    avg("price").alias("AvgItemPrice"),
    avg("total_days").alias("AvgDeliveryTime"),
    _sum(when(col("is_delayed") == True, 1).otherwise(0)).alias("DelayedOrders")
).withColumn("LastUpdated", current_timestamp())

# 3.2 Sales by State
agg_sales_by_state = fact_sales.join(
    dim_customer, "customer_id"
).groupBy(
    col("customer_state").alias("CustomerState")
).agg(
    count("order_id").alias("OrdersCount"),
    countDistinct("customer_id").alias("UniqueCustomers"),
    _sum("price").alias("TotalSales"),
    avg("price").alias("AvgItemPrice"),
    avg("total_days").alias("AvgDeliveryTime")
).withColumn("LastUpdated", current_timestamp())

# 3.3 Seller Performance
agg_seller_performance = fact_sales.join(
    dim_seller, "seller_id"
).groupBy(
    col("seller_id").alias("SellerID"),
    col("seller_state").alias("SellerState")
).agg(
    count("order_id").alias("OrdersCount"),
    _sum("price").alias("TotalSales"),
    avg("freight_value").alias("AvgShippingCost"),
    avg("total_days").alias("AvgDeliveryTime"),
    _sum(when(col("is_delayed") == True, 1).otherwise(0)).alias("DelayedOrders"),
    ( _sum(when(col("is_delayed") == True, 1).otherwise(0)) / count("order_id") ).alias("DelayRate")
).withColumn("LastUpdated", current_timestamp())

# 3.4 Monthly Sales Trends
agg_monthly_sales = fact_sales.join(
    dim_date, "date_id"
).groupBy(
    col("year").alias("Year"),
    col("month").alias("Month")
).agg(
    count("order_id").alias("OrdersCount"),
    countDistinct("customer_id").alias("UniqueCustomers"),
    _sum("price").alias("TotalSales"),
    avg("price").alias("AvgItemPrice")
).orderBy("Year", "Month").withColumn("LastUpdated", current_timestamp())

# 3.5 Order Status Analysis
# (If you wish to persist this aggregation, adjust the table schema accordingly)
agg_order_status = fact_sales.join(
    dim_product, "product_id"
).groupBy(
    col("product_category_name_english").alias("CategoryKey"),
    col("order_status").alias("OrderStatus")
).agg(
    count("order_id").alias("OrdersCount"),
    _sum("price").alias("TotalSales"),
    countDistinct("customer_id").alias("UniqueCustomers")
).withColumn("LastUpdated", current_timestamp())

# 3.6 Cross-State Order Analysis
from pyspark.sql.functions import expr
agg_cross_state = fact_sales.join(
    dim_customer, "customer_id"
).join(
    dim_seller, "seller_id"
).join(
    dim_product, "product_id"
).withColumn(
    "IsCrossState", when(col("customer_state") != col("seller_state"), True).otherwise(False).cast(BooleanType())
).groupBy(
    col("product_category_name_english").alias("CategoryKey"),
    col("IsCrossState")
).agg(
    count("order_id").alias("OrdersCount"),
    _sum("price").alias("TotalSales"),
    avg("total_days").alias("AvgDeliveryTime"),
    ( _sum(when(col("is_delayed") == True, 1).otherwise(0)) / count("order_id") ).alias("DelayRate")
).withColumn("LastUpdated", current_timestamp())

# 3.7 Product Size Analysis
# Note: dim_product_with_size is assumed to be derived from dim_product with proper casting.
agg_size_analysis = fact_sales.join(
    dim_product_with_size, "product_id"
).groupBy(
    col("product_category_name_english").alias("CategoryKey"),
    col("size_category").alias("SizeCategory")
).agg(
    count("order_id").alias("OrdersCount"),
    _sum("price").alias("TotalSales"),
    avg("freight_value").alias("AvgShippingCost")
).withColumn("LastUpdated", current_timestamp())

# 3.8 Payment Method Analysis
agg_payment_methods = fact_sales.join(
    order_payments_df.select("order_id", "payment_type"),
    "order_id"
).groupBy(
    _hash("payment_type").alias("PaymentTypeKey")
).agg(
    count("order_id").alias("OrdersCount"),
    _sum("price").alias("TotalSales"),
    avg("price").alias("AvgOrderValue"),
    countDistinct("customer_id").alias("UniqueCustomers")
).withColumn("LastUpdated", current_timestamp())

# Write out payment method aggregation as an example:
agg_payment_methods.write.mode("overwrite").parquet(curated_path + "aggregates/payment_methods/")

# Optionally, display some sample data:
print("\nSample data from agg_payment_methods:")
agg_payment_methods.show(5)

# Data quality checks
print("\nData quality checks:")
print(f"Order status analysis record count: {agg_order_status.count()}")
print(f"Cross-state analysis record count: {agg_cross_state.count()}")
print(f"Size analysis record count: {agg_size_analysis.count()}")

print("\nNull values in key metrics:")
print("Order status analysis nulls in TotalSales:", 
      agg_order_status.filter(col("TotalSales").isNull()).count())

print("\nActual column names in agg_cross_state:")
print(agg_cross_state.columns)

print("Cross-state analysis nulls in AvgDeliveryTime:", 
      agg_cross_state.filter(col("AvgDeliveryTime").isNull()).count())

StatementMeta(ecomsparkpool, 29, 6, Finished, Available, Finished)

Creating business aggregations...
Creating order status analysis...
Creating product size analysis...
Creating payment method analysis...

Sample data from agg_payment_methods:
+------------+------------+--------------------+------------------+----------------+
|payment_type|orders_count|         total_sales|   avg_order_value|unique_customers|
+------------+------------+--------------------+------------------+----------------+
|      boleto|       22867|  2391525.6600000267| 104.5841457121628|           19614|
| credit_card|       86769|1.0974357299999023E7|126.47785845173993|           75991|
|     voucher|        6274|   659473.6399999988|105.11215173732847|            3766|
|  debit_card|        1691|  183758.74000000037|108.66868125369625|            1521|
+------------+------------+--------------------+------------------+----------------+


Data quality checks:
Order status analysis record count: 190
Cross-state analysis record count: 143
Size analysis record count: 194

Null val

### Data Persistence and Verification

This section:
1. Writes all dimension and fact tables to the curated layer
2. Persists aggregated datasets for quick access by reporting tools
3. Displays sample data to verify the output quality
4. Provides summary statistics on the processed data volume

The curated data is now optimized for consumption by business intelligence tools and dashboards.

In [6]:
# 4. WRITE TO CURATED CONTAINER
print("Writing dimension tables to curated container...")

# Write dimensions
dim_customer.write.mode("overwrite").parquet(curated_path + "dimensions/dim_customer/")
dim_product.write.mode("overwrite").parquet(curated_path + "dimensions/dim_product/")
dim_seller.write.mode("overwrite").parquet(curated_path + "dimensions/dim_seller/")
dim_date.write.mode("overwrite").parquet(curated_path + "dimensions/dim_date/")
dim_geography.write.mode("overwrite").parquet(curated_path + "dimensions/dim_geography/")

print("Writing fact tables to curated container...")
# Write facts
fact_sales.write.mode("overwrite").parquet(curated_path + "facts/fact_sales/")
fact_reviews.write.mode("overwrite").parquet(curated_path + "facts/fact_reviews/")

print("Writing aggregation tables to curated container...")
# Write aggregations
# Write all aggregations to curated layer
print("\nWriting aggregations to curated layer...")
agg_sales_by_category.write.mode("overwrite").parquet(curated_path + "aggregates/sales_by_category/")
agg_sales_by_state.write.mode("overwrite").parquet(curated_path + "aggregates/sales_by_state/")
agg_seller_performance.write.mode("overwrite").parquet(curated_path + "aggregates/seller_performance/")
agg_monthly_sales.write.mode("overwrite").parquet(curated_path + "aggregates/monthly_sales/")
agg_order_status.write.mode("overwrite").parquet(curated_path + "aggregates/order_status/")
agg_cross_state.write.mode("overwrite").parquet(curated_path + "aggregates/cross_state_analysis/")
agg_size_analysis.write.mode("overwrite").parquet(curated_path + "aggregates/size_analysis/")

# Show sample data from key tables
print("\nSample data from fact_sales:")
fact_sales.select("order_id", "customer_id", "price", "total_days", "is_delayed").show(5)

print("\nSample data from agg_sales_by_category:")
agg_sales_by_category.show(5)

print("\nSample data from agg_monthly_sales:")
agg_monthly_sales.show(5)

# Add these lines after your existing sample data displays
print("\nSample data from agg_order_status:")
agg_order_status.show(5)

print("\nSample data from agg_cross_state:")
agg_cross_state.show(5)

print("\nSample data from agg_size_analysis:")
agg_size_analysis.show(5)

print("\nData warehousing process completed successfully!")

# Final pipeline summary
print("\n=== ETL Pipeline Summary ===")
print(f"Processed {fact_sales.count()} sales records")
print(f"Created {dim_product.count()} product dimension records")
print(f"Created {dim_customer.count()} customer dimension records")
print(f"Created {dim_seller.count()} seller dimension records")
print(f"Generated {agg_sales_by_category.count() + agg_sales_by_state.count() + agg_seller_performance.count() + agg_monthly_sales.count() + agg_order_status.count() + agg_cross_state.count() + agg_size_analysis.count()} analytical aggregations")
print("Pipeline execution complete!")

StatementMeta(ecomsparkpool, 29, 7, Finished, Available, Finished)

Writing dimension tables to curated container...
Writing fact tables to curated container...
Writing aggregation tables to curated container...

Writing aggregations to curated layer...

Sample data from fact_sales:
+--------------------+--------------------+-----+----------+----------+
|            order_id|         customer_id|price|total_days|is_delayed|
+--------------------+--------------------+-----+----------+----------+
|995392413cee61cc1...|4bf24904ec428325a...|129.9|        18|     false|
|b39de9ed2bb8fd08a...|ed18b557140ff674f...| 98.9|        15|     false|
|b1a88554eb1f7f686...|5cf799d0ac88e1d32...|39.99|         7|     false|
|fe2ac30768e07e362...|be07f06f183227b5c...| 74.5|        10|     false|
|460673777918e7a42...|6f352988122e277e6...| 48.0|        17|     false|
+--------------------+--------------------+-----+----------+----------+
only showing top 5 rows


Sample data from agg_sales_by_category:
+-----------------------------+------------+----------------+---------

In [None]:
# 5. WRITE TO CURATED CONTAINER - CSVs in a single folder
csv_base_path = curated_path + "csvs/"

print("Writing dimension tables to curated container as CSVs...")

# Write dimensions as CSV with header
dim_customer.write.mode("overwrite").option("header", "true").csv(csv_base_path + "dim_customer/")
dim_product.write.mode("overwrite").option("header", "true").csv(csv_base_path + "dim_product/")
dim_seller.write.mode("overwrite").option("header", "true").csv(csv_base_path + "dim_seller/")
dim_date.write.mode("overwrite").option("header", "true").csv(csv_base_path + "dim_date/")
dim_geography.write.mode("overwrite").option("header", "true").csv(csv_base_path + "dim_geography/")

print("Writing fact tables to curated container as CSVs...")
# Write facts as CSV
fact_sales.write.mode("overwrite").option("header", "true").csv(csv_base_path + "fact_sales/")
fact_reviews.write.mode("overwrite").option("header", "true").csv(csv_base_path + "fact_reviews/")

print("Writing aggregation tables to curated container as CSVs...")
# Write aggregations as CSV
agg_sales_by_category.write.mode("overwrite").option("header", "true").csv(csv_base_path + "agg_sales_by_category/")
agg_sales_by_state.write.mode("overwrite").option("header", "true").csv(csv_base_path + "agg_sales_by_state/")
agg_seller_performance.write.mode("overwrite").option("header", "true").csv(csv_base_path + "agg_seller_performance/")
agg_monthly_sales.write.mode("overwrite").option("header", "true").csv(csv_base_path + "agg_monthly_sales/")
agg_order_status.write.mode("overwrite").option("header", "true").csv(csv_base_path + "agg_order_status/")
agg_cross_state.write.mode("overwrite").option("header", "true").csv(csv_base_path + "agg_cross_state_analysis/")
agg_size_analysis.write.mode("overwrite").option("header", "true").csv(csv_base_path + "agg_size_analysis/")
