## E-Commerce Data Processing Pipeline
### Transformation Layer

This notebook represents the second stage in the E-commerce ETL pipeline. It:
1. Loads raw Brazilian e-commerce data from the Olist dataset
2. Cleans and transforms the data
3. Creates derived fields for analysis
4. Writes processed data to the processed layer for analytics

### Setup and Configuration

The code below initializes a Spark session and defines storage paths for reading raw data and writing processed data.

In [5]:
# Import necessary libraries
import pandas as pd
import numpy as np
from pyspark.sql.types import *
import pyspark.sql.functions as F
from pyspark.sql import SparkSession

# Initialize Spark Session (if not already initialized)
spark = SparkSession.builder.getOrCreate()

# Define function for generating ADLS paths
def get_adls_path(container: str, folder: str) -> str:
    """
    Generate an ADLS path based on the container and folder.
    """
    storage_account = "ecomsalessa"
    return f"abfss://{container}@{storage_account}.dfs.core.windows.net/{folder}/"

# Define source paths (raw data)
raw_container = "raw"
raw_folder = "ecommerce-dataset"
raw_path = get_adls_path(raw_container, raw_folder)

# Define destination paths (processed data)
processed_container = "processed"
processed_folder = "ecommerce-dataset-l0"
processed_path = get_adls_path(processed_container, processed_folder)

### Data Loading and Initial Standardization

This section performs:
1. Loading of raw CSV files from the raw container
2. Standardization of column names to snake_case format for consistency
3. Implementation of consistent naming conventions across all datasets

The Olist dataset includes customers, orders, products, sellers, and their relationships.

In [6]:
# Load each dataset from ADLS
customers_df = spark.read.csv(raw_path + "olist_customers_dataset.csv", header=True, inferSchema=True)
orders_df = spark.read.csv(raw_path + "olist_orders_dataset.csv", header=True, inferSchema=True)
order_items_df = spark.read.csv(raw_path + "olist_order_items_dataset.csv", header=True, inferSchema=True)
order_payments_df = spark.read.csv(raw_path + "olist_order_payments_dataset.csv", header=True, inferSchema=True)
order_reviews_df = spark.read.csv(raw_path + "olist_order_reviews_dataset.csv", header=True, inferSchema=True)
products_df = spark.read.csv(raw_path + "olist_products_dataset.csv", header=True, inferSchema=True)
sellers_df = spark.read.csv(raw_path + "olist_sellers_dataset.csv", header=True, inferSchema=True)
category_names_df = spark.read.csv(raw_path + "product_category_name_translation.csv", header=True, inferSchema=True)
geolocation_df = spark.read.csv(raw_path + "olist_geolocation_dataset.csv", header=True, inferSchema=True)

# Function to convert column names to standardized format (snake_case)
def to_snake_case(column_name):
    return column_name.lower().replace(' ', '_').replace('-', '_')

# Apply to all dataframes
customers_df = customers_df.select([F.col(c).alias(to_snake_case(c)) for c in customers_df.columns])
orders_df = orders_df.select([F.col(c).alias(to_snake_case(c)) for c in orders_df.columns])
order_items_df = order_items_df.select([F.col(c).alias(to_snake_case(c)) for c in order_items_df.columns])
order_payments_df = order_payments_df.select([F.col(c).alias(to_snake_case(c)) for c in order_payments_df.columns])
order_reviews_df = order_reviews_df.select([F.col(c).alias(to_snake_case(c)) for c in order_reviews_df.columns])
products_df = products_df.select([F.col(c).alias(to_snake_case(c)) for c in products_df.columns])
sellers_df = sellers_df.select([F.col(c).alias(to_snake_case(c)) for c in sellers_df.columns])
category_names_df = category_names_df.select([F.col(c).alias(to_snake_case(c)) for c in category_names_df.columns])
geolocation_df = geolocation_df.select([F.col(c).alias(to_snake_case(c)) for c in geolocation_df.columns])


### Data Type Conversion and Null Handling

This section addresses data quality issues through:
1. Conversion of date/timestamp columns to proper formats
2. Conversion of numeric columns to appropriate types
3. Handling of null values in critical fields

Proper data typing is essential for accurate analysis and reporting.

In [7]:
# Convert date columns to proper date format
orders_df = orders_df.withColumn("order_purchase_timestamp", F.to_timestamp("order_purchase_timestamp"))
orders_df = orders_df.withColumn("order_approved_at", F.to_timestamp("order_approved_at"))
orders_df = orders_df.withColumn("order_delivered_carrier_date", F.to_timestamp("order_delivered_carrier_date"))
orders_df = orders_df.withColumn("order_delivered_customer_date", F.to_timestamp("order_delivered_customer_date"))
orders_df = orders_df.withColumn("order_estimated_delivery_date", F.to_timestamp("order_estimated_delivery_date"))

# Convert numeric columns to proper types
order_items_df = order_items_df.withColumn("price", F.col("price").cast(DoubleType()))
order_items_df = order_items_df.withColumn("freight_value", F.col("freight_value").cast(DoubleType()))
order_payments_df = order_payments_df.withColumn("payment_value", F.col("payment_value").cast(DoubleType()))
order_reviews_df = order_reviews_df.withColumn("review_score", F.col("review_score").cast(IntegerType()))

# Fill null values as needed - use string or value, not None
# Instead of None, we'll keep nulls for date columns and use other approaches
products_df = products_df.fillna({"product_category_name": "uncategorized"})


### Data Exploration and Quality Assessment

Before proceeding with transformations, this section:
1. Examines the schema of key tables
2. Displays sample data to understand structure and content
3. Identifies missing values that might impact analysis

This assessment reveals data quality issues before dimensional model creation.

In [8]:
# Alternative approach for handling null values in date columns
# Use when clause to leave them as null (no change needed)
# orders_df = orders_df.withColumn("order_approved_at", 
#                                 F.when(F.col("order_approved_at").isNotNull(), 
#                                       F.col("order_approved_at")))

# Standardize product categories
products_df = products_df.withColumn("product_category_name", F.lower(F.col("product_category_name")))

# Display sample data and data info
print("Customers DataFrame Schema:")
customers_df.printSchema()
print("\nCustomers DataFrame Sample:")
customers_df.show(5)

print("\nOrders DataFrame Schema:")
orders_df.printSchema()
print("\nOrders DataFrame Sample:")
orders_df.show(5)

# Check for missing values
print("\nMissing Values in Orders DataFrame:")
for column in orders_df.columns:
    missing_count = orders_df.filter(F.col(column).isNull()).count()
    if missing_count > 0:
        print(f"Column {column}: {missing_count} missing values")




### Feature Engineering and Extended Attributes

This section enriches the data through:
1. Creation of status flags to track order progress
2. Calculation of delivery time metrics (shipping days, delivery days)
3. Addition of delay indicators and measurements
4. Conversion of product categories to English using translation mapping

These derived fields enhance analytical capabilities and business insights.

In [10]:
# Add these lines after your date conversion section

# Create status flags to track order progress
orders_df = orders_df.withColumn(
    "order_status", 
    F.when(F.col("order_delivered_customer_date").isNotNull(), "DELIVERED")
     .when(F.col("order_delivered_carrier_date").isNotNull(), "SHIPPED")
     .when(F.col("order_approved_at").isNotNull(), "APPROVED")
     .otherwise("CREATED")
)

# Calculate delivery time metrics (only where data is available)
orders_df = orders_df.withColumn(
    "shipping_time_days",
    F.when(
        F.col("order_delivered_carrier_date").isNotNull() & F.col("order_approved_at").isNotNull(),
        F.datediff(F.col("order_delivered_carrier_date"), F.col("order_approved_at"))
    )
)

orders_df = orders_df.withColumn(
    "delivery_time_days",
    F.when(
        F.col("order_delivered_customer_date").isNotNull() & F.col("order_delivered_carrier_date").isNotNull(),
        F.datediff(F.col("order_delivered_customer_date"), F.col("order_delivered_carrier_date"))
    )
)

orders_df = orders_df.withColumn(
    "total_delivery_time_days",
    F.when(
        F.col("order_delivered_customer_date").isNotNull() & F.col("order_purchase_timestamp").isNotNull(),
        F.datediff(F.col("order_delivered_customer_date"), F.col("order_purchase_timestamp"))
    )
)

# Optional: Add a delivery delay indicator
orders_df = orders_df.withColumn(
    "is_delayed",
    F.when(
        F.col("order_delivered_customer_date").isNotNull() & F.col("order_estimated_delivery_date").isNotNull(),
        F.col("order_delivered_customer_date") > F.col("order_estimated_delivery_date")
    ).otherwise(None)
)

# Calculate delay days (if delivery was late)
orders_df = orders_df.withColumn(
    "delay_days",
    F.when(
        F.col("is_delayed") == True,
        F.datediff(F.col("order_delivered_customer_date"), F.col("order_estimated_delivery_date"))
    ).otherwise(0)
)

# Check for missing values
print("\nMissing Values in Orders DataFrame:")
for column in orders_df.columns:
    missing_count = orders_df.filter(F.col(column).isNull()).count()
    if missing_count > 0:
        print(f"Column {column}: {missing_count} missing values")


# Add this after your data transformations
status_counts = orders_df.groupBy("order_status").count().orderBy("order_status")
print("\nOrder Status Distribution:")
status_counts.show()

# Add these lines after your existing code

# Create additional boolean columns that clearly indicate status
orders_df = orders_df.withColumn("is_approved", F.col("order_approved_at").isNotNull())
orders_df = orders_df.withColumn("is_shipped", F.col("order_delivered_carrier_date").isNotNull())
orders_df = orders_df.withColumn("is_delivered", F.col("order_delivered_customer_date").isNotNull())

# Calculate completed times with default values for reporting
orders_df = orders_df.withColumn(
    "shipping_days", 
    F.when(F.col("shipping_time_days").isNotNull(), F.col("shipping_time_days")).otherwise(-1)
)

orders_df = orders_df.withColumn(
    "delivery_days", 
    F.when(F.col("delivery_time_days").isNotNull(), F.col("delivery_time_days")).otherwise(-1)
)

orders_df = orders_df.withColumn(
    "total_days", 
    F.when(F.col("total_delivery_time_days").isNotNull(), F.col("total_delivery_time_days")).otherwise(-1)
)

# Add this line to see the new columns
print("\nUpdated Orders DataFrame Schema (with boolean flags):")
orders_df.printSchema()

# Add this after loading and transforming the datasets, before writing to processed

# IMPORTANT: Verify the category translation data
print("\nCategory Names DataFrame:")
category_names_df.printSchema()
category_names_df.show(5)

# Make sure both dataframes have the columns needed for join
print("\nProducts DataFrame Category Column:")
products_df.select("product_category_name").show(5)

# Fix: Use explicit column references to avoid ambiguity
products_with_english_categories = products_df.join(
    category_names_df,
    products_df["product_category_name"] == category_names_df["product_category_name"],
    "left"
).select(
    products_df["*"],  # All columns from products_df
    category_names_df["product_category_name_english"]  # Just the English name from category_names_df
)

# Check the joined data
print("\nJoined Products with Categories:")
products_with_english_categories.select(
    "product_id", 
    "product_category_name", 
    "product_category_name_english"
).show(5)

# Update the products_df to include the English category names
products_df = products_with_english_categories.withColumn(
    "product_category_name_english",
    F.coalesce(F.col("product_category_name_english"), F.lit("uncategorized"))
)

### Data Persistence to Processed Layer

After all transformations, this section:
1. Writes the processed datasets to the processed layer in Parquet format
2. Maintains the original entity structure while enhancing with new attributes
3. Prepares the data for dimensional modeling in the next pipeline stage

The processed data now has consistent formats, enriched attributes, and is optimized for analytics.

In [11]:
# Write cleaned data to processed container
customers_df.write.mode("overwrite").parquet(processed_path + "customers/")
orders_df.write.mode("overwrite").parquet(processed_path + "orders/")
order_items_df.write.mode("overwrite").parquet(processed_path + "order_items/")
order_payments_df.write.mode("overwrite").parquet(processed_path + "order_payments/")
order_reviews_df.write.mode("overwrite").parquet(processed_path + "order_reviews/")
products_df.write.mode("overwrite").parquet(processed_path + "products/")
sellers_df.write.mode("overwrite").parquet(processed_path + "sellers/")
category_names_df.write.mode("overwrite").parquet(processed_path + "category_names/")
geolocation_df.write.mode("overwrite").parquet(processed_path + "geolocation/")

print("Data cleaning and processing completed successfully!")