# SportsBar Pricing Dimension Processing Pipeline
## Medallion Architecture: Bronze ‚Üí Silver ‚Üí Gold

### üìå NOTEBOOK OVERVIEW

This notebook processes SportsBar product pricing data through the medallion architecture pattern, handling multiple date formats, validating numerical values, and merging with parent company master pricing data for comprehensive product cost analysis.

**Pipeline Purpose:** Ingest raw gross price data, standardize formats, validate values, join with products, then merge into parent company pricing dimension for cost-based analytics.

**Output:** `fmcg.gold.dim_gross_price` - Unified pricing dimension with annual grain aggregation

### üîÑ PROCESSING FLOW

| Layer | Table Name | Grain | Purpose | Key Transformations |
|-------|-----------|-------|---------|---------------------|
| **Bronze** | `fmcg.bronze.gross_price` | Daily/Monthly | Raw data ingestion | Metadata tracking |
| **Silver** | `fmcg.silver.gross_price` | Varies | Quality & enrichment | 4-step quality process |
| **Gold (Staging)** | `fmcg.gold.sb_dim_gross_price` | Annual | SportsBar pricing | Month‚ÜíYear aggregation |
| **Gold (Parent)** | `fmcg.gold.dim_gross_price` | Annual | Master pricing data | MERGE with parent company |

### üéØ KEY BUSINESS LOGIC

**Pricing Data Model:**
- Captures monthly gross prices for products
- Transforms to annual grain (one price per product per year)
- Handles multiple date formats and invalid price values
- Joins with products to ensure product_code consistency

**Data Quality Steps (4-Step Process):**
1. **Date Standardization** - Parse 4 different date formats consistently
2. **Price Validation** - Numeric validation with negative price handling
3. **Product Join** - Enrich with product_code from products dimension
4. **Annual Aggregation** - Select latest non-zero price per product per year

### üíº BUSINESS CONTEXT

The pricing dimension provides cost information for products, enabling cost-based analysis, margin calculations, and trend analysis. By storing annual pricing points, the warehouse tracks how product prices evolved over time, critical for accurate COGS (Cost of Goods Sold) calculations and profitability analysis.

## STEP 1: Import Required Libraries

PySpark, Delta Lake, and Window function modules for pricing transformations

In [0]:
# Import PySpark SQL functions for data transformations
from pyspark.sql import functions as F
# Import DeltaTable for MERGE operations
from delta.tables import DeltaTable
# Import Window for over() clause - needed for ranking/row numbering operations
from pyspark.sql.window import Window

## STEP 2: Load Project Utilities & Initialize Notebook Widgets

Import centralized configuration and set up pipeline parameters

In [0]:
# Load utilities - defines bronze_schema, silver_schema, gold_schema constants
%run /Workspace/Project1/1_setup_catalog/utilities

In [0]:
# Verify schema constants loaded (should print: bronze silver gold)
print(bronze_schema, silver_schema, gold_schema)

bronze silver gold


In [0]:
# Configure notebook widgets for parameterized execution
dbutils.widgets.text("catalog", "fmcg", "Catalog")
dbutils.widgets.text("data_source", "gross_price", "Data Source")

# Get widget values
catalog = dbutils.widgets.get("catalog")
data_source = dbutils.widgets.get("data_source")

# Construct ADLS Gen2 path to gross_price CSV files
base_path = f"abfss://conatiner-de-practice@adlsgen2narayan.dfs.core.windows.net/{data_source}/*.csv"
# Alternative S3 path for AWS deployment (commented):
# base_path = f's3://sportsbar-final/{data_source}/*.csv'
print(base_path)

abfss://conatiner-de-practice@adlsgen2narayan.dfs.core.windows.net/gross_price/*.csv


## STEP 3: BRONZE LAYER - Raw Pricing Data Ingestion

**Purpose:** Load raw pricing CSV files with complete lineage tracking

**Medallion Pattern:** Capture all data exactly as-is, no transformations

**Metadata Tracked:**
- `read_timestamp` - Processing time
- `file_name` - Source file identifier
- `file_size` - Data volume

**Output Table:** `fmcg.bronze.gross_price`

In [0]:
df = (
    spark.read.format("csv")                     # Read CSV format
        .option("header", True)                   # First row is header
        .option("inferSchema", True)              # Auto-detect data types
        .load(base_path)                          # Load from ADLS
        .withColumn("read_timestamp", F.current_timestamp())  # Add processing timestamp
        .select("*", "_metadata.file_name", "_metadata.file_size")  # Include metadata
)

In [0]:
# Display schema to verify correct data type inference
df.printSchema()

root
 |-- product_id: integer (nullable = true)
 |-- month: string (nullable = true)
 |-- gross_price: string (nullable = true)
 |-- read_timestamp: timestamp (nullable = false)
 |-- file_name: string (nullable = false)
 |-- file_size: long (nullable = false)



In [0]:
# Preview first 10 rows to inspect data quality
display(df.limit(10))

product_id,month,gross_price,read_timestamp,file_name,file_size
25891101,2025/07/01,-84,2025-11-30T13:52:09.541Z,gross_price.csv,2741
25891101,01/08/2025,unknown,2025-11-30T13:52:09.541Z,gross_price.csv,2741
25891101,2025/09/01,84,2025-11-30T13:52:09.541Z,gross_price.csv,2741
25891101,2025-10-01,83,2025-11-30T13:52:09.541Z,gross_price.csv,2741
25891101,2025-11-01,83,2025-11-30T13:52:09.541Z,gross_price.csv,2741
88888888,2025-12-01,-83,2025-11-30T13:52:09.541Z,gross_price.csv,2741
25891102,2025-07-01,68,2025-11-30T13:52:09.541Z,gross_price.csv,2741
25891102,2025-08-01,68,2025-11-30T13:52:09.541Z,gross_price.csv,2741
25891102,2025-09-01,68,2025-11-30T13:52:09.541Z,gross_price.csv,2741
25891102,2025-10-01,69,2025-11-30T13:52:09.541Z,gross_price.csv,2741


In [0]:
# Write raw pricing data to bronze layer with Change Data Feed for audit trail
df.write\
 .format("delta") \
 .option("delta.enableChangeDataFeed", "true") \  # Enable CDF
 .mode("overwrite") \                             # Replace previous bronze data
 .saveAsTable(f"{catalog}.{bronze_schema}.{data_source}")

## STEP 4: SILVER LAYER - Pricing Data Quality & Standardization

**Purpose:** Clean and standardize pricing data with business rule application

**Medallion Pattern:** Transform to conformed, validated schema

**Quality Checks (4-Step Process):**
1. ‚úÖ **Date Parsing** - Handle 4 different date formats consistently
2. ‚úÖ **Price Validation** - Numeric validation with negative value handling
3. ‚úÖ **Product Enrichment** - Join with products dimension for product_code
4. ‚úÖ **Annual Aggregation** - Grain transformation in Gold layer

**Output Table:** `fmcg.silver.gross_price`

In [0]:
# Read from bronze layer - starting point for transformations
df_bronze = spark.sql(f"SELECT * FROM {catalog}.{bronze_schema}.{data_source};")
df_bronze.show(10)

+----------+----------+-----------+--------------------+---------------+---------+
|product_id|     month|gross_price|      read_timestamp|      file_name|file_size|
+----------+----------+-----------+--------------------+---------------+---------+
|  25891101|2025/07/01|        -84|2025-11-30 13:52:...|gross_price.csv|     2741|
|  25891101|01/08/2025|    unknown|2025-11-30 13:52:...|gross_price.csv|     2741|
|  25891101|2025/09/01|         84|2025-11-30 13:52:...|gross_price.csv|     2741|
|  25891101|2025-10-01|         83|2025-11-30 13:52:...|gross_price.csv|     2741|
|  25891101|2025-11-01|         83|2025-11-30 13:52:...|gross_price.csv|     2741|
|  88888888|2025-12-01|        -83|2025-11-30 13:52:...|gross_price.csv|     2741|
|  25891102|2025-07-01|         68|2025-11-30 13:52:...|gross_price.csv|     2741|
|  25891102|2025-08-01|         68|2025-11-30 13:52:...|gross_price.csv|     2741|
|  25891102|2025-09-01|         68|2025-11-30 13:52:...|gross_price.csv|     2741|
|  2

### Silver Layer Transformations (4-Step Quality Framework)

#### QUALITY STEP 1Ô∏è‚É£: Normalize `month` Field - Parse Multiple Date Formats

**Objective:** Convert raw date strings to standardized DATE type

**Challenge:** Source system produces inconsistent date formats

**Supported Formats:**
- `yyyy/MM/dd` (e.g., 2024/01/15)
- `dd/MM/yyyy` (e.g., 15/01/2024)
- `yyyy-MM-dd` (e.g., 2024-01-15)
- `dd-MM-yyyy` (e.g., 15-01-2024)

**Technique:** Use `coalesce()` with `try_to_date()` to try formats sequentially

**Business Impact:** Unified date handling for time-based analytics

In [0]:
# Inspect unique date values to see formatting inconsistencies
df_bronze.select('month').distinct().show()

+----------+
|     month|
+----------+
|2025/07/01|
|01/08/2025|
|2025/09/01|
|2025-10-01|
|2025-11-01|
|2025-12-01|
|2025-07-01|
|2025-08-01|
|2025-09-01|
|2025/11/01|
|2025/08/01|
|01-09-2025|
|2025/10/01|
|01/12/2025|
|01/09/2025|
|01-12-2025|
|01-08-2025|
|01/10/2025|
+----------+



In [0]:
# QUALITY STEP 1Ô∏è‚É£: Parse `month` from multiple possible formats
# Try each format in sequence until one succeeds; if all fail, result is NULL
date_formats = ["yyyy/MM/dd", "dd/MM/yyyy", "yyyy-MM-dd", "dd-MM-yyyy"]

df_silver = df_bronze.withColumn(
    "month",
    # Coalesce tries each format in order, returns first non-null result
    F.coalesce(
        F.try_to_date(F.col("month"), "yyyy/MM/dd"),    # Try format 1
        F.try_to_date(F.col("month"), "dd/MM/yyyy"),    # Try format 2
        F.try_to_date(F.col("month"), "yyyy-MM-dd"),    # Try format 3
        F.try_to_date(F.col("month"), "dd-MM-yyyy")     # Try format 4
    )
)

In [0]:
# Verify date parsing worked - all should be in yyyy-MM-dd format now
df_silver.select('month').distinct().show()

+----------+
|     month|
+----------+
|2025-07-01|
|2025-08-01|
|2025-09-01|
|2025-10-01|
|2025-11-01|
|2025-12-01|
+----------+



#### QUALITY STEP 2Ô∏è‚É£: Handling `gross_price` - Validation & Normalization

**Objective:** Validate and standardize price values

**Challenges:**
- Non-numeric values (text, symbols) mixed with valid prices
- Negative prices (data entry errors?)
- Null/missing values

**Logic:**
- Keep valid numeric prices as-is
- Convert negative prices to positive (absolute value)
- Replace all non-numeric with 0 (default/unknown)

**Business Impact:** Ensures price column is always numeric and workable

In [0]:
# Display to see raw price values and identify issues (negative, invalid, etc.)
df_silver.show(10)

+----------+----------+-----------+--------------------+---------------+---------+
|product_id|     month|gross_price|      read_timestamp|      file_name|file_size|
+----------+----------+-----------+--------------------+---------------+---------+
|  25891101|2025-07-01|        -84|2025-11-30 13:52:...|gross_price.csv|     2741|
|  25891101|2025-08-01|    unknown|2025-11-30 13:52:...|gross_price.csv|     2741|
|  25891101|2025-09-01|         84|2025-11-30 13:52:...|gross_price.csv|     2741|
|  25891101|2025-10-01|         83|2025-11-30 13:52:...|gross_price.csv|     2741|
|  25891101|2025-11-01|         83|2025-11-30 13:52:...|gross_price.csv|     2741|
|  88888888|2025-12-01|        -83|2025-11-30 13:52:...|gross_price.csv|     2741|
|  25891102|2025-07-01|         68|2025-11-30 13:52:...|gross_price.csv|     2741|
|  25891102|2025-08-01|         68|2025-11-30 13:52:...|gross_price.csv|     2741|
|  25891102|2025-09-01|         68|2025-11-30 13:52:...|gross_price.csv|     2741|
|  2

In [0]:
# QUALITY STEP 2Ô∏è‚É£: Validate gross_price column
# Validate numeric values, fix negative prices (abs value), replace invalid with 0

df_silver = df_silver.withColumn(
    "gross_price",
    F.when(
        # Check if value matches numeric pattern: optional sign, digits, optional decimal
        F.col("gross_price").rlike(r'^-?\d+(\.\d+)?$'),
        # If numeric, check for negative and convert to absolute value if needed
        F.when(F.col("gross_price").cast("double") < 0, -1 * F.col("gross_price").cast("double"))
         .otherwise(F.col("gross_price").cast("double"))
    )
    .otherwise(0)  # Non-numeric values ‚Üí 0 (default/unknown)
)

In [0]:
# Verify price validation - all prices should now be valid doubles (no negatives, no text)
df_silver.show(10)

+----------+----------+-----------+--------------------+---------------+---------+
|product_id|     month|gross_price|      read_timestamp|      file_name|file_size|
+----------+----------+-----------+--------------------+---------------+---------+
|  25891101|2025-07-01|       84.0|2025-11-30 13:52:...|gross_price.csv|     2741|
|  25891101|2025-08-01|        0.0|2025-11-30 13:52:...|gross_price.csv|     2741|
|  25891101|2025-09-01|       84.0|2025-11-30 13:52:...|gross_price.csv|     2741|
|  25891101|2025-10-01|       83.0|2025-11-30 13:52:...|gross_price.csv|     2741|
|  25891101|2025-11-01|       83.0|2025-11-30 13:52:...|gross_price.csv|     2741|
|  88888888|2025-12-01|       83.0|2025-11-30 13:52:...|gross_price.csv|     2741|
|  25891102|2025-07-01|       68.0|2025-11-30 13:52:...|gross_price.csv|     2741|
|  25891102|2025-08-01|       68.0|2025-11-30 13:52:...|gross_price.csv|     2741|
|  25891102|2025-09-01|       68.0|2025-11-30 13:52:...|gross_price.csv|     2741|
|  2

In [0]:
# QUALITY STEP 3Ô∏è‚É£: Enrich with Product Information
# Join pricing data with products dimension to get standardized product_code
# This ensures pricing aligns with product dimension join keys

df_products = spark.table("fmcg.silver.products")  # Load products dimension
df_joined = df_silver.join(
    df_products.select("product_id", "product_code"),  # Select only needed columns
    on="product_id",                                    # Join key: product_id
    how="inner"                                         # Keep only matching prices
)
# Select columns in logical order
df_joined = df_joined.select(
    "product_id",
    "product_code",
    "month",
    "gross_price",
    "read_timestamp",
    "file_name",
    "file_size"
)

df_joined.show(5)

+----------+--------------------+----------+-----------+--------------------+---------------+---------+
|product_id|        product_code|     month|gross_price|      read_timestamp|      file_name|file_size|
+----------+--------------------+----------+-----------+--------------------+---------------+---------+
|  25891101|e91ba9d665f90254d...|2025-07-01|       84.0|2025-11-30 13:52:...|gross_price.csv|     2741|
|  25891101|e91ba9d665f90254d...|2025-08-01|        0.0|2025-11-30 13:52:...|gross_price.csv|     2741|
|  25891101|e91ba9d665f90254d...|2025-09-01|       84.0|2025-11-30 13:52:...|gross_price.csv|     2741|
|  25891101|e91ba9d665f90254d...|2025-10-01|       83.0|2025-11-30 13:52:...|gross_price.csv|     2741|
|  25891101|e91ba9d665f90254d...|2025-11-01|       83.0|2025-11-30 13:52:...|gross_price.csv|     2741|
+----------+--------------------+----------+-----------+--------------------+---------------+---------+
only showing top 5 rows


In [0]:
# Write standardized pricing data to silver layer with CDF for audit trail
df_joined.write\
 .format("delta") \
 .option("delta.enableChangeDataFeed", "true")\
 .option("mergeSchema", "true") \                 # Allow schema evolution
 .mode("overwrite") \                             # Replace previous silver data
 .saveAsTable(f"{catalog}.{silver_schema}.{data_source}")

## STEP 5: GOLD LAYER - Business-Ready Pricing Dimension (Staging)

**Purpose:** Create SportsBar pricing dimension with annual grain

**Medallion Pattern:** Select business columns, apply grain transformation

**Grain Change:** Monthly input ‚Üí Annual output (one price per product per year)

**Key Columns:**
- `product_code` - Join key with product dimension
- `month` / `year` - Time dimension
- `gross_price` - Product cost

**Output Table:** `fmcg.gold.sb_dim_gross_price` (staging for parent merge)

In [0]:
# Read from silver layer - start with validated pricing data
df_silver = spark.sql(f"SELECT * FROM {catalog}.{silver_schema}.{data_source};")

In [0]:
# Select only business-relevant columns (drop technical metadata)
df_gold = df_silver.select("product_code", "month", "gross_price")
df_gold.show(5)

+--------------------+----------+-----------+
|        product_code|     month|gross_price|
+--------------------+----------+-----------+
|e91ba9d665f90254d...|2025-07-01|       84.0|
|e91ba9d665f90254d...|2025-08-01|        0.0|
|e91ba9d665f90254d...|2025-09-01|       84.0|
|e91ba9d665f90254d...|2025-10-01|       83.0|
|e91ba9d665f90254d...|2025-11-01|       83.0|
+--------------------+----------+-----------+
only showing top 5 rows


In [0]:
# Write SportsBar pricing dimension (staging table before parent merge)
df_gold.write\
 .format("delta") \
 .option("delta.enableChangeDataFeed", "true") \
 .mode("overwrite") \                             # Replace previous version
 .saveAsTable(f"{catalog}.{gold_schema}.sb_dim_{data_source}")

## STEP 6: QUALITY STEP 4Ô∏è‚É£ - Annual Price Aggregation & Parent Merge

**Purpose:** Transform monthly prices to annual grain and merge with parent pricing dimension

**Challenge:** Multiple prices per product per year (monthly data)

**Solution:** Select latest non-zero price per product per year using Window function

**Pattern:** Delta MERGE operation for incremental upserts

**Output Table:** `fmcg.gold.dim_gross_price` (unified master pricing data)

In [0]:
# Load SportsBar pricing data from gold staging table
df_gold_price = spark.table("fmcg.gold.sb_dim_gross_price")
df_gold_price.show(5)

+--------------------+----------+-----------+
|        product_code|     month|gross_price|
+--------------------+----------+-----------+
|e91ba9d665f90254d...|2025-07-01|       84.0|
|e91ba9d665f90254d...|2025-08-01|        0.0|
|e91ba9d665f90254d...|2025-09-01|       84.0|
|e91ba9d665f90254d...|2025-10-01|       83.0|
|e91ba9d665f90254d...|2025-11-01|       83.0|
+--------------------+----------+-----------+
only showing top 5 rows


### Get Latest Price per Product per Year (Annual Aggregation)

In [0]:
df_gold_price = (
    df_gold_price
    # Extract year from month for grouping
    .withColumn("year", F.year("month"))
    # Create flag: 0=non-zero price (preferred), 1=zero price (fallback)
    # This ensures non-zero prices rank higher than zero prices
    .withColumn("is_zero", F.when(F.col("gross_price") == 0, 1).otherwise(0))
)

# Define window: partition by product/year, order by (non-zero first, then latest month)
# Ordering: is_zero ASC (0 before 1, so non-zero before zero)
#          month DESC (most recent month first)
w = (
    Window
    .partitionBy("product_code", "year")           # One price per product per year
    .orderBy(F.col("is_zero"), F.col("month").desc())  # Prefer non-zero, then latest
)

# Select only first row per window (latest non-zero price if available)
df_gold_latest_price = (
    df_gold_price
      .withColumn("rnk", F.row_number().over(w))   # Assign row numbers per window
      .filter(F.col("rnk") == 1)                   # Keep only rank 1 (latest/best)
)

In [0]:
# Display aggregated annual prices - verify one row per product per year
display(df_gold_latest_price)

product_code,month,gross_price,year,is_zero,rnk
062f5574bbdf4386b2c7c6075483b417b4a00b172fcba919dbba7dae1b774379,2025-12-01,281.0,2025,0,1
0cb7b2f42657b625f754e833aa1cf6a967be26f17415f5342302ebb0e90c8a28,2025-10-01,100.0,2025,0,1
102628255d24304d6bbe0438b1ac992054f262e0814d306d0a34d7356cef3268,2025-12-01,86.0,2025,0,1
2e387cef1424d6e7b162b45622d4b1a788d11776e33d05cc8552f4ecd2ea1896,2025-12-01,108.0,2025,0,1
3cab59f05924285270313afcfe40a08983bb03dd88f432e34fc6336914c14345,2025-12-01,493.0,2025,0,1
451f7167b28a25bde73995910e31c07dfa26411f1db47847f19e16747effbdaa,2025-12-01,187.0,2025,0,1
716fa4e54b7894c910180276e0535d49afb25cdcfac09533fb74ae00689e5742,2025-11-01,440.0,2025,0,1
778c2a7aa27bfdb211fd5ece048de80d00fbf3d6924bd908d91054796ba16ab6,2025-12-01,296.0,2025,0,1
77b6f538a9d0e0cf845db5c2cbecec46fdd30303b501e06f64baf1d4dc0e66f9,2025-12-01,50.0,2025,0,1
889c67757ece9c973791dfbc2d47b026a3342cc7255e47a3170329d158e897c2,2025-12-01,138.0,2025,0,1


In [0]:
## Prepare for parent merge - select required columns with standardized naming

df_gold_latest_price = df_gold_latest_price.select(
    "product_code",
    "year",
    "gross_price"
).withColumnRenamed(
    "gross_price",
    "price_inr"  # Rename to indicate currency (Indian Rupees)
).select(
    "product_code",
    "price_inr",
    "year"
)

# Convert year to string for consistency with parent table
df_gold_latest_price = df_gold_latest_price.withColumn("year", F.col("year").cast("string"))

df_gold_latest_price.show(5)

+--------------------+---------+----+
|        product_code|price_inr|year|
+--------------------+---------+----+
|062f5574bbdf4386b...|    281.0|2025|
|0cb7b2f42657b625f...|    100.0|2025|
|102628255d24304d6...|     86.0|2025|
|2e387cef1424d6e7b...|    108.0|2025|
|3cab59f0592428527...|    493.0|2025|
+--------------------+---------+----+
only showing top 5 rows


In [0]:
# Display schema to verify data types before merge (year should be string)
df_gold_latest_price.printSchema()

root
 |-- product_code: string (nullable = true)
 |-- price_inr: double (nullable = true)
 |-- year: string (nullable = true)



In [0]:
# Perform MERGE operation to upsert SportsBar pricing into parent company dimension
delta_table = DeltaTable.forName(spark, "fmcg.gold.dim_gross_price")

delta_table.alias("target").merge(
    source=df_gold_latest_price.alias("source"),
    # Match condition: Same product_code indicates matching product
    condition="target.product_code = source.product_code"
).whenMatchedUpdate(
    # Update existing parent records with latest SportsBar pricing
    set={
        "price_inr": "source.price_inr",  # Update price with latest
        "year": "source.year"              # Update year
    }
).whenNotMatchedInsert(
    # Insert new products not yet in parent pricing dimension
    values={
        "product_code": "source.product_code",
        "price_inr": "source.price_inr",
        "year": "source.year"
    }
).execute()  # Execute the MERGE operation

DataFrame[num_affected_rows: bigint, num_updated_rows: bigint, num_deleted_rows: bigint, num_inserted_rows: bigint]

In [0]:
# Read and display final merged pricing dimension to verify MERGE success
gold_query = f"SELECT * FROM {catalog}.{gold_schema}.dim_{data_source};"
df_gold = spark.sql(gold_query)
# Display merged parent dimension - should now include all SportsBar product pricing
display(df_gold)

product_code,price_inr,year
ARCHDDE20D,2750,2024
ARCH158F41,5740,2024
ARCHAFF0E4,4610,2024
ARCH6B94F7,3910,2024
ARCH5D1FE7,1610,2024
ARCH7B49A9,1610,2024
ARCH497D34,1100,2024
ARCHE71D79,5830,2024
ARCHDD8749,3930,2024
BADMC045D4,4500,2024


Databricks data profile. Run in Databricks to view.