## Purpose of Bronze

Store raw data exactly as received, without business logic or transformations.
This layer acts as a source-of-truth backup and starting point for all analysis.

##### Bronze Ingestion Code — Explanation
What this code does

Reads raw CSV files from the Lakehouse Files folder
Loads data into Spark DataFrames as-is
Applies only basic schema consistency
Saves data as Delta tables in the Bronze schema

What is intentionally NOT done

No data cleaning
No null handling
No business logic
No timestamp conversion
No joins

Why this is important

Preserves original data for auditing and reprocessing
Prevents accidental loss of information
Allows downstream layers to be rebuilt safely



In [2]:
from pyspark.sql import functions as F

# ----------------------------
# 0) Create Bronze database
# ----------------------------
spark.sql("CREATE DATABASE IF NOT EXISTS bronze")

# ----------------------------
# 1) Helper: read raw CSV exactly as raw + enforce schema consistency
#    - No cleaning
#    - No timestamp conversion
#    - No KPI logic
#    - Only cast all columns to STRING so schema remains stable across runs
# ----------------------------
def read_csv_as_raw_strings(file_path: str):
    df = (
        spark.read
        .option("header", "true")
        .option("inferSchema", "false")   # IMPORTANT: keep raw types, do not infer
        .option("multiLine", "true")      # safer for fields with embedded line breaks
        .option("escape", "\"")
        .option("quote", "\"")
        .csv(file_path)
    )

    # Schema consistency: cast all columns to STRING (no other changes)
    df = df.select([F.col(c).cast("string").alias(c) for c in df.columns])
    return df

# ----------------------------
# 2) Configure files → Bronze table names
#    Update filenames if your Kaggle extract uses different names.
# ----------------------------
datasets = [
    ("Files/bronze_raw/customers.csv",   "bronze.customers_raw"),
    ("Files/bronze_raw/sessions.csv",    "bronze.sessions_raw"),
    ("Files/bronze_raw/events.csv", "bronze.events_raw"),   # if your file is events.csv, change path here
    ("Files/bronze_raw/products.csv",    "bronze.products_raw"),
    ("Files/bronze_raw/orders.csv",      "bronze.orders_raw"),
    ("Files/bronze_raw/order_items.csv", "bronze.order_items_raw"),
]

# ----------------------------
# 3) Ingest each CSV to Delta Bronze table
# ----------------------------
for file_path, table_name in datasets:
    df = read_csv_as_raw_strings(file_path)

    # Overwrite makes re-runs deterministic (safe for Bronze rebuilds)
    (df.write
        .mode("overwrite")
        .format("delta")
        .saveAsTable(table_name)
    )

    print(f"✅ Loaded {file_path} -> {table_name} | rows={df.count()} | cols={len(df.columns)}")

# ----------------------------
# 4) Quick verification (shows schemas, row counts)
# ----------------------------
for _, table_name in datasets:
    print("\n---", table_name, "---")
    spark.table(table_name).printSchema()
    print("rows =", spark.table(table_name).count())


StatementMeta(, e4008e9a-b754-4202-84c1-900750a88c1a, 4, Finished, Available, Finished)

✅ Loaded Files/bronze_raw/customers.csv -> bronze.customers_raw | rows=20000 | cols=7
✅ Loaded Files/bronze_raw/sessions.csv -> bronze.sessions_raw | rows=120000 | cols=6
✅ Loaded Files/bronze_raw/events.csv -> bronze.events_raw | rows=760958 | cols=10
✅ Loaded Files/bronze_raw/products.csv -> bronze.products_raw | rows=1197 | cols=6
✅ Loaded Files/bronze_raw/orders.csv -> bronze.orders_raw | rows=33580 | cols=10
✅ Loaded Files/bronze_raw/order_items.csv -> bronze.order_items_raw | rows=59163 | cols=5

--- bronze.customers_raw ---
root
 |-- customer_id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- email: string (nullable = true)
 |-- country: string (nullable = true)
 |-- age: string (nullable = true)
 |-- signup_date: string (nullable = true)
 |-- marketing_opt_in: string (nullable = true)

rows = 20000

--- bronze.sessions_raw ---
root
 |-- session_id: string (nullable = true)
 |-- customer_id: string (nullable = true)
 |-- start_time: string (nullable = true)
 |

In [3]:
# 1) Confirm tables exist
spark.sql("SHOW TABLES IN bronze").show(truncate=False)

# 2) Row counts (quick health check)
tables = [
    "bronze.customers_raw",
    "bronze.sessions_raw",
    "bronze.events_raw",
    "bronze.products_raw",
    "bronze.orders_raw",
    "bronze.order_items_raw",
]

for t in tables:
    df = spark.table(t)
    print(f"{t} -> rows={df.count()}, cols={len(df.columns)}")


StatementMeta(, e4008e9a-b754-4202-84c1-900750a88c1a, 5, Finished, Available, Finished)

+-------------------------------------------------------+---------------+-----------+
|namespace                                              |tableName      |isTemporary|
+-------------------------------------------------------+---------------+-----------+
|`ecommerce-funnel-analytics`.ecommerce_lakehouse.bronze|customers_raw  |false      |
|`ecommerce-funnel-analytics`.ecommerce_lakehouse.bronze|events_raw     |false      |
|`ecommerce-funnel-analytics`.ecommerce_lakehouse.bronze|order_items_raw|false      |
|`ecommerce-funnel-analytics`.ecommerce_lakehouse.bronze|orders_raw     |false      |
|`ecommerce-funnel-analytics`.ecommerce_lakehouse.bronze|products_raw   |false      |
|`ecommerce-funnel-analytics`.ecommerce_lakehouse.bronze|sessions_raw   |false      |
+-------------------------------------------------------+---------------+-----------+

bronze.customers_raw -> rows=20000, cols=7
bronze.sessions_raw -> rows=120000, cols=6
bronze.events_raw -> rows=760958, cols=10
bronze.prod