## Configure Environment and Paths
Define key variables for the catalog, schemas, and volume paths. This step centralizes configuration, making the notebook easier to manage and adapt. We also set the active catalog and schema to simplify subsequent commands by avoiding fully qualified names.

In [0]:
main_catalog = "main"

volume_schema = "lakehouse_sales"
volume_name = "raw"

bronze_schema = "lakehouse_sales_bronze"
bronze_table = "bronze_sales"

In [0]:
spark.sql(f"USE CATALOG {main_catalog}")
spark.sql(f"USE SCHEMA {volume_schema}")

## Generate Synthetic Sales Data
To make this notebook self-contained, we'll generate a realistic dataset of sales orders. A Python function using the **Faker** library creates mock data, including order IDs, dates, countries, and product details. This simulates a typical raw data source without requiring external file dependencies.

In [0]:
import random
from datetime import datetime

import pandas as pd
from faker import Faker

def generate_orders(start_id=1, end_id=100, start_date='2024-01-01', end_date='2024-12-31', seed=42):
    """
    Generates a list of dictionaries in the format:
    {"order_id": int, "order_date": "YYYY-MM-DD", "country": "BR", "sku": "SKU-001", "qty": int, "unit_price": float}
    """
    Faker.seed(seed)
    random.seed(seed)
    fake = Faker()

    start = datetime.strptime(start_date, '%Y-%m-%d').date()
    end = datetime.strptime(end_date, '%Y-%m-%d').date()

    data = []
    for order_id in range(start_id, end_id + 1):
        order_date = fake.date_between(start_date=start, end_date=end).strftime('%Y-%m-%d')
        country = fake.country_code()
        sku = f"SKU-{random.randint(1, 50):03d}"
        qty = random.randint(1, 5)
        unit_price = round(random.uniform(9.0, 499.0), 2)

        data.append({
            "order_id": order_id,
            "order_date": order_date,
            "country": country,
            "sku": sku,
            "qty": qty,
            "unit_price": unit_price
        })
    return data

data = generate_orders(start_id=1000, end_id=2000, start_date='2024-01-01', end_date='2024-12-31', seed=7)

## Stage Raw Data in a Unity Catalog Volume
The generated data is converted into a Spark DataFrame and then written as a CSV file to a pre-configured Unity Catalog Volume. This pattern mimics a common real-world scenario where raw data files land in a staging location before being loaded into the Bronze layer of the lakehouse.

In [0]:
df = pd.DataFrame(data)

volume_path = f"/Volumes/{main_catalog}/{volume_schema}/{volume_name}"
csv_path = f"{volume_path}/raw_sales_2.csv"

spark_df = spark.createDataFrame(df)
spark_df.coalesce(1).write \
  .mode("overwrite") \
  .option("header", "true") \
  .option("inferSchema", "true") \
  .csv(csv_path)

## Ingest Raw Data into the Bronze Table
Use the `COPY INTO` command to idempotently load the raw sales data from the CSV file into the `bronze_sales` Delta table. This command is highly efficient for incremental data ingestion. We use `mergeSchema` to allow for schema evolution, which accommodates new columns in the source data without failing the pipeline.

In [0]:
spark.sql(f"USE SCHEMA {bronze_schema}")

spark.sql(f"""
    COPY INTO {bronze_table}
    FROM '/Volumes/{main_catalog}/{volume_schema}/{volume_name}/raw_sales_2.csv'
    FILEFORMAT = CSV
    FORMAT_OPTIONS('header'='true', 'inferSchema'='true')
    COPY_OPTIONS('mergeSchema'='true')
""").display()

## Validate Bronze Table Ingestion
Run simple validation queries to confirm the success of the ingestion process.
- The first query counts the total rows to ensure that data was loaded into the `bronze_sales` table.
- The second query checks the table's transaction history, providing metadata about the `COPY INTO` operation and confirming the versioning capabilities of Delta Lake.

In [0]:
%sql
SELECT COUNT(*) AS bronze_count FROM bronze_sales;

In [0]:
%sql
DESCRIBE HISTORY bronze_sales;