### Setting Up the Unity Catalog Structure

Before we can generate or process any data, we must first establish the foundational structure for our project within Databricks. This initial setup is performed using SQL commands to create the necessary components in **Unity Catalog**, which is Databricks' unified governance solution for all data and AI assets.

This code block prepares the logical "scaffolding" for our lakehouse environment, ensuring our data is organized, secure, and easily discoverable.

### Code Explanation

The following SQL script executes several key setup operations:

1.  **`CREATE CATALOG IF NOT EXISTS main;`**
    * **What it does:** This command creates the top-level container in Unity Catalog's three-level namespace (`catalog.schema.table`). A catalog is the highest level of data organization. We are using a standard catalog named `main`.
    * **Why we do it:** It provides a primary grouping for all the schemas and data assets related to a specific environment or division within an organization. Using `IF NOT EXISTS` ensures the script can be re-run without causing errors.

2.  **`CREATE SCHEMA IF NOT EXISTS main.lakehouse_sales;`**
    * **What it does:** Inside the `main` catalog, this creates a **schema** (logically equivalent to a database) named `lakehouse_sales`.
    * **Why we do it:** This creates a dedicated, isolated namespace for our specific project. All tables, views, and volumes related to the "lakehouse sales" project will be neatly organized here, preventing conflicts with other projects.

3.  **`CREATE VOLUME IF NOT EXISTS main.lakehouse_sales.raw ...;`**
    * **What it does:** This is a crucial step that creates a **Volume** named `raw` within our project's schema.
    * **Why we do it:** Volumes are Unity Catalog objects that allow you to store and access unstructured and semi-structured files (like CSVs, JSON, images, etc.) in a governed way. We are creating this `raw` volume to serve as the **official landing zone for our Bronze layer**. All raw, unprocessed files from source systems will be ingested and stored here before any transformation begins. This provides a secure, organized, and governable location for our raw data files. 

With this foundation in place, we now have a well-defined and governed location (`/Volumes/main/lakehouse_sales/raw/`) to land our raw sales data in the next step.

In [0]:
%sql
CREATE CATALOG IF NOT EXISTS main;
CREATE SCHEMA IF NOT EXISTS main.lakehouse_sales;
CREATE VOLUME IF NOT EXISTS main.lakehouse_sales.raw COMMENT 'Raw files for Lakehouse Sales';

## Data Generation for Lakehouse Sales Analysis

This Python script is the foundational step for our data project within Databricks. Its primary purpose is to **generate a synthetic (or "mock") dataset** of sales orders. This is a common and crucial practice in data engineering, as it allows us to build and test our entire data pipeline without needing access to real, sensitive production data.

### How the Code Works

The script defines a function called `generate_orders` that programmatically creates a list of sales orders.

  * **Libraries Used:** It leverages powerful Python libraries:

      * **`Faker`**: To generate realistic-looking fake data like country codes and dates.
      * **`random`**: To create random numbers for quantities, SKUs, and prices.
      * **`datetime`**: To correctly handle and format date objects.

  * **Reproducibility:** We set a `seed` for both `Faker` and `random`. This is very important because it ensures that every time we run the script, we get the **exact same "random" data**. This makes our development process predictable and allows us to reliably test our transformations.

  * **Data Structure:** The function generates a specified number (`n=1000`) of order records, where each order is a Python dictionary containing:

      * `order_id`: A unique identifier for the order.
      * `order_date`: A random date within the year 2024.
      * `country`: A random two-letter country code.
      * `sku`: A product identifier (Stock Keeping Unit), from a pool of 50 possible products.
      * `qty`: The number of units sold in the order.
      * `unit_price`: The price of a single unit.

### Purpose in the Databricks Project

The main goal here is to **simulate a raw data source**. In a real-world scenario, this data might come from a database, an API, or a stream of events. By generating it ourselves, we create a reliable starting point for our project.

This dataset will act as the **"Bronze" layer** in our Medallion architecture. We will use it to:

1.  Develop and test ETL (Extract, Transform, Load) logic.
2.  Build data cleaning and validation steps.
3.  Create transformations to prepare the data for analysis (the "Silver" and "Gold" layers).
4.  Perform exploratory data analysis and build visualizations.

In [0]:
import random
from datetime import datetime

import pandas as pd
from faker import Faker

def generate_orders(n=100, start_date='2024-01-01', end_date='2024-12-31', seed=42):
    """
    Generates a list of dictionaries in the format:
    {"order_id": int, "order_date": "YYYY-MM-DD", "country": "BR", "sku": "SKU-001", "qty": int, "unit_price": float}
    """
    Faker.seed(seed)
    random.seed(seed)
    fake = Faker()

    start = datetime.strptime(start_date, '%Y-%m-%d').date()
    end = datetime.strptime(end_date, '%Y-%m-%d').date()

    data = []
    for order_id in range(1, n + 1):
        order_date = fake.date_between(start_date=start, end_date=end).strftime('%Y-%m-%d')
        country = fake.country_code()
        sku = f"SKU-{random.randint(1, 50):03d}"
        qty = random.randint(1, 5)
        unit_price = round(random.uniform(9.0, 499.0), 2)

        data.append({
            "order_id": order_id,
            "order_date": order_date,
            "country": country,
            "sku": sku,
            "qty": qty,
            "unit_price": unit_price
        })
    return data

data = generate_orders(n=1000, start_date='2024-01-01', end_date='2024-12-31', seed=7)

### Ingesting and Persisting Raw Data to the Bronze Layer

Following the in-memory data generation, this code block performs the crucial first step of any ETL pipeline: **ingestion**. Here, we take the synthetic order data we created and persist it to a permanent storage location within our Databricks lakehouse. This simulates the landing of raw data from an external source system into our initial data layer, commonly known as the **Bronze layer**.

### Code Explanation

1.  **Pandas to Spark Conversion:**
    * `df = pd.DataFrame(data)`: First, the list of Python dictionaries is converted into a **Pandas DataFrame**. This is a common and convenient intermediate format.
    * `spark_df = spark.createDataFrame(df)`: The Pandas DataFrame is then converted into a **Spark DataFrame**. This is the most critical step, as it distributes the data across the cluster and allows us to leverage Spark's powerful parallel processing engine for all subsequent transformations.

2.  **Defining the Storage Path:**
    * `volume_path = "/Volumes/main/lakehouse_sales/raw"`: We define a path within **Databricks Volumes**. Volumes are the recommended way to store and access data, libraries, and other files in Databricks. Our path is structured to follow data lakehouse best practices, with a dedicated `raw` directory to hold our untouched, original source data. This directory represents our Bronze layer.

3.  **Writing the Data to a CSV File:**
    * The `spark_df.write` command is used to save the DataFrame to a file. Let's break down the options used:
    * `.coalesce(1)`: This command reduces the DataFrame's partitions to one before writing. The result is that our output is a **single, clean CSV file** (`raw_sales_1.csv`) instead of a directory containing multiple part-files, which is Spark's default behavior. This is ideal for smaller datasets and simpler ingestion processes.
    * `.mode("overwrite")`: This specifies that if a file at the target path already exists, it should be completely replaced. This is extremely useful during development, as it makes our script idempotent—we can re-run it multiple times and get the same clean result without manual cleanup.
    * `.option("header", "true")`: This ensures that the column names from our DataFrame are written as the first line in the CSV file, making the file self-describing and easier to read by other tools.
    * `.csv(csv_path)`: Finally, this command executes the write operation, saving the data in CSV format to the specified path within our Volume.

In [0]:
df = pd.DataFrame(data)

volume_path = "/Volumes/main/lakehouse_sales/raw"
csv_path = f"{volume_path}/raw_sales_1.csv"

spark_df = spark.createDataFrame(df)
spark_df.coalesce(1).write \
  .mode("overwrite") \
  .option("header", "true") \
  .option("inferSchema", "true") \
  .csv(csv_path)

## Inspecting the Raw Data Schema

After writing our sample sales data to a CSV file, the next logical step is to read it back into a Spark DataFrame. We'll use the `.printSchema()` method to inspect the data types that Spark inferred from the file. This allows us to verify if the schema is correct before proceeding with any transformations.

> **Best Practice Note:** While we use the `inferSchema=True` option for convenience in this initial step, it is highly recommended to **manually define a robust schema** for production jobs. Inferring schemas can be slow and may lead to incorrect data types, especially with null values or mixed data.

Inspecting and correcting the schema early is critical. An improper schema can lead to several significant problems:

  * **Storage Inefficiency:** Storing a number as a `string` type, for example, consumes significantly more disk space than storing it as an `integer` or `long`.
  * **Poor Query Performance:** Performing calculations on numeric data stored as strings requires explicit casting, which adds computational overhead and slows down queries.
  * **Increased Costs:** Both inefficient storage and slower queries consume more cloud resources (storage and compute), which directly translates to higher costs.

In [0]:
raw_sales = spark.read.csv(csv_path, header=True, inferSchema=True)
raw_sales.printSchema()

### Creating the Bronze Table in Delta Lake

With our raw data read and its schema verified, we now transition from a simple file in a Volume to a structured, permanent table. This command takes the `raw_sales_df` DataFrame and saves it as a **managed table** within Unity Catalog. This table represents our official **Bronze layer**—the single source of truth for our raw, unfiltered sales data.

### Code Explanation

`raw_sales.write.mode("overwrite").saveAsTable("main.lakehouse_sales.raw_sales")`

* **`raw_sales.write`**: This initiates the DataFrameWriter API, which is used to save the contents of a DataFrame to storage.
* **`.mode("overwrite")`**: This setting ensures that if the table `raw_sales` already exists, Spark will completely delete and recreate it with the new data. This makes our data ingestion script **idempotent**, meaning we can re-run it multiple times during development and always get a clean, consistent result.
* **`.saveAsTable("main.lakehouse_sales.raw_sales")`**: This is the core action. It saves the DataFrame as a table governed by Unity Catalog.
    * **Format:** By default, tables in Databricks are created in the **Delta Lake** format. This is a massive upgrade from a standard CSV file, providing critical features like ACID transactions (ensuring data integrity), time travel (version history), and schema enforcement. 🚀
    * **Location:** The table is saved using its fully qualified three-level name, placing it exactly where it belongs in our `lakehouse_sales` schema.

### **Why This Matters**

At this point, we have successfully ingested our source data. It's no longer just a loose file in a landing zone; it's a queryable, robust, and versioned Delta table. This Bronze table will now serve as the reliable foundation for all downstream transformations as we move towards creating our Silver and Gold layers.



In [0]:
raw_sales.write.mode("overwrite").saveAsTable("main.lakehouse_sales.raw_sales")

### Verifying Delta Lake Features with Table History

Now that our data is stored in the Delta Lake format, we can immediately take advantage of its powerful, built-in features. One of the most important is **"Time Travel,"** which provides a complete audit log of every transaction (or change) made to the table.

We can inspect this log using a simple SQL command: `DESCRIBE HISTORY`.

### Code Explanation

`DESCRIBE HISTORY main.lakehouse_sales.raw_sales;`

* **`DESCRIBE HISTORY`**: This command retrieves detailed information about each operation that has modified the Delta table since it was created.
* **`main.lakehouse_sales.raw_sales`**: This specifies the full name of our Bronze layer table.

When you run this command, you will see a table that includes columns like:

* **`version`**: The version number of the table after the commit. Each change creates a new version.
* **`timestamp`**: The exact time the transaction occurred.
* **`operation`**: The type of change made (e.g., `WRITE`, `CREATE TABLE`, `MERGE`, `DELETE`).
* **`operationParameters`**: Details about the operation, such as the mode used (`Overwrite`, `Append`).
* **`notebook`**: Information about the notebook and cell that executed the change, providing clear lineage.
* **`clusterId`**: The ID of the cluster used for the transaction.

### Why This is a Game-Changer

The output of this command proves that our `raw_sales` table is far more than just a static collection of data. It is a **fully auditable and versioned asset**.

This capability is fundamental to building a reliable data lakehouse:

1.  **Auditing and Governance:** You have a complete, immutable log of who changed what and when, which is essential for compliance and governance.
2.  **Reproducibility:** You can easily query a previous version of the table to reproduce reports, machine learning models, or analyses from a specific point in time.
3.  **Error Recovery:** If a recent data ingestion job introduced bad data, you can instantly "time travel" back to the version before the error occurred, effectively rolling back the change without complex restore procedures.

By running this simple query, we have demonstrated a core pillar of the Delta Lake architecture and confirmed that our Bronze layer is robust, reliable, and ready for production-level workloads.

In [0]:
%sql
DESCRIBE HISTORY main.lakehouse_sales.raw_sales;