# Storage to Bronze Pipeline Documentation
### Overview
This notebook is responsible for loading raw data from the source (Azure Data Lake Storage) into the Bronze layer using Delta Live Tables (DLT). 

### Process
- Authenticate to access data in ADLS
- Configure bronze layer dictionary
- Create DLT tables function
- Kick off function to loop through dictionary and create DLT tables

## 1. Authenticate  
The following code authenticates access to the Azure Data Lake Storage (ADLS) using a stored secret in Azure Key Vault.

- `storage_account_name`: Name of the Azure storage account.
- `secret_scope`: The secret scope from Azure Key Vault.
- `secret_key`: The key to access the storage account.  

This configuration allows us to securely access the storage account and load data.

In [0]:
# Define variables for storage authentication
storage_account_name = "dornystorage"
secret_scope = "dorny-key-vault"
secret_key = "storage"

# Run this to set authentication
spark.conf.set(
    f"fs.azure.account.key.{storage_account_name}.dfs.core.windows.net",
    dbutils.secrets.get(scope=secret_scope, key=secret_key))


## 2. Bronze Layer Configuration
The Bronze layer is used to store raw, untransformed data. This section of the code defines the base path to the Bronze layer and sets up a dictionary with table configurations that specify the folder structure for each table in the Bronze layer.

- `bronze_base_path`: The root directory in the Azure Data Lake Storage where all Bronze layer data resides.
- `bronze_tables_config`: A dictionary mapping each table name to its corresponding folder in the Bronze layer.

In [0]:
import dlt
from pyspark.sql.functions import *

# Base path for Bronze layer
bronze_base_path = r"abfss://bronze@dornystorage.dfs.core.windows.net/SalesLT/"

In [0]:
# Dictionary to store table configurations
bronze_tables_config = {
    "adventureworks.bronze.Address": {
        "folder": "Address"
    },
    "adventureworks.bronze.Customer": {
        "folder": "Customer"
    },
    "adventureworks.bronze.CustomerAddress": {
        "folder": "CustomerAddress"
    },
    "adventureworks.bronze.Product": {
        "folder": "Product"
    },
    "adventureworks.bronze.ProductCategory": {
        "folder": "ProductCategory"
    },
    "adventureworks.bronze.ProductDescription": {
        "folder": "ProductDescription"
    },
    "adventureworks.bronze.ProductModel": {
        "folder": "ProductModel"
    },
    "adventureworks.bronze.ProductModelProductDescription": {
        "folder": "ProductModelProductDescription"
    },
    "adventureworks.bronze.SalesOrderDetail": {
        "folder": "SalesOrderDetail"
    },
    "adventureworks.bronze.SalesOrderHeader": {
        "folder": "SalesOrderHeader"
    }
}

## 3. Function to Create DLT Tables
The function create_bronze_table dynamically creates Delta Live Tables (DLT) for each table in the Bronze layer. It reads data in real-time from the specified folder, adds an ingestion timestamp, and stores the data with file path metadata.

### Function Details:
- `@dlt.table`: This decorator defines the DLT table and includes properties such as the table name, comment, and quality level. The quality is set to "bronze", indicating that the data is raw and untransformed.
- `load_data()`: This function reads the data from the specified folder in the Azure Data Lake Storage (ADLS) as Parquet files, adds an ingestion timestamp, and captures the file path from which the data is ingested.

In [0]:
# Function to dynamically create DLT tables
def create_bronze_table(table_name, folder):
    @dlt.table(
        name=table_name,
        comment="Raw data from Bronze layer",
        table_properties={
            "quality": "bronze",
            "pipelines.reset.allowed": "false"}
    )
    def load_data():
        table_path = f"{bronze_base_path}{folder}/"
        return (
            spark.readStream
            .format("cloudFiles")
            .option("cloudFiles.format", "parquet")
            .load(table_path)
            .withColumn("ingestion_time", current_timestamp())  # Add ingestion timestamp
            .withColumn("source_file", col("_metadata.file_path"))  # Use _metadata.file_path instead of input_file_name
        )


## 4. Creating Tables for All Configured Data  
The loop below iterates over the `bronze_tables_config` dictionary and calls the `create_bronze_table` function for each table, passing the table name and folder as parameters.
This dynamically creates DLT tables for each entry in the configuration, ensuring that the data is loaded into the correct table and folder in the Bronze layer.

In [0]:
# Loop through the dictionary to create tables
for table_name, config in bronze_tables_config.items():
    create_bronze_table(table_name, config["folder"])

Name,Type
SalesOrderID,int
RevisionNumber,int
OrderDate,timestamp
DueDate,timestamp
ShipDate,timestamp
Status,int
OnlineOrderFlag,boolean
SalesOrderNumber,string
PurchaseOrderNumber,string
AccountNumber,string


# Additional Notes
- Ingestion Timestamp: The ingestion timestamp (ingestion_time) is added to each record to keep track of when the data was ingested into the pipeline.
File Path: The source file path (source_file) is extracted from the metadata of the file being ingested. This can be useful for traceability and debugging purposes.