# Bronze Layer Ingestion

Goal: Ingest raw JSON files in into Bronze Delta tables.

Source Files:

  - customers.json
  - products
  - orders
  - sales
  - countries

Bronze Tables:

  - bronze.customers_raw
  - bronze.products_raw
  - bronze.orders_raw
  - bronze.sales_raw
  - bronze.countries_raw

Design Notes
  - Data is ingested in its raw, unmodified form, preserving original values, retaining JSON datatypes.
  - Column names have been cleaned to support Spark-friendly names.
  - Load-level metadata columns have been added:
    - _ingest_timestamp — time the record was loaded into the Bronze layer.
    - _ingest_file — source file name or path.
    - _source_system — originating system identifier.
  - The following helper functions have been created:
    - load_variable_json - processes malformed JSON so it can be loaded into a DataFrame.
    - is_json_line - Tests to see a line is empty, JSON or an array.
    - clean_json_column_names - converts json columns names to format Spark can process.
    - load_bronze_table - Loads DataFrame into it's Bronze layer table.
  - Writes are rebuilt using overwrite to facilitate reruns and avoid duplicates.
    

## Imports and Context Setting

### Imports

In [0]:

from pyspark.sql.functions import current_timestamp, lit
import re
import json
import pandas as pd



### Context Setting



In [0]:
pd.set_option("display.max_rows", 10)
pd.set_option("display.max_columns", 50)
pd.set_option("display.width", 120)

SOURCE_SYSTEM = "sales test homework"
RAW_PATH = '../data/'

spark.sql("USE CATALOG sales")
spark.sql("USE SCHEMA bronze")

## Helper Functions






### load_variable_json

In [0]:
def load_variable_json(raw_data_path:str, verbose:bool = True) -> pd.DataFrame:
    '''Process variably formed JSON files'''
    with open(raw_data_path, 'r') as f:
        raw = f.read().strip()

    # Try properly formed JSON
    try:
        parsed = json.loads(raw)
        if isinstance(parsed, dict):
            data = [parsed]
        elif isinstance(parsed, list):
            data = parsed
        else:
            data = [parsed]
        if verbose:
            print(f"[load_variable_json] parsed as properly formed JSON from: {raw_data_path}")
        return pd.DataFrame(data)
    except json.JSONDecodeError:
        pass

    # Try JSON lines format
    lines = [ln.strip() for ln in raw.splitlines() if is_json_line(ln)]
    json_line_objects = []
    json_line_objects_ok = True
    for line in lines:
        try:
            obj = json.loads(line)
            json_line_objects.append(obj)
        except json.JSONDecodeError:
            json_line_objects_ok = False
            break
    if json_line_objects_ok and json_line_objects:
        if verbose:
             print(f"[load_variable_json] parsed as JSONL (one JSON object per line) from: {raw_data_path}")
        return pd.DataFrame(json_line_objects)
    
    # Improperly formed JSON (end of line commas)
    cleaned = raw

    # Strip trailing commas at end of file
    while cleaned.endswith(","):
        cleaned = cleaned[:-1].rstrip()

    cleaned = "[" + cleaned + "]"
    data = json.loads(cleaned)

    if verbose:
        print(f"[load_variable_json] Parsed as 'concatenated JSON objects with commas' from: {raw_data_path}")

    return pd.DataFrame(data)

### is_json_line

In [0]:
def is_json_line(line: str) -> bool:
    '''Test to see is each line in file is empty, JSON or an array'''
    line = line.strip()
    if not line:
        return False
    return line.startswith("{") or line.startswith("[")

### clean_json_column_names

In [0]:
def clean_json_column_names(col:str) -> str:
    ''' Converts invalid JSON column names to valid Spark column names'''
    col = re.sub(r"[^\w]+", "_", col)   # replace non alphanumeric with _
    col = re.sub(r"_+", "_", col)       # replace multiple _ with single _
    col = col.strip("_")                # remove leading and trailing _
    return col

### load_bronze_table

In [0]:
def load_bronze_table(filename: str, table_name: str):
    """
    Load a JSON file from ../data into a Bronze Delta table.

    - Reads JSON with load_variable_json (Pandas)
    - Converts to Spark DataFrame
    - Adds standard Bronze metadata
    - Writes as Delta table using 'table_name' in the current schema
    """
    full_path = RAW_PATH + filename
    pdf = load_variable_json(full_path)
    # Clean column names
    pdf.columns = [clean_json_column_names(c) for c in pdf.columns]

    sdf = spark.createDataFrame(pdf)

    # Add bronze metadata
    bronze_df = (
        sdf
        .withColumn("_ingest_timestamp", current_timestamp())
        .withColumn("_ingest_file", lit(filename))
        .withColumn("_source_system", lit(SOURCE_SYSTEM))
    )

    # Write Delta table
    (
    bronze_df
        .write
        .format("delta")
        .mode("overwrite")
        .saveAsTable(table_name)
    )



## Load Bronze Delta Tables 


### Customers

#### Load bronze.customers_raw

In [0]:
load_bronze_table("customers.json", "customers_raw")

#### Check bronze.customers_raw load

In [0]:
display(spark.table("customers_raw").limit(5))

### Products

#### Load bronze.products_raw

In [0]:
load_bronze_table("products.json", "products_raw")

#### Check bronze.products_raw load

In [0]:
display(spark.table("products_raw").limit(5))

### Countries

#### Load bronze.countries_raw

In [0]:
load_bronze_table("countries.json", "countries_raw")

#### Check bronze.countries_raw 

In [0]:
display(spark.table("countries_raw").limit(5))

### Sales

#### Load bronze.sales_raw

In [0]:
load_bronze_table("sales.json", "sales_raw")


#### Check bronze.sales_raw load

In [0]:
display(spark.table("sales_raw").limit(5))


### Orders

#### Load bronze.orders_raw

In [0]:
load_bronze_table("orders.json", "orders_raw")

#### Check bronze.orders_raw load

In [0]:
display(spark.table("orders_raw").limit(5))