# Bronze Layer – Data Ingestion Pipeline

**Purpose:**
This notebook orchestrates the Bronze layer ingestion for all UK and US economic datasets used in the Oil Market Analytics project.
It executes source-specific ingestion functions to extract and store raw data from APIs and files into Delta tables, preserving the original schema and structure for full traceability.

**Scope:**

Bank of England datasets (Interest Rate, GDP, CPI)

UK Unemployment data

Energy price data (WTI, Brent, Natural Gas)

Federal Reserve datasets (Interest Rate, GDP, CPI, Uemployment)

**Process Overview:**

Extract: Pulls data from public APIs and local Excel files.

Load: Writes raw datasets into the Bronze Delta tables under the oil_analytics schema.

Validate: Performs basic checks — table existence, ingestion timestamps, record counts, duplicate counts and null column names.

**Outputs:**
All raw data stored as Delta tables in the Bronze layer, e.g.:

oil_analytics.bronze_energy_prices

oil_analytics.bronze_ftse100

oil_analytics.bronze_uk_unemployment

oil_analytics.bronze_fed_cpi

oil_analytics.bronze_fed_interest_rate

### Set up

In [0]:
dbutils.library.restartPython()

In [0]:
%pip install -r ../requirements.txt

Imports

In [0]:
from src.transforms.bronze_layer.bronze_energy_prices import generate_bronze_energy_price_table
from src.transforms.bronze_layer.bronze_index import generate_bronze_index_tables
from src.transforms.bronze_layer.bronze_macro import generate_bronze_macro_tables
from src.transforms.bronze_layer.bronze_macro_local import generate_bronze_macro_local_tables

In [0]:
from pyspark.sql.functions import max

## Generate Bronze Tables

In [0]:
generate_bronze_energy_price_table(spark)
generate_bronze_index_tables(spark)
generate_bronze_macro_tables(spark)
generate_bronze_macro_local_tables(spark)

### Validation checks

Load tables for validation

In [0]:
energy_price_df = spark.table("oil_analytics.bronze_energy_prices")

sp500_df = spark.table("oil_analytics.bronze_sp500")
ftse100_df = spark.table("oil_analytics.bronze_ftse100")
dollar_index_df = spark.table("oil_analytics.bronze_dollar_index")

uk_unemployment_df = spark.table("oil_analytics.bronze_uk_unemployment")
uk_cpi_df = spark.table("oil_analytics.bronze_uk_cpi")
uk_gdp_df = spark.table("oil_analytics.bronze_uk_gdp")
uk_interest_rate_df = spark.table("oil_analytics.bronze_uk_interest_rate")
fed_unemployment_df = spark.table("oil_analytics.bronze_fed_unemployment")
fed_cpi_df = spark.table("oil_analytics.bronze_fed_cpi")
fed_gdp_df = spark.table("oil_analytics.bronze_fed_gdp")
fed_interest_rate_df = spark.table("oil_analytics.bronze_fed_interest_rate")


In [0]:
all_df = {
    "energy_price_df": energy_price_df, 
    "sp500_df": sp500_df, 
    "ftse100_df": ftse100_df, 
    "dollar_index_df": dollar_index_df, 
    "uk_unemployment_df": uk_unemployment_df, 
    "uk_cpi_df": uk_cpi_df, 
    "uk_gdp_df": uk_gdp_df, 
    "uk_interest_rate_df": uk_interest_rate_df, 
    "fed_unemployment_df": fed_unemployment_df, 
    "fed_cpi_df": fed_cpi_df, 
    "fed_gdp_df": fed_gdp_df, 
    "fed_interest_rate_df": fed_interest_rate_df
    }


Most recent Ingestion Timestamp

In [0]:
for k,v in all_df.items():
    print(f"Latest ingestion timestamp for {k}: {v.select(max("ingestion_timestamp")).collect()[0][0]}")

Total row count for each bronze table

In [0]:
for k,v in all_df.items():
    print(f"Total row count for {k}: ", v.count())

Check for duplicate rows

In [0]:
for k,v in all_df.items():
    print(f"Total duplciate rows for {k}: {v.count() - v.dropDuplicates().count()}")

Check for null/empty column names

In [0]:
for k,v in all_df.items():
    cols = v.columns
    null_cols = [c for c in cols if c is None or c.strip() == ""]
    print(f"Total NULL column names for {k}: {len(null_cols)}")