---

## Bronze Layer – Companies House ETL Pipeline

### Objective

The purpose of the Bronze layer in this pipeline is to **ingest raw Companies House API data and store it in a structured, queryable Delta format** while preserving the original source fidelity. This layer focuses on **data capture, traceability, and scalability**, not business transformations.

---

### Data Source

* Source System: **UK Companies House API**
* Data Type: **JSON (multiline, nested)**
* Datasets Ingested:

  * Company Overview
  * Officers
  * Filing History

Each company’s data arrives as nested JSON files organised in a folder hierarchy based on **year / month / day / company_number**.

---

### Key Design Decisions

#### 1. Raw to Delta Conversion

Raw JSON files stored in Databricks Volumes are converted into **Delta Tables**.
Delta format provides:

* ACID transactions
* Schema enforcement
* Time travel
* Performance optimisation

---

#### 2. Explicit Schema Definition

Instead of relying on schema inference, **StructType schemas** were defined manually for each dataset.
This prevents:

* `_corrupt_record` issues
* Schema drift
* Unexpected data type mismatches

---

#### 3. Multiline JSON Handling

The Companies House API returns nested, multiline JSON.
Spark was configured using:

```
.option("multiline", "true")
```

to correctly parse large JSON objects.

---

#### 4. Metadata Extraction from File Path

Unity Catalog does not support `input_file_name()`, so `_metadata.file_path` was used to extract:

* `company_number`

This ensures every record is **uniquely identifiable and traceable to its source file**.

---

#### 5. Flattening Nested Structures

Some datasets contained arrays (e.g., officers, filing items).
These were flattened using:

```
explode()
```

This makes the tables query-friendly while still remaining close to the source structure.

---

#### 6. Audit Column

A technical audit column was added:

* `last_updated_ts` → captures table write time

This enables:

* Refresh tracking
* Incremental logic in future layers
* Operational monitoring

---

### Metadata-Driven Architecture

To improve scalability and avoid hardcoding:

* A **config.json** file stores catalog, schema, base path, and dataset metadata.
* The notebook dynamically reads configuration using **Databricks Widgets**.
* This allows easy environment switching (Dev / Test / Prod) without code changes.

---

### Output Tables (Unity Catalog)

Catalog: `companies-data`
Schema: `bronze`

Tables Created:

* `overview`
* `officers`
* `filing_history`

Each table:

* Is stored in Delta format
* Uses `company_number` as the primary identifier
* Includes `last_updated_ts`

---

### What the Bronze Layer Does NOT Do

To maintain proper medallion architecture separation, the Bronze layer intentionally avoids:

* Business transformations
* Deduplication
* Joins
* Aggregations
* Data cleansing

These responsibilities are deferred to the **Silver layer**.

---

### Result

The Bronze layer provides a **reliable, scalable, and audit-ready raw data foundation** that preserves source integrity while enabling downstream analytical processing.


In [0]:
dbutils.widgets.text(
    name="config_path",
    defaultValue="/Workspace/Users/ud3041@gmail.com/end-to-end-ETL-pipeline/medallion/bronze/config_company_house.json",
    label="Config File Path"
)


In [0]:
import sys
import json
from pyspark.sql.types import *
from pyspark.sql.functions import (
    col,
    regexp_extract,
    explode,
    current_timestamp
)

# =========================
# IMPORT SHARED UTILITIES
# =========================
from utils.logger import get_logger
from utils.sparksession import create_spark_session
from utils.schema import SCHEMA_MAP

# =========================
# INITIALISE LOGGER & SPARK
# =========================
logger = get_logger("ds2b_company_house_bronze")
spark = create_spark_session("DS2B | Company House Bronze")

logger.info("Spark session created successfully")

# =========================
# LOAD CONFIG
# =========================
config_path = dbutils.widgets.get("config_path")
logger.info(f"Loading config from: {config_path}")

with open(config_path, "r") as f:
    config = json.load(f)

CATALOG = config["catalog"]
SCHEMA = config["schema"]
BASE_PATH = config["base_path"]

logger.info(
    f"Config loaded | Catalog={CATALOG}, Schema={SCHEMA}, BasePath={BASE_PATH}"
)

# =========================
# PROCESS TABLES
# =========================
for table in config["tables"]:

    table_name = table["name"]
    file_name = table["file"]
    explode_flag = table.get("explode", False)
    explode_column = table.get("explode_column")

    logger.info(f"Starting processing for table: {table_name}")

    df = (
        spark.read
        .schema(SCHEMA_MAP[table_name])
        .option("multiline", "true")
        .json(f"{BASE_PATH}/*/*/*/*/{file_name}")
        .withColumn("file_path", col("_metadata.file_path"))
        .withColumn(
            "company_number",
            regexp_extract("file_path", r'/([0-9A-Z]+)/[^/]+$', 1)
        )
    )

    logger.info(f"Read completed for {table_name}")

    if explode_flag:
        logger.info(
            f"Exploding column '{explode_column}' for table {table_name}"
        )
        df = (
            df.withColumn("exploded", explode(explode_column))
              .select("company_number", "exploded.*")
        )

    df = df.withColumn("last_updated_ts", current_timestamp())

    logger.info(f"Writing Delta table: {CATALOG}.{SCHEMA}.{table_name}")

    (
        df.write
        .format("delta")
        .mode("overwrite")
        .option("overwriteSchema", "true")
        .saveAsTable(f"`{CATALOG}`.{SCHEMA}.{table_name}")
    )

    logger.info(f"Completed table: {table_name}")

logger.info("Metadata-Driven Bronze Pipeline Completed Successfully")