# RetailX Lakehouse Project
## End-to-End Data Engineering Assignment(Bronze -> Silver -> Gold)

## Project Overview
This project demonstrate the design and implementation of a modern Lakehouse architecture using Databricks, Delta Lake, and Unity Catalog, following industry best practices for data ingestion, transformation, and analytics

The goal is to:
- Ingest raw data reliably (Bronze)
- Clean and standardize it for analytics (Silver)
- Produce business-ready metrics (Gold)

This repository was built as part of a Data Engineer technical assessment, with strong emphasis on:
- Scalability
- Data quality
- Governance
- Readability



## Architecture Overview
The project follows the Medallion Architecture
```text
Source Files
   ↓
Bronze (Raw, Incremental Ingestion)
   ↓
Silver (Cleaned, Typed, Business-Ready)
   ↓
Gold (Aggregated KPIs & Business Metrics)
```
Each layer has a clear responsibility, ensuring separation of concerns and long-term maintainability

## Repository Structure

```text
retailx-lakehouse/
│
├── databricks/
│   ├── notebooks
│   ├── screenshot
│
├── docs/
│   ├── 01_unity_catalog_setup.md
│   ├── 02_orders_bronze_ingestion.md
│   ├── 03_customers_bronze_autoloader.md
│   ├── 04_silver_customers.md
│   ├── 04_silver_orders.md
│   ├── 05_gold_metrics.md
│   ├── dlt_pipeline.md
│   ├── lakebridge_analyser.md
│   ├── lakebridge_transpiler_output.md
│   ├── lakebridge_final_documentation.md
│
├── Oracle Legacy Script/
│   ├── oracle_schema.sql
│   ├── sample_plsql.sql
│
│
└── README.md

```

## Unity Catalog documentation:
[Unity Catalog Setup Notebook](/databricks/notebooks/01_unity_catalog_setup.sql.ipynb)

[Unity Catalog Setup Doc](/docs/01_unity_catalog_setup.md)

## Bronze Layer (Raw Ingestion)
## Objective
Ingest raw customer and order data incrementally and reliably into Databricks using Auto Loader, while preserving raw structure and ensuring auditability

### Key Concepts Implemented
- Databricks Auto Loader (CloudFiles)
- Incremental file ingestion
- Explicit file ingestion
- Exactly-once processing
- Unity Catalog-managed tables

### Bronze Customers
- Source: CSV files landing in DBFS
- Ingestion Mode: Streaming (Auto Loader)
- Target Table:
```sql
migration_project_db_ws.bronze.customers
```

### Documentation:


---

### Bronze Orders
- Same ingestion pattern as Customers
- Raw transactional data preserved
- No business logic applied

### Documentation:
[Bronze Order ingestion Notebook](/databricks/notebooks/01_bronze_orders_ingestion.py)

[Bronze Customer ingestion Notebook](/databricks/notebooks/03_bronze_customers_autoloader.py)

[Bronze Order doc](/docs/02_orders_bronze_ingestion.md)

[Bronze Customer doc](/docs/03_customers_bronze_autoloader.md)

## Why Bronze Matters
- Immutable raw data
- Supports replay and debugging
- Prevents early data loss
- Aligns with enterprise ingestion patterns


├── docs/
│   ├── 01_unity_catalog_setup.md
│   ├── 02_orders_bronze_ingestion.md
│   ├── 03_customers_bronze_autoloader.md
│   ├── 04_silver_customers.md
│   ├── 04_silver_orders.md
│   ├── 05_gold_metrics.md
│   ├── dlt_pipeline.md
│   ├── lakebridge_analyser.md
│   ├── lakebridge_transpiler_output.md
│   ├── lakebridge_final_documentation.md

# Silver Layer (Clean & Standardized)
## Objective

Transform Bronze data into clean, typed and analytics-ready datasets, applying business rules while maintaining lineage.

### Key Concepts Implemented
- Batch-based transformations
- Data cleansing & validation
- Type enforcement (DATE, TIMESTAMP, DOUBLE)
- Column standardization
- Deterministic overwrite strategy


## Silver Customers
- Cleans customer master data
- Normalizes emails and country fields
- Converts date strings into proper DATE types

## Documentation:
[Order Silver Notebook](/databricks/notebooks/04_orders_transformation.py)

[Customer Silver Notebook](/databricks/notebooks/04_customers_transformation.py)

[Order Silver doc](/docs/04_silver_orders.md)

[Customer Silver doc](/docs/04_silver_customers.md)

---

## Silver Orders
- Parses ISO timestamps
- Casts monetary values
- Prepares data for analytical joins

## Documentation:
(GitHub place holder)

## Why Silver Matters
- Single source of truth
- Removes ingestion noise
- Enforces business semantics
- Enables trusted analytics



# Gold Layer (Business Metrics)
## Objective
Expose business-level KPIs derived from Silver data, optimized for reporting and decision-making


## Key Concepts Implemented
- Business-driven aggregation
- Fact-to-dimension joins
- KPI-focused table design
- Read-optimized datasets

# Gold Metrics Implemented
## 1.Daily Revenue
### Business Question:
> How much revenue do we generate per day?

Table:
```sql
migragation_project_db_ws.gold.daily_sales
```
---

## 2. Revenue per Customer
### Business Question:
> Which customers generate the most revenue?

Table:
```sql
migration_project_db_ws.gold.revenue_per_customer
```

---

## 3. Daily Revenue by Country (Enriched)
### Business Question:
> How does daily revenue vary by country?

Table:
```sql
migration_project_db_ws.gold.daily_sales_country
```
### Documentation:
[Gold Notebook](/databricks/notebooks/05_gold_aggregations.py)

[Gold doc](/docs/05_gold_metrics.md)


### Why Gold Matters
- Directly consumed by BI tools
- Business-friendly schema
- No technical columns
- KPI-focused design


# Delta Live Tables (DLT) Pipeline
## Objective

Demonstrate managed, declarative pipelines using Delta Live Tables, including:

- Built-in data quality
- Automated dependency management
- End-to-end lineage tracking

---

## Why DLT?

DLT simplifies pipeline development by allowing engineers to:

- Declare transformations
- Enforce expectations
- Automatically manage execution order

---

## DLT Pipeline Design

The DLT pipeline recreates the Bronze → Silver flow using DLT decorators.
```python
@dlt.table(
  name="silver_customers",
  comment="Cleaned customer data"
)
def silver_customers():
    return (
        dlt.read("bronze_customers")
        .filter("customer_id IS NOT NULL")
    )
```

## Data Quality Expectations
```python
@dlt.expect("valid_customer_id", "customer_id IS NOT NULL")
```

---

### Documentation:
[DLT Notebook](/databricks/notebooks/06_dlt_gold_pipeline.py)

[DLT Doc](/docs/dlt_pipeline.md)


## Benefits Demonstrated

- Automatic retries
- Built-in observability
- Schema evolution support
- Enterprise-grade reliability

# Oracle to Databricks Migration (Lakebridge)
## Objective

Simulate enterprise database migration from Oracle to Databricks Lakehouse using Databricks Lakebridge.

---

### Why Lakebridge?

Lakebridge accelerates large-scale RDBMS migrations by:

- Extracting Oracle schemas
- Migrating data into Delta format
- Preserving table relationships
- Reducing manual migration effort

#### Migration
```text
Oracle Tables
   ↓
Lakebridge Extraction
   ↓
Bronze Delta Tables
   ↓
Silver / Gold Processing
```

#### Migration Strategy (Conceptual)

- Oracle tables treated as authoritative sources
- Initial full load into Bronze
- Downstream logic remains unchanged


### Documentation:
[Lakebridge Analyser Simulated](/docs/lakebridge_analyser.md)

[Lakebridge Transpiler Output Simulated](/docs/lakebridge_transpiler_output.md)

[Mapping Final Document](/docs/mapping_final_documentation.md)

## Governance & Best Practices
- Unity Catalog-managed tables
- Clear schema ownership
- Layer-specific responsibilities
- Production-aligned design decisions

## Validation Strategy
Each layer includes:
- Row count checks
- Null validations
- Type consistency checks
- Incremental ingestion verification