# Simulated Data Pipeline Playbook

This notebook demonstrates how to simulate end-to-end pipelines that move data from an on-premises environment into a data lake and onward to Snowflake.
We will focus on showing the **techniques** that practitioners would use, including how to reconcile schema differences before merging data into a final analytics database.

## 0. Imports & Configuration

We rely on pandas to stand in for the compute engines (Spark, Snowflake) and use Python dictionaries to emulate configuration artifacts.

In [None]:
import pandas as pd
from pathlib import Path

BASE_PATH = Path('simulated_storage')
BASE_PATH.mkdir(exist_ok=True)
print('Working directory prepared:', BASE_PATH)

## 1. Scenario Setup

To keep a consistent business story, we will follow a fictional retail company consolidating online orders. Data originates from multiple source systems:

* **AVS (On-Prem / VM)**: Legacy ERP extracts that use snake_case headers.
* **Data Lake Landing**: Semi-structured JSON exports with camelCase headers and nested attributes.
* **Snowflake**: Target warehouse expecting business-friendly Pascal Case headers.

In real life we would connect to each platform with vendor-specific connectors. Here, we simulate those reads with local files while emphasizing best practices such as configuration-driven ingestion and schema validation.

### 1.1 Sample Source Files

We create three mock datasets with intentionally mismatched headers to illustrate transformation needs:

In [None]:
avs_orders = pd.DataFrame({
    'order_id': [101, 102, 103],
    'customer_id': ['C001', 'C002', 'C003'],
    'order_total': [250.0, 190.5, 330.25],
    'order_ts': ['2024-03-01 10:15:00', '2024-03-01 12:30:00', '2024-03-02 09:45:00']
})

raw_lake_orders = pd.DataFrame({
    'OrderID': ['A-900', 'A-901'],
    'CustomerCode': ['C002', 'C004'],
    'GrossAmount': [210.0, 480.0],
    'UpdatedAt': ['2024-03-02T14:00:00Z', '2024-03-02T16:20:00Z']
})

partner_feed = pd.DataFrame({
    'ORDER NUMBER': [5001, 5002],
    'CUSTOMER': ['C001', 'C005'],
    'TOTAL $': [125.5, 275.75],
    'MODIFIED': ['2024/03/03 08:30:00', '2024/03/03 09:10:00']
})

avs_orders, raw_lake_orders, partner_feed

## 2. Ingestion Techniques

### 2.1 Configuration-Driven Connectors

Real pipelines rely on configuration maps to determine connection parameters and mapping logic. Below we define metadata that describes each source, its storage layer, and how headers should be normalized.

In [None]:
source_config = {
    'avs_orders': {
        'layer': 'AVS',
        'read_format': 'csv',
        'primary_key': 'order_id',
        'expected_headers': ['order_id', 'customer_id', 'order_total', 'order_ts'],
        'rename_map': {
            'order_id': 'OrderID',
            'customer_id': 'CustomerID',
            'order_total': 'Amount',
            'order_ts': 'UpdatedAt'
        }
    },
    'raw_lake_orders': {
        'layer': 'DataLake',
        'read_format': 'json',
        'primary_key': 'OrderID',
        'expected_headers': ['OrderID', 'CustomerCode', 'GrossAmount', 'UpdatedAt'],
        'rename_map': {
            'OrderID': 'OrderID',
            'CustomerCode': 'CustomerID',
            'GrossAmount': 'Amount',
            'UpdatedAt': 'UpdatedAt'
        }
    },
    'partner_feed': {
        'layer': 'PartnerFTP',
        'read_format': 'xlsx',
        'primary_key': 'ORDER NUMBER',
        'expected_headers': ['ORDER NUMBER', 'CUSTOMER', 'TOTAL $', 'MODIFIED'],
        'rename_map': {
            'ORDER NUMBER': 'OrderID',
            'CUSTOMER': 'CustomerID',
            'TOTAL $': 'Amount',
            'MODIFIED': 'UpdatedAt'
        }
    }
}
source_config

### 2.2 Schema Validation Utility

A lightweight validation function checks whether the incoming headers match expectations. In production, this step would raise alerts or move files to quarantine when mismatches occur.

In [None]:
def validate_headers(df: pd.DataFrame, config: dict, *, source: str) -> None:
    actual = list(df.columns)
    expected = config[source]['expected_headers']
    missing = set(expected) - set(actual)
    extra = set(actual) - set(expected)
    if missing or extra:
        print(f"[WARN] {source}: header mismatch detected")
        if missing:
            print('  Missing columns:', ', '.join(sorted(missing)))
        if extra:
            print('  Unexpected columns:', ', '.join(sorted(extra)))
    else:
        print(f"[OK] {source}: headers aligned")

validate_headers(avs_orders, source_config, source='avs_orders')
validate_headers(raw_lake_orders, source_config, source='raw_lake_orders')
validate_headers(partner_feed, source_config, source='partner_feed')

## 3. Data Lake Landing (Bronze/Silver)

We mimic a bronze-to-silver process. Bronze keeps the data as-is but tracks metadata. Silver standardizes header names, harmonizes datatypes, and enriches records with a unified surrogate key.

In [None]:
def normalize_headers(df: pd.DataFrame, rename_map: dict) -> pd.DataFrame:
    return df.rename(columns=rename_map)

bronze_tables = {
    'avs_orders_bronze': avs_orders.copy(),
    'raw_lake_orders_bronze': raw_lake_orders.copy(),
    'partner_feed_bronze': partner_feed.copy()
}

silver_tables = {}
for source, bronze_df in bronze_tables.items():
    cfg_key = source.replace('_bronze', '')
    renamed = normalize_headers(bronze_df, source_config[cfg_key]['rename_map'])
    renamed['IngestedFrom'] = source_config[cfg_key]['layer']
    silver_tables[source.replace('_bronze', '_silver')] = renamed

silver_tables

### 3.1 Type Harmonization

Notice that timestamps and numeric fields follow different patterns. We centralize casting rules to avoid repeated transformation logic.

In [None]:
dtype_rules = {
    'OrderID': 'string',
    'CustomerID': 'string',
    'Amount': 'float',
    'UpdatedAt': 'datetime64[ns]'
}

for table_name, df in silver_tables.items():
    typed_df = df.astype({k: v for k, v in dtype_rules.items() if k in df.columns}, errors='ignore')
    typed_df['UpdatedAt'] = pd.to_datetime(typed_df['UpdatedAt'], errors='coerce')
    silver_tables[table_name] = typed_df

silver_tables['avs_orders_silver'].dtypes

## 4. Snowflake Loading & Transformation

In production we would use Snowflake stages and `COPY INTO` commands. Here, we consolidate the silver tables into a single **gold** table while applying business rules:

* Deduplicate on `OrderID` preferring the most recent `UpdatedAt`.
* Create human-readable headers expected by analytics teams.
* Split the output into two presentation formats: a wide executive report and a transactional view.

In [None]:
combined = pd.concat(silver_tables.values(), ignore_index=True)
combined = combined.sort_values('UpdatedAt').drop_duplicates('OrderID', keep='last')
combined['Amount'] = combined['Amount'].round(2)
combined['UpdatedAt'] = combined['UpdatedAt'].dt.tz_localize('UTC')
combined

### 4.1 Final Warehouse Schema

Snowflake tables often use Pascal Case. We also demonstrate how to produce variant header layouts.

In [None]:
warehouse_view = combined.rename(columns={
    'OrderID': 'OrderId',
    'CustomerID': 'CustomerId',
    'Amount': 'NetAmount',
    'UpdatedAt': 'UpdatedAtUtc',
    'IngestedFrom': 'SourceSystem'
})
warehouse_view

### 4.2 Executive Summary Layout

Business stakeholders sometimes request multi-level headers that distinguish metrics from metadata. Pandas styling allows us to mimic the resulting presentation layer.

In [None]:
executive_summary = warehouse_view.copy()
executive_summary.columns = pd.MultiIndex.from_tuples([
    ('Identity', 'OrderId'),
    ('Identity', 'CustomerId'),
    ('Financials', 'NetAmount'),
    ('Operational', 'UpdatedAtUtc'),
    ('Operational', 'SourceSystem')
])
executive_summary

## 5. Orchestration & Monitoring Patterns

Below is a pseudo-code Airflow DAG highlighting how these transformations would be orchestrated in a production setting.

In [None]:
from textwrap import dedent

airflow_dag = dedent('''
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

with DAG(
    dag_id="retail_orders_pipeline",
    start_date=datetime(2024, 3, 1),
    schedule_interval="0 * * * *",
    catchup=False,
) as dag:

    extract_avs = PythonOperator(
        task_id="extract_avs",
        python_callable=extract_from_avs,
    )

    load_bronze = PythonOperator(
        task_id="load_bronze",
        python_callable=write_to_bronze,
    )

    transform_silver = PythonOperator(
        task_id="transform_silver",
        python_callable=standardize_headers,
    )

    publish_gold = PythonOperator(
        task_id="publish_gold",
        python_callable=publish_to_snowflake,
    )

    extract_avs >> load_bronze >> transform_silver >> publish_gold
''')
print(airflow_dag)

## 6. End-to-End Demonstration

Finally, we package the transformations into reusable functions to demonstrate how a single orchestration call could drive the pipeline.

> **Tip:** In a real deployment, each function would live in its own module with logging, error handling, retry logic, and parameterization for environment-specific values.

In [None]:
def standardize_source(df: pd.DataFrame, cfg_key: str) -> pd.DataFrame:
    cfg = source_config[cfg_key]
    validate_headers(df, source_config, source=cfg_key)
    normalized = normalize_headers(df, cfg['rename_map'])
    normalized['IngestedFrom'] = cfg['layer']
    return normalized.astype({
        'OrderID': 'string',
        'CustomerID': 'string',
        'Amount': 'float'
    }, errors='ignore')

def run_pipeline(sources: dict[str, pd.DataFrame]) -> tuple[pd.DataFrame, pd.DataFrame]:
    silver = [standardize_source(df, key) for key, df in sources.items()]
    combined_df = pd.concat(silver, ignore_index=True)
    combined_df['UpdatedAt'] = pd.to_datetime(combined_df['UpdatedAt'], errors='coerce')
    combined_df = combined_df.sort_values('UpdatedAt').drop_duplicates('OrderID', keep='last')
    warehouse_df = combined_df.rename(columns={
        'OrderID': 'OrderId',
        'CustomerID': 'CustomerId',
        'Amount': 'NetAmount',
        'UpdatedAt': 'UpdatedAtUtc',
        'IngestedFrom': 'SourceSystem'
    })
    exec_summary = warehouse_df.copy()
    exec_summary.columns = pd.MultiIndex.from_tuples([
        ('Identity', 'OrderId'),
        ('Identity', 'CustomerId'),
        ('Financials', 'NetAmount'),
        ('Operational', 'UpdatedAtUtc'),
        ('Operational', 'SourceSystem')
    ])
    return warehouse_df, exec_summary

warehouse_df, exec_summary = run_pipeline({
    'avs_orders': avs_orders,
    'raw_lake_orders': raw_lake_orders,
    'partner_feed': partner_feed
})
warehouse_df, exec_summary

## 7. Key Takeaways

* **Schema reconciliation** is the heart of multi-source consolidation. The configuration maps give a repeatable way to translate headers, enforce types, and annotate provenance.
* **Layered storage (Bronze → Silver → Gold)** enables auditable transformations and simplifies debugging when data drifts.
* **Presentation models** can diverge dramatically from raw data structures; designing reusable formatting logic avoids one-off reporting hacks.

This notebook can serve as a sandbox for experimenting with additional sources, validation rules, and monitoring hooks before wiring up production pipelines.