Skip to content

Alexander-Kershaw/AtlasStack

Repository files navigation

AtlasStack

AtlasStack is a modular ingestion and validation stack for UK energy and infrastructure datasets.

It converts unstable external APIs into deterministic, schema-controlled, test-validated inputs for analytics, forecasting, and ML systems.

CI Python Tests Linter Status


Why AtlasStack Exists

Public energy datasets (NESO, ESO, weather APIs, interconnector feeds) are predominantly:

  • poorly versioned
  • weakly typed
  • prone to silent schema drift
  • inconsistent in cadence
  • rarely testable

AtlasStack treats ingestion as engineering.

It enforces structure, typing, cadence, and validation before data is allowed into analytics or ML layers.

If the foundation is unreliable, every forecast built on top of it is suspect in its trustworthiness.


System Architecture

Core Data Lineage

flowchart LR
  EXT[External APIs] --> BR[Bronze<br/>Raw JSONL<br/>dt partitions]
  BR --> SL[Silver<br/>Typed Parquet<br/>schema enforced]
  SL --> STG[dbt staging models]
  STG --> DIM[dim_date]
  STG --> FCT[fct_observations]
  FCT --> MART[mart_daily_summary]
Loading

Bronze

  • Raw API responses
  • Append-only
  • Partitioned by dt=YYYY-MM-DD
  • Never mutated / no transformation logic

Silver

  • Strict typing
  • Normalized timestamps
  • Explicit schema enforcement
  • Ready for dbt consumption
  • Deterministic partitioning

Warehouse (Current: DuckDB)

  • Local analytical engine
  • dbt transformations
  • Fact and dimension modelling
  • Explicit data tests

Architecture Diagram

flowchart TB

  %% Sources
  subgraph Sources
    NESO[NESO Demand API]
    WEATHER[Open-Meteo Weather API]
  end

  %% Orchestration
  subgraph Orchestration
    PREFECT[Prefect Flow]
  end

  %% Storage
  subgraph Storage
    BRONZE[Bronze JSONL<br/>data/bronze]
    SILVER[Silver Parquet<br/>data/silver]
  end

  %% Transform
  subgraph Transform
    DUCK[(DuckDB Warehouse)]
    DBT[dbt Models]
    MARTS[Marts]
  end

  %% Quality
  subgraph Validation
    PYTEST[Unit Tests]
    DBTTEST[dbt Data Tests]
    CI[GitHub Actions]
  end

  %% Future
  subgraph Optional_Cloud_Extension
    S3[S3 Object Storage]
    SNOW[Snowflake Warehouse]
  end

  NESO --> PREFECT
  WEATHER --> PREFECT

  PREFECT --> BRONZE
  BRONZE --> SILVER

  SILVER --> DUCK
  DUCK --> DBT
  DBT --> MARTS

  PYTEST --> CI
  DBTTEST --> CI

  BRONZE -. storage swap .-> S3
  SILVER -. storage swap .-> S3
  DUCK -. warehouse swap .-> SNOW

Loading

CLI Usage

Run the pipeline for the last N days:

atlasstack run --days 3

This serves to:

  • Extract NESO demand data to the bronze layer
  • Extract Open-Meteo weather data to the bronze layer
  • Build silver layers
  • Execute dbt build
  • Produce validated data marts

Check CLI options:

atlasstack --help
atlasstack run --help

Design Principles

AtlasStack is governed by the following set of engineering constraints:

  1. Determinism over convenience The same date range always produces identical outputs.

  2. Immutability Bronze level data is never mutated, being append-only. Corrections happen in the downstream layers.

  3. Explicit schema contracts All external data is normalised and typed before consumption. Cadence and null thresholds are enforced.

  4. Loud failure The CI fails on scheme drift, cadence breaks, or coverage degradations occur.

  5. Layered testing Unit tests used for extractors. Data tests used for marts. CI validation tests for the full stack.

  6. Infrastructure focus Analytics are secondary to foundational reliability.


Development Workflow

Install Locally:

pip install -e ".[dev]"

Run Lint and Tests

ruff check .
pytest

Bootstrap CI run locally:

python scripts/ci_bootstrap.py

Run dbt Manually:

cd dbt/atlasstack_dbt
dbt build --no-partial-parse --profiles-dir .

What Successful Runs Look Like:

A successful pipeline run produces:

  • Partitioned bronze JSONL files
  • Partitioned silver Parquet files
  • Passing dbt tests
  • A valid fct_observations table with:
    • Half-hour cadence
    • Enforced weather coverage thresholds
    • Unique settlement timestamps

If dbt and pytest are green, the ingestion layer is behaving as expected for the tested range.


Storage Backends

AtlasStack has storage abstraction.

Default (Local)

  • Bronze: data/bronze/
  • Silver: data/silver/
  • Warehouse: DuckDB

Optional (Cloud-ready)

  • Bronze/Silver → S3
  • Warehouse → Snowflake

The cloud infrastructure is scaffolded but not required to run the project.

The entire stack runs locally without cloud billing dependencies.


Roadmap

Short-Term

  • Prefect deployment to managed orchestration (schedules and retries)
  • Run metadata: structured run reports, row counts, and freshness markers

Mid-Term

  • Move bronze/silver to S3 (partitioned object storage)
  • Switch warehouse target from DuckDB to Snowflake (raw, staging, and marts)
  • CI runs dbt against a temporary Snowflake schema (PR validation)

Long-Term

  • Minimal Terraform: S3 bucket and Snowflake roles/permissions
  • Dataset contracts and schema drift alerts (contract breaks fail CI)

About

AtlasStack is a modular ingestion and validation stack for UK energy and infrastructure datasets. It converts unstable external APIs into deterministic, schema-controlled, test-validated inputs for analytics, forecasting, and ML systems

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors