# 01 – Smart-Grid Data Overview

Initial exploratory notebook to summarize feature distributions, correlations, and anomaly cues.


## Goals
- Validate that ingestion outputs (`data/interim/*.parquet`) contain the expected feature count and timestamp density.
- Produce quick sanity plots (load vs generation, temperature vs demand) to guide dimensionality-reduction choices.
- Record notable quality issues (missing spans, sensor drift) for the data-prep backlog.


In [None]:
from __future__ import annotations

import sys
from pathlib import Path

candidates = [Path.cwd(), Path.cwd().parent, Path.cwd().parent.parent]
for candidate in candidates:
    if (candidate / "README.md").exists():
        PROJECT_ROOT = candidate.resolve()
        break
else:
    raise RuntimeError("Unable to locate project root. Run the notebook from within the repo.")

SRC_PATH = PROJECT_ROOT / "src"
if SRC_PATH.exists() and str(SRC_PATH) not in sys.path:
    sys.path.insert(0, str(SRC_PATH))

DATA_DIR = PROJECT_ROOT / "data"
CONFIG_DIR = PROJECT_ROOT / "configs"
print(f"Project root: {PROJECT_ROOT}")
print(f"Data dir: {DATA_DIR}")

In [None]:
from smart_grid_fault_detection.data_prep import load_manifest

manifest_path = CONFIG_DIR / "data_manifest.yaml"
if not manifest_path.exists():
    manifest_path = CONFIG_DIR / "data_manifest.example.yaml"
    print("Using example manifest – update configs/data_manifest.yaml for your environment.")

manifest = load_manifest(manifest_path)
manifest

In [None]:
import pandas as pd

interim_path = manifest.output.interim_table
if not interim_path.exists():
    print(f"Interim table {interim_path} not found. Run the ingestion step first.")
else:
    df = pd.read_parquet(interim_path)
    display(df.head())
    display(df.describe(include="all"))
    print(f"Rows: {len(df):,} | Columns: {len(df.columns)}")

## Next Analyses
- Plot demand/generation/environmental correlations (pairplot or heatmap).
- Inspect rolling z-score baselines per feature to spot anomalies visually.
- Flag missing intervals or duplicate timestamps for cleaning scripts.
