# 01 — Load and Validate

This notebook establishes trust boundaries.

It does not analyze system behavior.  
It does not describe distributions.  
It does not explain outcomes.

Its only purpose is to determine what parts of the raw dataset are admissible for downstream reasoning, and under what constraints.

---

## Question

Can raw installation records be reduced to a year-level temporal representation without introducing silent failure modes?

---

## Inputs

- Raw Tracking the Sun dataset  
- Repo 0 schemas  
- Repo 0 column roles  
- Repo 0 violation definitions  

No assumptions are introduced locally.

---

## Outputs

- `installation_year` (derived, canonical)  
- violation classifications related to temporal validity  
- diagnostics describing coverage and exclusions  

These outputs are structural.  
They are not interpretive.

---

## What This Notebook Does

- loads the raw dataset as-is  
- asserts presence and type of required raw fields  
- attempts to derive `installation_year` from raw date strings  
- classifies outcomes into valid, restricted, and invalid categories  
- records violations without correcting them  
- writes canonical outputs and diagnostics for downstream use  

Nothing is modified to improve appearance or convenience.

---

## What This Notebook Does Not Do

- no exploratory analysis  
- no visualization  
- no modeling  
- no interpretation of trends or behavior  
- no global filtering or “cleaning”  

Rows are not removed because they are inconvenient.  
They are tagged when they are inadmissible for specific uses.

---

## Method

The procedure followed here is mechanical:

1. Load raw data  
2. Assert schema presence  
3. Attempt minimal coercion  
4. Classify results  
5. Persist outcomes  

Each step is observable.  
No transformation is silent.

---

## Failure Is Data

Violations encountered here are retained.

They indicate:
- limits of reporting
- historical gaps
- structural constraints

They are not treated as errors to be erased.

Downstream notebooks may choose to exclude subsets of data based on these diagnostics, but this notebook does not make that decision.

---

## Scope Boundary

This notebook answers one question only:

> Is a year-level temporal representation defensible across the dataset?

Any question beyond that belongs elsewhere.

---

## Status

If `installation_year` survives this process with acceptable coverage and known limitations, it earns canonical status.

If it does not, downstream work must adapt accordingly.


In [10]:
import os
import json
import pandas as pd
from pathlib import Path


In [11]:
# External data
DATA_PATH = Path(os.environ["TRACKING_THE_SUN_DATA"])

# Notebook location
NOTEBOOK_DIR = Path.cwd()

# Shared root
PROJECT_ROOT = NOTEBOOK_DIR.parents[1]

# Repo 0 (program / architecture)
ARCH_ROOT = PROJECT_ROOT / "00_tracking_the_sun_program"

RAW_SCHEMA_PATH = ARCH_ROOT / "schemas" / "raw" / "tracking_the_sun.raw.schema.json"
VIOLATION_PATH = ARCH_ROOT / "validation" / "violation-types.json"

# Outputs
OUTPUT_CANONICAL = Path("../outputs/canonical/installation_year.csv")
OUTPUT_DIAGNOSTICS = Path("../outputs/diagnostics/installation_date_violations.csv")


In [12]:
df = pd.read_parquet(DATA_PATH, engine="pyarrow")

print("Data loaded")
print(f"Rows: {df.shape[0]} | Columns: {df.shape[1]}")

df.head()


Data loaded
Rows: 1921220 | Columns: 80


Unnamed: 0,data_provider_1,data_provider_2,system_id_1,system_id_2,installation_date,pv_system_size_dc,total_installed_price,rebate_or_grant,customer_segment,expansion_system,...,output_capacity_inverter_3,dc_optimizer,inverter_loading_ratio,battery_manufacturer,battery_model,battery_rated_capacity_kw,battery_rated_capacity_kwh,battery_price,technology_type,extensions_multiphase_id
0,City of Palo Alto Utilities,-1,PVD1,-1,1999-10-01,5.364819,43408.0,29513.0,GOV,-1,...,-1.0,0.0,-1.0,-1,-1,-1.0,-1.0,-1.0,pv-only,-1
1,California Public Utilities Commission,-1,PGE-INT-114109373,-1,2017-11-06,7.02,21060.0,0.0,COM,0,...,-1.0,1.0,0.78,-1,-1,-1.0,-1.0,-1.0,pv-only,-1
2,City of Palo Alto Utilities,-1,PVP1,-1,1999-10-13,3.274033,30335.0,11384.79,RES,-1,...,-1.0,0.0,-1.0,-1,-1,-1.0,-1.0,-1.0,pv-only,-1
3,California Public Utilities Commission,-1,PGE-INT-114149823,-1,2017-11-06,5.7,29184.0,0.0,RES,0,...,-1.0,0.0,1.14,-1,-1,-1.0,-1.0,-1.0,pv-only,-1
4,City of Palo Alto Utilities,-1,PVD2,-1,1999-12-01,6.272357,58960.0,19800.0,RES,-1,...,-1.0,0.0,-1.0,-1,-1,-1.0,-1.0,-1.0,pv-only,-1


In [13]:
import json

try:
    with open(RAW_SCHEMA_PATH) as f:
        raw_schema = json.load(f)
    print("RAW_SCHEMA loaded successfully")
except Exception as e:
    print("RAW_SCHEMA failed:", type(e), e)

try:
    with open(VIOLATION_PATH) as f:
        violation_definitions = json.load(f)
    print("VIOLATION schema loaded successfully")
except Exception as e:
    print("VIOLATION schema failed:", type(e), e)


RAW_SCHEMA loaded successfully
VIOLATION schema loaded successfully


In [14]:
with open(RAW_SCHEMA_PATH) as f:
    raw_schema = json.load(f)

with open(VIOLATION_PATH) as f:
    violation_definitions = json.load(f)

print("Repo 0 artifacts loaded")


Repo 0 artifacts loaded


In [15]:
required_columns = raw_schema.get("required", [])

missing = [col for col in required_columns if col not in df.columns]

if missing:
    raise ValueError(f"Missing required raw columns: {missing}")

print("All required raw columns present.")

All required raw columns present.


In [16]:
df["installation_year"] = (
    pd.to_datetime(df["installation_date"], errors="coerce")
      .dt.year
)

In [17]:
diagnostics = pd.DataFrame({
    "installation_date": df["installation_date"],
    "installation_year": df["installation_year"]
})

MIN_REPORTING_YEAR = 1998
MAX_REPORTING_YEAR = 2024

diagnostics["violation"] = "valid"

diagnostics.loc[
    diagnostics["installation_year"].isna(),
    "violation"
] = "unparseable"

diagnostics.loc[
    diagnostics["installation_year"] < MIN_REPORTING_YEAR,
    "violation"
] = "outside_reporting_window"

diagnostics.loc[
    diagnostics["installation_year"] > MAX_REPORTING_YEAR,
    "violation"
] = "outside_reporting_window"

diagnostics["violation"].value_counts(dropna=False)


violation
valid                       1920983
unparseable                     231
outside_reporting_window          6
Name: count, dtype: int64

In [18]:
OUTPUT_CANONICAL.parent.mkdir(parents=True, exist_ok=True)
OUTPUT_DIAGNOSTICS.parent.mkdir(parents=True, exist_ok=True)

df[["installation_year"]].to_csv(OUTPUT_CANONICAL, index=False)
diagnostics.to_csv(OUTPUT_DIAGNOSTICS, index=False)

print("Canonical candidate and diagnostics written.")

Canonical candidate and diagnostics written.


## Notebook Resolution

This notebook establishes the admissibility of records for downstream analysis.

Specifically, it:
- validates raw records against the declared raw schema
- derives canonical temporal fields (`installation_year`)
- classifies temporal violations and exclusions
- materializes diagnostics used as gating signals downstream

No records are filtered or removed in this notebook.
No descriptive statistics, distributions, or interpretations are performed.

Downstream notebooks are responsible for:
- applying validity filters explicitly
- defining analytical populations
- constructing descriptive or predictive baselines

All subsequent analysis assumes the contracts defined here unless explicitly stated otherwise.
