# Canonical System Identity and Analytical Grain

This notebook establishes the only admissible system-level representation
of residential solar installations used in downstream analysis.

It performs no exploratory analysis and introduces no analytical interpretation.

Its sole responsibilities are to:
- load raw Tracking the Sun data as reported,
- enforce identifier admissibility rules defined in Repo 1,
- formally declare the canonical system identifier,
- collapse multiple reported rows into one system-level record,
- enforce the one-row-per-system grain invariant,
- and produce auditable diagnostics for exclusions and violations.

No system-level artifact may be produced unless all identity and grain
constraints are satisfied.


## Phase 0 — Raw Data Load & Structural Inspection

This phase loads the raw Tracking the Sun dataset and performs basic structural inspection.
No assumptions are made about system identity, grain, or column roles.

The purpose of this phase is purely descriptive: to understand the shape, size, and
surface characteristics of the raw data before any admissibility or scope decisions
are applied.


In [17]:
from pathlib import Path
import os

import pandas as pd
import numpy as np

In [18]:
# Resolve raw data path via environment configuration

RAW_DATA_PATH = Path(os.environ.get("TRACKING_THE_SUN_DATA", ""))

if RAW_DATA_PATH == Path(""):
    raise EnvironmentError(
        "TRACKING_THE_SUN_DATA environment variable is not set."
    )

if not RAW_DATA_PATH.exists():
    raise FileNotFoundError(
        f"Raw Tracking the Sun dataset not found at: {RAW_DATA_PATH}"
    )

if RAW_DATA_PATH.suffix != ".parquet":
    raise ValueError(
        "TRACKING_THE_SUN_DATA must point to a .parquet file."
    )


In [19]:
# Load raw dataset
df_raw = pd.read_parquet(RAW_DATA_PATH)

In [20]:
df_raw.shape

(1921220, 80)

The raw Tracking the Sun dataset contains 1,921,220 rows and 80 columns.
Each row represents an administrative record related to a solar installation.
At this stage, rows are not assumed to correspond one-to-one with systems.

In [21]:
# Inspect column names
df_raw.columns.tolist()


['data_provider_1',
 'data_provider_2',
 'system_id_1',
 'system_id_2',
 'installation_date',
 'pv_system_size_dc',
 'total_installed_price',
 'rebate_or_grant',
 'customer_segment',
 'expansion_system',
 'multiple_phase_system',
 'tts_link_id',
 'new_construction',
 'tracking',
 'ground_mounted',
 'zip_code',
 'city',
 'utility_service_territory',
 'third_party_owned',
 'installer_name',
 'self_installed',
 'azimuth_1',
 'azimuth_2',
 'azimuth_3',
 'tilt_1',
 'tilt_2',
 'tilt_3',
 'module_manufacturer_1',
 'module_model_1',
 'module_quantity_1',
 'module_manufacturer_2',
 'module_model_2',
 'module_quantity_2',
 'module_manufacturer_3',
 'module_model_3',
 'module_quantity_3',
 'additional_modules',
 'technology_module_1',
 'technology_module_2',
 'technology_module_3',
 'bipv_module_1',
 'bipv_module_2',
 'bipv_module_3',
 'bifacial_module_1',
 'bifacial_module_2',
 'bifacial_module_3',
 'nameplate_capacity_module_1',
 'nameplate_capacity_module_2',
 'nameplate_capacity_module_3',
 '

The dataset contains 80 columns spanning identifiers, temporal fields,
physical system measurements, categorical indicators, and repeated
component specifications (modules, inverters, batteries).

At this stage, columns are not assigned semantic roles and no assumptions
are made about how values should be reconciled or interpreted.


In [22]:
# Inspect column data types
df_raw.dtypes

data_provider_1                object
data_provider_2                object
system_id_1                    object
system_id_2                    object
installation_date              object
                               ...   
battery_rated_capacity_kw     float64
battery_rated_capacity_kwh    float64
battery_price                 float64
technology_type                object
extensions_multiphase_id       object
Length: 80, dtype: object

Column data types reflect a typical administrative dataset structure.
Identifiers, categorical indicators, and dates are stored as objects,
while physical measurements and prices are numeric.

At this stage, no parsing, coercion, or semantic interpretation is applied.

In [23]:
# Missingness summary (top-level)
df_raw.isna().sum().sort_values(ascending=False).head(20)

micro_inverter_1               1830683
built_in_meter_inverter_1      1830683
bifacial_module_1              1778908
bipv_module_1                  1776393
output_capacity_inverter_1     1317603
micro_inverter_2                129121
built_in_meter_inverter_2       129121
bifacial_module_2                85966
bipv_module_2                    85878
output_capacity_inverter_2       78792
micro_inverter_3                 14075
built_in_meter_inverter_3        14075
bifacial_module_3                10889
bipv_module_3                    10883
output_capacity_inverter_3        8629
nameplate_capacity_module_1       1769
installation_date                  231
nameplate_capacity_module_2         55
nameplate_capacity_module_3         15
dc_optimizer                        12
dtype: int64

High missingness is concentrated in repeated component fields
(e.g. module_1–3, inverter_1–3, optional flags), reflecting structural
optionality rather than data loss. Core administrative and measurement
fields exhibit substantially lower missingness.

Phase 0 concludes here. No assertions about system identity, scope, or
column roles have been made.


## Phase 1 — System Identity Existence & Scope Resolution

This phase determines whether canonical system identity can be constructed
from the raw dataset and enforces scope restrictions required for
system-level analysis.

All assertions in this phase are gatekeeping conditions. Failure to meet
them halts or restricts the pipeline.


In [24]:
# Assert presence of canonical system identifier
if "tts_link_id" not in df_raw.columns:
    raise KeyError(
        "Canonical system identifier `tts_link_id` is missing from the dataset."
    )

df_raw["tts_link_id"].isna().sum()


0