# Configuration Variable Coverage Diagnostics

## Purpose

This notebook documents which configuration-related variables have sufficient
empirical support to define **within-size structural variation** in residential
solar systems.

It satisfies the requirement to record how analytical constraints are learned
from the data rather than imposed a priori.

This notebook does not:
- infer structure
- define regimes
- evaluate deviation
- assess risk

It reports empirical coverage only.


In [21]:
import os
from pathlib import Path
import pandas as pd
import numpy as np


In [22]:
NOTEBOOK_DIR = Path.cwd()
REPO_ROOT = NOTEBOOK_DIR.parent

INPUTS_DIR = REPO_ROOT / "inputs"
OUTPUTS_DIR = REPO_ROOT / "outputs"

## Data Access Pattern

This repository does not materialize the full canonical dataset.

System-level configuration variables are accessed via the shared Parquet-backed
Tracking the Sun dataset, using the same environment-based access pattern
established upstream.

Expected-size context is provided by an explicit upstream artifact produced by
Repo 2 and passed unchanged into this repository.


In [23]:
# External data source (shared across repos)
DATA_PATH = Path(os.environ["TRACKING_THE_SUN_DATA"])

# Load Parquet dataset
df_systems = pd.read_parquet(DATA_PATH)

# Sanity check
df_systems.shape
df_systems.columns.tolist()

['data_provider_1',
 'data_provider_2',
 'system_id_1',
 'system_id_2',
 'installation_date',
 'pv_system_size_dc',
 'total_installed_price',
 'rebate_or_grant',
 'customer_segment',
 'expansion_system',
 'multiple_phase_system',
 'tts_link_id',
 'new_construction',
 'tracking',
 'ground_mounted',
 'zip_code',
 'city',
 'utility_service_territory',
 'third_party_owned',
 'installer_name',
 'self_installed',
 'azimuth_1',
 'azimuth_2',
 'azimuth_3',
 'tilt_1',
 'tilt_2',
 'tilt_3',
 'module_manufacturer_1',
 'module_model_1',
 'module_quantity_1',
 'module_manufacturer_2',
 'module_model_2',
 'module_quantity_2',
 'module_manufacturer_3',
 'module_model_3',
 'module_quantity_3',
 'additional_modules',
 'technology_module_1',
 'technology_module_2',
 'technology_module_3',
 'bipv_module_1',
 'bipv_module_2',
 'bipv_module_3',
 'bifacial_module_1',
 'bifacial_module_2',
 'bifacial_module_3',
 'nameplate_capacity_module_1',
 'nameplate_capacity_module_2',
 'nameplate_capacity_module_3',
 '

### System-Level Column Surface

The list above represents the full set of variables recorded at the
system level in the Tracking the Sun dataset.

No assumptions are made at this stage about:
- Variable relevance
- Variable quality
- Variable analytical role

All inclusion/exclusion decisions will be justified empirically
via coverage analysis in later cells.


In [24]:
from pathlib import Path

# TEMPORARY explicit artifacts pathARTIFACTS_DIR = Path(os.environ["TTS_ARTIFACTS"])
ARTIFACTS_DIR = Path(os.environ["TTS_ARTIFACTS"])
ARTIFACTS_DIR.exists()
df_expected = pd.read_csv(
    ARTIFACTS_DIR / "baseline_with_expected_size.csv"
)

## Verifying the Expected Size Artifact Contract

Before integrating expected size information with system configuration data, the canonical artifact must be validated.

Specifically, the artifact must satisfy the following conditions:

- one row per residential solar system
- a stable system identifier (`system_id`)
- an installation year consistent with the raw dataset
- expected system size and residual values present

If any of these conditions are violated, downstream joins and structural analysis are invalid, and the issue must be corrected upstream.

Only once this contract is confirmed does Repo 3 proceed.


In [25]:
df_expected.shape
df_expected.columns.tolist()

['system_id',
 'installation_year',
 'pv_system_size_dc',
 'expected_system_size_kw',
 'size_residual_kw']

## Preparing Join Keys for System-Level Integration

The expected size artifact produced upstream includes both a stable system identifier and an installation year.

To enable a controlled, auditable join between expected size context and system configuration data, the raw system dataset must expose compatible join keys.

This step derives the installation year at the system level and verifies the presence of stable identifiers prior to integration.


In [26]:
# Derive installation year for join consistency
df_systems["installation_year"] = (
    pd.to_datetime(df_systems["installation_date"], errors="coerce")
      .dt.year
)

df_systems[["system_id_1", "installation_year"]].head()


Unnamed: 0,system_id_1,installation_year
0,PVD1,1999.0
1,PGE-INT-114109373,2017.0
2,PVP1,1999.0
3,PGE-INT-114149823,2017.0
4,PVD2,1999.0


## Diagnosing Expected Size Artifact Grain

Before integrating expected size context with system configuration data, the
grain of the expected size artifact must be verified.

Although the artifact is intended to be system-level, expected size is defined
as a function of installation year, which may introduce multiple rows per
system.

This diagnostic checks whether the expected size artifact is unique at the
system level or whether additional consolidation is required prior to
integration.


In [27]:
df_expected.groupby("system_id").size().value_counts().head()


1    1918957
2        456
4        255
3         39
7         11
Name: count, dtype: int64

## Enforcing System-Grain Uniqueness for Expected Size

Downstream structural analysis in this repository requires a single expected
size context per system.

Although the expected size artifact is largely system-grain, a small number of
systems appear multiple times due to upstream construction details.

To restore a system-level reference frame, the expected size artifact is
explicitly collapsed to one row per system using a deterministic rule.


In [28]:
df_expected_system = (
    df_expected
    .sort_values("installation_year")
    .drop_duplicates(subset="system_id", keep="last")
)

df_expected_system.shape


(1919733, 5)

## Integrating Expected Size Context at the System Level

Structural analysis in this repository requires that each system be associated
with a single expected size reference.

After enforcing system-grain uniqueness in the expected size artifact, expected
size context can be safely integrated with system-level configuration records.

The integration is performed using a controlled many-to-one join to preserve
all observed system records while preventing duplication of expected size
values.


In [29]:
df_joined = df_systems.merge(
    df_expected_system,
    left_on="system_id_1",
    right_on="system_id",
    how="inner",
    validate="m:1"
)

df_joined.shape


(1921220, 86)

## Declaring Configuration Dimensions

The following variables represent system configuration attributes that may vary
meaningfully even when system size is held constant.

At this stage, no assumptions are made about their analytical usefulness.
Variables are evaluated solely on empirical coverage.


In [30]:
config_vars = [
    "tracking",
    "ground_mounted",
    "third_party_owned",
    "new_construction",
    "expansion_system",
    "multiple_phase_system",
    "technology_type",
    "micro_inverter_1",
    "micro_inverter_2",
    "micro_inverter_3",
    "dc_optimizer",
    "battery_manufacturer",
    "battery_rated_capacity_kwh",
]


## Configuration Variable Coverage Diagnostics

Coverage diagnostics quantify the proportion of observed system records for
which each configuration variable is populated.

These diagnostics are used to distinguish active configuration dimensions from
those that are structurally dormant.


In [31]:
coverage = (
    df_joined[config_vars]
    .notna()
    .mean()
    .sort_values(ascending=False)
    .to_frame(name="coverage_fraction")
)

coverage


Unnamed: 0,coverage_fraction
tracking,1.0
ground_mounted,1.0
third_party_owned,1.0
new_construction,1.0
expansion_system,1.0
multiple_phase_system,1.0
technology_type,0.999994
dc_optimizer,0.999994
battery_manufacturer,0.999994
battery_rated_capacity_kwh,0.999994


## Interpretation of Coverage Diagnostics

Coverage diagnostics indicate that most configuration variables are consistently
observed across system records and are admissible for structural analysis.

Component-indexed variables (e.g., micro_inverter_1â€“3) exhibit asymmetric
coverage patterns reflecting system design conventions rather than missingness.

No variables are excluded at this stage. Coverage results define which
configuration dimensions are active and how they should be treated in
downstream structural comparisons.


In [32]:
OUTPUTS_DIR.mkdir(exist_ok=True)

coverage.to_csv(
    OUTPUTS_DIR / "configuration_variable_coverage.csv"
)
