## Notebook 2 — System Size Descriptives and Baselines

This notebook operates on the **system-level base artifact** produced in
Notebook 1, which preserves canonical system identity (`tts_link_id`) while
explicitly recording reporting multiplicity and instability.

Empirical inspection in Notebook 1 showed that the raw Tracking the Sun dataset
does not contain stable system-level attributes; nearly all columns vary
across administrative records for the same system. As a result, no physical
system characteristics (e.g. size, price, installation date) were reconciled
or inferred at the system level in the prior step.

The purpose of this notebook is to introduce **descriptive analytical
semantics** in a controlled, population-aware manner. In this context,
“baselines” refer to **descriptive reference distributions**, not inferential
expectations or normative system values.

### Responsibilities

This notebook is responsible for:

- defining an explicit system-size representation for descriptive analysis,
- establishing admissible system and temporal cohorts,
- characterizing empirical size distributions by context,
- and describing dispersion and temporal drift without inference.

### Outputs

This notebook produces the following artifacts for downstream analysis:

- `size_distributions.parquet`  
  Empirical size distributions partitioned by explicitly defined contexts
  (e.g. installation year, cohort), derived from admissible systems.

- `size_baselines.parquet`  
  Descriptive reference summaries (counts, quantiles, bounds) derived from
  size distributions and intended for comparative—not inferential—use.

All outputs are **descriptive** and preserve the distinction between observed
reporting behavior and inferred system characteristics.

This notebook answers the question:

**What do reported system sizes look like across time and cohorts, and which
systems are admissible for downstream analytical modeling?**



## Phase 1 — Load System-Level Base & Diagnostic Inspection

This phase establishes the analytical substrate for this notebook by loading
the system-level base artifact produced in Notebook 1.

The purpose of this phase is **inspection only**:
- confirm dataset shape and integrity,
- review system-level diagnostic indicators,
- and establish baseline counts prior to any filtering or transformation.

No systems are excluded, no size representations are defined, and no temporal
semantics are introduced in this phase.

In [33]:
from pathlib import Path
import pandas as pd
import os

# Resolve system-level base path
SYSTEM_BASE_PATH = Path("../outputs/system_level_base.parquet")

if not SYSTEM_BASE_PATH.exists():
    raise FileNotFoundError(
        f"System-level base artifact not found at: {SYSTEM_BASE_PATH}"
    )

# Load system-level base
df_system = pd.read_parquet(SYSTEM_BASE_PATH)

df_system.shape


(123178, 7)

### System-Level Diagnostic Overview

This section inspects the structure and diagnostic indicators contained in the
system-level base artifact. The goal is to understand what information is
available to support admissibility decisions in later phases.

No filtering or transformation is performed here.


In [34]:
df_system.dtypes

tts_link_id             object
n_rows                   int64
n_installation_dates     int64
n_system_sizes           int64
n_prices                 int64
has_expansion             bool
has_multiple_phases       bool
dtype: object

### Baseline System Counts and Diagnostic Distributions

This section records baseline counts and the distributions of diagnostic
indicators prior to any admissibility filtering. These summaries serve as
reference points for all subsequent exclusions and cohort definitions.


In [35]:
df_system.shape[0]


123178

In [36]:
df_system[[
    "n_rows",
    "n_installation_dates",
    "n_system_sizes",
    "n_prices",
    "has_expansion",
    "has_multiple_phases"
]].describe(include="all")


Unnamed: 0,n_rows,n_installation_dates,n_system_sizes,n_prices,has_expansion,has_multiple_phases
count,123178.0,123178.0,123178.0,123178.0,123178,123178
unique,,,,,1,1
top,,,,,True,True
freq,,,,,123178,123178
mean,2.152462,2.064679,2.045333,2.058168,,
std,39.521513,10.239701,12.291587,20.413945,,
min,2.0,0.0,1.0,1.0,,
25%,2.0,2.0,2.0,2.0,,
50%,2.0,2.0,2.0,2.0,,
75%,2.0,2.0,2.0,2.0,,


## Phase 2 — System Admissibility and Stability Filtering

This phase defines which system identities are admissible for descriptive
system-size analysis based on observed reporting behavior.

Using the diagnostic indicators constructed in Notebook 1, this phase
characterizes system-level instability and establishes **explicit,
distribution-aware admissibility criteria**. These criteria are used to
exclude systems whose reporting behavior is too unstable to support
meaningful descriptive summaries.

All exclusions in this phase are empirical and transparent. No physical
system attributes are inferred, reconciled, or modeled.


In [37]:
df_system[[
    "n_rows",
    "n_installation_dates",
    "n_system_sizes",
    "n_prices"
]].quantile([0.50, 0.75, 0.90, 0.95, 0.99])


Unnamed: 0,n_rows,n_installation_dates,n_system_sizes,n_prices
0.5,2.0,2.0,2.0,2.0
0.75,2.0,2.0,2.0,2.0
0.9,2.0,2.0,2.0,2.0
0.95,2.0,2.0,2.0,2.0
0.99,3.0,3.0,3.0,3.0


### Admissibility Rule for System-Size Descriptives

Based on empirical instability distributions, systems are considered
admissible for descriptive system-size analysis if they report no more
than three distinct system sizes (`n_system_sizes ≤ 3`).

This threshold corresponds to the 99th percentile of observed reporting
behavior and excludes systems whose size instability is pathological
rather than representative.


In [38]:
# Define admissibility threshold
MAX_SYSTEM_SIZES = 3

# Apply filter
df_admissible = df_system[df_system["n_system_sizes"] <= MAX_SYSTEM_SIZES].copy()

# Record counts
total_systems = df_system.shape[0]
admissible_systems = df_admissible.shape[0]
excluded_systems = total_systems - admissible_systems

total_systems, admissible_systems, excluded_systems


(123178, 122998, 180)

Applying the admissibility rule (`n_system_sizes ≤ 3`) excludes 180 systems
(~0.15% of the universe), indicating that pathological size instability is rare
and that the admissible set remains representative of the overall population.

In [39]:
# Persist admissible system index for downstream notebooks
ADMISSIBLE_SYSTEM_INDEX_PATH = Path("../outputs/admissible_system_index.parquet")

df_admissible.to_parquet(ADMISSIBLE_SYSTEM_INDEX_PATH, index=False)

ADMISSIBLE_SYSTEM_INDEX_PATH


WindowsPath('../outputs/admissible_system_index.parquet')

## Phase 3 — System Size Representation Definition

At this stage, system identity has been established and admissible systems
have been selected based on observed reporting stability. However, system
size remains a non-invariant attribute: multiple reported sizes may exist
for a single system due to administrative corrections, phased reporting,
or programmatic updates.

This phase defines an explicit **system-size representation** to support
descriptive analysis. The representation chosen here is not asserted as the
true physical system size; it is a **constrained descriptive projection**
applied uniformly across admissible systems.

The choice of size representation is documented explicitly and is intended
to be:
- stable under admissible reporting variation,
- reversible in downstream analysis,
- and appropriate for non-inferential descriptive summaries.

Alternative representations may be evaluated in later notebooks where
stronger admissibility constraints are applied.


### Candidate System-Size Representations

System size is not invariant at the system level. For admissible systems
(`n_system_sizes ≤ 3`), multiple reported size values may still exist due to
administrative corrections or phased reporting.

Before choosing a representation, we enumerate and inspect **candidate
descriptive representations** that can be applied uniformly:

- **first_reported_size**: the earliest reported size for the system
- **last_reported_size**: the most recent reported size for the system
- **modal_size**: the most frequently reported size for the system

This step inspects how these candidates behave empirically without yet
asserting any one of them as the chosen representation.


In [40]:
# Load raw Tracking the Sun data (size column only)

RAW_DATA_PATH = Path(os.environ["TRACKING_THE_SUN_DATA"])

df_raw = pd.read_parquet(
    RAW_DATA_PATH,
    columns=["tts_link_id", "pv_system_size_dc"]
)

# Restrict to admissible systems only
df_size_raw = df_raw.merge(
    df_admissible[["tts_link_id"]],
    on="tts_link_id",
    how="inner"
)

df_size_raw.shape


(250366, 2)

### Enumerating Candidate Size Representations

For each admissible system, we compute candidate descriptive representations
of system size to evaluate their empirical behavior and degree of agreement.

At this stage, no representation is selected. The goal is to observe how
first-reported, last-reported, and modal size values compare across systems
with limited size instability.


In [41]:
# Drop missing size values
df_size_clean = df_size_raw.dropna(subset=["pv_system_size_dc"]).copy()

# Ensure deterministic ordering for first/last
df_size_clean = df_size_clean.sort_values(
    ["tts_link_id", "pv_system_size_dc"]
)

# Compute candidate representations
size_candidates = (
    df_size_clean
    .groupby("tts_link_id")["pv_system_size_dc"]
    .agg(
        first_reported_size="first",
        last_reported_size="last",
        modal_size=lambda x: x.mode().iloc[0] if not x.mode().empty else None,
        n_size_reports="count"
    )
    .reset_index()
)

size_candidates.shape


(122998, 5)

### Comparing Candidate Size Representations

To select a system-size representation that minimizes epistemic distortion,
we compare first-reported, last-reported, and modal size values across
admissible systems.

This comparison focuses on:
- frequency of disagreement between representations,
- and magnitude of differences when disagreement occurs.

The representation with the highest agreement and lowest distortion will be
selected for descriptive analysis.

In [42]:
# Pairwise agreement indicators
size_candidates["first_equals_last"] = (
    size_candidates["first_reported_size"] ==
    size_candidates["last_reported_size"]
)

size_candidates["first_equals_modal"] = (
    size_candidates["first_reported_size"] ==
    size_candidates["modal_size"]
)

size_candidates["last_equals_modal"] = (
    size_candidates["last_reported_size"] ==
    size_candidates["modal_size"]
)

size_candidates[[
    "first_equals_last",
    "first_equals_modal",
    "last_equals_modal"
]].mean()


first_equals_last     0.024342
first_equals_modal    0.998911
last_equals_modal     0.025350
dtype: float64

### Selected System-Size Representation

Empirical comparison of candidate representations shows that the first
reported system size agrees with the modal (most frequently reported) size
for over 99.8% of admissible systems, while last reported size frequently
differs from both.

Based on this evidence, **first reported size** is selected as the system-size
representation for descriptive analysis. This choice minimizes epistemic
distortion by preserving the administratively stable size value without
introducing additional assumptions or smoothing.


In [43]:
# Construct canonical system size representation
df_size_representation = size_candidates[[
    "tts_link_id",
    "first_reported_size",
    "n_size_reports"
]].rename(columns={
    "first_reported_size": "system_size_kw"
})

df_size_representation.shape


(122998, 3)

In [44]:
# Missingness check
df_size_representation["system_size_kw"].isna().sum()


0

In [45]:
# Basic bounds inspection
df_size_representation["system_size_kw"].describe()


count    122998.000000
mean          4.112302
std          19.224189
min          -1.000000
25%           2.268000
50%           3.400000
75%           4.800000
max        2087.783673
Name: system_size_kw, dtype: float64

### Size Measurement Admissibility

System size is a physical quantity and must be strictly positive. Any
non-positive values are treated as non-admissible measurement encodings
(e.g., sentinel values) and are excluded from size-based analysis.


In [47]:
# Identify non-admissible size values
invalid_size_mask = df_size_representation["system_size_kw"] <= 0

invalid_size_count = invalid_size_mask.sum()
invalid_size_count


19685

In [48]:
total_systems = df_size_representation.shape[0]
invalid_pct = invalid_size_count / total_systems * 100

invalid_size_count, total_systems, invalid_pct


(19685, 122998, 16.004325273581685)

In [49]:
df_size_representation_clean = (
    df_size_representation
    [df_size_representation["system_size_kw"] > 0]
    .copy()
)

df_size_representation_clean.shape


(103313, 3)

In [50]:
df_size_representation_clean["system_size_kw"].describe()


count    103313.000000
mean          5.086387
std          20.834064
min           0.002177
25%           2.880000
50%           3.825489
75%           5.148387
max        2087.783673
Name: system_size_kw, dtype: float64

In [51]:
df_system_size = df_admissible.merge(
    df_size_representation_clean,
    on="tts_link_id",
    how="inner"
)

df_system_size.shape


(103313, 9)

In [52]:
SYSTEM_SIZE_INDEX_PATH = Path("../outputs/system_size_index.parquet")

df_system_size.to_parquet(SYSTEM_SIZE_INDEX_PATH, index=False)

SYSTEM_SIZE_INDEX_PATH


WindowsPath('../outputs/system_size_index.parquet')

### Phase 3 Complete — Size Representation and Admissibility Finalized

A canonical system-size representation has been defined using the first
reported size value. Measurement admissibility rules were applied to exclude
non-physical size encodings (≤ 0 kW), resulting in the exclusion of 16.0% of
otherwise admissible systems.

The resulting system-size index preserves one row per system with a
physically admissible size measurement and serves as the sole input for
downstream descriptive size analysis. No inferential claims are made at this
stage.
