## Notebook 2 — System Size Descriptives and Baselines

This notebook operates on the **system-level base artifact** produced in
Notebook 1, which preserves canonical system identity (`tts_link_id`) while
explicitly recording reporting multiplicity and instability.

Empirical inspection in Notebook 1 showed that the raw Tracking the Sun dataset
does not contain stable system-level attributes; nearly all columns vary
across administrative records for the same system. As a result, no physical
system characteristics (e.g. size, price, installation date) were reconciled
or inferred at the system level in the prior step.

The purpose of this notebook is to introduce **descriptive analytical
semantics** in a controlled, population-aware manner. In this context,
“baselines” refer to **descriptive reference distributions**, not inferential
expectations or normative system values.

### Responsibilities

This notebook is responsible for:

- defining an explicit system-size representation for descriptive analysis,
- establishing admissible system and temporal cohorts,
- characterizing empirical size distributions by context,
- and describing dispersion and temporal drift without inference.

### Outputs

This notebook produces the following artifacts for downstream analysis:

- `size_distributions.parquet`  
  Empirical size distributions partitioned by explicitly defined contexts
  (e.g. installation year, cohort), derived from admissible systems.

- `size_baselines.parquet`  
  Descriptive reference summaries (counts, quantiles, bounds) derived from
  size distributions and intended for comparative—not inferential—use.

All outputs are **descriptive** and preserve the distinction between observed
reporting behavior and inferred system characteristics.

This notebook answers the question:

**What do reported system sizes look like across time and cohorts, and which
systems are admissible for downstream analytical modeling?**



## Phase 1 — Load System-Level Base & Diagnostic Inspection

This phase establishes the analytical substrate for this notebook by loading
the system-level base artifact produced in Notebook 1.

The purpose of this phase is **inspection only**:
- confirm dataset shape and integrity,
- review system-level diagnostic indicators,
- and establish baseline counts prior to any filtering or transformation.

No systems are excluded, no size representations are defined, and no temporal
semantics are introduced in this phase.

In [54]:
from pathlib import Path
import pandas as pd
import os

# Resolve system-level base path
SYSTEM_BASE_PATH = Path("../outputs/system_level_base.parquet")

if not SYSTEM_BASE_PATH.exists():
    raise FileNotFoundError(
        f"System-level base artifact not found at: {SYSTEM_BASE_PATH}"
    )

# Load system-level base
df_system = pd.read_parquet(SYSTEM_BASE_PATH)

df_system.shape


(123178, 7)

### System-Level Diagnostic Overview

This section inspects the structure and diagnostic indicators contained in the
system-level base artifact. The goal is to understand what information is
available to support admissibility decisions in later phases.

No filtering or transformation is performed here.


In [55]:
df_system.dtypes

tts_link_id             object
n_rows                   int64
n_installation_dates     int64
n_system_sizes           int64
n_prices                 int64
has_expansion             bool
has_multiple_phases       bool
dtype: object

### Baseline System Counts and Diagnostic Distributions

This section records baseline counts and the distributions of diagnostic
indicators prior to any admissibility filtering. These summaries serve as
reference points for all subsequent exclusions and cohort definitions.


In [56]:
df_system.shape[0]


123178

In [57]:
df_system[[
    "n_rows",
    "n_installation_dates",
    "n_system_sizes",
    "n_prices",
    "has_expansion",
    "has_multiple_phases"
]].describe(include="all")


Unnamed: 0,n_rows,n_installation_dates,n_system_sizes,n_prices,has_expansion,has_multiple_phases
count,123178.0,123178.0,123178.0,123178.0,123178,123178
unique,,,,,1,1
top,,,,,True,True
freq,,,,,123178,123178
mean,2.152462,2.064679,2.045333,2.058168,,
std,39.521513,10.239701,12.291587,20.413945,,
min,2.0,0.0,1.0,1.0,,
25%,2.0,2.0,2.0,2.0,,
50%,2.0,2.0,2.0,2.0,,
75%,2.0,2.0,2.0,2.0,,


## Phase 2 — System Admissibility and Stability Filtering

This phase defines which system identities are admissible for descriptive
system-size analysis based on observed reporting behavior.

Using the diagnostic indicators constructed in Notebook 1, this phase
characterizes system-level instability and establishes **explicit,
distribution-aware admissibility criteria**. These criteria are used to
exclude systems whose reporting behavior is too unstable to support
meaningful descriptive summaries.

All exclusions in this phase are empirical and transparent. No physical
system attributes are inferred, reconciled, or modeled.


In [58]:
df_system[[
    "n_rows",
    "n_installation_dates",
    "n_system_sizes",
    "n_prices"
]].quantile([0.50, 0.75, 0.90, 0.95, 0.99])


Unnamed: 0,n_rows,n_installation_dates,n_system_sizes,n_prices
0.5,2.0,2.0,2.0,2.0
0.75,2.0,2.0,2.0,2.0
0.9,2.0,2.0,2.0,2.0
0.95,2.0,2.0,2.0,2.0
0.99,3.0,3.0,3.0,3.0


### Admissibility Rule for System-Size Descriptives

Based on empirical instability distributions, systems are considered
admissible for descriptive system-size analysis if they report no more
than three distinct system sizes (`n_system_sizes ≤ 3`).

This threshold corresponds to the 99th percentile of observed reporting
behavior and excludes systems whose size instability is pathological
rather than representative.


In [59]:
# Define admissibility threshold
MAX_SYSTEM_SIZES = 3

# Apply filter
df_admissible = df_system[df_system["n_system_sizes"] <= MAX_SYSTEM_SIZES].copy()

# Record counts
total_systems = df_system.shape[0]
admissible_systems = df_admissible.shape[0]
excluded_systems = total_systems - admissible_systems

total_systems, admissible_systems, excluded_systems


(123178, 122998, 180)

Applying the admissibility rule (`n_system_sizes ≤ 3`) excludes 180 systems
(~0.15% of the universe), indicating that pathological size instability is rare
and that the admissible set remains representative of the overall population.

In [60]:
# Persist admissible system index for downstream notebooks
ADMISSIBLE_SYSTEM_INDEX_PATH = Path("../outputs/admissible_system_index.parquet")

df_admissible.to_parquet(ADMISSIBLE_SYSTEM_INDEX_PATH, index=False)

ADMISSIBLE_SYSTEM_INDEX_PATH


WindowsPath('../outputs/admissible_system_index.parquet')

## Phase 3 — System Size Representation Definition

At this stage, system identity has been established and admissible systems
have been selected based on observed reporting stability. However, system
size remains a non-invariant attribute: multiple reported sizes may exist
for a single system due to administrative corrections, phased reporting,
or programmatic updates.

This phase defines an explicit **system-size representation** to support
descriptive analysis. The representation chosen here is not asserted as the
true physical system size; it is a **constrained descriptive projection**
applied uniformly across admissible systems.

The choice of size representation is documented explicitly and is intended
to be:
- stable under admissible reporting variation,
- reversible in downstream analysis,
- and appropriate for non-inferential descriptive summaries.

Alternative representations may be evaluated in later notebooks where
stronger admissibility constraints are applied.


### Candidate System-Size Representations

System size is not invariant at the system level. For admissible systems
(`n_system_sizes ≤ 3`), multiple reported size values may still exist due to
administrative corrections or phased reporting.

Before choosing a representation, we enumerate and inspect **candidate
descriptive representations** that can be applied uniformly:

- **first_reported_size**: the earliest reported size for the system
- **last_reported_size**: the most recent reported size for the system
- **modal_size**: the most frequently reported size for the system

This step inspects how these candidates behave empirically without yet
asserting any one of them as the chosen representation.


In [61]:
# Load raw Tracking the Sun data (size column only)

RAW_DATA_PATH = Path(os.environ["TRACKING_THE_SUN_DATA"])

df_raw = pd.read_parquet(
    RAW_DATA_PATH,
    columns=["tts_link_id", "pv_system_size_dc"]
)

# Restrict to admissible systems only
df_size_raw = df_raw.merge(
    df_admissible[["tts_link_id"]],
    on="tts_link_id",
    how="inner"
)

df_size_raw.shape


(250366, 2)

### Enumerating Candidate Size Representations

For each admissible system, we compute candidate descriptive representations
of system size to evaluate their empirical behavior and degree of agreement.

At this stage, no representation is selected. The goal is to observe how
first-reported, last-reported, and modal size values compare across systems
with limited size instability.


In [62]:
# Drop missing size values
df_size_clean = df_size_raw.dropna(subset=["pv_system_size_dc"]).copy()

# Ensure deterministic ordering for first/last
df_size_clean = df_size_clean.sort_values(
    ["tts_link_id", "pv_system_size_dc"]
)

# Compute candidate representations
size_candidates = (
    df_size_clean
    .groupby("tts_link_id")["pv_system_size_dc"]
    .agg(
        first_reported_size="first",
        last_reported_size="last",
        modal_size=lambda x: x.mode().iloc[0] if not x.mode().empty else None,
        n_size_reports="count"
    )
    .reset_index()
)

size_candidates.shape


(122998, 5)

### Comparing Candidate Size Representations

To select a system-size representation that minimizes epistemic distortion,
we compare first-reported, last-reported, and modal size values across
admissible systems.

This comparison focuses on:
- frequency of disagreement between representations,
- and magnitude of differences when disagreement occurs.

The representation with the highest agreement and lowest distortion will be
selected for descriptive analysis.

In [63]:
# Pairwise agreement indicators
size_candidates["first_equals_last"] = (
    size_candidates["first_reported_size"] ==
    size_candidates["last_reported_size"]
)

size_candidates["first_equals_modal"] = (
    size_candidates["first_reported_size"] ==
    size_candidates["modal_size"]
)

size_candidates["last_equals_modal"] = (
    size_candidates["last_reported_size"] ==
    size_candidates["modal_size"]
)

size_candidates[[
    "first_equals_last",
    "first_equals_modal",
    "last_equals_modal"
]].mean()


first_equals_last     0.024342
first_equals_modal    0.998911
last_equals_modal     0.025350
dtype: float64

### Selected System-Size Representation

Empirical comparison of candidate representations shows that the first
reported system size agrees with the modal (most frequently reported) size
for over 99.8% of admissible systems, while last reported size frequently
differs from both.

Based on this evidence, **first reported size** is selected as the system-size
representation for descriptive analysis. This choice minimizes epistemic
distortion by preserving the administratively stable size value without
introducing additional assumptions or smoothing.


In [64]:
# Construct canonical system size representation
df_size_representation = size_candidates[[
    "tts_link_id",
    "first_reported_size",
    "n_size_reports"
]].rename(columns={
    "first_reported_size": "system_size_kw"
})

df_size_representation.shape


(122998, 3)

In [65]:
# Missingness check
df_size_representation["system_size_kw"].isna().sum()


0

In [66]:
# Basic bounds inspection
df_size_representation["system_size_kw"].describe()


count    122998.000000
mean          4.112302
std          19.224189
min          -1.000000
25%           2.268000
50%           3.400000
75%           4.800000
max        2087.783673
Name: system_size_kw, dtype: float64

### Size Measurement Admissibility

System size is a physical quantity and must be strictly positive. Any
non-positive values are treated as non-admissible measurement encodings
(e.g., sentinel values) and are excluded from size-based analysis.


In [67]:
# Identify non-admissible size values
invalid_size_mask = df_size_representation["system_size_kw"] <= 0

invalid_size_count = invalid_size_mask.sum()
invalid_size_count


19685

In [68]:
total_systems = df_size_representation.shape[0]
invalid_pct = invalid_size_count / total_systems * 100

invalid_size_count, total_systems, invalid_pct


(19685, 122998, 16.004325273581685)

In [69]:
df_size_representation_clean = (
    df_size_representation
    [df_size_representation["system_size_kw"] > 0]
    .copy()
)

df_size_representation_clean.shape


(103313, 3)

In [70]:
df_size_representation_clean["system_size_kw"].describe()


count    103313.000000
mean          5.086387
std          20.834064
min           0.002177
25%           2.880000
50%           3.825489
75%           5.148387
max        2087.783673
Name: system_size_kw, dtype: float64

In [71]:
df_system_size = df_admissible.merge(
    df_size_representation_clean,
    on="tts_link_id",
    how="inner"
)

df_system_size.shape


(103313, 9)

In [72]:
SYSTEM_SIZE_INDEX_PATH = Path("../outputs/system_size_index.parquet")

df_system_size.to_parquet(SYSTEM_SIZE_INDEX_PATH, index=False)

SYSTEM_SIZE_INDEX_PATH


WindowsPath('../outputs/system_size_index.parquet')

### Phase 3 Complete — Size Representation and Admissibility Finalized

A canonical system-size representation has been defined using the first
reported size value. Measurement admissibility rules were applied to exclude
non-physical size encodings (≤ 0 kW), resulting in the exclusion of 16.0% of
otherwise admissible systems.

The resulting system-size index preserves one row per system with a
physically admissible size measurement and serves as the sole input for
downstream descriptive size analysis. No inferential claims are made at this
stage.


## Phase 4 — Temporal Semantics and Cohort Construction

This phase defines how time is represented, bounded, and permitted to enter
downstream analysis. Time is treated strictly as a descriptive label used to
partition observations, not as a directional or causal variable.

No temporal comparisons, trends, or inferences are made in this phase.
All temporal constructs introduced here are explicitly defined and sealed
prior to descriptive analysis.


### Admissible Temporal Variable

The sole admissible temporal variable for system-level analysis is
`installation_date`, representing the reported installation completion date
of the system.

Other temporal fields (e.g., reporting updates or correction timestamps)
are explicitly excluded, as they reflect administrative processes rather
than system state.


In [74]:
# Reload raw data with only identity + installation date
df_temporal_raw = pd.read_parquet(
    RAW_DATA_PATH,
    columns=["tts_link_id", "installation_date"]
)

# Restrict to size-admissible systems only
df_temporal_raw = df_temporal_raw.merge(
    df_system_size[["tts_link_id"]],
    on="tts_link_id",
    how="inner"
)

df_temporal_raw.shape


(208956, 2)

In [76]:
# Parse installation_date to datetime
df_temporal_raw["installation_date_parsed"] = pd.to_datetime(
    df_temporal_raw["installation_date"],
    errors="coerce"
)

df_temporal_raw[["installation_date", "installation_date_parsed"]].head()


Unnamed: 0,installation_date,installation_date_parsed
0,2017-11-06,2017-11-06
1,2017-11-06,2017-11-06
2,2017-11-06,2017-11-06
3,2017-11-06,2017-11-06
4,2017-11-06,2017-11-06


In [77]:
# Count missing or unparsable dates
missing_dates = df_temporal_raw["installation_date_parsed"].isna().sum()

total_rows = df_temporal_raw.shape[0]
missing_pct = missing_dates / total_rows * 100

missing_dates, total_rows, missing_pct


(14, 208956, 0.006699975114378146)

In [78]:
# Inspect temporal range
df_temporal_raw["installation_date_parsed"].agg(
    min_date="min",
    max_date="max"
)

min_date   1997-03-21
max_date   2024-02-09
Name: installation_date_parsed, dtype: datetime64[ns]

In [79]:
# Derive installation year
df_temporal_raw["installation_year"] = (
    df_temporal_raw["installation_date_parsed"]
    .dt.year
)

df_temporal_raw[["installation_date_parsed", "installation_year"]].head()


Unnamed: 0,installation_date_parsed,installation_year
0,2017-11-06,2017.0
1,2017-11-06,2017.0
2,2017-11-06,2017.0
3,2017-11-06,2017.0
4,2017-11-06,2017.0


In [80]:
# Year coverage diagnostics
df_temporal_raw["installation_year"].value_counts().sort_index()


installation_year
1997.0        1
1998.0        4
1999.0       14
2000.0       22
2001.0      199
2002.0      330
2003.0      432
2004.0      552
2005.0      514
2006.0      790
2007.0     1209
2008.0     1252
2009.0     1868
2010.0     2051
2011.0     3010
2012.0     4343
2013.0     8368
2014.0    11451
2015.0    15189
2016.0    16352
2017.0    11138
2018.0    12416
2019.0    14360
2020.0    16603
2021.0    21642
2022.0    30518
2023.0    34290
2024.0       24
Name: count, dtype: int64

### Temporal Scope Restriction

Although installation dates are available for a wider historical range,
the analytical scope of this project is explicitly restricted to the stable
reporting regime spanning installation years 2021 through 2023.

Earlier installation years are excluded from analysis to avoid conflating
changes in reporting coverage, administrative practices, and market maturity
with structural characteristics of system size. This restriction is a scope
definition, not a data quality judgment.

All downstream analyses operate exclusively within this temporally bounded
universe.


In [81]:
# Restrict to stable reporting regime (2021–2023)
df_temporal_clean = df_temporal_raw[
    df_temporal_raw["installation_year"].between(2021, 2023)
].copy()

df_temporal_clean.shape


(86450, 4)

In [82]:
# Define sealed installation-year cohorts
df_temporal_clean["installation_year_cohort"] = (
    df_temporal_clean["installation_year"].astype(int)
)


In [84]:
# Collapse to system-level temporal index
df_temporal_system = (
    df_temporal_clean
    .groupby("tts_link_id")
    .agg(
        installation_year=("installation_year", "min"),
        installation_year_cohort=("installation_year_cohort", "min")
    )
    .reset_index()
)

df_temporal_system.shape


(76046, 3)

In [85]:
TEMPORAL_INDEX_PATH = Path("../outputs/system_temporal_index.parquet")

df_temporal_system.to_parquet(TEMPORAL_INDEX_PATH, index=False)

TEMPORAL_INDEX_PATH


WindowsPath('../outputs/system_temporal_index.parquet')

### Phase 4 Complete — Temporal Scope and Cohorts Finalized

Temporal scope for analysis has been explicitly restricted to the stable
reporting regime spanning installation years 2021 through 2023. Installation
year is represented as a sealed cohort label used solely for partitioning
observations.

The resulting system temporal index preserves one row per system with
admissible size and time attributes. No temporal trends, comparisons, or
inferences are asserted at this stage.

## Phase 5 — Size Distributions by Context

This phase characterizes the empirical distribution of system sizes within
explicitly defined, non-overlapping contexts. A distribution is treated as
a descriptive object that captures spread and density without implying
normativity, expectation, or deviation.

All distributions are computed within sealed contexts and are stored as
data artifacts. No cross-context comparisons or temporal interpretations
are made in this phase.


In [86]:
# Load infrastructure artifacts
df_size = pd.read_parquet("../outputs/system_size_index.parquet")
df_time = pd.read_parquet("../outputs/system_temporal_index.parquet")

# Assemble analysis table
df_analysis = df_size.merge(
    df_time,
    on="tts_link_id",
    how="inner"
)

df_analysis.shape


(76046, 11)

### Context Definition

Distributions are computed within explicitly defined contexts that partition
systems without altering entity identity or analytical grain. In this phase,
the sole context is `installation_year_cohort`, representing sealed
installation-year cohorts defined in Phase 4.

All descriptive statistics are computed independently within each cohort.
No cross-cohort comparisons are made.


In [87]:
# Compute size distributions by installation-year cohort
size_distributions = (
    df_analysis
    .groupby("installation_year_cohort")["system_size_kw"]
    .agg(
        n_systems="count",
        p10=lambda x: x.quantile(0.10),
        p25=lambda x: x.quantile(0.25),
        p50=lambda x: x.quantile(0.50),
        p75=lambda x: x.quantile(0.75),
        p90=lambda x: x.quantile(0.90),
        min_size="min",
        max_size="max"
    )
    .reset_index()
)

size_distributions


Unnamed: 0,installation_year_cohort,n_systems,p10,p25,p50,p75,p90,min_size,max_size
0,2021,21606,2.44,2.97,3.9,5.13,7.139135,0.047297,1159.2
1,2022,28144,2.45,3.12,4.0,5.3,7.142857,0.011828,945.72
2,2023,26296,2.380645,3.120857,4.02,5.53,7.494583,0.020147,1108.56


In [88]:
SIZE_DISTRIBUTIONS_PATH = Path("../outputs/size_distributions.parquet")

size_distributions.to_parquet(SIZE_DISTRIBUTIONS_PATH, index=False)

SIZE_DISTRIBUTIONS_PATH


WindowsPath('../outputs/size_distributions.parquet')

### Phase 5 Complete — Size Distributions by Context

Empirical size distributions have been computed within sealed
installation-year cohorts. Each distribution captures the spread and bounds
of observed system sizes without asserting norms, expectations, or
comparisons across contexts.

The resulting distribution artifact serves as a descriptive foundation for
subsequent baseline and deviation analysis.


## Phase 6 — Baselines (Expected Size by Context)

This phase defines context-conditional baseline system sizes derived from
empirical size distributions. A baseline represents a typical observed value
within a sealed context and is used solely as a reference point for subsequent
descriptive deviation analysis.

Baselines do not imply normative expectations, optimality, or correctness.
They are descriptive constructs anchored in observed distributions.


In [89]:
size_distributions = pd.read_parquet(
    "../outputs/size_distributions.parquet"
)

size_distributions


Unnamed: 0,installation_year_cohort,n_systems,p10,p25,p50,p75,p90,min_size,max_size
0,2021,21606,2.44,2.97,3.9,5.13,7.139135,0.047297,1159.2
1,2022,28144,2.45,3.12,4.0,5.3,7.142857,0.011828,945.72
2,2023,26296,2.380645,3.120857,4.02,5.53,7.494583,0.020147,1108.56


### Baseline Definition

The baseline system size within each context is defined as the median
(p50) of the observed size distribution. The median is selected because it
is robust to skewness and extreme values and reflects the central tendency
of observed system sizes without assuming symmetry or normality.

Supporting quantiles are retained for transparency but are not treated as
baseline values.


In [90]:
# Construct baseline table
size_baselines = (
    size_distributions[[
        "installation_year_cohort",
        "n_systems",
        "p25",
        "p50",
        "p75"
    ]]
    .rename(columns={
        "p50": "expected_system_size_kw"
    })
)

size_baselines


Unnamed: 0,installation_year_cohort,n_systems,p25,expected_system_size_kw,p75
0,2021,21606,2.97,3.9,5.13
1,2022,28144,3.12,4.0,5.3
2,2023,26296,3.120857,4.02,5.53


In [91]:
SIZE_BASELINES_PATH = Path("../outputs/size_baselines.parquet")

size_baselines.to_parquet(SIZE_BASELINES_PATH, index=False)

SIZE_BASELINES_PATH


WindowsPath('../outputs/size_baselines.parquet')

### Phase 6 Complete — Contextual Baselines Defined

Context-conditional baseline system sizes have been defined using the median
of observed size distributions. These baselines serve as descriptive reference
points for subsequent deviation and likelihood analysis and do not represent
normative expectations or targets.


## Phase 7 — Dispersion and Temporal Drift Characterization

This phase characterizes the spread of system sizes within each context and
describes how dispersion varies across contexts. Dispersion is treated as a
descriptive property of observed distributions and is not interpreted as
volatility, abnormality, or deviation from expectation.

Temporal drift refers solely to changes in distributional shape or spread
across sealed cohorts and does not imply directionality, causation, or trend.


In [92]:
size_distributions = pd.read_parquet(
    "../outputs/size_distributions.parquet"
)

size_distributions


Unnamed: 0,installation_year_cohort,n_systems,p10,p25,p50,p75,p90,min_size,max_size
0,2021,21606,2.44,2.97,3.9,5.13,7.139135,0.047297,1159.2
1,2022,28144,2.45,3.12,4.0,5.3,7.142857,0.011828,945.72
2,2023,26296,2.380645,3.120857,4.02,5.53,7.494583,0.020147,1108.56


### Dispersion Measures

Dispersion is quantified using robust, order-based statistics derived from
distribution quantiles. The following measures are computed:

- Interquartile Range (IQR): p75 − p25  
- Central Span: p90 − p10  

Minimum and maximum values are retained only as distributional bounds and are
not used as measures of dispersion.


In [93]:
# Compute dispersion metrics by context
size_dispersion = size_distributions.assign(
    iqr=lambda df: df["p75"] - df["p25"],
    p90_p10_span=lambda df: df["p90"] - df["p10"]
)[[
    "installation_year_cohort",
    "n_systems",
    "iqr",
    "p90_p10_span",
    "min_size",
    "max_size"
]]

size_dispersion


Unnamed: 0,installation_year_cohort,n_systems,iqr,p90_p10_span,min_size,max_size
0,2021,21606,2.16,4.699135,0.047297,1159.2
1,2022,28144,2.18,4.692857,0.011828,945.72
2,2023,26296,2.409143,5.113938,0.020147,1108.56


In [94]:
SIZE_DISPERSION_PATH = Path("../outputs/size_dispersion.parquet")

size_dispersion.to_parquet(SIZE_DISPERSION_PATH, index=False)

SIZE_DISPERSION_PATH


WindowsPath('../outputs/size_dispersion.parquet')

### Phase 7 Complete — Dispersion Characterized

Distributional dispersion of system sizes has been characterized within
sealed installation-year cohorts using robust quantile-based measures.
Observed changes in dispersion across cohorts are recorded descriptively
without implying abnormality, volatility, or trend.

The resulting dispersion artifact provides geometric context for subsequent
deviation and likelihood analysis.


## Notebook 2 Complete — System Size Descriptives and Baselines

This notebook completed the construction of system-size descriptive geometry
under explicit epistemic constraints. System sizes were represented using a
uniform, admissible definition; temporal scope was restricted to sealed
installation-year cohorts; and empirical size distributions, baselines, and
dispersion measures were computed without invoking inference, deviation, or
normative interpretation.

The resulting artifacts establish a controlled descriptive foundation for
subsequent structural, scaling, and regime analysis. No claims regarding
abnormality, risk, or causality are made at this stage. All downstream analyses
must treat these outputs as descriptive reference structures only.
