## Notebook 1 — Structural Degrees of Freedom at Fixed System Size

This notebook initiates Repo 3 by assembling a context-aware, size-conditioned
system-level dataset and formally defining the structural degrees of freedom
that remain once system size is held fixed.

All inputs consumed here are descriptive artifacts produced upstream in Repo 2.
This notebook performs **no inference, no deviation assessment, and no regime or
risk assignment**. System size is treated as a conditioning variable rather than
an object of analysis.

The sole purpose of this notebook is to:
- Assemble system-level data with explicit temporal context
- Declare and operationalize what it means to “hold size fixed”
- Define the structural dimensions that may vary orthogonally to size

All interpretive, comparative, or normative analysis is explicitly deferred to
subsequent repositories.


## Phase 1 — Input Artifact Audit and Role Assertion

This phase audits the upstream artifacts consumed by Repo 3 and asserts their
semantic role, grain, and admissible usage within this repository.

Repo 3 consumes **descriptive artifacts only**, produced and sealed in Repo 2.
Artifacts are classified into system-level anchors, cohort-level context tables,
and reference distributions. Only system-level and cohort-level artifacts may be
joined in this repository, and only at declared grains.

This phase performs no data loading or transformation. Its sole purpose is to
establish the analytical contract under which all subsequent operations occur.


## Phase 2 — Input Files Structural Audit

This phase inspects all input files provided to Repo 3 and reports their
row counts and column structures.

The purpose of this phase is to make the available data explicit before any
grain enforcement or invariant assertions are applied.


In [45]:
import pandas as pd
from pathlib import Path
import os

INPUT_DIR = Path("../inputs")

input_files = sorted(
    list(INPUT_DIR.glob("*.parquet")) + list(INPUT_DIR.glob("*.csv"))
)

for path in input_files:
    if path.suffix == ".parquet":
        df = pd.read_parquet(path)
    else:
        df = pd.read_csv(path)

    print(f"\nFILE: {path.name}")
    print(f"SHAPE: {df.shape}")
    print("COLUMNS:")
    for col in df.columns:
        print(f"  - {col}")



FILE: admissible_system_index.parquet
SHAPE: (122998, 7)
COLUMNS:
  - tts_link_id
  - n_rows
  - n_installation_dates
  - n_system_sizes
  - n_prices
  - has_expansion
  - has_multiple_phases

FILE: size_baselines.parquet
SHAPE: (23, 5)
COLUMNS:
  - installation_year_cohort
  - n_systems
  - p25
  - expected_system_size_kw
  - p75

FILE: size_dispersion.parquet
SHAPE: (23, 6)
COLUMNS:
  - installation_year_cohort
  - n_systems
  - iqr
  - p90_p10_span
  - min_size
  - max_size

FILE: size_distributions.parquet
SHAPE: (23, 9)
COLUMNS:
  - installation_year_cohort
  - n_systems
  - p10
  - p25
  - p50
  - p75
  - p90
  - min_size
  - max_size

FILE: system_level_base.parquet
SHAPE: (123178, 7)
COLUMNS:
  - tts_link_id
  - n_rows
  - n_installation_dates
  - n_system_sizes
  - n_prices
  - has_expansion
  - has_multiple_phases

FILE: system_size_index.parquet
SHAPE: (103313, 9)
COLUMNS:
  - tts_link_id
  - n_rows
  - n_installation_dates
  - n_system_sizes
  - n_prices
  - has_expansion


## Phase 3 — System-Level Assembly (Size × Temporal)

This phase assembles the canonical system-level table for Repo 3 by combining
system size information with system-level temporal context.

The result of this phase is a one-row-per-system dataset that includes both
system size and installation timing. All downstream operations in this
repository must operate exclusively on this assembled table.



In [None]:
INPUT_DIR = Path("../inputs")

# Load system-level size index
df_size = pd.read_parquet(INPUT_DIR / "system_size_index.parquet")

# Load system-level temporal index
df_time = pd.read_parquet(INPUT_DIR / "system_temporal_index.parquet")

# Assemble system-level base table
df_system = df_size.merge(
    df_time,
    on="tts_link_id",
    how="inner",
    validate="one_to_one"
)

# --- Invariants ---
assert df_system["tts_link_id"].is_unique, (
    "System assembly must preserve one row per tts_link_id."
)

assert "system_size_kw" in df_system.columns, (
    "Assembled system table must include system_size_kw."
)

assert "installation_year" in df_system.columns, (
    "Assembled system table must include installation_year."
)

df_system.shape, df_system.head()




((103309, 11),
                tts_link_id  n_rows  n_installation_dates  n_system_sizes  \
 0      tts_extension_id_10       2                     2               2   
 1     tts_extension_id_100       2                     2               2   
 2    tts_extension_id_1000       2                     2               1   
 3   tts_extension_id_10000       2                     2               2   
 4  tts_extension_id_100000       2                     2               2   
 
    n_prices  has_expansion  has_multiple_phases  system_size_kw  \
 0         2           True                 True        2.400000   
 1         2           True                 True        6.250000   
 2         2           True                 True        4.410000   
 3         2           True                 True        5.565000   
 4         2           True                 True        3.263644   
 
    n_size_reports  installation_year  installation_year_cohort  
 0               2             2014.0        

## Phase 4 — Cohort-Level Context Attachment

This phase attaches cohort-level descriptive context to the system-level table
assembled in Phase 3.

Cohort artifacts provide descriptive baselines, dispersion, and distributional
summaries keyed by `installation_year_cohort`. These artifacts are joined strictly
as lookup tables.

System grain must be preserved. Each system remains represented by exactly one
row. No aggregation, collapsing, or interpretive analysis is performed in this
phase.


In [47]:
# Attach baseline expectations
df_system_context = df_system.merge(
    df_size_baselines,
    on="installation_year_cohort",
    how="left",
    validate="many_to_one"
)

# Attach dispersion metrics
df_system_context = df_system_context.merge(
    df_size_dispersion,
    on="installation_year_cohort",
    how="left",
    validate="many_to_one"
)

# Attach distribution summaries
df_system_context = df_system_context.merge(
    df_size_distributions,
    on="installation_year_cohort",
    how="left",
    validate="many_to_one"
)

# --- Invariant check ---
assert df_system_context["tts_link_id"].is_unique, (
    "Cohort context attachment must preserve one row per tts_link_id."
)

df_system_context.shape, df_system_context.head()


((103309, 28),
                tts_link_id  n_rows  n_installation_dates  n_system_sizes  \
 0      tts_extension_id_10       2                     2               2   
 1     tts_extension_id_100       2                     2               2   
 2    tts_extension_id_1000       2                     2               1   
 3   tts_extension_id_10000       2                     2               2   
 4  tts_extension_id_100000       2                     2               2   
 
    n_prices  has_expansion  has_multiple_phases  system_size_kw  \
 0         2           True                 True        2.400000   
 1         2           True                 True        6.250000   
 2         2           True                 True        4.410000   
 3         2           True                 True        5.565000   
 4         2           True                 True        3.263644   
 
    n_size_reports  installation_year  ...  min_size_x   max_size_x  n_systems  \
 0               2           

In [48]:
# --- Phase 4 cleanup: resolve column collisions and canonicalize cohort columns ---

df = df_system_context.copy()

# Explicit renaming map (canonical names only)
rename_map = {
    # Baseline (expected size reference)
    "expected_system_size_kw": "baseline_expected_system_size_kw",
    "p25_y": "baseline_p25",
    "p75_y": "baseline_p75",
    "n_systems_x": "cohort_n_systems",  # canonical cohort population size

    # Dispersion
    "iqr": "dispersion_iqr",
    "p90_p10_span": "dispersion_p90_p10_span",
    "min_size_x": "dispersion_min_size",
    "max_size_x": "dispersion_max_size",

    # Distribution envelope
    "p10": "distribution_p10",
    "p25_x": "distribution_p25",
    "p50": "distribution_p50",
    "p75_x": "distribution_p75",
    "p90": "distribution_p90",
    "min_size_y": "distribution_min_size",
    "max_size_y": "distribution_max_size",
}

df = df.rename(columns=rename_map)

# Drop duplicate cohort population column if still present
cols_to_drop = [c for c in df.columns if c in {"n_systems_y"}]
df = df.drop(columns=cols_to_drop, errors="ignore")

# Final guard: no ambiguous suffixes allowed past Phase 4
ambiguous_cols = [c for c in df.columns if c.endswith("_x") or c.endswith("_y")]
assert not ambiguous_cols, f"Unresolved ambiguous columns remain: {ambiguous_cols}"

df_system_context = df
df_system_context.shape, df_system_context.head()


((103309, 27),
                tts_link_id  n_rows  n_installation_dates  n_system_sizes  \
 0      tts_extension_id_10       2                     2               2   
 1     tts_extension_id_100       2                     2               2   
 2    tts_extension_id_1000       2                     2               1   
 3   tts_extension_id_10000       2                     2               2   
 4  tts_extension_id_100000       2                     2               2   
 
    n_prices  has_expansion  has_multiple_phases  system_size_kw  \
 0         2           True                 True        2.400000   
 1         2           True                 True        6.250000   
 2         2           True                 True        4.410000   
 3         2           True                 True        5.565000   
 4         2           True                 True        3.263644   
 
    n_size_reports  installation_year  ...  dispersion_min_size  \
 0               2             2014.0  ...  

### Artifact Freeze — Contextualized Descriptive System Table

The table produced at the end of Phase 4 represents the fully assembled,
context-enriched, system-grain dataset derived exclusively from descriptive
artifacts.

This table is frozen prior to any size conditioning, structural analysis, or
interpretation. It is emitted both as a Repo 3 output and as a reusable artifact
for downstream repositories via the tts_artifacts repository.



In [49]:
# Repo 3 output path
REPO3_OUTPUT_DIR = Path("../outputs")
REPO3_OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

repo3_output_path = REPO3_OUTPUT_DIR / "system_context_descriptive.parquet"

# tts_artifacts repo path (adjust relative depth if needed)
ARTIFACTS_DIR = Path("../../tts_artifacts")
ARTIFACTS_DIR.mkdir(parents=True, exist_ok=True)

artifacts_output_path = ARTIFACTS_DIR / "system_context_descriptive.parquet"

# Emit artifact to both locations
df_system_context.to_parquet(repo3_output_path, index=False)
df_system_context.to_parquet(artifacts_output_path, index=False)

repo3_output_path, artifacts_output_path, df_system_context.shape


(WindowsPath('../outputs/system_context_descriptive.parquet'),
 WindowsPath('../../tts_artifacts/system_context_descriptive.parquet'),
 (103309, 27))

## Phase 5 — Fixed-Size Conditioning

This phase defines what it means to “hold system size fixed” for the purposes of
structural analysis in Repo 3.

System size is conditioned relative to its installation-year cohort in order to
account for historical scale shifts and changing market baselines. Conditioning
is purely descriptive and establishes comparability, not evaluation.

No judgments regarding abnormality, deviation, efficiency, or risk are made in
this phase. Size conditioning serves only to place systems of different absolute
sizes onto a common, dimensionless scale suitable for downstream structural
analysis.

All subsequent phases operate on size-conditioned representations rather than
raw system size.


In [50]:
#  Size conditioning variable
# Cohort-relative size index (dimensionless)

# Defensive checks
required_cols = [
    "system_size_kw",
    "baseline_expected_system_size_kw",
    "installation_year_cohort",
]

missing = [c for c in required_cols if c not in df_system_context.columns]
assert not missing, f"Missing required columns for size conditioning: {missing}"

# Construct size index: ratio to cohort expectation
df_system_context["size_index"] = (
    df_system_context["system_size_kw"]
    / df_system_context["baseline_expected_system_size_kw"]
)

# Sanity checks
assert df_system_context["size_index"].notna().all(), (
    "size_index contains NaN values."
)

assert (df_system_context["size_index"] > 0).all(), (
    "size_index must be strictly positive."
)

# Quick distribution check (no interpretation)
df_system_context["size_index"].describe()


count    103309.000000
mean          1.342510
std           5.587414
min           0.000560
25%           0.759184
50%           1.000000
75%           1.354959
max         562.744925
Name: size_index, dtype: float64

In [51]:
# Size band construction (descriptive)

# Number of bands (adjustable, but fixed for now)
N_BANDS = 10

df_system_context["size_band"] = pd.qcut(
    df_system_context["size_index"],
    q=N_BANDS,
    labels=False,
    duplicates="drop"
)

# Sanity checks
assert df_system_context["size_band"].notna().all(), (
    "size_band contains NaN values."
)

df_system_context["size_band"].value_counts().sort_index()


size_band
0    10344
1    10324
2    10326
3    10354
4    10685
5    10112
6    10193
7    10309
8    10334
9    10328
Name: count, dtype: int64

## Phase 6 — Configuration Family Materialization and Collapse Rules
This phase materializes physical configuration dimensions at system grain by
collapsing raw component-level data into coherent configuration families.

Raw Tracking the Sun data describe system configuration across multiple rows and
component fields, reflecting reporting multiplicity, phased installation, and
partial disclosure. These representations are not directly admissible for
structural analysis.

Accordingly, this phase:
- identifies coherent physical configuration families,
- enumerates the raw columns associated with each family, and
- defines explicit, family-specific collapse rules that yield one deterministic
  configuration representation per system.

This phase performs no conditioning on system size, no measurement of variation,
and no inference. Its sole purpose is to construct admissible system-level
configuration dimensions for subsequent structural analysis.


### 6.1 Configuration Families Overview

Physical configuration in the Tracking the Sun dataset is not represented as a
single, atomic system description. Instead, it is encoded across multiple raw
columns and, in many cases, multiple rows per system, reflecting phased
installation, component-level reporting, and partial disclosure.

As a result, configuration dimensions cannot be collapsed or analyzed in
isolation. Individual raw columns (e.g., `module_quantity_1`,
`inverter_model_2`) do not carry independent meaning outside the context of
related fields that jointly describe the same physical subsystem.

To preserve physical interpretability and avoid introducing artificial
variation, configuration dimensions are therefore organized into **configuration
families**.

A configuration family is defined as a coherent group of raw variables that:
- jointly describe a single physical subsystem of the solar installation,
- share common reporting and missingness patterns,
- must be collapsed together to preserve semantic meaning, and
- collectively contribute to how system size is physically instantiated.

Configuration families serve two purposes in this phase:
1. They define the scope within which collapse rules are specified.
2. They establish the units of admissible physical configuration for downstream
   structural analysis.

Collapse is performed at the family level, yielding one deterministic,
system-level representation per family and per system (`tts_link_id`). No
assessment of variation or degrees of freedom is conducted at this stage.

The following sections enumerate each configuration family, identify the raw
columns associated with that family, and define explicit collapse rules that
produce admissible system-level configuration dimensions.


### 6.2 Module Configuration Family

The module configuration family describes how a system’s direct current (DC)
capacity is instantiated at the photovoltaic module level. This family captures
the physical composition of the array in terms of module count, power rating,
and technological characteristics.

In the raw Tracking the Sun data, module configuration is reported across
multiple indexed fields (e.g. `_1`, `_2`, `_3`) and may reflect mixed module
types, partial reporting, or placeholder values. Individual module-related
columns are therefore not semantically meaningful in isolation and must be
interpreted jointly.

#### Raw Columns Included

The following raw columns are associated with the module configuration family:

- Module identity and quantity  
  - `module_manufacturer_1`, `module_manufacturer_2`, `module_manufacturer_3`  
  - `module_model_1`, `module_model_2`, `module_model_3`  
  - `module_quantity_1`, `module_quantity_2`, `module_quantity_3`  
  - `additional_modules`

- Electrical characteristics  
  - `nameplate_capacity_module_1`, `nameplate_capacity_module_2`,
    `nameplate_capacity_module_3`  
  - `efficiency_module_1`, `efficiency_module_2`, `efficiency_module_3`

- Module technology indicators  
  - `technology_module_1`, `technology_module_2`, `technology_module_3`  
  - `bipv_module_1`, `bipv_module_2`, `bipv_module_3`  
  - `bifacial_module_1`, `bifacial_module_2`, `bifacial_module_3`

These columns jointly describe the number, type, and electrical properties of
modules used in a system.

#### Rationale for Family-Level Collapse

Module configuration variables exhibit the following structural properties:

- Multiple module types may be reported for a single system.
- Indexed fields (`_1`, `_2`, `_3`) may represent parallel strings, phased
  additions, or reporting artifacts.
- Placeholder values (e.g. `-1`) are frequently used to denote missing or
  inapplicable data.
- Quantity, nameplate capacity, and model identity are interdependent and cannot
  be collapsed independently without distorting physical meaning.

For these reasons, module-related variables must be collapsed jointly using
explicit, family-specific rules. Any collapse operation that treats these fields
independently would risk conflating reporting noise with genuine configuration
variation.

#### Collapse Objective

The objective of collapsing the module configuration family is to produce a
deterministic, system-level representation that:

- preserves the dominant physical realization of DC capacity,
- is defined at one row per system (`tts_link_id`),
- is robust to partial or noisy reporting, and
- does not introduce artificial variation across systems of identical size.

The resulting system-level module configuration variables serve as admissible
inputs for subsequent analysis of structural degrees of freedom at fixed size.


#### 6.2.1 Diagnostic Structure of Raw Module Configuration Data

This subsection performs narrowly scoped diagnostic analysis of raw
module-related configuration columns for the sole purpose of defining admissible
system-level collapse rules.

The objective is not to explore relationships, trends, or associations, but to
characterize the internal structure, reporting patterns, and failure modes of
module configuration data as recorded in the raw dataset.

Specifically, this diagnostic analysis addresses the following questions:

1. How many distinct module configurations are reported per system?
2. Do multiple module types meaningfully coexist within systems, or does a
   dominant configuration typically exist?
3. How are placeholder and missing values (`-1`, `0`, null) distributed across
   module-related fields?
4. Are quantities, nameplate capacities, and efficiencies internally consistent
   when multiple module entries are present?
5. Is it empirically defensible to collapse module configuration to a single
   dominant representation per system?

The results of this subsection directly determine the collapse rules defined in
Section 6.2.2. No structural degrees of freedom are identified here, and no
configuration equivalence is assessed.


In [54]:
# Load raw module-related data

RAW_DATA_PATH = Path(os.environ["TRACKING_THE_SUN_DATA"])

module_raw = pd.read_parquet(
    RAW_DATA_PATH,
    columns=[
        "tts_link_id",
        "module_quantity_1",
        "module_quantity_2",
        "module_quantity_3",
        "nameplate_capacity_module_1",
        "nameplate_capacity_module_2",
        "nameplate_capacity_module_3",
        "module_model_1",
        "module_model_2",
        "module_model_3",
        "module_manufacturer_1",
        "module_manufacturer_2",
        "module_manufacturer_3",
        "efficiency_module_1",
        "efficiency_module_2",
        "efficiency_module_3",
        "additional_modules",
    ],
    engine="fastparquet",
)


# Define what counts as a valid module entry
def is_valid_quantity(x):
    return pd.notna(x) and x > 0

# Count valid module entries per system
module_counts = (
    raw.assign(
        mq1_valid=raw["module_quantity_1"].apply(is_valid_quantity),
        mq2_valid=raw["module_quantity_2"].apply(is_valid_quantity),
        mq3_valid=raw["module_quantity_3"].apply(is_valid_quantity),
    )
    .groupby("tts_link_id")[["mq1_valid", "mq2_valid", "mq3_valid"]]
    .any()
    .sum(axis=1)
)

# Summarize distribution
multiplicity_summary = (
    module_counts.value_counts()
    .sort_index()
    .rename("n_systems")
    .to_frame()
)

multiplicity_summary["pct_systems"] = (
    multiplicity_summary["n_systems"] / multiplicity_summary["n_systems"].sum()
)

multiplicity_summary


Unnamed: 0,n_systems,pct_systems
0,152,0.001234
1,60863,0.494102
2,55911,0.4539
3,6253,0.050764


#### Summary: Multiplicity Structure of Module Configuration

Diagnostic analysis of module quantity reporting reveals that module
configuration is frequently composite rather than singular.

Across all systems with admissible system identifiers:

- Approximately **49%** of systems report exactly **one** valid module entry.
- Approximately **45%** of systems report **two** valid module entries.
- Approximately **5%** of systems report **three** valid module entries.
- Fewer than **0.2%** of systems report no valid module quantities and are
  considered pathological for module configuration analysis.

These results indicate that indexed module fields (`_1`, `_2`, `_3`) are **not
mutually exclusive alternatives**, but instead often represent concurrent module
configurations within the same system. As a consequence, naive collapse rules
(e.g., selecting the first non-null entry) would misrepresent the physical
configuration of a substantial fraction of systems.

At the same time, multiplicity alone does not imply that systems are physically
mixed in a meaningful sense. Multiple module entries may reflect minor
supplementary components, phased additions, or reporting artifacts rather than
balanced hybrid configurations.

Accordingly, the next diagnostic step evaluates **dominance versus mixture**:
specifically, whether one module configuration typically accounts for the
majority of a system’s DC capacity when multiple module entries are present.

The outcome of that analysis will determine whether module configuration can be
collapsed to a single dominant representation per system, or whether mixture
itself constitutes a structural degree of freedom.


In [55]:

# Helper to compute DC contribution safely
def dc_contribution(qty, cap):
    if pd.notna(qty) and pd.notna(cap) and qty > 0 and cap > 0:
        return qty * cap
    return 0.0

# Compute DC contribution per indexed module entry
module_raw["dc_1"] = module_raw.apply(
    lambda r: dc_contribution(r["module_quantity_1"], r["nameplate_capacity_module_1"]),
    axis=1,
)
module_raw["dc_2"] = module_raw.apply(
    lambda r: dc_contribution(r["module_quantity_2"], r["nameplate_capacity_module_2"]),
    axis=1,
)
module_raw["dc_3"] = module_raw.apply(
    lambda r: dc_contribution(r["module_quantity_3"], r["nameplate_capacity_module_3"]),
    axis=1,
)

# Aggregate DC contributions to system level
dc_by_system = (
    module_raw
    .groupby("tts_link_id")[["dc_1", "dc_2", "dc_3"]]
    .sum()
    .reset_index()
)

# Total DC per system
dc_by_system["dc_total"] = dc_by_system[["dc_1", "dc_2", "dc_3"]].sum(axis=1)

# Count number of contributing module entries per system
dc_by_system["n_contributors"] = (
    (dc_by_system[["dc_1", "dc_2", "dc_3"]] > 0).sum(axis=1)
)

# Compute dominance share (max contributor / total DC)
dc_by_system["dc_max_share"] = (
    dc_by_system[["dc_1", "dc_2", "dc_3"]].max(axis=1)
    / dc_by_system["dc_total"]
)

# Restrict to multi-module systems only
multi_module_systems = dc_by_system.loc[
    (dc_by_system["n_contributors"] >= 2) & (dc_by_system["dc_total"] > 0),
    "dc_max_share"
]

# Summarize dominance distribution
dominance_summary = multi_module_systems.describe(
    percentiles=[0.5, 0.75, 0.9, 0.95]
)

dominance_summary


count    55005.000000
mean         0.674973
std          0.092886
min          0.333333
50%          0.677342
75%          0.738243
90%          0.792008
95%          0.822539
max          0.996844
Name: dc_max_share, dtype: float64

#### Summary: Dominance vs Mixture in Module Configuration

Analysis of DC capacity contributions across multi-module systems indicates that
module configuration is typically **mixed rather than dominated by a single
module type**.

Among systems reporting two or more valid module entries:

- The median dominant module contributes approximately **68%** of total DC
  capacity.
- Only **5%** of systems exhibit dominance above approximately **82%**.
- A substantial lower tail exists, with some systems exhibiting near-balanced
  capacity splits across module entries.

These results indicate that while a largest module configuration is often
present, dominance is generally **insufficient to justify collapsing module
configuration to a single representative entry** without loss of structural
information.

Accordingly, module mixture constitutes a **structural characteristic** of
systems and must be preserved or explicitly summarized in any admissible
system-level representation.

This finding constrains admissible collapse rules for the module configuration
family and motivates the use of mixture-aware summaries rather than dominant-only
selection.


In [58]:

module_columns = [
    "module_quantity_1", "module_quantity_2", "module_quantity_3",
    "nameplate_capacity_module_1", "nameplate_capacity_module_2", "nameplate_capacity_module_3",
    "module_model_1", "module_model_2", "module_model_3",
    "module_manufacturer_1", "module_manufacturer_2", "module_manufacturer_3",
    "efficiency_module_1", "efficiency_module_2", "efficiency_module_3",
    "additional_modules",
]

placeholder_rows = []

for col in module_columns:
    s = module_raw[col]

    placeholder_rows.append({
        "column": col,
        "pct_null": s.isna().mean(),
        "pct_minus_one": (s == -1).mean() if s.dtype != "object" else (s == "-1").mean(),
        "pct_zero": (s == 0).mean() if s.dtype != "object" else 0.0,
    })

placeholder_summary = (
    pd.DataFrame(placeholder_rows)
    .sort_values(["pct_minus_one", "pct_null"], ascending=False)
    .reset_index(drop=True)
)

placeholder_summary



Unnamed: 0,column,pct_null,pct_minus_one,pct_zero
0,efficiency_module_3,6e-06,0.994334,0.0
1,nameplate_capacity_module_3,8e-06,0.994332,0.0
2,module_model_3,6e-06,0.993429,0.0
3,module_quantity_3,6e-06,0.993083,0.0
4,efficiency_module_2,6e-06,0.955276,0.0
5,nameplate_capacity_module_2,2.9e-05,0.955255,0.0
6,module_model_2,6e-06,0.947191,0.0
7,module_quantity_2,6e-06,0.934872,0.0
8,efficiency_module_1,6e-06,0.074791,0.0
9,nameplate_capacity_module_1,0.000921,0.074074,0.0


In [59]:
# Check internal consistency: quantity and capacity placeholders should co-occur
consistency_checks = []

for i in [1, 2, 3]:
    qty_col = f"module_quantity_{i}"
    cap_col = f"nameplate_capacity_module_{i}"

    inconsistent = module_raw[
        (module_raw[qty_col] > 0) & (module_raw[cap_col] <= 0)
    ]

    consistency_checks.append({
        "entry_index": i,
        "n_inconsistent_rows": len(inconsistent),
        "pct_inconsistent": len(inconsistent) / len(module_raw),
    })

pd.DataFrame(consistency_checks)


Unnamed: 0,entry_index,n_inconsistent_rows,pct_inconsistent
0,1,105400,0.054861
1,2,43896,0.022848
2,3,4808,0.002503


#### Summary: Placeholder and Missingness Structure in Module Configuration

Placeholder values in module-related columns exhibit a highly structured and
index-consistent pattern.

For indexed module entries:

- Fields associated with index `_1` are predominantly populated, with low rates
  of placeholder values.
- Fields associated with index `_2` are largely placeholder (`-1`), but are
  meaningfully populated for a non-trivial subset of systems.
- Fields associated with index `_3` are almost entirely placeholder, indicating
  that this index is rarely used in practice.

This pattern is consistent across module quantity, nameplate capacity, model,
manufacturer, and efficiency fields, indicating that placeholder values reflect
intentional padding rather than random missingness.

Cross-field consistency checks further show that cases where a positive module
quantity is reported alongside a missing or invalid nameplate capacity are rare
(≤6% for index `_1`, and substantially lower for higher indices).

Taken together, these results indicate that placeholder entries can be safely
excluded from collapse operations, and that remaining non-placeholder entries
provide a coherent basis for mixture-aware system-level summarization.


#### Module Configuration Collapse Rules

This subsection defines the rules used to collapse raw module configuration data
to a deterministic, system-level representation. These rules are derived
directly from the diagnostic analyses in Section 6.2.1 and are constrained by
observed reporting structure, dominance behavior, and placeholder consistency.

The collapse rules are designed to preserve genuine structural variation while
eliminating reporting artifacts and padding.

---

##### Rule 1 — Placeholder Exclusion

Module entries with placeholder or non-admissible values are excluded from all
collapse operations.

Specifically:
- Any indexed entry (`_1`, `_2`, `_3`) with `module_quantity <= 0` is treated as
  non-existent.
- Entries with missing or non-positive `nameplate_capacity_module` are excluded,
  even if quantity is reported.

This rule is justified by the highly structured placeholder patterns observed in
Section 6.2.1, where `-1` values act as intentional padding rather than ambiguous
missingness.

---

##### Rule 2 — Mixture Preservation

Module configuration is treated as **composite rather than singular**.

Diagnostic analysis shows that:
- Multiple module entries are common.
- No single entry exhibits sufficiently strong dominance to justify exclusive
  selection.
- Balanced or moderately skewed mixtures are structurally prevalent.

Accordingly, collapse rules **do not select a single representative module
configuration**. Instead, mixture information is preserved via capacity-weighted
summaries.

---

##### Rule 3 — Capacity-Weighted Aggregation

For each system, the DC contribution of each admissible module entry is computed
as:

\[
\text{DC}_i = \text{module\_quantity}_i \times
\text{nameplate\_capacity\_module}_i
\]

These contributions are used as weights in all aggregate summaries.

This ensures that module configurations are summarized according to their
physical contribution to system capacity rather than raw counts or reporting
order.

---

##### Rule 4 — System-Level Module Summary Variables

The following system-level module configuration variables are constructed:

- **Total module DC capacity**  
  Sum of DC contributions across all admissible module entries.

- **Effective module wattage (capacity-weighted mean)**  
  Capacity-weighted average of `nameplate_capacity_module`.

- **Effective module efficiency (capacity-weighted mean)**  
  Capacity-weighted average of `efficiency_module`, where available.

- **Module mixture count**  
  Number of distinct admissible module entries contributing positive DC capacity.

These variables jointly describe how system size is instantiated at the module
level while preserving mixture structure.

---

##### Rule 5 — Categorical Attributes (Model and Manufacturer)

Categorical module attributes (`module_model`, `module_manufacturer`) are handled
as follows:

- The **dominant categorical value** is defined as the value associated with the
  largest DC contribution.
- If multiple categorical values contribute non-trivially, dominance is recorded
  but mixture count captures the presence of heterogeneity.

This approach avoids exploding configuration classes while still anchoring each
system to a physically meaningful reference module.

---

##### Rule 6 — Admissibility and Scope

The resulting system-level module configuration variables:
- are defined at one row per system (`tts_link_id`),
- preserve genuine mixture where it exists,
- exclude reporting artifacts and padding, and
- are admissible inputs for subsequent analysis of structural degrees of freedom
  at fixed size.

No inference, comparison, or equivalence is defined at this stage. The output of
this collapse step serves solely as input to later phases.


### 6.3 Inverter Configuration Family


### 6.4 Layout and Orientation Family


### 6.5 Mounting Context Family


### 6.6 Storage (Battery) Family


## Phase 6 — Structural Degrees of Freedom

This phase defines the structural degrees of freedom available to systems once
system size has been held fixed via cohort-relative conditioning.

A structural degree of freedom is any system-level attribute or configuration
dimension that may vary independently of size and therefore represents genuine
structural variation rather than scale effects.

This phase performs no aggregation, comparison, or inference. It exists solely
to enumerate and formally define the dimensions along which structural variation
will be examined in subsequent analysis.


### Defined Structural Degrees of Freedom

The following structural degrees of freedom are available in the assembled system-level dataset and are admissible for analysis once size is held fixed.

#### 1. Configuration Complexity
- **Representation:** `n_system_sizes`, `n_size_reports`
- **Description:** Number of reported size components or revisions associated with a system.
- **Justification:** Reflects reporting and configuration complexity rather than physical scale.

#### 2. Pricing Structure Presence
- **Representation:** `n_prices`
- **Description:** Number of distinct price records associated with a system.
- **Justification:** Indicates commercial or contractual complexity independent of installed capacity.

#### 3. Expansion Behavior
- **Representation:** `has_expansion`
- **Description:** Boolean indicator of whether a system underwent expansion.
- **Justification:** Captures temporal and structural modification independent of final size.

#### 4. Phasing Behavior
- **Representation:** `has_multiple_phases`
- **Description:** Boolean indicator of multi-phase installation or reporting.
- **Justification:** Reflects deployment strategy rather than scale.

#### 5. Reporting Density
- **Representation:** `n_rows`
- **Description:** Number of raw records associated with a system.
- **Justification:** Measures data and reporting richness, not physical size.

#### 6. Temporal Position
- **Representation:** `installation_year`, `installation_year_cohort`
- **Description:** Placement of the system in historical time.
- **Justification:** Controls for era-specific practices while remaining independent of relative size conditioning.


### Configuration-Relevant Subset

While all degrees of freedom listed above vary independently of size, only a
subset are relevant for defining **configuration equivalence**.

Configuration equivalence is concerned exclusively with the **physical
realization of system size**—that is, how capacity is instantiated through
components and layout.

Accordingly, degrees of freedom related to reporting density, pricing structure,
temporal position, expansion, and phasing are retained for descriptive analysis
but **excluded from equivalence construction**.

Only physical configuration dimensions that are:
- defined at system grain,
- admissible once size is fixed, and
- empirically constrained at fixed size

are used to define configuration classes. The identification of such dimensions
proceeds in the following sections.

