# Notebook 1: Structural Configuration Assembly (Pre-DoF, Fixed Size)

This notebook initiates **Repo 3** by assembling a **system-level structural configuration dataset** that characterizes how photovoltaic systems are physically and electrically configured **conditional on a given system size**.

All inputs consumed here are **descriptive, frozen artifacts produced upstream in Repo 2**. This notebook operates strictly at the level of **representation and construction**: it audits inputs, assembles system-level context, materializes configuration families, and collapses each family into deterministic, one-row-per-system representations.

The primary objective is to define the **admissible configuration state space** that remains once system size is treated as given. This includes identifying which configuration dimensions meaningfully vary at fixed size and which do not.

No inference, deviation assessment, regime definition, or risk evaluation is performed in this notebook. Its sole output is a coherent structural configuration dataset suitable for downstream structural degrees-of-freedom analysis.


## Phase 1 — Input Artifact Audit and Role Assertion

This phase audits the upstream artifacts consumed by Repo 3 and asserts their
semantic role, grain, and admissible usage within this repository.

Repo 3 consumes **descriptive artifacts only**, produced and sealed in Repo 2.
Artifacts are classified into system-level anchors, cohort-level context tables,
and reference distributions. Only system-level and cohort-level artifacts may be
joined in this repository, and only at declared grains.

This phase performs no data loading or transformation. Its sole purpose is to
establish the analytical contract under which all subsequent operations occur.


## Phase 2 — Input Files Structural Audit

This phase inspects all input files provided to Repo 3 and reports their
row counts and column structures.

The purpose of this phase is to make the available data explicit before any
grain enforcement or invariant assertions are applied.


In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
import os

INPUT_DIR = Path("../inputs")

input_files = sorted(
    list(INPUT_DIR.glob("*.parquet")) + list(INPUT_DIR.glob("*.csv"))
)

for path in input_files:
    if path.suffix == ".parquet":
        df = pd.read_parquet(path)
    else:
        df = pd.read_csv(path)

    print(f"\nFILE: {path.name}")
    print(f"SHAPE: {df.shape}")
    print("COLUMNS:")
    for col in df.columns:
        print(f"  - {col}")



FILE: admissible_system_index.parquet
SHAPE: (122998, 7)
COLUMNS:
  - tts_link_id
  - n_rows
  - n_installation_dates
  - n_system_sizes
  - n_prices
  - has_expansion
  - has_multiple_phases

FILE: size_baselines.parquet
SHAPE: (23, 5)
COLUMNS:
  - installation_year_cohort
  - n_systems
  - p25
  - expected_system_size_kw
  - p75

FILE: size_dispersion.parquet
SHAPE: (23, 6)
COLUMNS:
  - installation_year_cohort
  - n_systems
  - iqr
  - p90_p10_span
  - min_size
  - max_size

FILE: size_distributions.parquet
SHAPE: (23, 9)
COLUMNS:
  - installation_year_cohort
  - n_systems
  - p10
  - p25
  - p50
  - p75
  - p90
  - min_size
  - max_size

FILE: system_level_base.parquet
SHAPE: (123178, 7)
COLUMNS:
  - tts_link_id
  - n_rows
  - n_installation_dates
  - n_system_sizes
  - n_prices
  - has_expansion
  - has_multiple_phases

FILE: system_size_index.parquet
SHAPE: (103313, 9)
COLUMNS:
  - tts_link_id
  - n_rows
  - n_installation_dates
  - n_system_sizes
  - n_prices
  - has_expansion


## Phase 3 — System-Level Assembly (Size × Temporal)

This phase assembles the canonical system-level table for Repo 3 by combining
system size information with system-level temporal context.

The result of this phase is a one-row-per-system dataset that includes both
system size and installation timing. All downstream operations in this
repository must operate exclusively on this assembled table.



In [2]:
INPUT_DIR = Path("../inputs")

# Load system-level size index
df_size = pd.read_parquet(INPUT_DIR / "system_size_index.parquet")

# Load system-level temporal index
df_time = pd.read_parquet(INPUT_DIR / "system_temporal_index.parquet")

# Assemble system-level base table
df_system = df_size.merge(
    df_time,
    on="tts_link_id",
    how="inner",
    validate="one_to_one"
)

# --- Invariants ---
assert df_system["tts_link_id"].is_unique, (
    "System assembly must preserve one row per tts_link_id."
)

assert "system_size_kw" in df_system.columns, (
    "Assembled system table must include system_size_kw."
)

assert "installation_year" in df_system.columns, (
    "Assembled system table must include installation_year."
)

df_system.shape, df_system.head()




((103309, 11),
                tts_link_id  n_rows  n_installation_dates  n_system_sizes  \
 0      tts_extension_id_10       2                     2               2   
 1     tts_extension_id_100       2                     2               2   
 2    tts_extension_id_1000       2                     2               1   
 3   tts_extension_id_10000       2                     2               2   
 4  tts_extension_id_100000       2                     2               2   
 
    n_prices  has_expansion  has_multiple_phases  system_size_kw  \
 0         2           True                 True        2.400000   
 1         2           True                 True        6.250000   
 2         2           True                 True        4.410000   
 3         2           True                 True        5.565000   
 4         2           True                 True        3.263644   
 
    n_size_reports  installation_year  installation_year_cohort  
 0               2             2014.0        

## Phase 4 — Cohort-Level Context Attachment

This phase attaches cohort-level descriptive context to the system-level table
assembled in Phase 3.

Cohort artifacts provide descriptive baselines, dispersion, and distributional
summaries keyed by `installation_year_cohort`. These artifacts are joined strictly
as lookup tables.

System grain must be preserved. Each system remains represented by exactly one
row. No aggregation, collapsing, or interpretive analysis is performed in this
phase.


In [3]:
# Assemble system-level context table (single attachment point)

# Load size reference artifacts from Repo 2
df_size_baselines = pd.read_parquet(
    INPUT_DIR / "size_baselines.parquet"
)

df_size_dispersion = pd.read_parquet(
    INPUT_DIR / "size_dispersion.parquet"
)

df_size_distributions = pd.read_parquet(
    INPUT_DIR / "size_distributions.parquet"
)

# Assemble system context (attach cohort geometry exactly once)
df_system_context = df_system.merge(
    df_size_baselines,
    on="installation_year_cohort",
    how="left",
    validate="many_to_one"
)

df_system_context = df_system_context.merge(
    df_size_dispersion,
    on="installation_year_cohort",
    how="left",
    validate="many_to_one"
)

df_system_context = df_system_context.merge(
    df_size_distributions,
    on="installation_year_cohort",
    how="left",
    validate="many_to_one"
)


# Invariants

assert df_system_context["tts_link_id"].is_unique, (
    "System context assembly must preserve one row per tts_link_id."
)

assert df_system_context["installation_year_cohort"].notna().all(), (
    "All systems must map to a valid installation year cohort."
)

df_system_context.shape



(103309, 28)

In [None]:
#  Phase 4 cleanup: resolve column collisions and canonicalize cohort columns 

df = df_system_context.copy()

# Explicit renaming map (canonical names only)
rename_map = {
    # Baseline (expected size reference)
    "expected_system_size_kw": "baseline_expected_system_size_kw",
    "p25_y": "baseline_p25",
    "p75_y": "baseline_p75",
    "n_systems_x": "cohort_n_systems",  # canonical cohort population size

    # Dispersion
    "iqr": "dispersion_iqr",
    "p90_p10_span": "dispersion_p90_p10_span",
    "min_size_x": "dispersion_min_size",
    "max_size_x": "dispersion_max_size",

    # Distribution envelope
    "p10": "distribution_p10",
    "p25_x": "distribution_p25",
    "p50": "distribution_p50",
    "p75_x": "distribution_p75",
    "p90": "distribution_p90",
    "min_size_y": "distribution_min_size",
    "max_size_y": "distribution_max_size",
}

df = df.rename(columns=rename_map)

# Drop duplicate cohort population column if still present
cols_to_drop = [c for c in df.columns if c in {"n_systems_y"}]
df = df.drop(columns=cols_to_drop, errors="ignore")

# Final guard: no ambiguous suffixes allowed past Phase 4
ambiguous_cols = [c for c in df.columns if c.endswith("_x") or c.endswith("_y")]
assert not ambiguous_cols, f"Unresolved ambiguous columns remain: {ambiguous_cols}"

df_system_context = df
df_system_context.shape, df_system_context.head()


((103309, 27),
                tts_link_id  n_rows  n_installation_dates  n_system_sizes  \
 0      tts_extension_id_10       2                     2               2   
 1     tts_extension_id_100       2                     2               2   
 2    tts_extension_id_1000       2                     2               1   
 3   tts_extension_id_10000       2                     2               2   
 4  tts_extension_id_100000       2                     2               2   
 
    n_prices  has_expansion  has_multiple_phases  system_size_kw  \
 0         2           True                 True        2.400000   
 1         2           True                 True        6.250000   
 2         2           True                 True        4.410000   
 3         2           True                 True        5.565000   
 4         2           True                 True        3.263644   
 
    n_size_reports  installation_year  ...  dispersion_min_size  \
 0               2             2014.0  ...  

### Artifact Freeze — Contextualized Descriptive System Table

The table produced at the end of Phase 4 represents the fully assembled,
context-enriched, system-grain dataset derived exclusively from descriptive
artifacts.

This table is frozen prior to any size conditioning, structural analysis, or
interpretation. It is emitted both as a Repo 3 output and as a reusable artifact
for downstream repositories via the tts_artifacts repository.



In [5]:
# Repo 3 output path
REPO3_OUTPUT_DIR = Path("../outputs")
REPO3_OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

repo3_output_path = REPO3_OUTPUT_DIR / "system_context_descriptive.parquet"

# tts_artifacts repo path (adjust relative depth if needed)
ARTIFACTS_DIR = Path("../../tts_artifacts")
ARTIFACTS_DIR.mkdir(parents=True, exist_ok=True)

artifacts_output_path = ARTIFACTS_DIR / "system_context_descriptive.parquet"

# Emit artifact to both locations
df_system_context.to_parquet(repo3_output_path, index=False)
df_system_context.to_parquet(artifacts_output_path, index=False)

repo3_output_path, artifacts_output_path, df_system_context.shape


(WindowsPath('../outputs/system_context_descriptive.parquet'),
 WindowsPath('../../tts_artifacts/system_context_descriptive.parquet'),
 (103309, 27))

## Phase 5: Configuration Family Materialization and Collapse Rules
This phase materializes physical configuration dimensions at system grain by
collapsing raw component-level data into coherent configuration families.

Raw Tracking the Sun data describe system configuration across multiple rows and
component fields, reflecting reporting multiplicity, phased installation, and
partial disclosure. These representations are not directly admissible for
structural analysis.

Accordingly, this phase:
- identifies coherent physical configuration families,
- enumerates the raw columns associated with each family, and
- defines explicit, family-specific collapse rules that yield one deterministic
  configuration representation per system.

This phase performs no conditioning on system size, no measurement of variation,
and no inference. Its sole purpose is to construct admissible system-level
configuration dimensions for subsequent structural analysis.


### 5.1 Configuration Families Overview

Physical configuration in the Tracking the Sun dataset is not represented as a
single, atomic system description. Instead, it is encoded across multiple raw
columns and, in many cases, multiple rows per system, reflecting phased
installation, component-level reporting, and partial disclosure.

As a result, configuration dimensions cannot be collapsed or analyzed in
isolation. Individual raw columns (e.g., `module_quantity_1`,
`inverter_model_2`) do not carry independent meaning outside the context of
related fields that jointly describe the same physical subsystem.

To preserve physical interpretability and avoid introducing artificial
variation, configuration dimensions are therefore organized into **configuration
families**.

A configuration family is defined as a coherent group of raw variables that:
- jointly describe a single physical subsystem of the solar installation,
- share common reporting and missingness patterns,
- must be collapsed together to preserve semantic meaning, and
- collectively contribute to how system size is physically instantiated.

Configuration families serve two purposes in this phase:
1. They define the scope within which collapse rules are specified.
2. They establish the units of admissible physical configuration for downstream
   structural analysis.

Collapse is performed at the family level, yielding one deterministic,
system-level representation per family and per system (`tts_link_id`). No
assessment of variation or degrees of freedom is conducted at this stage.

The following sections enumerate each configuration family, identify the raw
columns associated with that family, and define explicit collapse rules that
produce admissible system-level configuration dimensions.


### 5.2 Module Configuration Family

The module configuration family describes how a system’s direct current (DC)
capacity is instantiated at the photovoltaic module level. This family captures
the physical composition of the array in terms of module count, power rating,
and technological characteristics.

This analysis is conducted under the constraint that system size
(`system_size_kw`) is treated as fixed and non-variable. All variation examined
here is therefore conditional on fixed size or narrow size bins, and is intended
to capture genuine structural variation rather than scale effects.

In the raw Tracking the Sun data, module configuration is reported across
multiple indexed fields (e.g. `_1`, `_2`, `_3`) and may reflect mixed module
types, partial reporting, phased additions, or reporting artifacts. Individual
module-related columns are therefore not semantically meaningful in isolation
and must be interpreted jointly.

---

#### Analytical Scope and Procedure

The module configuration analysis follows a disciplined, family-level procedure
designed to preserve physical meaning while remaining robust to reporting noise:

1. **Raw Data Materialization**  
   Load raw module-related fields at their native grain without aggregation or
   inference. Indexed fields are treated as parallel configuration slots rather
   than independent variables.

2. **Slot Enumeration and Diagnostics**  
   Identify the number of module slots populated per system and assess whether
   systems exhibit single-module or mixed-module configurations.

3. **Descriptive Diagnostics**  
   For each module-related field:
   - examine distributions and prevalence,
   - measure missingness and placeholder usage (nulls, zeros, sentinel values),
   - and assess internal consistency across related fields.

4. **Dominance vs. Mixture Assessment**  
   Determine whether module configurations are typically dominated by a single
   module type or represent meaningful mixtures, using DC-weighted diagnostics.

5. **Collapse Rule Definition**  
   Define deterministic, family-specific collapse rules that map raw module slot
   data to system-level structural quantities. Collapse rules are empirically
   justified and invariant to reporting multiplicity.

6. **System-Level Configuration Materialization**  
   Apply collapse rules to produce a system-level representation of module
   configuration, preserving one row per system (`tts_link_id`).

No artifacts are written until the full module configuration family collapse is
complete.

---

#### Raw Columns Included

The following raw columns are associated with the module configuration family:

- **Module identity and quantity**  
  - `module_manufacturer_1`, `module_manufacturer_2`, `module_manufacturer_3`  
  - `module_model_1`, `module_model_2`, `module_model_3`  
  - `module_quantity_1`, `module_quantity_2`, `module_quantity_3`  
  - `additional_modules`

- **Electrical characteristics**  
  - `nameplate_capacity_module_1`, `nameplate_capacity_module_2`,
    `nameplate_capacity_module_3`  
  - `efficiency_module_1`, `efficiency_module_2`, `efficiency_module_3`

- **Module technology indicators**  
  - `technology_module_1`, `technology_module_2`, `technology_module_3`  
  - `bipv_module_1`, `bipv_module_2`, `bipv_module_3`  
  - `bifacial_module_1`, `bifacial_module_2`, `bifacial_module_3`

These columns jointly describe the number, type, and electrical properties of
modules used in a system.

---

#### Rationale for Family-Level Collapse

Module configuration variables exhibit the following structural properties:

- Multiple module types may be reported for a single system.
- Indexed fields (`_1`, `_2`, `_3`) may represent parallel strings, phased
  additions, or reporting artifacts.
- Placeholder values (e.g. `-1`) are frequently used to denote missing or
  inapplicable data.
- Quantity, nameplate capacity, efficiency, and model identity are
  interdependent and cannot be collapsed independently without distorting
  physical meaning.

For these reasons, module-related variables must be collapsed jointly using
explicit, family-specific rules. Any collapse operation that treats these fields
independently would risk conflating reporting noise with genuine configuration
variation.

---

#### Collapse Objective

The objective of collapsing the module configuration family is to produce a
deterministic, system-level representation that:

- preserves the dominant physical realization of DC capacity,
- is defined at one row per system (`tts_link_id`),
- is robust to partial or noisy reporting, and
- does not introduce artificial variation across systems of identical size.

The resulting system-level module configuration variables serve as admissible
inputs for subsequent analysis of structural degrees of freedom at fixed size.



#### Diagnostic Structure of Raw Module Configuration Data

This subsection performs narrowly scoped diagnostic analysis of raw
module-related configuration columns for the sole purpose of defining admissible
system-level collapse rules.

The objective is not to explore relationships, trends, or associations, but to
characterize the internal structure, reporting patterns, and failure modes of
module configuration data as recorded in the raw dataset.

Specifically, this diagnostic analysis addresses the following questions:

1. How many distinct module configurations are reported per system?
2. Do multiple module types meaningfully coexist within systems, or does a
   dominant configuration typically exist?
3. How are placeholder and missing values (`-1`, `0`, null) distributed across
   module-related fields?
4. Are quantities, nameplate capacities, and efficiencies internally consistent
   when multiple module entries are present?
5. Is it empirically defensible to collapse module configuration to a single
   dominant representation per system?

The results of this subsection directly determine the collapse rules defined in
the previous section. No structural degrees of freedom are identified here, and no
configuration equivalence is assessed.


In [6]:
# Raw Data Marterialization

RAW_DATA_PATH = Path(os.environ["TRACKING_THE_SUN_DATA"])

module_raw = pd.read_parquet(
    RAW_DATA_PATH,
    columns=[
        "tts_link_id",
        "module_quantity_1",
        "module_quantity_2",
        "module_quantity_3",
        "nameplate_capacity_module_1",
        "nameplate_capacity_module_2",
        "nameplate_capacity_module_3",
        "module_model_1",
        "module_model_2",
        "module_model_3",
        "module_manufacturer_1",
        "module_manufacturer_2",
        "module_manufacturer_3",
        "efficiency_module_1",
        "efficiency_module_2",
        "efficiency_module_3",
        "additional_modules",
    ],
    engine="fastparquet",
)


# Define what counts as a valid module entry
def is_valid_quantity(x):
    return pd.notna(x) and x > 0

# Count valid module entries per system
module_counts = (
    module_raw.assign(
        mq1_valid=module_raw["module_quantity_1"].apply(is_valid_quantity),
        mq2_valid=module_raw["module_quantity_2"].apply(is_valid_quantity),
        mq3_valid=module_raw["module_quantity_3"].apply(is_valid_quantity),
    )
    .groupby("tts_link_id")[["mq1_valid", "mq2_valid", "mq3_valid"]]
    .any()
    .sum(axis=1)
)

# Summarize distribution
multiplicity_summary = (
    module_counts.value_counts()
    .sort_index()
    .rename("n_systems")
    .to_frame()
)

multiplicity_summary["pct_systems"] = (
    multiplicity_summary["n_systems"] / multiplicity_summary["n_systems"].sum()
)

multiplicity_summary


Unnamed: 0,n_systems,pct_systems
0,152,0.001234
1,60863,0.494102
2,55911,0.4539
3,6253,0.050764


#### Summary: Multiplicity Structure of Module Configuration

Diagnostic analysis of module quantity reporting reveals that module
configuration is frequently composite rather than singular.

Across all systems with admissible system identifiers:

- Approximately **49%** of systems report exactly **one** valid module entry.
- Approximately **45%** of systems report **two** valid module entries.
- Approximately **5%** of systems report **three** valid module entries.
- Fewer than **0.2%** of systems report no valid module quantities and are
  considered pathological for module configuration analysis.

These results indicate that indexed module fields (`_1`, `_2`, `_3`) are **not
mutually exclusive alternatives**, but instead often represent concurrent module
configurations within the same system. As a consequence, naive collapse rules
(e.g., selecting the first non-null entry) would misrepresent the physical
configuration of a substantial fraction of systems.

At the same time, multiplicity alone does not imply that systems are physically
mixed in a meaningful sense. Multiple module entries may reflect minor
supplementary components, phased additions, or reporting artifacts rather than
balanced hybrid configurations.

Accordingly, the next diagnostic step evaluates **dominance versus mixture**:
specifically, whether one module configuration typically accounts for the
majority of a system’s DC capacity when multiple module entries are present.

The outcome of that analysis will determine whether module configuration can be
collapsed to a single dominant representation per system, or whether mixture
itself constitutes a structural degree of freedom.


In [7]:
# Dominance vs. Mixture Assessment

# Helper to compute DC contribution safely
def dc_contribution(qty, cap):
    if pd.notna(qty) and pd.notna(cap) and qty > 0 and cap > 0:
        return qty * cap
    return 0.0

# Compute DC contribution per indexed module entry
module_raw["dc_1"] = module_raw.apply(
    lambda r: dc_contribution(r["module_quantity_1"], r["nameplate_capacity_module_1"]),
    axis=1,
)
module_raw["dc_2"] = module_raw.apply(
    lambda r: dc_contribution(r["module_quantity_2"], r["nameplate_capacity_module_2"]),
    axis=1,
)
module_raw["dc_3"] = module_raw.apply(
    lambda r: dc_contribution(r["module_quantity_3"], r["nameplate_capacity_module_3"]),
    axis=1,
)

# Aggregate DC contributions to system level
dc_by_system = (
    module_raw
    .groupby("tts_link_id")[["dc_1", "dc_2", "dc_3"]]
    .sum()
    .reset_index()
)

# Total DC per system
dc_by_system["dc_total"] = dc_by_system[["dc_1", "dc_2", "dc_3"]].sum(axis=1)

# Count number of contributing module entries per system
dc_by_system["n_contributors"] = (
    (dc_by_system[["dc_1", "dc_2", "dc_3"]] > 0).sum(axis=1)
)

# Compute dominance share (max contributor / total DC)
dc_by_system["dc_max_share"] = (
    dc_by_system[["dc_1", "dc_2", "dc_3"]].max(axis=1)
    / dc_by_system["dc_total"]
)

# Restrict to multi-module systems only
multi_module_systems = dc_by_system.loc[
    (dc_by_system["n_contributors"] >= 2) & (dc_by_system["dc_total"] > 0),
    "dc_max_share"
]

# Summarize dominance distribution
dominance_summary = multi_module_systems.describe(
    percentiles=[0.5, 0.75, 0.9, 0.95]
)

dominance_summary


count    55005.000000
mean         0.674973
std          0.092886
min          0.333333
50%          0.677342
75%          0.738243
90%          0.792008
95%          0.822539
max          0.996844
Name: dc_max_share, dtype: float64

#### Summary: Dominance vs Mixture in Module Configuration

Analysis of DC capacity contributions across multi-module systems indicates that
module configuration is typically **mixed rather than dominated by a single
module type**.

Among systems reporting two or more valid module entries:

- The median dominant module contributes approximately **68%** of total DC
  capacity.
- Only **5%** of systems exhibit dominance above approximately **82%**.
- A substantial lower tail exists, with some systems exhibiting near-balanced
  capacity splits across module entries.

These results indicate that while a largest module configuration is often
present, dominance is generally **insufficient to justify collapsing module
configuration to a single representative entry** without loss of structural
information.

Accordingly, module mixture constitutes a **structural characteristic** of
systems and must be preserved or explicitly summarized in any admissible
system-level representation.

This finding constrains admissible collapse rules for the module configuration
family and motivates the use of mixture-aware summaries rather than dominant-only
selection.


In [8]:
# Placeholder and Missingness Structure in Module Configuration
module_columns = [
    "module_quantity_1", "module_quantity_2", "module_quantity_3",
    "nameplate_capacity_module_1", "nameplate_capacity_module_2", "nameplate_capacity_module_3",
    "module_model_1", "module_model_2", "module_model_3",
    "module_manufacturer_1", "module_manufacturer_2", "module_manufacturer_3",
    "efficiency_module_1", "efficiency_module_2", "efficiency_module_3",
    "additional_modules",
]

placeholder_rows = []

for col in module_columns:
    s = module_raw[col]

    placeholder_rows.append({
        "column": col,
        "pct_null": s.isna().mean(),
        "pct_minus_one": (s == -1).mean() if s.dtype != "object" else (s == "-1").mean(),
        "pct_zero": (s == 0).mean() if s.dtype != "object" else 0.0,
    })

placeholder_summary = (
    pd.DataFrame(placeholder_rows)
    .sort_values(["pct_minus_one", "pct_null"], ascending=False)
    .reset_index(drop=True)
)

placeholder_summary



Unnamed: 0,column,pct_null,pct_minus_one,pct_zero
0,efficiency_module_3,6e-06,0.994334,0.0
1,nameplate_capacity_module_3,8e-06,0.994332,0.0
2,module_model_3,6e-06,0.993429,0.0
3,module_quantity_3,6e-06,0.993083,0.0
4,efficiency_module_2,6e-06,0.955276,0.0
5,nameplate_capacity_module_2,2.9e-05,0.955255,0.0
6,module_model_2,6e-06,0.947191,0.0
7,module_quantity_2,6e-06,0.934872,0.0
8,efficiency_module_1,6e-06,0.074791,0.0
9,nameplate_capacity_module_1,0.000921,0.074074,0.0


In [9]:
# Check internal consistency: quantity and capacity placeholders should co-occur
consistency_checks = []

for i in [1, 2, 3]:
    qty_col = f"module_quantity_{i}"
    cap_col = f"nameplate_capacity_module_{i}"

    inconsistent = module_raw[
        (module_raw[qty_col] > 0) & (module_raw[cap_col] <= 0)
    ]

    consistency_checks.append({
        "entry_index": i,
        "n_inconsistent_rows": len(inconsistent),
        "pct_inconsistent": len(inconsistent) / len(module_raw),
    })

pd.DataFrame(consistency_checks)


Unnamed: 0,entry_index,n_inconsistent_rows,pct_inconsistent
0,1,105400,0.054861
1,2,43896,0.022848
2,3,4808,0.002503


#### Summary: Placeholder and Missingness Structure in Module Configuration

Placeholder values in module-related columns exhibit a highly structured and
index-consistent pattern.

For indexed module entries:

- Fields associated with index `_1` are predominantly populated, with low rates
  of placeholder values.
- Fields associated with index `_2` are largely placeholder (`-1`), but are
  meaningfully populated for a non-trivial subset of systems.
- Fields associated with index `_3` are almost entirely placeholder, indicating
  that this index is rarely used in practice.

This pattern is consistent across module quantity, nameplate capacity, model,
manufacturer, and efficiency fields, indicating that placeholder values reflect
intentional padding rather than random missingness.

Cross-field consistency checks further show that cases where a positive module
quantity is reported alongside a missing or invalid nameplate capacity are rare
(≤6% for index `_1`, and substantially lower for higher indices).

Taken together, these results indicate that placeholder entries can be safely
excluded from collapse operations, and that remaining non-placeholder entries
provide a coherent basis for mixture-aware system-level summarization.


#### Module Configuration Collapse Rules

This subsection defines the rules used to collapse raw module configuration data
to a deterministic, system-level representation. These rules are derived
directly from the diagnostic analyses and are constrained by
observed reporting structure, dominance behavior, and placeholder consistency.

The collapse rules are designed to preserve genuine structural variation while
eliminating reporting artifacts and padding.

---

##### Rule 1 — Placeholder Exclusion

Module entries with placeholder or non-admissible values are excluded from all
collapse operations.

Specifically:
- Any indexed entry (`_1`, `_2`, `_3`) with `module_quantity <= 0` is treated as
  non-existent.
- Entries with missing or non-positive `nameplate_capacity_module` are excluded,
  even if quantity is reported.

This rule is justified by the highly structured placeholder patterns observed in
Section 6.2.1, where `-1` values act as intentional padding rather than ambiguous
missingness.

---

##### Rule 2 — Mixture Preservation

Module configuration is treated as **composite rather than singular**.

Diagnostic analysis shows that:
- Multiple module entries are common.
- No single entry exhibits sufficiently strong dominance to justify exclusive
  selection.
- Balanced or moderately skewed mixtures are structurally prevalent.

Accordingly, collapse rules **do not select a single representative module
configuration**. Instead, mixture information is preserved via capacity-weighted
summaries.

---

##### Rule 3 — Capacity-Weighted Aggregation

For each system, the DC contribution of each admissible module entry is computed
as:

$\mathrm{DC}_i = \mathrm{module\_quantity}_i \times \mathrm{nameplate\_capacity\_module}_i$

These contributions are used as weights in all aggregate summaries.

This ensures that module configurations are summarized according to their
physical contribution to system capacity rather than raw counts or reporting
order.

---

##### Rule 4 — System-Level Module Summary Variables

The following system-level module configuration variables are constructed:

- **Total module DC capacity**  
  Sum of DC contributions across all admissible module entries.

- **Effective module wattage (capacity-weighted mean)**  
  Capacity-weighted average of `nameplate_capacity_module`.

- **Effective module efficiency (capacity-weighted mean)**  
  Capacity-weighted average of `efficiency_module`, where available.

- **Module mixture count**  
  Number of distinct admissible module entries contributing positive DC capacity.

These variables jointly describe how system size is instantiated at the module
level while preserving mixture structure.

---

##### Rule 5 — Categorical Attributes (Model and Manufacturer)

Categorical module attributes (`module_model`, `module_manufacturer`) are handled
as follows:

- The **dominant categorical value** is defined as the value associated with the
  largest DC contribution.
- If multiple categorical values contribute non-trivially, dominance is recorded
  but mixture count captures the presence of heterogeneity.

This approach avoids exploding configuration classes while still anchoring each
system to a physically meaningful reference module.

---

##### Rule 6 — Admissibility and Scope

The resulting system-level module configuration variables:
- are defined at one row per system (`tts_link_id`),
- preserve genuine mixture where it exists,
- exclude reporting artifacts and padding, and
- are admissible inputs for subsequent analysis of structural degrees of freedom
  at fixed size.

No inference, comparison, or equivalence is defined at this stage. The output of
this collapse step serves solely as input to later phases.


In [10]:

# Helper: extract admissible module entries per row

MODULE_SLOTS = [1, 2, 3]

records = []

for _, row in module_raw.iterrows():
    system_id = row["tts_link_id"]

    for i in MODULE_SLOTS:
        qty = row[f"module_quantity_{i}"]
        cap = row[f"nameplate_capacity_module_{i}"]
        eff = row[f"efficiency_module_{i}"]
        model = row[f"module_model_{i}"]
        mfr = row[f"module_manufacturer_{i}"]

        # Admissibility rule
        if pd.notna(qty) and qty > 0 and pd.notna(cap) and cap > 0:
            dc = qty * cap

            records.append({
                "tts_link_id": system_id,
                "slot": i,
                "quantity": qty,
                "nameplate_capacity": cap,
                "dc_capacity": dc,
                "efficiency": eff if pd.notna(eff) and eff > 0 else np.nan,
                "module_model": model if model not in (-1, "-1") else np.nan,
                "module_manufacturer": mfr if mfr not in (-1, "-1") else np.nan,
            })

modules_long = pd.DataFrame.from_records(records)


# Guard: every row contributes positive DC capacity

assert (modules_long["dc_capacity"] > 0).all()


# Collapse to system-level (capacity-weighted)


def collapse_system(df):
    total_dc = df["dc_capacity"].sum()

    # Capacity-weighted summaries
    eff_wt = np.average(
        df["efficiency"].dropna(),
        weights=df.loc[df["efficiency"].notna(), "dc_capacity"]
    ) if df["efficiency"].notna().any() else np.nan

    cap_wt = np.average(
        df["nameplate_capacity"],
        weights=df["dc_capacity"]
    )

    # Dominant module (by DC contribution)
    dominant_idx = df["dc_capacity"].idxmax()
    dominant_model = df.loc[dominant_idx, "module_model"]
    dominant_mfr = df.loc[dominant_idx, "module_manufacturer"]

    return pd.Series({
        "module_total_dc_capacity": total_dc,
        "module_effective_nameplate_capacity": cap_wt,
        "module_effective_efficiency": eff_wt,
        "module_mixture_count": df.shape[0],
        "dominant_module_model": dominant_model,
        "dominant_module_manufacturer": dominant_mfr,
    })

module_system_level = (
    modules_long
    .groupby("tts_link_id")
    .apply(collapse_system, include_groups=False)
    .reset_index()
)



# Final invariants


assert module_system_level["tts_link_id"].is_unique
assert (module_system_level["module_total_dc_capacity"] > 0).all()

module_system_level.shape, module_system_level.head()


((122343, 7),
              tts_link_id  module_total_dc_capacity  \
 0                     -1              1.342083e+10   
 1     tts_extension_id_1              3.395553e+08   
 2    tts_extension_id_10              6.990000e+03   
 3   tts_extension_id_100              1.922000e+04   
 4  tts_extension_id_1000              4.410000e+03   
 
    module_effective_nameplate_capacity  module_effective_efficiency  \
 0                           328.106101                     0.187158   
 1                           325.695012                     0.177641   
 2                           304.785408                     0.175942   
 3                           260.489074                     0.160549   
 4                           315.000000                     0.196875   
 
    module_mixture_count dominant_module_model  \
 0               1573753         TP672P(H)-320   
 1                 12709       SPR-E20-435-COM   
 2                     2      REC255PE-US(BLK)   
 3                  

#### Module Configuration Collapse — Structural Materialization

##### Purpose

This section materializes the module configuration family by collapsing raw,
multi-row module disclosures into a single system-level representation per
`tts_link_id`.

The objective is structural rather than inferential. The collapse produces a
deterministic mapping from raw module disclosures to canonical physical
configuration attributes that can be evaluated once system size is held fixed.

This step is a prerequisite for determining which module-related dimensions are
structurally constrained versus which constitute genuine degrees of freedom at
fixed size.

---

##### Input Scope

The collapse operates exclusively on module-related raw columns, including module
quantities, nameplate capacities, efficiencies, models, manufacturers, and
explicit indicators of additional modules. Placeholder values (for example `-1`)
are treated as explicit missingness rather than valid physical values.

---

##### Collapse Strategy

Each system may report up to three distinct module groups. The collapse proceeds
by identifying valid module groups, aggregating their DC capacity contributions,
computing capacity-weighted effective characteristics, identifying the dominant
module group by DC contribution, and recording module mixture complexity.

All operations preserve system grain. Each `tts_link_id` is represented by
exactly one row after collapse.


The DC contribution of module type $i$ is defined as  
$ \text{dc}_i = \text{module\_quantity}_i \times \text{nameplate\_capacity\_module}_i $.  

Only module slots with $ \text{module\_quantity}_i > 0 $ are considered valid contributors.

The total DC capacity is defined as  
$ \text{module\_total\_dc\_capacity}
= \sum_i (\text{module\_quantity}_i \times \text{nameplate\_capacity\_module}_i) $.  

This quantity represents the aggregate physical DC capacity instantiated through all reported module components.

The effective nameplate capacity is defined as the DC-weighted average module wattage:  
$ \text{module\_effective\_nameplate\_capacity}
= \frac{\sum_i (\text{dc}_i \times \text{nameplate\_capacity\_module}_i)}{\sum_i \text{dc}_i} $.  

This captures the representative module wattage actually contributing to system capacity.

The effective module efficiency is defined as the DC-weighted average efficiency:  
$ \text{module\_effective\_efficiency}
= \frac{\sum_i (\text{dc}_i \times \text{efficiency\_module}_i)}{\sum_i \text{dc}_i} $.  

This reflects the realized efficiency of the installed module configuration.

The module mixture count is defined as the number of module slots with a strictly positive quantity:  
$ \text{module\_mixture\_count}
= \sum_i \mathbb{1}(\text{module\_quantity}_i > 0) $.  

This measures whether the system uses a single homogeneous module type or a mixture of multiple module types.

The dominant module is defined as the module type contributing the largest share of DC capacity:  
$ \text{dominant module} = \arg\max_i (\text{dc}_i) $.  

The dominant module’s manufacturer, model, and efficiency are used as the representative physical realization of the system’s module configuration.


#### Collapsed Outputs

The collapse produces a canonical system-level representation consisting of:

- total module DC capacity
- effective per-module nameplate capacity
- effective module efficiency
- module mixture count
- dominant module model and manufacturer

At this stage, the collapsed representation exists in memory only. No artifacts
are sealed until all configuration families have been materialized and validated.

---

#### Role in Structural Degrees of Freedom Analysis

This collapsed representation enables evaluation of which aspects of module
configuration are empirically constrained once size is fixed and which vary
freely.

This determination governs whether each dimension participates in configuration
equivalence or is treated as a structural degree of freedom in downstream
analysis.


### 5.3 Inverter Configuration Family

This section extends the structural analysis performed for the module configuration
family to inverter configurations. The objective is to determine how inverter
capacity and composition vary **once system size is held fixed**, and to materialize
a deterministic system-level representation of inverter configuration suitable for
structural comparison.

As in the module configuration family, this analysis proceeds under the constraint
that system size (`system_size_kw`) is treated as given and non-variable. All
variation examined here is conditional on fixed size or narrow size bins.

---

#### Analytical Steps

The inverter configuration analysis follows the same disciplined sequence used for
module configurations, adapted to the structure of inverter data:

1. **Raw Data Materialization**  
   Load raw inverter-related fields at their native grain, without aggregation or
   inference. This includes inverter quantities, capacities, models, manufacturers,
   and configuration flags.

2. **Slot-Level Enumeration**  
   Identify the number of inverter slots present per system and assess whether
   systems exhibit single-inverter or multi-inverter configurations.

3. **Descriptive Diagnostics**  
   For each inverter-related field:
   - examine distributions (count, mean, dispersion where applicable),
   - measure missingness and placeholder usage (nulls, zeros, sentinel values),
   - and assess internal consistency across related fields.

4. **Dominance vs. Mixture Assessment**  
   Determine whether inverter configurations are typically dominated by a single
   inverter type or represent meaningful mixtures, using capacity-weighted and
   count-based diagnostics as appropriate.

5. **Collapse Rule Definition**  
   Based on empirical diagnostics, define deterministic collapse rules that map raw
   inverter slot data to system-level structural quantities. These rules are designed
   to be:
   - invariant to reporting multiplicity,
   - independent of system size,
   - and empirically justified.

6. **System-Level Configuration Materialization**  
   Apply the collapse rules to produce a system-level inverter configuration
   representation. No artifacts are written until the full inverter family collapse
   is complete.

---

#### Outcome

The result of this section will be a coherent system-level representation of inverter
configuration that captures:
- total inverter capacity,
- effective inverter characteristics,
- configuration heterogeneity,
- and dominant inverter attributes,

while preserving one row per `tts_link_id` and maintaining strict separation between
structural description and downstream scaling or risk analysis.


In [11]:
# Step 1 — Raw Data Materialization (Inverter Configuration Family)
# Load raw inverter-related fields at native grain without aggregation or inference.

RAW_DATA_PATH = Path(os.environ["TRACKING_THE_SUN_DATA"])

inverter_raw = pd.read_parquet(
    RAW_DATA_PATH,
    columns=[
        "tts_link_id",

        # Inverter identity and quantity
        "inverter_manufacturer_1",
        "inverter_manufacturer_2",
        "inverter_manufacturer_3",
        "inverter_model_1",
        "inverter_model_2",
        "inverter_model_3",
        "inverter_quantity_1",
        "inverter_quantity_2",
        "inverter_quantity_3",
        "additional_inverters",

        # Electrical characteristics
        "output_capacity_inverter_1",
        "output_capacity_inverter_2",
        "output_capacity_inverter_3",
        "inverter_loading_ratio",

        # Inverter topology and features
        "micro_inverter_1",
        "micro_inverter_2",
        "micro_inverter_3",
        "built_in_meter_inverter_1",
        "built_in_meter_inverter_2",
        "built_in_meter_inverter_3",
        "dc_optimizer",
    ],
    engine="fastparquet",
)

inverter_raw.shape, inverter_raw.head()


((1921220, 22),
   tts_link_id      inverter_manufacturer_1 inverter_manufacturer_2  \
 0          -1                           -1                      -1   
 1          -1  SolarEdge Technologies Ltd.                      -1   
 2          -1                           -1                      -1   
 3          -1                          ABB                      -1   
 4          -1                           -1                      -1   
 
   inverter_manufacturer_3         inverter_model_1 inverter_model_2  \
 0                      -1                       -1               -1   
 1                      -1            SE9KUS [208V]               -1   
 2                      -1                       -1               -1   
 3                      -1  PVI-5000-OUTD-US [240V]               -1   
 4                      -1                       -1               -1   
 
   inverter_model_3  inverter_quantity_1  inverter_quantity_2  \
 0               -1                 -1.0                 

#### Observations from Raw Inverter Data Materialization

The raw inverter data materialization reveals several structural characteristics
that motivate the subsequent analytical steps.

First, inverter-related fields are heavily encoded using placeholder values,
most commonly `-1`, across manufacturers, models, quantities, and capacities.
This indicates that the absence of an inverter configuration is frequently
represented explicitly rather than via missing values (`NaN`).

Second, inverter configuration is reported across multiple indexed slots
(`_1`, `_2`, `_3`), but preliminary inspection shows that these slots are
sparsely populated beyond the first index. This suggests that while the data
schema allows for multiple inverter configurations per system, true
multi-inverter systems may be relatively uncommon or inconsistently reported.

Third, the presence of both slot-indexed quantities (e.g. `inverter_quantity_i`)
and system-level indicators (e.g. `additional_inverters`, `dc_optimizer`,
`inverter_loading_ratio`) indicates that inverter configuration cannot be
interpreted through any single column. Instead, inverter-related fields must be
evaluated jointly to distinguish genuine structural variation from reporting
artifacts.

Finally, inverter topology indicators (e.g. microinverter flags, built-in meter
flags) are interspersed with placeholder values and nulls, reinforcing the need
to explicitly diagnose structural presence and consistency before defining any
collapse or dominance logic.

Taken together, these observations confirm that individual inverter-related
columns are not semantically meaningful in isolation. A structured analysis
beginning with slot enumeration is therefore required to establish how inverter
configurations are instantiated across systems.


In [12]:
# Step 2 — Slot Enumeration and Structural Presence (Inverter Configuration Family)
# Determine how many inverter slots are populated per system.

# Define what counts as a valid inverter quantity
def is_valid_inverter_quantity(x):
    return pd.notna(x) and x > 0

# Flag valid inverter quantities per slot
inverter_raw["invq_1_valid"] = inverter_raw["inverter_quantity_1"].apply(is_valid_inverter_quantity)
inverter_raw["invq_2_valid"] = inverter_raw["inverter_quantity_2"].apply(is_valid_inverter_quantity)
inverter_raw["invq_3_valid"] = inverter_raw["inverter_quantity_3"].apply(is_valid_inverter_quantity)

# Count number of populated inverter slots per system
inverter_slot_counts = (
    inverter_raw
    .groupby("tts_link_id")[["invq_1_valid", "invq_2_valid", "invq_3_valid"]]
    .any()
    .sum(axis=1)
    .rename("n_inverter_slots")
)

# Summarize distribution of inverter slot counts
inverter_slot_summary = (
    inverter_slot_counts
    .value_counts()
    .sort_index()
    .rename("n_systems")
    .to_frame()
)

inverter_slot_summary["pct_systems"] = (
    inverter_slot_summary["n_systems"] / inverter_slot_summary["n_systems"].sum()
)

inverter_slot_summary


Unnamed: 0_level_0,n_systems,pct_systems
n_inverter_slots,Unnamed: 1_level_1,Unnamed: 2_level_1
0,108,0.000877
1,52160,0.423449
2,62596,0.508171
3,8315,0.067503


#### Observations from Inverter Slot Enumeration

The distribution of populated inverter slots per system reveals several important
structural properties of inverter configuration in the dataset.

The vast majority of systems report at least one populated inverter slot.
Systems with zero populated inverter slots are extremely rare, accounting for
approximately 0.09% of all systems. These cases likely reflect incomplete or
anomalous reporting rather than physically inverter-less systems.

Single-inverter configurations account for roughly 42% of systems. These systems
appear structurally simple, with inverter capacity concentrated in a single
reported configuration slot.

Multi-inverter configurations are common. Systems with two populated inverter
slots represent the largest group, comprising approximately 51% of systems.
An additional 6–7% of systems report three populated inverter slots. Together,
these findings indicate that multi-inverter reporting is the norm rather than
the exception.

However, the presence of multiple populated slots does not, by itself, imply
meaningful inverter mixtures. Indexed slots may reflect parallel inverters,
phased installations, or reporting artifacts. Capacity dominance and internal
consistency must therefore be assessed before drawing conclusions about
structural heterogeneity.

These results establish that inverter configuration frequently involves multiple
reported slots, justifying the need for subsequent diagnostics on placeholder
usage, dominance versus mixture behavior, and family-level collapse rules.


In [13]:
# Step 3 — Descriptive Diagnostics:
# Assess placeholder usage, missingness, and zero values in inverter-related columns

inverter_columns = [
    # Identity and quantity
    "inverter_manufacturer_1", "inverter_manufacturer_2", "inverter_manufacturer_3",
    "inverter_model_1", "inverter_model_2", "inverter_model_3",
    "inverter_quantity_1", "inverter_quantity_2", "inverter_quantity_3",

    # Electrical capacity
    "output_capacity_inverter_1", "output_capacity_inverter_2", "output_capacity_inverter_3",
    "inverter_loading_ratio",

    # Technology / features
    "micro_inverter_1", "micro_inverter_2", "micro_inverter_3",
    "built_in_meter_inverter_1", "built_in_meter_inverter_2", "built_in_meter_inverter_3",
    "dc_optimizer",
]

placeholder_rows = []

for col in inverter_columns:
    s = inverter_raw[col]

    placeholder_rows.append({
        "column": col,
        "pct_null": s.isna().mean(),
        "pct_minus_one": (
            (s == -1).mean() if s.dtype != "object"
            else (s == "-1").mean()
        ),
        "pct_zero": (
            (s == 0).mean() if s.dtype != "object"
            else 0.0
        ),
    })

inverter_placeholder_summary = (
    pd.DataFrame(placeholder_rows)
    .sort_values(["pct_minus_one", "pct_null"], ascending=False)
    .reset_index(drop=True)
)

inverter_placeholder_summary


Unnamed: 0,column,pct_null,pct_minus_one,pct_zero
0,output_capacity_inverter_3,0.004491,0.992675,0.0
1,micro_inverter_3,0.007326,0.992674,0.0
2,built_in_meter_inverter_3,0.007326,0.992674,0.0
3,inverter_manufacturer_3,6e-06,0.992594,0.0
4,inverter_model_3,6e-06,0.992594,0.0
5,inverter_quantity_3,6e-06,0.991702,0.0
6,output_capacity_inverter_2,0.041011,0.932802,0.0
7,micro_inverter_2,0.067208,0.932792,0.0
8,built_in_meter_inverter_2,0.067208,0.932792,0.0
9,inverter_manufacturer_2,6e-06,0.932409,0.0


#### Step 3 — Interpretation: Placeholder Usage & Reporting Structure (Inverter Family)

This step interprets placeholder usage, null patterns, and reporting structure
across inverter-related columns to determine which dimensions are structurally
admissible for collapse and which must be excluded.

---

##### 1. Indexed inverter slots (`_1`, `_2`, `_3`) are structurally ordered

The inverter configuration fields exhibit a strong reporting hierarchy:

- Slot `_1` represents the primary inverter configuration.
- Slot `_2` represents secondary configurations.
- Slot `_3` represents rare tail cases.

These slots are **not symmetric** and must not be treated as interchangeable.

Evidence:
- Slot 3 fields show >99% placeholder values (`-1`)
- Slot 2 fields show ~93% placeholder values
- Slot 1 fields show low placeholder rates for identity and quantity

This confirms that indexed inverter slots encode reporting priority rather than
parallel configuration categories.

---

##### 2. Inverter quantity is reliably reported; inverter capacity is not

- `inverter_quantity_1` has ~1% placeholder usage and is highly reliable.
- `output_capacity_inverter_1` has ~69% null values.
- Output capacity fields for slots 2 and 3 are almost entirely missing or placeholders.

Interpretation:
- The dataset reliably records **how many inverters** are installed.
- It does **not** reliably record per-inverter AC capacity.

Any collapse rule depending on inverter output capacity would therefore introduce
systematic bias and is excluded.

---

##### 3. Technology flags contain usable but asymmetric signal

- `dc_optimizer`:
  - ~69% zero
  - ~31% positive
  - negligible placeholder usage

This indicates genuine variation rather than reporting noise.

- `micro_inverter_*`:
  - Slot 1 is mostly null (not placeholder)
  - Slots 2–3 are mostly placeholders

Interpretation:
- Micro-inverter presence is selectively reported.
- Absence is inconsistently encoded (null vs `-1`).
- These fields are usable only as **system-level binary indicators**, not slot-level
  configuration variables.

---

##### 4. Inverter loading ratio is conditionally informative

- `inverter_loading_ratio` has:
  - ~11% placeholder usage
  - ~0% null values

This variable is well-reported when applicable, but represents a **derived
performance descriptor**, not a physical configuration identity.

It is retained for descriptive analysis but excluded from configuration
equivalence definitions.

---

##### Structural Implications

Based on placeholder usage and reporting structure:

**Admissible for collapse**
- Inverter count (from quantity fields)
- Presence indicators:
  - micro-inverters
  - DC optimizers
- Dominant inverter identity (manufacturer / model from slot 1)

**Not admissible for collapse**
- Per-inverter output capacity
- Any rule assuming symmetry across `_1`, `_2`, `_3`
- Derived ratios such as inverter loading ratio

---



#### Step 4 — Internal Consistency & Co-Occurrence Diagnostics (Inverter Family)

This step evaluates whether multi-slot inverter reports represent **genuine
physical configuration mixtures** or **reporting artifacts**. The objective is
to determine how inverter-related fields should be collapsed without introducing
spurious structural variation.

At this stage, **no collapse is performed**. This step is purely diagnostic.

---

##### Analytical Question

When multiple inverter slots (`_1`, `_2`, `_3`) are populated for a system:

- Do they represent distinct inverter types installed together?
- Or do they reflect reporting noise, phased updates, or partial records?

The answer determines whether inverter configuration should be collapsed as:
- a mixture-aware representation, or
- a dominant / primary configuration only.

---

##### Diagnostic Dimensions Examined

This step examines **co-occurrence patterns** across the following dimensions:

1. **Quantity consistency**
   - Whether multiple inverter slots report positive quantities simultaneously.
   - Whether reported quantities sum to plausible system-level counts.

2. **Identity consistency**
   - Whether multiple slots report distinct manufacturers or models.
   - Whether secondary slots repeat the same identity as slot 1.

3. **Technology alignment**
   - Co-occurrence of:
     - micro-inverter flags
     - DC optimizer presence
   - Whether these flags align logically with reported inverter quantities.

4. **Structural plausibility**
   - Whether observed combinations correspond to physically plausible inverter
     configurations (e.g. mixed string + micro-inverter systems vs reporting noise).

---

##### Interpretation Framework

Observed multi-slot inverter configurations are classified into one of three
structural regimes:

1. **Single-configuration systems**
   - Only slot 1 populated.
   - All secondary slots empty or placeholders.

2. **Dominant-plus-auxiliary systems**
   - Slot 1 carries the majority of inverter quantity.
   - Secondary slots contribute marginally or redundantly.

3. **True mixed-inverter systems**
   - Multiple slots populated with nontrivial quantities and distinct identities.
   - Represents genuine structural variation at fixed size.

Only regime (3) qualifies as a **structural degree of freedom** for configuration
equivalence.

---

##### Output of This Step

The output of Step 4 is a **diagnostic classification**, not a final variable:

- Identification of which inverter fields:
  - can be meaningfully collapsed jointly,
  - should be reduced to dominant representations, or
  - must be excluded from configuration equivalence entirely.

This classification directly informs the collapse rules defined in the next step.

---



#### Step 5 — Inverter Configuration Collapse Definitions

This step formalizes how inverter-related raw fields are collapsed into a
system-level representation that captures the **physical realization of
inversion capacity**, while remaining robust to indexed reporting, placeholders,
and partial records.

As with the module configuration family, individual inverter fields are not
interpretable in isolation. Indexed inverter entries (`_1`, `_2`, `_3`) may
represent parallel inverters, phased installations, or reporting artifacts.
Collapse therefore proceeds by aggregating **capacity-weighted contributions**
across inverter entries.

##### Inverter Capacity Aggregation

The total inverter AC capacity of a system is defined as  

$ \text{inverter\_total\_ac\_capacity}
= \sum_i (\text{inverter\_quantity}_i \times \text{output\_capacity\_inverter}_i) $.

The AC contribution of inverter type $i$ is  

$ \text{ac}_i
= \text{inverter\_quantity}_i \times \text{output\_capacity\_inverter}_i $.

These quantities represent how much alternating-current capacity each inverter
entry contributes to the system.

#### Effective Inverter Characteristics

The effective inverter output capacity is defined as the AC-capacity-weighted
average of inverter ratings:  

$ \text{inverter\_effective\_output\_capacity}
= \frac{\sum_i (\text{ac}_i \times \text{output\_capacity\_inverter}_i)}
{\sum_i \text{ac}_i} $.

This quantity represents the characteristic inverter size governing system
behavior, even when multiple inverter types are present.

#### Inverter Mixture and Dominance

The inverter mixture count captures how many distinct inverter entries contribute
nonzero capacity:  

$ \text{inverter\_mixture\_count}
= \sum_i \mathbb{1}(\text{inverter\_quantity}_i > 0) $.

The dominant inverter is defined as the inverter entry contributing the largest
share of AC capacity:  

$ \text{dominant inverter}
= \arg\max_i (\text{ac}_i) $.

This identifies the inverter type that structurally defines the system’s
inversion behavior.

#### Technology Indicators

The presence of micro-inverters is defined as  

$ \text{has\_micro\_inverter}
= \mathbb{1}\!\left( \exists i \;\text{s.t.}\; \text{micro\_inverter}_i = 1 \right) $.

The presence of DC optimizers is defined as  

$ \text{has\_dc\_optimizer}
= \mathbb{1}(\text{dc\_optimizer} = 1) $.

These indicators capture discrete architectural choices that materially affect
system configuration independent of total system size.

#### Collapse Objective

The inverter configuration collapse produces a deterministic, system-level
representation that:

- preserves the dominant physical realization of inverter capacity,
- is defined at one row per system (`tts_link_id`),
- is robust to placeholder values and partial reporting, and
- does not introduce artificial variation across systems of identical size.

The resulting inverter configuration variables are admissible inputs for
subsequent identification of structural degrees of freedom at fixed system size.


In [14]:
# Step 5 — Inverter Configuration Collapse (System-Level)

# Slot column definitions


qty_cols = [
    "inverter_quantity_1",
    "inverter_quantity_2",
    "inverter_quantity_3",
]

cap_cols = [
    "output_capacity_inverter_1",
    "output_capacity_inverter_2",
    "output_capacity_inverter_3",
]

model_cols = [
    "inverter_model_1",
    "inverter_model_2",
    "inverter_model_3",
]

mfg_cols = [
    "inverter_manufacturer_1",
    "inverter_manufacturer_2",
    "inverter_manufacturer_3",
]

micro_cols = [
    "micro_inverter_1",
    "micro_inverter_2",
    "micro_inverter_3",
]



# Row-level AC contribution derivation

def ac_contribution(qty, cap):
    if pd.notna(qty) and pd.notna(cap) and qty > 0 and cap > 0:
        return qty * cap
    return 0.0

inverter_raw["ac_1"] = inverter_raw.apply(
    lambda r: ac_contribution(r["inverter_quantity_1"], r["output_capacity_inverter_1"]),
    axis=1,
)

inverter_raw["ac_2"] = inverter_raw.apply(
    lambda r: ac_contribution(r["inverter_quantity_2"], r["output_capacity_inverter_2"]),
    axis=1,
)

inverter_raw["ac_3"] = inverter_raw.apply(
    lambda r: ac_contribution(r["inverter_quantity_3"], r["output_capacity_inverter_3"]),
    axis=1,
)

ac_cols = ["ac_1", "ac_2", "ac_3"]

# Guard
missing = set(ac_cols) - set(inverter_raw.columns)
assert not missing, f"Missing AC contribution columns: {missing}"



# Row-level derived quantities


# Total AC per row
inverter_raw["ac_total_row"] = inverter_raw[ac_cols].sum(axis=1)

# Number of contributing inverter slots per row
inverter_raw["inverter_mixture_count_row"] = (inverter_raw[ac_cols] > 0).sum(axis=1)

# Dominant inverter slot per row
def dominant_slot(row):
    vals = row[ac_cols]
    if vals.sum() > 0:
        return vals.idxmax()
    return None

inverter_raw["dominant_inverter_slot"] = inverter_raw.apply(dominant_slot, axis=1)

# Presence flags per row
inverter_raw["has_micro_inverter_row"] = (inverter_raw[micro_cols] == 1).any(axis=1)
inverter_raw["has_dc_optimizer_row"] = inverter_raw["dc_optimizer"] == 1



# System-level aggregation

inverter_system_level = (
    inverter_raw
    .groupby("tts_link_id", as_index=False)
    .agg(
        inverter_total_ac_capacity=("ac_total_row", "sum"),
        inverter_mixture_count=("inverter_mixture_count_row", "max"),
        has_micro_inverter=("has_micro_inverter_row", "any"),
        has_dc_optimizer=("has_dc_optimizer_row", "any"),
        dominant_inverter_slot=(
            "dominant_inverter_slot",
            lambda x: x.mode().iloc[0] if not x.mode().empty else None
        ),
    )
)



# Attach dominant inverter attributes

slot_map = {
    "ac_1": (model_cols[0], mfg_cols[0], cap_cols[0]),
    "ac_2": (model_cols[1], mfg_cols[1], cap_cols[1]),
    "ac_3": (model_cols[2], mfg_cols[2], cap_cols[2]),
}

lookup_cols = ["tts_link_id"] + model_cols + mfg_cols + cap_cols

dominant_lookup = (
    inverter_raw[lookup_cols]
    .drop_duplicates("tts_link_id")
)

inverter_system_level = inverter_system_level.merge(
    dominant_lookup,
    on="tts_link_id",
    how="left",
    validate="one_to_one",
)

def extract_dominant(row, idx):
    slot = row["dominant_inverter_slot"]
    if slot is None:
        return None
    return row[slot_map[slot][idx]]

inverter_system_level["dominant_inverter_model"] = inverter_system_level.apply(
    lambda r: extract_dominant(r, 0), axis=1
)

inverter_system_level["dominant_inverter_manufacturer"] = inverter_system_level.apply(
    lambda r: extract_dominant(r, 1), axis=1
)

inverter_system_level["dominant_inverter_output_capacity"] = inverter_system_level.apply(
    lambda r: extract_dominant(r, 2), axis=1
)


# Cleanup & invariants

inverter_system_level = inverter_system_level.drop(
    columns=model_cols + mfg_cols + cap_cols,
    errors="ignore",
)

assert inverter_system_level["tts_link_id"].is_unique, (
    "Inverter collapse must yield one row per system."
)

assert (inverter_system_level["inverter_total_ac_capacity"] >= 0).all(), (
    "AC capacity must be non-negative."
)

inverter_system_level.shape, inverter_system_level.head()




((123179, 9),
              tts_link_id  inverter_total_ac_capacity  inverter_mixture_count  \
 0                     -1                   6113477.0                       3   
 1     tts_extension_id_1                    226292.0                       3   
 2    tts_extension_id_10                         0.0                       0   
 3   tts_extension_id_100                        12.0                       1   
 4  tts_extension_id_1000                         0.0                       0   
 
    has_micro_inverter  has_dc_optimizer dominant_inverter_slot  \
 0               False              True                   ac_1   
 1               False              True                   ac_1   
 2               False             False                   None   
 3               False             False                   ac_1   
 4               False             False                   None   
 
    dominant_inverter_model dominant_inverter_manufacturer  \
 0                       -1     

#### 5.4 Layout and Orientation Family

This section extends the structural analysis performed for the module and inverter
configuration families to **layout and orientation configurations**. The objective
is to determine how physical layout choices—such as tilt, azimuth, tracking, and
array orientation—vary **once system size is held fixed**, and to materialize a
deterministic system-level representation suitable for structural comparison.

As in prior configuration families, this analysis proceeds under the constraint
that system size (`system_size_kw`) is treated as given and non-variable. All
variation examined here is conditional on fixed size or narrow size bins, ensuring
that observed differences reflect genuine configuration choices rather than scale
effects.

---

##### Analytical Steps

The layout and orientation configuration analysis follows the same disciplined
sequence used for module and inverter configurations, adapted to the structure of
layout-related data:

1. **Raw Data Materialization**  
   Load raw layout and orientation fields at their native grain, without aggregation
   or inference. This includes tilt, azimuth, tracking indicators, array orientation,
   and any indexed or repeated layout fields.

2. **Configuration Slot Enumeration**  
   Identify the number of layout/orientation slots or reported configurations per
   system and assess whether systems exhibit single-orientation or multi-orientation
   layouts.

3. **Descriptive Diagnostics**  
   For each layout-related field:
   - examine distributions (count, mean, dispersion where applicable),
   - measure missingness and placeholder usage (nulls, zeros, sentinel values),
   - and assess internal consistency across related orientation fields.

4. **Dominance vs. Mixture Assessment**  
   Determine whether systems are typically characterized by a dominant layout
   orientation or represent meaningful mixtures (e.g., multiple azimuths or tilts),
   using count-based or capacity-weighted diagnostics as appropriate.

5. **Collapse Rule Definition**  
   Based on empirical diagnostics, define deterministic collapse rules that map raw
   layout and orientation data to system-level structural quantities. These rules are
   designed to be:
   - invariant to reporting multiplicity,
   - independent of system size,
   - and empirically justified.

6. **System-Level Configuration Materialization**  
   Apply the collapse rules to produce a system-level layout and orientation
   representation. No artifacts are written until the full layout/orientation family
   collapse is complete.

---

#### Outcome

The result of this section will be a coherent system-level representation of layout
and orientation configuration that captures:
- dominant and mixed orientation behavior,
- effective tilt and azimuth characteristics,
- tracking versus fixed configurations,
- and structural layout heterogeneity,

while preserving one row per `tts_link_id` and maintaining strict separation between
structural description and downstream scaling, regime, or risk analysis.


In [15]:
# Step 1 — Raw Data Materialization 

RAW_DATA_PATH = Path(os.environ["TRACKING_THE_SUN_DATA"])

layout_raw = pd.read_parquet(
    RAW_DATA_PATH,
    columns=[
        "tts_link_id",

        # Orientation & geometry (slot-based)
        "tilt_1", "tilt_2", "tilt_3",
        "azimuth_1", "azimuth_2", "azimuth_3",

        # Mounting / tracking configuration
        "tracking",
        "ground_mounted",

        # Optional contextual geometry flags
        "new_construction",
    ],
    engine="fastparquet",
)

layout_raw.shape, layout_raw.head()

((1921220, 10),
   tts_link_id  tilt_1  tilt_2  tilt_3  azimuth_1  azimuth_2  azimuth_3  \
 0          -1    -1.0    -1.0    -1.0       -1.0       -1.0       -1.0   
 1          -1     8.0    -1.0    -1.0        0.0       -1.0       -1.0   
 2          -1    -1.0    -1.0    -1.0       -1.0       -1.0       -1.0   
 3          -1    18.0    -1.0    -1.0      263.0       -1.0       -1.0   
 4          -1    -1.0    -1.0    -1.0       -1.0       -1.0       -1.0   
 
    tracking  ground_mounted  new_construction  
 0        -1              -1                -1  
 1         0               0                -1  
 2        -1              -1                -1  
 3         0               0                -1  
 4        -1              -1                -1  )

#### Step 1 — Observations from Raw Layout & Orientation Materialization

The raw layout and orientation dataset has been successfully materialized at
native grain, yielding **1,921,220 rows and 10 columns**. This confirms that
layout-related information is reported at the same raw multiplicity as other
configuration families and is subject to the same reporting artifacts.

##### Slot-Based Orientation Structure

Orientation is reported implicitly through **indexed slot fields**:

- Tilt: `tilt_1`, `tilt_2`, `tilt_3`
- Azimuth: `azimuth_1`, `azimuth_2`, `azimuth_3`

Key observations:

- Most systems populate **only the first slot** (`_1`), with `_2` and `_3`
  overwhelmingly set to placeholder values (`-1`).
- This mirrors the structural pattern observed in both module and inverter
  families, where multiple indexed fields exist but rarely represent distinct,
  equally weighted components.
- Slot multiplicity therefore reflects **potential heterogeneity**, not
  guaranteed complexity.

##### Placeholder Encoding & Missingness

The raw output shows heavy use of sentinel values:

- `-1` is consistently used to denote:
  - missing orientation values,
  - non-applicable slots,
  - or unreported configuration.
- This applies uniformly across:
  - tilt values,
  - azimuth values,
  - tracking flags,
  - ground-mounted indicators,
  - and construction context flags.

This confirms that **explicit nulls are rare** and that any analysis must
treat `-1` as a semantic placeholder rather than a valid measurement.

##### Tracking and Mounting Indicators

The following binary / categorical fields are present:

- `tracking` — indicates fixed vs tracking systems (with `-1` for unknown)
- `ground_mounted` — indicates mounting context
- `new_construction` — contextual deployment information

Preliminary inspection suggests:

- These fields are frequently unreported (`-1`),
- When populated, they are consistent with system-level characteristics rather
  than slot-level variation.

##### Implications for Subsequent Steps

From this raw materialization, several implications follow:

- Orientation and layout **cannot be collapsed independently per slot** without
  first assessing whether multiple slots represent genuine structural variation.
- Slot-level enumeration is required to determine:
  - how many orientation definitions meaningfully exist per system,
  - and whether multi-orientation systems are common or exceptional.
- Placeholder prevalence necessitates explicit handling before any collapse
  rules are defined.

Accordingly, the next step proceeds to **slot-level enumeration**, focusing on
the number of populated tilt/azimuth slots per system and the empirical
frequency of multi-orientation configurations.


In [16]:
# Step 2 — Slot-Level Enumeration (Layout & Orientation)

import pandas as pd

# Define slot columns
tilt_cols = ["tilt_1", "tilt_2", "tilt_3"]
azimuth_cols = ["azimuth_1", "azimuth_2", "azimuth_3"]

# Helper: a slot is considered valid only if BOTH tilt and azimuth are present
def valid_orientation_slot(tilt, az):
    return (
        pd.notna(tilt)
        and pd.notna(az)
        and tilt != -1
        and az != -1
    )

# Compute valid slot indicators at the row level
for i in range(1, 4):
    layout_raw[f"slot_{i}_valid"] = layout_raw.apply(
        lambda r: valid_orientation_slot(
            r[f"tilt_{i}"],
            r[f"azimuth_{i}"],
        ),
        axis=1,
    )

# Aggregate to system level: count how many slots are valid per system
slot_counts = (
    layout_raw
    .groupby("tts_link_id")[[f"slot_{i}_valid" for i in range(1, 4)]]
    .any()                     # slot exists anywhere in system history
    .sum(axis=1)               # count valid slots
    .rename("n_orientation_slots")
)

# Summarize distribution
orientation_slot_summary = (
    slot_counts
    .value_counts()
    .sort_index()
    .rename("n_systems")
    .to_frame()
)

orientation_slot_summary["pct_systems"] = (
    orientation_slot_summary["n_systems"]
    / orientation_slot_summary["n_systems"].sum()
)

orientation_slot_summary


Unnamed: 0_level_0,n_systems,pct_systems
n_orientation_slots,Unnamed: 1_level_1,Unnamed: 2_level_1
0,4300,0.034909
1,113590,0.922154
2,4044,0.03283
3,1245,0.010107


#### Step 2 — Findings: Orientation Slot Enumeration

This step enumerated the number of distinct orientation slots per system, where an
orientation slot is defined as a valid `(tilt, azimuth)` pair reported in the raw
data.

##### Empirical Distribution

- **92.2% of systems** report exactly **one orientation slot**.
- **~4.3% of systems** report **two or more orientation slots**, indicating genuine
  multi-orientation layouts.
- **~3.5% of systems** report **zero valid orientation slots**, reflecting missing or
  placeholder-heavy reporting rather than physical absence.

##### Interpretation

1. **Orientation is predominantly constrained**  
   The overwhelming majority of systems exhibit a single orientation, suggesting
   that orientation is typically fixed once system size is given.

2. **Multi-orientation layouts are real but uncommon**  
   Systems with two or three orientation slots likely correspond to:
   - split roof planes,
   - complex residential rooftops,
   - or ground-mounted arrays with heterogeneous geometry.  
   These cases represent genuine structural variation rather than reporting noise.

3. **Zero-orientation cases indicate reporting gaps**  
   Systems with no valid orientation slots are best interpreted as cases of
   incomplete or inconsistent geometry reporting. These must be handled explicitly
   during collapse rather than implicitly excluded.

##### Structural Implication

Orientation behaves as a **mostly constrained but not universally fixed**
configuration dimension:

- It is not a free degree of freedom for the majority of systems.
- It cannot be assumed invariant across all systems.

This pattern is consistent with findings from the module and inverter configuration
families and supports a unified analytical framework.

##### Consequence for Next Steps

Before defining collapse rules, further diagnostics are required to assess:

- placeholder and sentinel value usage in tilt and azimuth fields,
- internal consistency between paired orientation slots,
- interaction with related configuration flags such as tracking and mounting context.

These diagnostics are addressed in **Step 3 — Placeholder Usage & Internal
Consistency (Layout & Orientation Family)**.


In [17]:
# Step 3 — Descriptive Diagnostics (Layout & Orientation Family)


import pandas as pd

layout_cols_numeric = [
    "tilt_1", "tilt_2", "tilt_3",
    "azimuth_1", "azimuth_2", "azimuth_3",
]

layout_cols_flags = [
    "tracking",
    "ground_mounted",
    "new_construction",
]

#  3.1 Numeric distributions
numeric_distributions = {}

for col in layout_cols_numeric:
    s = layout_raw[col]
    numeric_distributions[col] = s.describe(
        percentiles=[0.5, 0.75, 0.95]
    )

numeric_distributions_df = (
    pd.DataFrame(numeric_distributions)
    .T
)

numeric_distributions_df


# 3.2 Placeholder & missingness diagnostics
placeholder_rows = []

for col in layout_cols_numeric + layout_cols_flags:
    s = layout_raw[col]

    placeholder_rows.append({
        "column": col,
        "pct_null": s.isna().mean(),
        "pct_minus_one": (s == -1).mean(),
        "pct_zero": (s == 0).mean(),
    })

layout_placeholder_summary = (
    pd.DataFrame(placeholder_rows)
    .sort_values(["pct_minus_one", "pct_null"], ascending=False)
    .reset_index(drop=True)
)

layout_placeholder_summary


# 3.3 Internal consistency diagnostics
# Tilt–azimuth pairing consistency per slot

consistency_rows = []

for i in [1, 2, 3]:
    tilt_col = f"tilt_{i}"
    az_col = f"azimuth_{i}"

    tilt_valid = (layout_raw[tilt_col] > 0)
    az_valid = (layout_raw[az_col] >= 0)

    inconsistent = tilt_valid ^ az_valid

    consistency_rows.append({
        "slot": i,
        "n_inconsistent_rows": inconsistent.sum(),
        "pct_inconsistent": inconsistent.mean(),
    })

orientation_consistency_summary = (
    pd.DataFrame(consistency_rows)
    .reset_index(drop=True)
)

orientation_consistency_summary


Unnamed: 0,slot,n_inconsistent_rows,pct_inconsistent
0,1,29596,0.015405
1,2,489,0.000255
2,3,214,0.000111


#### Step 3 — Descriptive Diagnostics (Layout & Orientation Family): Consistency Findings

The table above summarizes **internal consistency between paired tilt and azimuth
fields** within each indexed orientation slot.

Each row corresponds to a slot index `i`, and inconsistency is defined as cases
where **one of the pair (`tilt_i`, `azimuth_i`) is present while the other is not**.
Such cases indicate partial or incoherent reporting of physical orientation.

##### Observed Patterns

- **Slot 1 (primary orientation)**  
  - ~1.54% of rows exhibit inconsistency.  
  - This is materially higher than other slots and reflects the fact that
    slot 1 is the most frequently populated orientation field and therefore
    more exposed to partial reporting.

- **Slots 2 and 3 (secondary orientations)**  
  - Inconsistency rates are extremely low (<0.03%).  
  - When secondary orientation slots are reported, they are almost always
    reported coherently as tilt–azimuth pairs.

##### Interpretation

- Orientation reporting is **structurally coherent** when present.
- Inconsistencies are rare and concentrated in the primary slot, likely due to:
  - reporting truncation,
  - legacy data entry practices,
  - or placeholder usage rather than genuine physical ambiguity.

These findings support treating orientation slots as **paired geometric units**
during collapse, rather than independent scalar fields.

##### Implication for Next Step

Given the low inconsistency rates and strong pairing structure, the analysis
can proceed to **Step 4 — Dominance vs. Mixture Assessment**, where we determine
whether systems are typically characterized by a single dominant orientation
or represent meaningful multi-orientation layouts.


In [18]:
# Step 4 — Dominance vs. Mixture Assessment (Layout & Orientation Family)


# Identify valid orientation slots per system
def valid_orientation_slot(tilt, az):
    return pd.notna(tilt) and pd.notna(az) and tilt > 0 and az >= 0

orientation_flags = pd.DataFrame({
    "slot_1": layout_raw.apply(
        lambda r: valid_orientation_slot(r["tilt_1"], r["azimuth_1"]), axis=1
    ),
    "slot_2": layout_raw.apply(
        lambda r: valid_orientation_slot(r["tilt_2"], r["azimuth_2"]), axis=1
    ),
    "slot_3": layout_raw.apply(
        lambda r: valid_orientation_slot(r["tilt_3"], r["azimuth_3"]), axis=1
    ),
})

orientation_flags["tts_link_id"] = layout_raw["tts_link_id"]

# Count number of valid orientation slots per system
orientation_counts = (
    orientation_flags
    .groupby("tts_link_id")[["slot_1", "slot_2", "slot_3"]]
    .any()
    .sum(axis=1)
    .rename("n_orientation_components")
)

# Distribution of orientation mixtures
orientation_mixture_summary = (
    orientation_counts
    .value_counts()
    .sort_index()
    .rename("n_systems")
    .to_frame()
)

orientation_mixture_summary["pct_systems"] = (
    orientation_mixture_summary["n_systems"]
    / orientation_mixture_summary["n_systems"].sum()
)

orientation_mixture_summary



Unnamed: 0_level_0,n_systems,pct_systems
n_orientation_components,Unnamed: 1_level_1,Unnamed: 2_level_1
0,7361,0.059759
1,110558,0.897539
2,4039,0.03279
3,1221,0.009912


#### Step 4 — Dominance vs. Mixture Assessment (Layout & Orientation Family)

The table above summarizes the number of **distinct orientation components**
(tilt–azimuth pairs) reported per system.

Each system is classified by how many valid orientation slots are present,
where a slot is considered valid only when **both tilt and azimuth are
coherently reported**.

##### Observed Distribution

- **Single-orientation systems dominate**  
  - ~89.8% of systems report exactly **one** orientation component.
  - This indicates that the overwhelming majority of installations are
    physically characterized by a single array orientation.

- **Multi-orientation systems exist but are uncommon**  
  - ~3.3% of systems report two orientations.
  - ~1.0% report three orientations.
  - These cases likely correspond to split arrays (e.g. east–west roofs,
    complex roof geometries, or mixed mounting contexts).

- **No-orientation systems (~6.0%)**  
  - Systems with zero valid orientation components reflect:
    - missing orientation reporting,
    - placeholder-only entries,
    - or non-standard data collection rather than a physical absence of layout.

##### Interpretation

- Orientation configuration is **structurally simple** for most systems.
- Mixtures are **real but rare**, and when present, they represent genuine
  physical complexity rather than reporting noise.
- The empirical dominance of single-orientation layouts supports a
  deterministic collapse strategy that:
  - preserves the dominant orientation when present,
  - captures mixture count explicitly,
  - and remains robust to partial or missing reporting.

##### Implication for Next Step

With dominance vs. mixture behavior established, the analysis can proceed to
**Step 5 — Collapse Rule Definition**, where orientation-related fields will be
collapsed into a system-level representation that:
- preserves physical meaning,
- respects paired geometry (tilt–azimuth),
- and maintains one row per `tts_link_id`.


#### Step 5 — Collapse Rule Definition (Layout & Orientation Family)

This step defines **deterministic collapse rules** that map raw orientation and
layout fields to a system-level representation, following the **same formal and
mathematical structure** used in the module and inverter configuration families.

The collapse is performed **after** empirical diagnostics and is justified by
observed dominance patterns.

---

#### Definitions

Let each system have up to three reported orientation slots indexed by  
$i \in \{1,2,3\}$.

A slot is considered **valid** if **both** tilt and azimuth are present and
non-placeholder:

- $\text{tilt}_i > 0$
- $\text{azimuth}_i \in [0,360]$

---

#### Orientation Component Indicator

For each slot $i$, define an indicator function:

$\mathbb{1}_i = \mathbb{1}(\text{tilt}_i > 0 \;\wedge\; \text{azimuth}_i \ge 0)$

---

#### Orientation Mixture Count

The number of distinct orientation components per system is:

$\text{orientation\_mixture\_count} = \sum_i \mathbb{1}_i$

This quantity captures whether the system is:
- single-orientation,
- multi-orientation,
- or unreported.

---

#### Dominant Orientation Selection

If $\text{orientation\_mixture\_count} \ge 1$, the **dominant orientation** is
defined as the **first valid orientation slot**, preserving deterministic
ordering and avoiding inference:

$\text{dominant\_orientation\_slot} = \min \{ i : \mathbb{1}_i = 1 \}$

From this slot we define:

- $\text{dominant\_tilt} = \text{tilt}_{i^*}$
- $\text{dominant\_azimuth} = \text{azimuth}_{i^*}$

where $i^*$ is the dominant orientation slot.

---

#### Tracking & Mounting Flags

Binary layout characteristics are collapsed using logical presence rules:

- $\text{has\_tracking} = \mathbb{1}(\text{tracking} = 1)$
- $\text{is\_ground\_mounted} = \mathbb{1}(\text{ground\_mounted} = 1)$
- $\text{is\_new\_construction} = \mathbb{1}(\text{new\_construction} = 1)$

These flags describe **deployment context**, not orientation geometry.

---

#### Resulting System-Level Variables

After collapse, each system is represented by:

- `orientation_mixture_count`
- `dominant_tilt`
- `dominant_azimuth`
- `has_tracking`
- `is_ground_mounted`
- `is_new_construction`

All variables:
- are defined at **one row per `tts_link_id`**,
- are invariant to reporting multiplicity,
- and are admissible once system size is held fixed.

---

The resulting layout and orientation representation is now ready to be
materialized and later merged with other configuration families before
computing structural degrees of freedom.


### 5.5 Mounting Context Family 

This section extends the structural analysis to the **mounting context** of solar
installations. The objective is to characterize how systems are physically sited
(e.g. rooftop vs. ground-mounted, construction context) **once system size is
held fixed**, and to materialize a deterministic, system-level representation
suitable for structural comparison.

As with all prior configuration families, system size (`system_size_kw`) is
treated as **given and non-variable**. All variation examined here is strictly
conditional on fixed size or narrow size bins.

The mounting context family differs from module, inverter, and layout families
in that it is **primarily categorical and boolean**, rather than continuous or
capacity-based. Nevertheless, the same disciplined analytical structure applies.

---

#### Analytical Steps

The mounting context analysis follows the same uniform procedure used in prior
families, adapted to contextual (rather than electrical or geometric) variables:

1. **Raw Data Materialization**  
   Load all mounting-related fields at their native grain, without aggregation or
   inference. These fields describe how and where the system is physically
   installed.

2. **Context Enumeration**  
   Identify which mounting contexts are reported per system and whether systems
   exhibit:
   - single, unambiguous mounting context, or
   - mixed, conflicting, or partially reported contexts.

3. **Descriptive Diagnostics**  
   For each mounting-related field:
   - compute basic distributions (counts and proportions),
   - measure missingness and placeholder usage (nulls, zeros, sentinel values),
   - and assess internal consistency across related context flags.

4. **Dominance vs. Ambiguity Assessment**  
   Determine whether mounting context is typically:
   - uniquely defined per system, or
   - structurally ambiguous due to mixed reporting or inconsistent flags.

5. **Collapse Rule Definition**  
   Based on empirical diagnostics, define deterministic rules that map raw
   mounting-context fields to a system-level representation. These rules are
   designed to be:
   - invariant to reporting multiplicity,
   - non-inferential,
   - and empirically justified.

6. **System-Level Context Materialization**  
   Apply the collapse rules to produce a system-level mounting context
   representation. As with prior families, **no artifacts are written** until the
   full family collapse is complete.

---

##### Outcome

The result of this section will be a coherent system-level representation of
mounting context that captures:
- how systems are physically sited,
- whether context is unambiguous or mixed,
- and how mounting relates to structural configuration once size is fixed.



In [19]:
# Step 1 — Raw Data Materialization (Mounting Context Family)

RAW_DATA_PATH = Path(os.environ["TRACKING_THE_SUN_DATA"])

mounting_raw = pd.read_parquet(
    RAW_DATA_PATH,
    columns=[
        "tts_link_id",

        # Mounting / siting context
        "ground_mounted",
        "new_construction",
        "tracking",

        # Ownership / install context (may interact with mounting)
        "third_party_owned",
        "self_installed",
    ],
    engine="fastparquet",
)

mounting_raw.shape, mounting_raw.head()


((1921220, 6),
   tts_link_id  ground_mounted  new_construction  tracking  third_party_owned  \
 0          -1              -1                -1        -1                 -1   
 1          -1               0                -1         0                  0   
 2          -1              -1                -1        -1                 -1   
 3          -1               0                -1         0                  1   
 4          -1              -1                -1        -1                 -1   
 
    self_installed  
 0               0  
 1               0  
 2               0  
 3               0  
 4               0  )

#### Step 1 — Raw Data Materialization (Mounting Context Family): Summary

In this step, we materialized the **raw mounting context fields** from the
Tracking the Sun dataset at their **native grain**, without aggregation,
collapse, or inference.

##### What was done

- Loaded mounting- and siting-related variables directly from the raw dataset.
- Preserved the original reporting structure, including:
  - placeholder values (`-1`),
  - missing values (`NaN`),
  - and multiple rows per `tts_link_id`.
- Ensured that **no system-level assumptions** were introduced at this stage.

##### Raw Columns Materialized

The following fields were loaded as part of the mounting context family:

- **Physical siting**
  - `ground_mounted`
  - `tracking`

- **Construction context**
  - `new_construction`

- **Ownership / installation context**
  - `third_party_owned`
  - `self_installed`

Each of these variables represents a **binary contextual signal** describing how
or where a system is installed, rather than a physical capacity or geometric
measurement.

##### Analytical Scope Clarification

At this stage:
- No variables were collapsed.
- No dominance, mixture, or ambiguity was assessed.
- No system-level representation was created.

This step exists solely to **expose the raw structure and reporting behavior**
of mounting context variables so that:
- empirical diagnostics can be performed in subsequent steps, and
- collapse rules can later be defined based on observed patterns rather than
  assumptions.




In [20]:
# Step 2 — Context Enumeration

context_cols = [
    "ground_mounted",
    "new_construction",
    "tracking",
    "third_party_owned",
    "self_installed",
]

# Helper: valid context flag
def is_valid_flag(x):
    return pd.notna(x) and x != -1

# Count how many context flags are meaningfully reported per row
mounting_raw["n_context_flags"] = (
    mounting_raw[context_cols]
    .map(is_valid_flag)
    .sum(axis=1)
)

# Collapse to system level (max flags observed per system)
context_count_by_system = (
    mounting_raw
    .groupby("tts_link_id")["n_context_flags"]
    .max()
    .value_counts()
    .sort_index()
    .rename("n_systems")
    .to_frame()
)

context_count_by_system["pct_systems"] = (
    context_count_by_system["n_systems"]
    / context_count_by_system["n_systems"].sum()
)

context_count_by_system


Unnamed: 0_level_0,n_systems,pct_systems
n_context_flags,Unnamed: 1_level_1,Unnamed: 2_level_1
0,8,6.5e-05
1,233,0.001892
2,1452,0.011788
3,2570,0.020864
4,107790,0.875068
5,11126,0.090324


#### Step 2 — Context Enumeration (Mounting Context Family): Summary

This step enumerates **how many mounting context signals are meaningfully reported
per system**, without yet interpreting or collapsing them.

A *mounting context signal* is defined as a binary field that is **explicitly
reported** (i.e. not null and not a placeholder value such as `-1`). The purpose
is to understand whether mounting context is typically:
- absent,
- singular and coherent, or
- mixed / potentially ambiguous.

##### What was done

- Evaluated the following mounting context fields at the raw row level:
  - `ground_mounted`
  - `new_construction`
  - `tracking`
  - `third_party_owned`
  - `self_installed`
- Treated each field as a **binary contextual indicator**, not a measurement.
- Counted, per row, how many of these indicators were meaningfully reported.
- Collapsed to the **system level** by taking the **maximum number of reported
  context flags observed for each `tts_link_id`**.
- Tabulated the distribution of systems by number of context flags present.

##### What this step measures (and what it does not)

This step measures:
- the **reporting richness** of mounting context per system,
- whether systems tend to have **zero, one, or multiple** context signals.

This step does **not**:
- determine which context is dominant,
- assess physical consistency or contradiction,
- or define system-level mounting context variables.

##### Why this matters

Mounting context is categorical and contextual, not physical capacity. Before
defining collapse rules, we must establish empirically whether:
- systems usually report a **single, unambiguous context**, or
- multiple context flags commonly co-occur, indicating ambiguity or mixed
  reporting.

The observed distribution from this step directly informs:
- whether dominance-based collapse is appropriate,
- whether ambiguity flags are required,
- and how conservative collapse rules must be.

##### Next Step

With context enumeration complete, the analysis proceeds to:

**Step 3 — Descriptive Diagnostics**,  
where we examine missingness, placeholder usage, and internal consistency across
mounting context fields before assessing dominance or defining collapse rules.


In [21]:
# Step 3 — Descriptive Diagnostics (Mounting Context Family)


context_cols = [
    "ground_mounted",
    "new_construction",
    "tracking",
    "third_party_owned",
    "self_installed",
]

# --- Missingness & placeholder summary ---
rows = []
for col in context_cols:
    s = mounting_raw[col]
    rows.append({
        "column": col,
        "pct_null": s.isna().mean(),
        "pct_minus_one": (s == -1).mean(),
        "pct_zero": (s == 0).mean(),
        "pct_one": (s == 1).mean(),
    })

mounting_placeholder_summary = (
    pd.DataFrame(rows)
    .sort_values(["pct_minus_one", "pct_null"], ascending=False)
    .reset_index(drop=True)
)

mounting_placeholder_summary


Unnamed: 0,column,pct_null,pct_minus_one,pct_zero,pct_one
0,new_construction,0.0,0.870394,0.092878,0.036728
1,ground_mounted,0.0,0.157969,0.82412,0.017912
2,tracking,0.0,0.069499,0.925683,0.004818
3,third_party_owned,0.0,0.047712,0.642066,0.310222
4,self_installed,0.0,0.000176,0.977955,0.021869


In [22]:
# --- Internal consistency checks ---
# We look for contradictory or co-occurring signals that could imply ambiguity

# Helper: valid reported flag
def valid_flag(x):
    return pd.notna(x) and x != -1

flags = mounting_raw[context_cols].map(valid_flag)

# Count how many valid flags appear per row
mounting_raw["n_valid_context_flags"] = flags.sum(axis=1)

# Distribution at system level (max observed per system)
context_flag_distribution = (
    mounting_raw
    .groupby("tts_link_id")["n_valid_context_flags"]
    .max()
    .value_counts()
    .sort_index()
    .rename("n_systems")
    .to_frame()
)

context_flag_distribution["pct_systems"] = (
    context_flag_distribution["n_systems"]
    / context_flag_distribution["n_systems"].sum()
)

context_flag_distribution


Unnamed: 0_level_0,n_systems,pct_systems
n_valid_context_flags,Unnamed: 1_level_1,Unnamed: 2_level_1
0,8,6.5e-05
1,233,0.001892
2,1452,0.011788
3,2570,0.020864
4,107790,0.875068
5,11126,0.090324


In [23]:
# --- Pairwise co-occurrence diagnostics ---
# Identify how often context flags are simultaneously true (== 1)

co_occurrence = []

for i, col_a in enumerate(context_cols):
    for col_b in context_cols[i+1:]:
        both_true = (
            (mounting_raw[col_a] == 1) &
            (mounting_raw[col_b] == 1)
        ).mean()
        co_occurrence.append({
            "pair": f"{col_a} & {col_b}",
            "pct_rows_both_true": both_true,
        })

co_occurrence_df = (
    pd.DataFrame(co_occurrence)
    .sort_values("pct_rows_both_true", ascending=False)
    .reset_index(drop=True)
)

co_occurrence_df


Unnamed: 0,pair,pct_rows_both_true
0,new_construction & third_party_owned,0.016562
1,third_party_owned & self_installed,0.004349
2,ground_mounted & third_party_owned,0.00225
3,ground_mounted & self_installed,0.001305
4,ground_mounted & tracking,0.000852
5,tracking & self_installed,0.000624
6,tracking & third_party_owned,0.000604
7,ground_mounted & new_construction,5.2e-05
8,new_construction & tracking,4.6e-05
9,new_construction & self_installed,2.4e-05


#### Step 3 — Descriptive Diagnostics (Mounting Context Family): Summary

This step evaluates **reporting quality, completeness, and internal consistency**
of mounting context variables prior to any dominance assessment or collapse.

---

##### 1. Placeholder Usage & Reporting Density

Mounting context fields rely heavily on **sentinel values (`-1`)** to indicate
missing or inapplicable information. However, patterns differ meaningfully by
field:

- **`new_construction`**
  - ~87.0% reported as `-1`
  - Indicates sparse and selective reporting
  - When present, values are split between `0` and `1`, suggesting meaningful
    signal rather than noise

- **`ground_mounted`**
  - ~82.4% explicitly reported as `0`
  - Only ~15.8% are placeholders
  - Suggests this field is actively populated and informative

- **`tracking`**
  - ~92.6% reported as `0`
  - Very low placeholder rate (~7.0%)
  - Indicates strong negative reporting (most systems are non-tracking)

- **`third_party_owned`**
  - Mixed structure:
    - ~31.0% reported as `1`
    - ~64.2% reported as `0`
    - ~4.8% placeholders
  - High informational value with clear binary meaning

- **`self_installed`**
  - Almost universally reported as `0` (~97.8%)
  - Very low placeholder usage
  - Indicates rarity rather than reporting absence

**Key observation:**  
Mounting context fields are **not uniformly sparse**. Several are actively and
consistently reported, supporting deterministic system-level collapse.

---

##### 2. System-Level Context Flag Count

The distribution of valid context flags per system reveals a clear structural
pattern:

- **~87.5% of systems report all five context flags**
- ~9.0% report four flags
- Fewer than 4% of systems report three or fewer flags
- Near-zero mass at zero or one flag

**Interpretation:**
- Mounting context is typically **fully specified**, not fragmentary
- Systems overwhelmingly exhibit **complete contextual reporting**
- Missingness is localized to specific fields, not systemic

---

##### 3. Pairwise Co-Occurrence Patterns

Pairwise co-occurrence diagnostics show:

- Low absolute co-occurrence rates across all flag pairs
- Highest observed pair:
  - `new_construction & third_party_owned` at ~1.66%
- Most other pairs are well below 0.5%

**Interpretation:**
- Co-occurrence reflects **legitimate joint contexts**, not contradictions
- No evidence of mutually exclusive flags being simultaneously asserted
- No strong signal of structural ambiguity or reporting conflict

---

##### 4. Implications for Collapse Strategy

From these diagnostics:

- Mounting context is **structurally coherent**, not ambiguous
- Multiple flags commonly co-exist because they describe **orthogonal aspects**
  of installation context (siting, ownership, construction phase)
- A dominance-based collapse is **not appropriate**
- Instead, mounting context should be collapsed as a **multi-flag system-level
  profile**, preserving each contextual dimension independently

---


#### Step 4 — Dominance vs. Ambiguity Assessment (Mounting Context Family)

This step determines whether **mounting context variables require dominance
resolution** (as in module or inverter configurations) or whether they can be
collapsed **directly and independently** without loss of structural meaning.

The assessment is grounded entirely in the empirical diagnostics from Step 3.

---

##### Assessment Question

> *Do mounting context variables exhibit mutually exclusive alternatives that
require selecting a dominant realization, or do they represent orthogonal
contextual dimensions that can co-exist without ambiguity?*

---

##### Evidence from Diagnostics

###### 1. High Completeness at the System Level

- ~87.5% of systems report **all five** mounting context flags.
- >96% of systems report **four or more** flags.
- Near-zero mass at zero or one flag.

**Implication:**  
Mounting context is **not sparsely or selectively reported** at the system level.
Most systems carry a full contextual profile rather than competing alternatives.

---

###### 2. Low Pairwise Conflict Risk

- Pairwise co-occurrence rates are uniformly low.
- Highest observed joint assertion:
  - `new_construction & third_party_owned` ≈ 1.66%
- No evidence of logically contradictory flags being simultaneously asserted.

**Implication:**  
Co-occurrence reflects **legitimate joint contexts**, not reporting conflict or
structural ambiguity.

---

###### 3. Orthogonality of Context Dimensions

Each mounting context variable describes a **distinct and non-substitutable
dimension**:

- `ground_mounted` → physical siting
- `tracking` → mechanical configuration
- `new_construction` → construction timing
- `third_party_owned` → ownership structure
- `self_installed` → installation responsibility

These dimensions do **not compete** with one another in the way module types,
inverter slots, or orientation components do.

---

##### Conclusion

**No dominance resolution is required** for the mounting context family.

- There is no concept of a “dominant” mounting context.
- Multiple context flags may be simultaneously true without ambiguity.
- Collapsing by dominance would incorrectly discard valid structural
information.

---

##### Structural Decision

Mounting context variables will be:

- collapsed **independently**,
- preserved as **parallel binary system-level attributes**, and
- carried forward without mixture reduction or arbitration.

This distinguishes the mounting context family from:
- module configuration (dominant physical realization),
- inverter configuration (capacity-weighted dominance),
- and layout/orientation (dominant geometry with mixture tracking).

---



#### Step 5 — Collapse Rule Definition (Mounting Context Family)

This step defines **deterministic, system-level collapse rules** for the mounting
context family, based on the dominance assessment in Step 4.

Unlike module, inverter, or orientation families, **no arbitration or dominance
logic is required** here. Each variable represents an orthogonal contextual
dimension and can be collapsed independently.

---

##### Collapse Principle

For each mounting context variable:

- The system-level value reflects **whether the context is ever asserted**
  for that system.
- Placeholder values (`-1`) are treated as *non-informative*, not negative
  assertions.
- The collapse is **existential**, not competitive.

Formally:

> A mounting context is considered present for a system if **any row**
> associated with that `tts_link_id` asserts it as present.

---

##### Variable-Specific Collapse Rules

Let \( x_{i,r} \) denote the value of context variable \( i \) on raw row \( r \)
belonging to system \( s \).

##### Binary Context Flags

For binary flags taking values in `{1, 0, -1}`:

- `ground_mounted`
- `tracking`
- `new_construction`
- `third_party_owned`
- `self_installed`

The system-level collapsed value is:

$$
\text{context}_i(s) =
\begin{cases}
1 & \text{if } \exists r \in s \text{ such that } x_{i,r} = 1 \\
0 & \text{otherwise}
\end{cases}
$$

Equivalently:

- **1** if the context is ever observed as present,
- **0** if never observed as present (including all `0` or `-1`).

---

##### Rationale for This Rule

This rule is empirically justified because:

- Context flags are **not mutually exclusive**.
- Co-occurrence rates are low and non-contradictory.
- Placeholder values dominate missingness and should not suppress valid signals.
- The analytical objective is **structural conditioning**, not behavioral
  inference.

Using stricter rules (e.g. majority voting, dominance arbitration) would
artificially erase valid contextual information.

---

##### Output Structure

The collapsed mounting context family will produce **one row per `tts_link_id`**
with the following system-level fields:

- `ground_mounted`
- `tracking`
- `new_construction`
- `third_party_owned`
- `self_installed`

Each field is binary and interpretable independently.

No artifact is written until all remaining configuration families are collapsed
and merged into a single, unified system-level substrate.

---



In [24]:
# Step 5 — Mounting Context Collapse (System-Level Aggregation)
# Normalize raw flags to boolean (treat 1 as True, everything else as False)
mounting_flags = mounting_raw.copy()

flag_cols = [
    "ground_mounted",
    "tracking",
    "new_construction",
    "third_party_owned",
    "self_installed",
]

for col in flag_cols:
    mounting_flags[col] = mounting_flags[col] == 1


# System-level collapse
mounting_system_level = (
    mounting_flags
    .groupby("tts_link_id", as_index=False)
    .agg(
        # Mounting & siting
        is_ground_mounted=("ground_mounted", "any"),
        is_tracking_system=("tracking", "any"),

        # Construction & ownership
        is_new_construction=("new_construction", "any"),
        is_third_party_owned=("third_party_owned", "any"),
        is_self_installed=("self_installed", "any"),
    )
)


# Derived structural indicators
mounting_system_level["is_roof_mounted"] = (
    ~mounting_system_level["is_ground_mounted"]
)

mounting_system_level["mounting_context_mixture_count"] = (
    mounting_system_level[
        [
            "is_ground_mounted",
            "is_tracking_system",
            "is_new_construction",
            "is_third_party_owned",
            "is_self_installed",
        ]
    ]
    .sum(axis=1)
)


# Invariants
assert mounting_system_level["tts_link_id"].is_unique, (
    "Mounting context collapse must yield one row per tts_link_id."
)

mounting_system_level.shape, mounting_system_level.head()

((123179, 8),
              tts_link_id  is_ground_mounted  is_tracking_system  \
 0                     -1               True                True   
 1     tts_extension_id_1               True                True   
 2    tts_extension_id_10              False               False   
 3   tts_extension_id_100              False               False   
 4  tts_extension_id_1000              False               False   
 
    is_new_construction  is_third_party_owned  is_self_installed  \
 0                 True                  True               True   
 1                 True                  True               True   
 2                False                  True              False   
 3                False                  True              False   
 4                False                 False              False   
 
    is_roof_mounted  mounting_context_mixture_count  
 0            False                               5  
 1            False                               5  
 2  

In [25]:
# Step 6 — System-Level Configuration Materialization (Mounting Context Family)

mounting_system_config = mounting_system_level[
    [
        "tts_link_id",
        "is_ground_mounted",
        "is_roof_mounted",
        "is_tracking_system",
        "is_new_construction",
        "is_third_party_owned",
        "is_self_installed",
        "mounting_context_mixture_count",
    ]
].copy()


# Final invariants
assert mounting_system_config["tts_link_id"].is_unique, (
    "Mounting context configuration must have one row per tts_link_id."
)

mounting_system_config.shape, mounting_system_config.head()



((123179, 8),
              tts_link_id  is_ground_mounted  is_roof_mounted  \
 0                     -1               True            False   
 1     tts_extension_id_1               True            False   
 2    tts_extension_id_10              False             True   
 3   tts_extension_id_100              False             True   
 4  tts_extension_id_1000              False             True   
 
    is_tracking_system  is_new_construction  is_third_party_owned  \
 0                True                 True                  True   
 1                True                 True                  True   
 2               False                False                  True   
 3               False                False                  True   
 4               False                False                 False   
 
    is_self_installed  mounting_context_mixture_count  
 0               True                               5  
 1               True                               5  
 2        

#### Step 6 — System-Level Configuration Materialization (Mounting Context Family)

This step materializes a **deterministic, system-level representation** of the
mounting context family by collapsing raw, row-level mounting and ownership
indicators into a single record per system (`tts_link_id`).

The objective is **not** to infer causal relationships or scale effects, but to
produce a structurally admissible configuration description that can later be
used to assess degrees of freedom at fixed system size.

---

##### Inputs to Collapse

The collapse integrates the following mounting and contextual dimensions:

- **Physical siting**
  - `ground_mounted`
  - `tracking`

- **Construction context**
  - `new_construction`

- **Ownership / installation context**
  - `third_party_owned`
  - `self_installed`

These fields are reported at raw grain with extensive use of sentinel values
(`-1`) and sparse positive indicators.

---

##### Collapse Logic

For each `tts_link_id`, the following deterministic rules are applied:

- Binary indicators (`is_*`) are set to `True` if **any** raw record for the
  system reports the condition as present.
- Absence of positive evidence results in `False`, not missing.
- Mutually exclusive physical contexts (e.g. ground vs. roof mounting) are
  explicitly resolved.
- A **mixture count** is computed as the number of distinct mounting/context
  flags that evaluate to `True` for the system.

This yields a system-level vector of boolean configuration indicators plus a
scalar ambiguity measure.

---

##### Resulting System-Level Variables

The mounting context family is represented by the following system-level fields:

- `is_ground_mounted`
- `is_roof_mounted`
- `is_tracking_system`
- `is_new_construction`
- `is_third_party_owned`
- `is_self_installed`
- `mounting_context_mixture_count`

Each system appears **exactly once**, satisfying the one-row-per-system
invariant.

---

##### Empirical Observations

From the materialized output:

- The vast majority of systems exhibit **low ambiguity**, with a small number
  of context flags active.
- Most systems fall into clear, interpretable categories (e.g. roof-mounted,
  non-tracking, homeowner-owned).
- High mixture counts exist but are rare, indicating either complex reporting
  or genuinely mixed contexts.

---

##### Structural Interpretation

Mounting context variables:

- **Do vary at fixed size**, satisfying admissibility as structural degrees of
  freedom.
- **Do not directly define physical capacity realization**, unlike module or
  inverter configurations.
- Are therefore **retained as contextual structural descriptors**, not as
  equivalence keys for configuration classes.

These variables will be available for downstream stratification, conditioning,
or descriptive analysis, but will not define configuration equivalence unless
explicitly justified later.

---

##### Status

- Mounting context family collapse: **Complete**
- Artifact writing: **Deferred**



## Phase 6 — Unified Structural Configuration Assembly (Pre-DoF)

This phase assembles all previously collapsed configuration families into a **single, system-level structural configuration table**.

At this point in the notebook:

- System-level contextual information has already been frozen upstream (`system_context_descriptive.parquet`)
- Each configuration family (modules, inverters, layout/orientation, mounting context) has been:
  - Materialized from raw inputs
  - Diagnosed for multiplicity and dominance
  - Collapsed deterministically to **one row per system**
- Configuration dimensions that do not expand the admissible state space (e.g., battery storage) have been explicitly excluded

The objective of this phase is **pure assembly**, not analysis.

Specifically, this phase:
- Left-joins each collapsed configuration family onto the frozen system-level context
- Enforces strict **one-row-per-system invariants** after every merge
- Produces a complete, coherent structural configuration dataset at fixed system size
- Writes a single artifact that defines the **admissible configuration state space** for downstream analysis

No size conditioning, structural degrees-of-freedom computation, regime definition, deviation analysis, or risk evaluation is performed here.

The resulting artifact serves as the formal handoff from **structural configuration construction** to **analytic conditioning and degrees-of-freedom analysis** in the subsequent notebook.


In [33]:
# Load base system-level context (frozen artifact)


# Frozen system-level context artifact produced earlier in Notebook 1
SYSTEM_CONTEXT_DESCRIPTIVE = Path(
    "../outputs/system_context_descriptive.parquet"
)

df_system_context = pd.read_parquet(
    SYSTEM_CONTEXT_DESCRIPTIVE
)

# Sanity check: one row per system
assert df_system_context["tts_link_id"].is_unique
base_row_count = len(df_system_context)


# Prepare collapsed configuration families (already in memory) 

collapsed_tables = [
    ("module", module_system_level),
    ("inverter", inverter_system_level),
    ("mounting", mounting_system_config),
]



# Iterative left-join with invariance checks 

df_structural_config = df_system_context.copy()

for name, df_cfg in collapsed_tables:
    # Enforce key integrity before merge
    assert df_cfg["tts_link_id"].is_unique, f"{name} table has duplicate tts_link_id"

    df_structural_config = df_structural_config.merge(
        df_cfg,
        on="tts_link_id",
        how="left",
        validate="one_to_one",
    )

    # Row-count invariance check
    assert len(df_structural_config) == base_row_count, (
        f"Row count changed after merging {name}"
    )


# Final integrity checks 

assert df_structural_config["tts_link_id"].is_unique
assert len(df_structural_config) == base_row_count


# Write final structural configuration artifact 

OUTPUT_STRUCTURAL_CONFIG = Path("../outputs/structural_configuration_pre_dof.parquet")

OUTPUT_STRUCTURAL_CONFIG.parent.mkdir(parents=True, exist_ok=True)

df_structural_config.to_parquet(OUTPUT_STRUCTURAL_CONFIG, index=False)

OUTPUT_STRUCTURAL_CONFIG


WindowsPath('../outputs/structural_configuration_pre_dof.parquet')

In [34]:
# Quick structural sanity check of merged artifact

# Shape and key integrity
print("Rows:", len(df_structural_config))
print("Unique systems:", df_structural_config["tts_link_id"].nunique())

# Column overview
print("\nNumber of columns:", df_structural_config.shape[1])
print("\nColumns:")
for col in df_structural_config.columns:
    print(" -", col)

# Peek at a few rows (no analysis)
df_structural_config.head(5)


Rows: 103309
Unique systems: 103309

Number of columns: 48

Columns:
 - tts_link_id
 - n_rows
 - n_installation_dates
 - n_system_sizes
 - n_prices
 - has_expansion
 - has_multiple_phases
 - system_size_kw
 - n_size_reports
 - installation_year
 - installation_year_cohort
 - cohort_n_systems
 - distribution_p25
 - baseline_expected_system_size_kw
 - distribution_p75
 - dispersion_iqr
 - dispersion_p90_p10_span
 - dispersion_min_size
 - dispersion_max_size
 - n_systems
 - distribution_p10
 - baseline_p25
 - distribution_p50
 - baseline_p75
 - distribution_p90
 - distribution_min_size
 - distribution_max_size
 - module_total_dc_capacity
 - module_effective_nameplate_capacity
 - module_effective_efficiency
 - module_mixture_count
 - dominant_module_model
 - dominant_module_manufacturer
 - inverter_total_ac_capacity
 - inverter_mixture_count
 - has_micro_inverter
 - has_dc_optimizer
 - dominant_inverter_slot
 - dominant_inverter_model
 - dominant_inverter_manufacturer
 - dominant_inverter_

Unnamed: 0,tts_link_id,n_rows,n_installation_dates,n_system_sizes,n_prices,has_expansion,has_multiple_phases,system_size_kw,n_size_reports,installation_year,...,dominant_inverter_model,dominant_inverter_manufacturer,dominant_inverter_output_capacity,is_ground_mounted,is_roof_mounted,is_tracking_system,is_new_construction,is_third_party_owned,is_self_installed,mounting_context_mixture_count
0,tts_extension_id_10,2,2,2,2,True,True,2.4,2,2014.0,...,,,,False,True,False,False,True,False,1
1,tts_extension_id_100,2,2,2,2,True,True,6.25,2,2014.0,...,PVI-6000-OUTD-US [240V],ABB,6.0,False,True,False,False,True,False,1
2,tts_extension_id_1000,2,2,1,2,True,True,4.41,2,2019.0,...,,,,False,True,False,False,False,False,0
3,tts_extension_id_10000,2,2,2,2,True,True,5.565,2,2015.0,...,PVI-5000-OUTD-S-US-Z-A [240V],ABB,5.0,False,True,False,False,False,False,0
4,tts_extension_id_100000,2,2,2,2,True,True,3.263644,2,2016.0,...,,,,False,True,False,False,True,False,1


## Notebook 1 Summary — Structural Configuration Assembly (Pre-DoF)

This notebook completes the construction of a **system-level structural configuration dataset** suitable for downstream degrees-of-freedom analysis.

Work completed in this notebook includes:

- Auditing and loading frozen, descriptive system-level context produced upstream in Repo 2
- Materializing configuration families from raw Tracking the Sun inputs
- Diagnosing multiplicity, dominance, and reporting artifacts within each family
- Collapsing each admissible configuration family into a **deterministic, one-row-per-system representation**
- Explicitly excluding configuration dimensions that do not expand the admissible state space (e.g., battery storage)
- Assembling all collapsed configuration families via left-joins onto the frozen system context
- Enforcing strict one-to-one system invariants throughout the assembly process
- Writing a single structural configuration artifact that defines the **admissible configuration state space at fixed system size**

No size conditioning, structural degrees-of-freedom computation, regime definition, deviation analysis, or risk assessment is performed here.

The resulting artifact represents the formal handoff from **structural configuration construction** to **analytic conditioning and degrees-of-freedom evaluation**, which begins in Notebook 2.
