# PopSim Data Preparation (Census Mode)

This notebook prepares all input files for PopulationSim using German Census grid data.

## Requirements

1. **GeoPackage** (`.gpkg`) - polygon defining your study area boundary
2. **Census data** (parquet or CSV) - 100m and 1km grid cells with population attributes
3. **MiD seed data** (CSV) - household and person survey data (`MiD2023_Haushalte.csv`, `MiD2023_Personen.csv`)

## Census Data Format

Census files must have cell IDs in the **first column** matching this format:
- **100m**: `CRS3035RES100mN{northing}E{easting}` (e.g., `CRS3035RES100mN2689100E4337000`)
- **1km**: `CRS3035RES1000mN{northing}E{easting}` (e.g., `CRS3035RES1000mN2689000E4337000`)

Coordinates are **EPSG:3035** (ETRS89-extended / LAEA Europe). All other columns become available as control totals.

## What Gets Generated

### Single Mode (`regiostar_split=False`)
- `popsim/data/geo_cross_walk.csv` - geographic hierarchy
- `popsim/data/control_totals_*.csv` - census data formatted as control totals
- `popsim/data/seed_persons.csv`, `seed_households.csv` - filtered MiD data
- `popsim/configs/controls.csv` - control definitions for PopSim

### RegioStar Split Mode (`regiostar_split=True`)
Creates separate folders for each RegioStaR17 value found in the study area:
- `popsim_regiostar_121/` - census AND MiD filtered to RegioStaR17=121
- `popsim_regiostar_125/` - census AND MiD filtered to RegioStaR17=125
- etc.

Each folder contains complete PopSim inputs filtered to that regional type. The controls file is edited once in `popsim/configs/_prep3_controls.csv` and copied to all folders.

## Configuration

**Edit all paths and settings below before running.**

In [2]:
# =============================================================================
# USER CONFIGURATION
# =============================================================================

# --- Paths (relative to this notebook) ---
inputs_dir = "inputs"      # Shared input files (MiD, census, geopackages)
popsim_dir = "popsim"      # Base PopSim folder (for controls template, settings)

# --- Study Area ---
geopackage_path = f"{inputs_dir}/outlineNI.gpkg"
geopackage_crs = None  # Set CRS if not embedded (None = auto-detect)

# --- Census Data (parquet or CSV) ---
census_100m_path = f"{inputs_dir}/cells_100m_with_gender_backf_binneds_happyorphans_with_aggs_regiostar.parquet"
census_1km_path = f"{inputs_dir}/cells_1km_with_binneds.parquet"

# Column containing number of households (run Step 1 with None to see options)
household_column = "Insgesamt_Haushalte_Groesse_des_privaten_Haushalts_100m-Gitter_adj"  # e.g., "Insgesamt_Haushalte_100m-Gitter"

# --- MiD Seed Data (semicolon-separated CSVs) ---
mid_households_path = f"{inputs_dir}/MiD2023_Haushalte.csv"
mid_persons_path = f"{inputs_dir}/MiD2023_Personen.csv"

# --- MiD Filtering (set to None to skip that filter) ---
kernwo = [1,2,3]        # Day of week: [2,3] or None to skip. 1=Mon, 2=Tue-Thu, 3=Fri, 4=Sat-Sun
regiostar17 = None   # Regional types: [121,123,124] or None to skip (used when regiostar_split=False)

# --- RegioStar Split Mode ---
# When True: Creates separate popsim folders for each RegioStaR17 value in the study area
# Each folder gets census AND MiD data filtered to that RegioStaR17 value
regiostar_split = True

# --- CSV Separators ---
census_csv_sep = ";"   # For input CSVs (ignored for parquet)
intermediate_sep = ";"  # For intermediate files (use ";" for German Excel)
# Note: Final PopSim files are always comma-separated

# --- Advanced ---
output_everything = False  # True = output all PopSim intermediates
seed_geography = "STAAT"   # Geography level for seed data (usually unchanged)

# =============================================================================
# END CONFIGURATION
# =============================================================================

## Step 1: Load Study Area and Filter Census

Loads your GeoPackage, filters census cells to the study area, and shows available columns.

In [23]:
import os
import re
import pandas as pd
import geopandas as gpd
from shapely.geometry import box

print("[Step 1/4] Loading study area and filtering census...")
print("=" * 60)

# Ensure output directories exist
os.makedirs(f"{popsim_dir}/data", exist_ok=True)
os.makedirs(f"{popsim_dir}/configs", exist_ok=True)

# Load GeoPackage
print(f"Loading GeoPackage: {geopackage_path}")
study_area = gpd.read_file(geopackage_path)

# Handle CRS
if study_area.crs is None and geopackage_crs:
    study_area = study_area.set_crs(geopackage_crs)
    print(f"  Set CRS to: {geopackage_crs}")
elif study_area.crs is None:
    raise ValueError("GeoPackage has no CRS. Please set geopackage_crs in configuration.")

# Transform to EPSG:3035 (Census CRS)
study_area_3035 = study_area.to_crs("EPSG:3035")
bounds = study_area_3035.total_bounds  # minx, miny, maxx, maxy
print(f"  Study area bounds (EPSG:3035): {bounds}")

# Parse cell ID to extract coordinates
def parse_cell_id_100m(cell_id):
    """Extract N,E coordinates from 100m cell ID like CRS3035RES100mN2689100E4337000"""
    match = re.match(r'CRS3035RES100mN(\d+)E(\d+)', str(cell_id))
    if match:
        return int(match.group(1)), int(match.group(2))
    return None, None

def parse_cell_id_1km(cell_id):
    """Extract N,E coordinates from 1km cell ID like CRS3035RES1000mN2689000E4337000"""
    match = re.match(r'CRS3035RES1000mN(\d+)E(\d+)', str(cell_id))
    if match:
        return int(match.group(1)), int(match.group(2))
    return None, None

def get_1km_id_from_100m(cell_id_100m):
    """Convert 100m cell ID to corresponding 1km cell ID."""
    n, e = parse_cell_id_100m(cell_id_100m)
    if n is None:
        return None
    n_1km = (n // 1000) * 1000
    e_1km = (e // 1000) * 1000
    return f"CRS3035RES1000mN{n_1km}E{e_1km}"

# Load 100m census
print(f"\nLoading 100m census: {census_100m_path}")

if census_100m_path.endswith('.parquet'):
    import pyarrow.parquet as pq
    pf_100m = pq.ParquetFile(census_100m_path)
    print(f"  Total rows: {pf_100m.metadata.num_rows:,}")
    print(f"  Total columns: {pf_100m.metadata.num_columns}")
    
    print("  Filtering to study area (this may take a moment)...")
    filtered_chunks = []
    total_read = 0
    
    for batch in pf_100m.iter_batches(batch_size=100000):
        df_batch = batch.to_pandas()
        total_read += len(df_batch)
        
        coords = df_batch.iloc[:, 0].apply(parse_cell_id_100m)
        df_batch['_N'] = coords.apply(lambda x: x[0])
        df_batch['_E'] = coords.apply(lambda x: x[1])
        
        mask = (
            (df_batch['_N'] >= bounds[1]) & (df_batch['_N'] <= bounds[3]) &
            (df_batch['_E'] >= bounds[0]) & (df_batch['_E'] <= bounds[2])
        )
        df_filtered = df_batch[mask].drop(columns=['_N', '_E'])
        
        if len(df_filtered) > 0:
            filtered_chunks.append(df_filtered)
        
        if total_read % 500000 == 0:
            print(f"    Processed {total_read:,} rows...")
    
    census_100m = pd.concat(filtered_chunks, ignore_index=True)
else:
    print(f"  Loading CSV with separator: '{census_csv_sep}'")
    census_100m_full = pd.read_csv(census_100m_path, sep=census_csv_sep)
    print(f"  Total rows: {len(census_100m_full):,}")
    
    coords = census_100m_full.iloc[:, 0].apply(parse_cell_id_100m)
    census_100m_full['_N'] = coords.apply(lambda x: x[0])
    census_100m_full['_E'] = coords.apply(lambda x: x[1])
    
    mask = (
        (census_100m_full['_N'] >= bounds[1]) & (census_100m_full['_N'] <= bounds[3]) &
        (census_100m_full['_E'] >= bounds[0]) & (census_100m_full['_E'] <= bounds[2])
    )
    census_100m = census_100m_full[mask].drop(columns=['_N', '_E']).copy()

print(f"  Filtered to {len(census_100m):,} cells in bounding box")

# Fine filter: check actual intersection with study area polygon
print("  Performing precise polygon intersection...")
id_col_100m = census_100m.columns[0]

def cell_intersects_study_area(cell_id):
    n, e = parse_cell_id_100m(cell_id)
    if n is None:
        return False
    cell_geom = box(e, n, e + 100, n + 100)
    return study_area_3035.geometry.intersects(cell_geom).any()

sample_mask = census_100m[id_col_100m].sample(min(100, len(census_100m))).apply(cell_intersects_study_area)
if sample_mask.mean() > 0.9:
    print("  Bounding box is tight, skipping detailed intersection.")
else:
    mask = census_100m[id_col_100m].apply(cell_intersects_study_area)
    census_100m = census_100m[mask]
    print(f"  After polygon intersection: {len(census_100m):,} cells")

# Find likely household columns
# print(f"\n{'='*60}")
# print("SUGGESTED HOUSEHOLD COLUMNS (first 5 values):")
# print(f"{'='*60}")

# hh_keywords = ['haushalt', 'household', 'hh_', 'wohnung']
# suggested = []
# for col in census_100m.columns:
#     col_lower = col.lower()
#     if any(kw in col_lower for kw in hh_keywords):
#         suggested.append(col)

# if suggested:
#     # Show all columns without truncation
#     with pd.option_context('display.max_columns', None, 'display.width', None):
#         display(census_100m[suggested].head())
# else:
#     print("  No household-related columns found.")
with pd.option_context('display.max_columns', None, 'display.width', None):
    display(census_100m.head())

# print(f"\nTotal columns available: {len(census_100m.columns)}")
# print("Use census_100m.columns to see all column names.")

# Load 1km census
print(f"\n{'='*60}")
print(f"Loading 1km census: {census_1km_path}")

if census_1km_path.endswith('.parquet'):
    census_1km_full = pd.read_parquet(census_1km_path)
else:
    census_1km_full = pd.read_csv(census_1km_path, sep=census_csv_sep)
print(f"  Total rows: {len(census_1km_full):,}")

# Filter 1km by deriving from 100m cells
km_ids_needed = set(census_100m[id_col_100m].apply(get_1km_id_from_100m).dropna())
id_col_1km = census_1km_full.columns[1]  # Usually GITTER_ID_1km
census_1km = census_1km_full[census_1km_full[id_col_1km].isin(km_ids_needed)].copy()
print(f"  Filtered to {len(census_1km):,} 1km cells")

# Save filtered data as parquet
census_100m.to_parquet(f'{popsim_dir}/data/_census_100m_filtered.parquet', index=False)
census_1km.to_parquet(f'{popsim_dir}/data/_census_1km_filtered.parquet', index=False)
print(f"\nSaved filtered census to {popsim_dir}/data/_census_*_filtered.parquet")

print(f"\n{'='*60}")
print("SUMMARY")
print(f"{'='*60}")
print(f"  100m cells in study area: {len(census_100m):,}")
print(f"  1km cells in study area: {len(census_1km):,}")
print(f"  MiD households: {mid_households_path}")
print(f"  MiD persons: {mid_persons_path}")
kernwo_list = kernwo if isinstance(kernwo, list) else ([kernwo] if kernwo else None)
regiostar17_list = regiostar17 if isinstance(regiostar17, list) else ([regiostar17] if regiostar17 else None)
print(f"  MiD filters: kernwo={kernwo_list}, regiostar17={regiostar17_list}")
print(f"  RegioStar split mode: {regiostar_split}")
print(f"\nSet 'household_column' in Configuration and re-run Step 1,")
print("or proceed to Step 2 if already set.")
print("\n[Step 1/4] Complete.")

[Step 1/4] Loading study area and filtering census...
Loading GeoPackage: inputs/outlineNI.gpkg
  Study area bounds (EPSG:3035): [4095955.24268242 3131621.21928417 4428201.34945498 3421366.23451482]

Loading 100m census: inputs/cells_100m_with_gender_backf_binneds_happyorphans_with_aggs_regiostar.parquet
  Total rows: 3,148,482
  Total columns: 570
  Filtering to study area (this may take a moment)...
    Processed 500,000 rows...
    Processed 1,000,000 rows...
    Processed 1,500,000 rows...
    Processed 2,000,000 rows...
    Processed 2,500,000 rows...
    Processed 3,000,000 rows...
  Filtered to 898,155 cells in bounding box
  Performing precise polygon intersection...
  After polygon intersection: 425,907 cells


Unnamed: 0,GITTER_ID_100m,Insgesamt_Bevoelkerung_Alter_in_10er-Jahresgruppen_100m-Gitter,Unter10_Alter_in_10er-Jahresgruppen_100m-Gitter,a10bis19_Alter_in_10er-Jahresgruppen_100m-Gitter,a20bis29_Alter_in_10er-Jahresgruppen_100m-Gitter,a30bis39_Alter_in_10er-Jahresgruppen_100m-Gitter,a40bis49_Alter_in_10er-Jahresgruppen_100m-Gitter,a50bis59_Alter_in_10er-Jahresgruppen_100m-Gitter,a60bis69_Alter_in_10er-Jahresgruppen_100m-Gitter,a70bis79_Alter_in_10er-Jahresgruppen_100m-Gitter,a80undaelter_Alter_in_10er-Jahresgruppen_100m-Gitter,Insgesamt_Bevoelkerung_Alter_in_5_Altersklassen_100m-Gitter,Unter18_Alter_in_5_Altersklassen_100m-Gitter,a18bis29_Alter_in_5_Altersklassen_100m-Gitter,a30bis49_Alter_in_5_Altersklassen_100m-Gitter,a50bis64_Alter_in_5_Altersklassen_100m-Gitter,a65undaelter_Alter_in_5_Altersklassen_100m-Gitter,AnteilAuslaender_Anteil_Auslaender_100m-Gitter,AnteilUeber65_Anteil_ueber_65_100m-Gitter,AnteilUnter18_Anteil_unter_18_100m-Gitter,AnteilAuslaenderAb18_Auslaenderanteil_ab18_100m-Gitter,Einwohner_Bevoelkerungszahl_100m-Gitter,Deutsche_ab18_Deutsche_Staatsangehoerige_ab18_100m-Gitter,durchschnFlaechejeBew_Durchschn_Flaeche_je_Bewohner_100m-Gitter,durchschnFlaechejeWohn_Durchschn_Flaeche_je_Wohnung_100m-Gitter,DurchschnHHGroesse_Durchschn_Haushaltsgroesse_100m-Gitter,durchschnMieteQM_Durchschn_Nettokaltmiete_100m-Gitter,durchschnMieteQM_Durchschn_Nettokaltmiete_Anzahl_der_Wohnungen_100m-Gitter,AnzahlWohnungen_Durchschn_Nettokaltmiete_Anzahl_der_Wohnungen_100m-Gitter,Durchschnittsalter_Durchschnittsalter_100m-Gitter,Eigentuemerquote_Eigentuemerquote_100m-Gitter,Insgesamt_Energietraeger_Energietraeger_100m-Gitter,Gas_Energietraeger_100m-Gitter,Heizoel_Energietraeger_100m-Gitter,Holz_Holzpellets_Energietraeger_100m-Gitter,Biomasse_Biogas_Energietraeger_100m-Gitter,Solar_Geothermie_Waermepumpen_Energietraeger_100m-Gitter,Strom_Energietraeger_100m-Gitter,Kohle_Energietraeger_100m-Gitter,Fernwaerme_Energietraeger_100m-Gitter,kein_Energietraeger_Energietraeger_100m-Gitter,Insgesamt_Bevoelkerung_Familienstand_100m-Gitter,Ledig_Familienstand_100m-Gitter,Verheiratet_Familienstand_100m-Gitter,Verwitwet_Familienstand_100m-Gitter,Geschieden_Familienstand_100m-Gitter,EingetrLebenspartnerschaft_Familienstand_100m-Gitter,EingetrLebenspartVerstorben_Familienstand_100m-Gitter,EingetrLebenspartAufgehoben_Familienstand_100m-Gitter,OhneAngabe_Familienstand_100m-Gitter,Insgesamt_Wohnungen_Flaeche_der_Wohnung_10m2_Intervalle_100m-Gitter,unter30_Flaeche_der_Wohnung_10m2_Intervalle_100m-Gitter,30bis39_Flaeche_der_Wohnung_10m2_Intervalle_100m-Gitter,40bis49_Flaeche_der_Wohnung_10m2_Intervalle_100m-Gitter,50bis59_Flaeche_der_Wohnung_10m2_Intervalle_100m-Gitter,60bis69_Flaeche_der_Wohnung_10m2_Intervalle_100m-Gitter,70bis79_Flaeche_der_Wohnung_10m2_Intervalle_100m-Gitter,80bis89_Flaeche_der_Wohnung_10m2_Intervalle_100m-Gitter,90bis99_Flaeche_der_Wohnung_10m2_Intervalle_100m-Gitter,100bis109_Flaeche_der_Wohnung_10m2_Intervalle_100m-Gitter,110bis119_Flaeche_der_Wohnung_10m2_Intervalle_100m-Gitter,120bis129_Flaeche_der_Wohnung_10m2_Intervalle_100m-Gitter,130bis139_Flaeche_der_Wohnung_10m2_Intervalle_100m-Gitter,140bis149_Flaeche_der_Wohnung_10m2_Intervalle_100m-Gitter,150bis159_Flaeche_der_Wohnung_10m2_Intervalle_100m-Gitter,160bis169_Flaeche_der_Wohnung_10m2_Intervalle_100m-Gitter,170bis179_Flaeche_der_Wohnung_10m2_Intervalle_100m-Gitter,180undmehr_Flaeche_der_Wohnung_10m2_Intervalle_100m-Gitter,Insgesamt_Gebaeude_Geb_Gebaeudetyp_Groesse_100m-Gitter,FreiEFH_Geb_Gebaeudetyp_Groesse_100m-Gitter,EFH_DHH_Geb_Gebaeudetyp_Groesse_100m-Gitter,EFH_Reihenhaus_Geb_Gebaeudetyp_Groesse_100m-Gitter,Freist_ZFH_Geb_Gebaeudetyp_Groesse_100m-Gitter,ZFH_DHH_Geb_Gebaeudetyp_Groesse_100m-Gitter,ZFH_Reihenhaus_Geb_Gebaeudetyp_Groesse_100m-Gitter,MFH_3bis6Wohnungen_Geb_Gebaeudetyp_Groesse_100m-Gitter,MFH_7bis12Wohnungen_Geb_Gebaeudetyp_Groesse_100m-Gitter,MFH_13undmehrWohnungen_Geb_Gebaeudetyp_Groesse_100m-Gitter,AndererGebaeudetyp_Geb_Gebaeudetyp_Groesse_100m-Gitter,Insgesamt_Gebaeude_Gebaeude_nach_Anzahl_der_Wohnungen_100m-Gitter,1_Wohnung_Gebaeude_nach_Anzahl_der_Wohnungen_100m-Gitter,2_Wohnungen_Gebaeude_nach_Anzahl_der_Wohnungen_100m-Gitter,3bis6_Wohnungen_Gebaeude_nach_Anzahl_der_Wohnungen_100m-Gitter,7bis12_Wohnungen_Gebaeude_nach_Anzahl_der_Wohnungen_100m-Gitter,13undmehr_Wohnungen_Gebaeude_nach_Anzahl_der_Wohnungen_100m-Gitter,Insgesamt_Gebaeude_Gebaeude_nach_Baujahr_in_MZ_Klassen_100m-Gitter,Vor1919_Gebaeude_nach_Baujahr_in_MZ_Klassen_100m-Gitter,a1919bis1948_Gebaeude_nach_Baujahr_in_MZ_Klassen_100m-Gitter,a1949bis1978_Gebaeude_nach_Baujahr_in_MZ_Klassen_100m-Gitter,a1979bis1990_Gebaeude_nach_Baujahr_in_MZ_Klassen_100m-Gitter,a1991bis2000_Gebaeude_nach_Baujahr_in_MZ_Klassen_100m-Gitter,a2001bis2010_Gebaeude_nach_Baujahr_in_MZ_Klassen_100m-Gitter,a2011bis2019_Gebaeude_nach_Baujahr_in_MZ_Klassen_100m-Gitter,a2020undspaeter_Gebaeude_nach_Baujahr_in_MZ_Klassen_100m-Gitter,Insgesamt_Energietraeger_Gebaeude_nach_Energietraeger_der_Heizung_100m-Gitter,Gas_Gebaeude_nach_Energietraeger_der_Heizung_100m-Gitter,Heizoel_Gebaeude_nach_Energietraeger_der_Heizung_100m-Gitter,Holz_Holzpellets_Gebaeude_nach_Energietraeger_der_Heizung_100m-Gitter,Biomasse_Biogas_Gebaeude_nach_Energietraeger_der_Heizung_100m-Gitter,Solar_Geothermie_Waermepumpen_Gebaeude_nach_Energietraeger_der_Heizung_100m-Gitter,Strom_Gebaeude_nach_Energietraeger_der_Heizung_100m-Gitter,Kohle_Gebaeude_nach_Energietraeger_der_Heizung_100m-Gitter,Fernwaerme_Gebaeude_nach_Energietraeger_der_Heizung_100m-Gitter,kein_Energietraeger_Gebaeude_nach_Energietraeger_der_Heizung_100m-Gitter,Insgesamt_Heizungsart_Gebaeude_nach_ueberwiegender_Heizungsart_100m-Gitter,Fernheizung_Gebaeude_nach_ueberwiegender_Heizungsart_100m-Gitter,Etagenheizung_Gebaeude_nach_ueberwiegender_Heizungsart_100m-Gitter,Blockheizung_Gebaeude_nach_ueberwiegender_Heizungsart_100m-Gitter,Zentralheizung_Gebaeude_nach_ueberwiegender_Heizungsart_100m-Gitter,Einzel_Mehrraumoefen_Gebaeude_nach_ueberwiegender_Heizungsart_100m-Gitter,keine_Heizung_Gebaeude_nach_ueberwiegender_Heizungsart_100m-Gitter,Insgesamt_Bevoelkerung_Geburtsland_Gruppen_100m-Gitter,Deutschland_Geburtsland_Gruppen_100m-Gitter,Ausland_Sonstige_Geburtsland_Gruppen_100m-Gitter,EU27_Land_Geburtsland_Gruppen_100m-Gitter,Sonstiges_Europa_Geburtsland_Gruppen_100m-Gitter,Sonstige_Welt_Geburtsland_Gruppen_100m-Gitter,Sonstige_Geburtsland_Gruppen_100m-Gitter,Insgesamt_Haushalte_Groesse_des_privaten_Haushalts_100m-Gitter,1_Person_Groesse_des_privaten_Haushalts_100m-Gitter,2_Personen_Groesse_des_privaten_Haushalts_100m-Gitter,3_Personen_Groesse_des_privaten_Haushalts_100m-Gitter,4_Personen_Groesse_des_privaten_Haushalts_100m-Gitter,5_Personen_Groesse_des_privaten_Haushalts_100m-Gitter,6_Personen_und_mehr_Groesse_des_privaten_Haushalts_100m-Gitter,Insgesamt_Familien_Grosse_Kernfamilie_bis6undmehrPers_100m-Gitter,a2Personen_Grosse_Kernfamilie_bis6undmehrPers_100m-Gitter,a3Personen_Grosse_Kernfamilie_bis6undmehrPers_100m-Gitter,a4Personen_Grosse_Kernfamilie_bis6undmehrPers_100m-Gitter,a5Personen_Grosse_Kernfamilie_bis6undmehrPers_100m-Gitter,a6Pers_und_mehr_Grosse_Kernfamilie_bis6undmehrPers_100m-Gitter,Insgesamt_Heizungsart_Heizungsart_100m-Gitter,Fernheizung_Heizungsart_100m-Gitter,Etagenheizung_Heizungsart_100m-Gitter,Blockheizung_Heizungsart_100m-Gitter,Zentralheizung_Heizungsart_100m-Gitter,Einzel_Mehrraumoefen_Heizungsart_100m-Gitter,keine_Heizung_Heizungsart_100m-Gitter,Leerstandsquote_Leerstandsquote_100m-Gitter,marktaktive_Leerstandsquote_Marktaktive_Leerstandsquote_100m-Gitter,Insgesamt_Bevoelkerung_Religion_100m-Gitter,Roemisch_katholisch_Religion_100m-Gitter,Evangelisch_Religion_100m-Gitter,Sonstige_keine_ohneAngabe_Religion_100m-Gitter,Insgesamt_Haushalte_Seniorenstatus_eines_privaten_Haushalts_100m-Gitter,HH_nurSenioren_Seniorenstatus_eines_privaten_Haushalts_100m-Gitter,HH_mitSenioren_Seniorenstatus_eines_privaten_Haushalts_100m-Gitter,HH_ohneSenioren_Seniorenstatus_eines_privaten_Haushalts_100m-Gitter,Insgesamt_Bevoelkerung_Staatsangehoerigkeit_100m-Gitter,Deutschland_Staatsangehoerigkeit_100m-Gitter,Ausland_Sonstige_Staatsangehoerigkeit_100m-Gitter,Insgesamt_Bevoelkerung_Staatsangehoerigkeit_Gruppen_100m-Gitter,Deutschland_Staatsangehoerigkeit_Gruppen_100m-Gitter,Ausland_Sonstige_Staatsangehoerigkeit_Gruppen_100m-Gitter,EU27_Land_Staatsangehoerigkeit_Gruppen_100m-Gitter,Sonstiges_Europa_Staatsangehoerigkeit_Gruppen_100m-Gitter,Sonstige_Welt_Staatsangehoerigkeit_Gruppen_100m-Gitter,Sonstige_Staatsangehoerigkeit_Gruppen_100m-Gitter,Insgesamt_Familie_Typ_der_Kernfamilie_nach_Kindern_100m-Gitter,Ehep_ohneKind_Typ_der_Kernfamilie_nach_Kindern_100m-Gitter,Ehep_mind_1Kind_unter18_Typ_der_Kernfamilie_nach_Kindern_100m-Gitter,Ehep_Kinder_ab18_Typ_der_Kernfamilie_nach_Kindern_100m-Gitter,EingetrLP_ohneKind_Typ_der_Kernfamilie_nach_Kindern_100m-Gitter,EingetrLP_mind_1Kind_unter18_Typ_der_Kernfamilie_nach_Kindern_100m-Gitter,EingetrLP_Kinder_ab18_Typ_der_Kernfamilie_nach_Kindern_100m-Gitter,NichtehelLG_ohneKind_Typ_der_Kernfamilie_nach_Kindern_100m-Gitter,NichtehelLG_mind_1Kind_unter18_Typ_der_Kernfamilie_nach_Kindern_100m-Gitter,NichtehelLG_Kinder_ab18_Typ_der_Kernfamilie_nach_Kindern_100m-Gitter,Vater_mind_1Kind_unter18_Typ_der_Kernfamilie_nach_Kindern_100m-Gitter,Vater_Kinder_ab18_Typ_der_Kernfamilie_nach_Kindern_100m-Gitter,Mutter_mind_1Kind_unter18_Typ_der_Kernfamilie_nach_Kindern_100m-Gitter,Mutter_Kinder_ab18_Typ_der_Kernfamilie_nach_Kindern_100m-Gitter,Insgesamt_Haushalte_Typ_priv_HH_Familie_100m-Gitter,EinpersHH_SingleHH_Typ_priv_HH_Familie_100m-Gitter,Paare_ohneKind_Typ_priv_HH_Familie_100m-Gitter,Paare_mitKind_Typ_priv_HH_Familie_100m-Gitter,Alleinerziehende_Typ_priv_HH_Familie_100m-Gitter,MehrpersHHohneKernfam_Typ_priv_HH_Familie_100m-Gitter,Insgesamt_Haushalte_Typ_priv_HH_Lebensform_100m-Gitter,EinpersHH_SingleHH_Typ_priv_HH_Lebensform_100m-Gitter,Ehepaare_Typ_priv_HH_Lebensform_100m-Gitter,EingetrLebensp_Typ_priv_HH_Lebensform_100m-Gitter,NichtehelLebensg_Typ_priv_HH_Lebensform_100m-Gitter,AlleinerzMuetter_Typ_priv_HH_Lebensform_100m-Gitter,AlleinerzVaeter_Typ_priv_HH_Lebensform_100m-Gitter,MehrpersHHohneKernfam_Typ_priv_HH_Lebensform_100m-Gitter,Insgesamt_Wohnungen_Wohnung_Gebaeudetyp_Groesse_100m-Gitter,FreiEFH_Wohnung_Gebaeudetyp_Groesse_100m-Gitter,EFH_DHH_Wohnung_Gebaeudetyp_Groesse_100m-Gitter,EFH_Reihenhaus_Wohnung_Gebaeudetyp_Groesse_100m-Gitter,Freist_ZFH_Wohnung_Gebaeudetyp_Groesse_100m-Gitter,ZFH_DHH_Wohnung_Gebaeudetyp_Groesse_100m-Gitter,ZFH_Reihenhaus_Wohnung_Gebaeudetyp_Groesse_100m-Gitter,MFH_3bis6Wohnungen_Wohnung_Gebaeudetyp_Groesse_100m-Gitter,MFH_7bis12Wohnungen_Wohnung_Gebaeudetyp_Groesse_100m-Gitter,MFH_13undmehrWohnungen_Wohnung_Gebaeudetyp_Groesse_100m-Gitter,AndererGebaeudetyp_Wohnung_Gebaeudetyp_Groesse_100m-Gitter,Insgesamt_Wohnungen_Wohnungen_nach_Zahl_der_Raeume_100m-Gitter,1Raum_Wohnungen_nach_Zahl_der_Raeume_100m-Gitter,2Raeume_Wohnungen_nach_Zahl_der_Raeume_100m-Gitter,3Raeume_Wohnungen_nach_Zahl_der_Raeume_100m-Gitter,4Raeume_Wohnungen_nach_Zahl_der_Raeume_100m-Gitter,5Raeume_Wohnungen_nach_Zahl_der_Raeume_100m-Gitter,6Raeume_Wohnungen_nach_Zahl_der_Raeume_100m-Gitter,7undmehrRaeume_Wohnungen_nach_Zahl_der_Raeume_100m-Gitter,Insgesamt_Bevoelkerung_Zahl_der_Staatsangehoerigkeiten_100m-Gitter,EineStaatsang_Zahl_der_Staatsangehoerigkeiten_100m-Gitter,Mehrere_deutsch_und_auslaendisch_Zahl_der_Staatsangehoerigkeiten_100m-Gitter,Mehrere_nur_auslaendisch_Zahl_der_Staatsangehoerigkeiten_100m-Gitter,Nicht_bekannt_Zahl_der_Staatsangehoerigkeiten_100m-Gitter,GITTER_ID_1km,GITTER_ID_10km,POP_TOTAL_100m,scale,is_orphan,POP_TOTAL_100m_adj,AGE_0,AGE_1,AGE_2,AGE_3,AGE_4,AGE_5,AGE_6,AGE_7,AGE_8,AGE_9,AGE_10,AGE_11,AGE_12,AGE_13,AGE_14,AGE_15,AGE_16,AGE_17,AGE_18,AGE_19,AGE_20,AGE_21,AGE_22,AGE_23,AGE_24,AGE_25,AGE_26,AGE_27,AGE_28,AGE_29,AGE_30,AGE_31,AGE_32,AGE_33,AGE_34,AGE_35,AGE_36,AGE_37,AGE_38,AGE_39,AGE_40,AGE_41,AGE_42,AGE_43,AGE_44,AGE_45,AGE_46,AGE_47,AGE_48,AGE_49,AGE_50,AGE_51,AGE_52,AGE_53,AGE_54,AGE_55,AGE_56,AGE_57,AGE_58,AGE_59,AGE_60,AGE_61,AGE_62,AGE_63,AGE_64,AGE_65,AGE_66,AGE_67,AGE_68,AGE_69,AGE_70,AGE_71,AGE_72,AGE_73,AGE_74,AGE_75,AGE_76,AGE_77,AGE_78,AGE_79,AGE_80,AGE_81,AGE_82,AGE_83,AGE_84,AGE_85,AGE_86,AGE_87,AGE_88,AGE_89,AGE_90,AGE_91,AGE_92,AGE_93,AGE_94,AGE_95,AGE_96,AGE_97,AGE_98,AGE_99,AGE_100,AGE_0_9_agg,AGE_10_19_agg,AGE_20_29_agg,AGE_30_39_agg,AGE_40_49_agg,AGE_50_59_agg,AGE_60_69_agg,AGE_70_79_agg,AGE_80_plus_agg,RegionalSchl√ºssel_ARS,Land,Regierungsbezirk,Kreis,VerwaltungsgemeinschaftTeil1,VerwaltungsgemeinschaftTeil2,Gemeinde,M_AGE_0,F_AGE_0,M_AGE_1,F_AGE_1,M_AGE_2,F_AGE_2,M_AGE_3,F_AGE_3,M_AGE_4,F_AGE_4,M_AGE_5,F_AGE_5,M_AGE_6,F_AGE_6,M_AGE_7,F_AGE_7,M_AGE_8,F_AGE_8,M_AGE_9,F_AGE_9,M_AGE_10,F_AGE_10,M_AGE_11,F_AGE_11,M_AGE_12,F_AGE_12,M_AGE_13,F_AGE_13,M_AGE_14,F_AGE_14,M_AGE_15,F_AGE_15,M_AGE_16,F_AGE_16,M_AGE_17,F_AGE_17,M_AGE_18,F_AGE_18,M_AGE_19,F_AGE_19,M_AGE_20,F_AGE_20,M_AGE_21,F_AGE_21,M_AGE_22,F_AGE_22,M_AGE_23,F_AGE_23,M_AGE_24,F_AGE_24,M_AGE_25,F_AGE_25,M_AGE_26,F_AGE_26,M_AGE_27,F_AGE_27,M_AGE_28,F_AGE_28,M_AGE_29,F_AGE_29,M_AGE_30,F_AGE_30,M_AGE_31,F_AGE_31,M_AGE_32,F_AGE_32,M_AGE_33,F_AGE_33,M_AGE_34,F_AGE_34,M_AGE_35,F_AGE_35,M_AGE_36,F_AGE_36,M_AGE_37,F_AGE_37,M_AGE_38,F_AGE_38,M_AGE_39,F_AGE_39,M_AGE_40,F_AGE_40,M_AGE_41,F_AGE_41,M_AGE_42,F_AGE_42,M_AGE_43,F_AGE_43,M_AGE_44,F_AGE_44,M_AGE_45,F_AGE_45,M_AGE_46,F_AGE_46,M_AGE_47,F_AGE_47,M_AGE_48,F_AGE_48,M_AGE_49,F_AGE_49,M_AGE_50,F_AGE_50,M_AGE_51,F_AGE_51,M_AGE_52,F_AGE_52,M_AGE_53,F_AGE_53,M_AGE_54,F_AGE_54,M_AGE_55,F_AGE_55,M_AGE_56,F_AGE_56,M_AGE_57,F_AGE_57,M_AGE_58,F_AGE_58,M_AGE_59,F_AGE_59,M_AGE_60,F_AGE_60,M_AGE_61,F_AGE_61,M_AGE_62,F_AGE_62,M_AGE_63,F_AGE_63,M_AGE_64,F_AGE_64,M_AGE_65,F_AGE_65,M_AGE_66,F_AGE_66,M_AGE_67,F_AGE_67,M_AGE_68,F_AGE_68,M_AGE_69,F_AGE_69,M_AGE_70,F_AGE_70,M_AGE_71,F_AGE_71,M_AGE_72,F_AGE_72,M_AGE_73,F_AGE_73,M_AGE_74,F_AGE_74,M_AGE_75,F_AGE_75,M_AGE_76,F_AGE_76,M_AGE_77,F_AGE_77,M_AGE_78,F_AGE_78,M_AGE_79,F_AGE_79,M_AGE_80,F_AGE_80,M_AGE_81,F_AGE_81,M_AGE_82,F_AGE_82,M_AGE_83,F_AGE_83,M_AGE_84,F_AGE_84,M_AGE_85,F_AGE_85,M_AGE_86,F_AGE_86,M_AGE_87,F_AGE_87,M_AGE_88,F_AGE_88,M_AGE_89,F_AGE_89,M_AGE_90,F_AGE_90,M_AGE_91,F_AGE_91,M_AGE_92,F_AGE_92,M_AGE_93,F_AGE_93,M_AGE_94,F_AGE_94,M_AGE_95,F_AGE_95,M_AGE_96,F_AGE_96,M_AGE_97,F_AGE_97,M_AGE_98,F_AGE_98,M_AGE_99,F_AGE_99,M_AGE_100,F_AGE_100,M_AGE_0_9_agg,M_AGE_10_19_agg,M_AGE_20_29_agg,M_AGE_30_39_agg,M_AGE_40_49_agg,M_AGE_50_59_agg,M_AGE_60_69_agg,M_AGE_70_79_agg,M_AGE_80_plus_agg,F_AGE_0_9_agg,F_AGE_10_19_agg,F_AGE_20_29_agg,F_AGE_30_39_agg,F_AGE_40_49_agg,F_AGE_50_59_agg,F_AGE_60_69_agg,F_AGE_70_79_agg,F_AGE_80_plus_agg,M_TOTAL,F_TOTAL,Insgesamt_Bevoelkerung_Familienstand_100m-Gitter_adj,Insgesamt_Energietraeger_Energietraeger_100m-Gitter_adj,Insgesamt_Heizungsart_Gebaeude_nach_ueberwiegender_Heizungsart_100m-Gitter_adj,Insgesamt_Haushalte_Groesse_des_privaten_Haushalts_100m-Gitter_adj,Insgesamt_Haushalte_Typ_priv_HH_Lebensform_100m-Gitter_adj,Insgesamt_Wohnungen_Wohnungen_nach_Zahl_der_Raeume_100m-Gitter_adj,Insgesamt_Wohnungen_Flaeche_der_Wohnung_10m2_Intervalle_100m-Gitter_adj,Insgesamt_Bevoelkerung_Geburtsland_Gruppen_100m-Gitter_adj,RegioStaR2,RegioStaR4,RegioStaR17,RegioStaR7,RegioStaR5,RegioStaRGem7,RegioStaRGem5
5436,CRS3035RES100mN3133200E4299100,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,3.0,0.0,100.0,0.0,0.0,4.0,4.0,68.59,85.26,1.75,0.0,0.0,0.0,68.21,100.0,4.0,1.013464,1.787025,1.621408,0.001391,0.075622,0.079112,0.001388,0.034846,0.381537,4.0,1.650352,1.840563,1.101435,0.631438,0.002985,0.0,0.000295,0.022745,4.0,0.011929,0.02867,0.068627,0.163903,0.417531,0.561485,0.805093,0.227694,1.080434,0.699718,0.238051,0.126074,0.139912,0.106781,0.091707,0.043529,0.188864,4.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,4.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.033455,0.072372,0.009937,4.472721,0.099264,1.015913,4.0,3.171844,1.828792,0.035699,0.17617,0.034777,0.00253,4.0,1.479565,2.65156,0.275306,0.181441,0.051494,0.022074,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,4.0,4.0,3.0,0.0,0.0,4.0,4.0,3.0,4.0,4.0,3.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,3.0,0.0,0.0,0.0,4.0,1.425292,2.868985,0.001042,0.162143,0.128998,0.029714,0.045266,4.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.015806,0.102963,0.706447,1.304089,1.029192,0.890963,0.95054,4.0,4.0,0.0,0.0,0.0,CRS3035RES1000mN3133000E4299000,CRS3035RES10000mN3130000E4290000,5.25,1.3125,False,5.250331,0.019189,0.019988,0.020018,0.023105,0.023577,0.023847,0.022655,0.022286,0.021745,0.021367,0.020444,0.020559,0.02047,0.02072,0.020983,0.020481,0.020784,0.021427,0.017547,0.023424,0.016911,0.017826,0.018651,0.019069,0.019725,0.096967,0.094432,0.094268,0.097656,0.100339,0.041236,0.044532,0.045681,0.045898,0.046189,0.045142,0.043854,0.04366,0.043364,0.044007,0.046933,0.046888,0.045572,0.043912,0.043714,0.042662,0.041423,0.041468,0.041682,0.043143,0.060677,0.063606,0.066603,0.070678,0.073037,0.073505,0.074299,0.075,0.075047,0.072991,5.532737e-09,5.328115e-09,5.161376e-09,4.886513e-09,4.693291e-09,0.414469,0.400902,0.151666,0.145058,0.141789,0.202787,0.195825,0.191891,0.17633,0.157327,0.038725,0.027315,0.035338,0.039209,0.036565,0.03722,0.035684,0.038191,0.033246,0.028933,0.025504,0.022252,0.019457,0.013775,0.011074,0.009662,0.008386,0.006814,0.005308,0.00396,0.002895,0.002086,0.001438,0.000919,0.000616,0.000855,0.217776,0.206838,0.575845,0.443561,0.437397,0.705443,1.253884,1.101311,0.308275,66330019019,6,6,33,0,19,19,0.010846,0.008343,0.011993,0.007995,0.006159,0.013859,0.008512,0.014592,0.016591,0.006986,0.013911,0.009936,0.012586,0.010069,0.013372,0.008915,0.010498,0.011248,0.008798,0.012569,0.013084,0.00736,0.008223,0.012335,0.012794,0.007676,0.009669,0.011051,0.011191,0.009792,0.01065,0.009831,0.006928,0.013856,0.010248,0.011179,0.009571,0.007976,0.009109,0.014315,0.010407,0.006504,0.008389,0.009437,0.018651,0.0,0.008668,0.010401,0.01315,0.006575,0.048483,0.048483,0.041314,0.053118,0.047134,0.047134,0.046258,0.051398,0.05017,0.05017,0.023307,0.017929,0.018555,0.025977,0.02284,0.02284,0.018359,0.027539,0.028226,0.017962,0.023759,0.021383,0.01839,0.025464,0.0177,0.02596,0.022714,0.020649,0.024937,0.01907,0.02477,0.022163,0.018755,0.028133,0.01928,0.026291,0.019373,0.024539,0.021857,0.021857,0.022586,0.020076,0.023413,0.01801,0.022329,0.019139,0.020841,0.020841,0.024385,0.018758,0.031552,0.029125,0.029815,0.033791,0.031451,0.035152,0.035339,0.035339,0.03733,0.035707,0.041464,0.032041,0.039105,0.035194,0.033552,0.041447,0.03316,0.041887,0.032964,0.040027,2.766368e-09,2.766368e-09,2.243417e-09,3.084698e-09,1.779785e-09,3.381591e-09,2.012093e-09,2.874419e-09,2.589402e-09,2.103889e-09,0.191293,0.223175,0.291565,0.109337,0.070778,0.080889,0.087035,0.058023,0.07734,0.06445,0.109193,0.093594,0.097913,0.097913,0.095945,0.095945,0.054255,0.122075,0.031465,0.125862,0.025816,0.012908,0.013657,0.013657,0.020787,0.014551,0.014003,0.025206,0.023269,0.013296,0.010151,0.027069,0.022303,0.013382,0.019096,0.019096,0.012467,0.020779,0.01736,0.011573,0.009564,0.01594,0.0,0.022252,0.0,0.019457,0.0,0.013775,0.0,0.011074,0.0,0.009662,0.0,0.008386,0.0,0.006814,0.0,0.005308,0.0,0.00396,0.0,0.002895,0.0,0.002086,0.0,0.001438,0.0,0.000919,0.0,0.000616,0.0,0.000855,0.113266,0.101468,0.292624,0.218789,0.21759,0.345733,0.71801,0.486304,0.09094,0.104511,0.10537,0.283221,0.224773,0.219807,0.359709,0.535874,0.615007,0.217335,2.584723,2.665607,5.249813,4.995793,5.70366,4.66144,4.66144,5.0,5.0,5.249813,1.0,12.0,125.0,74.0,53.0,77.0,55.0
6244,CRS3035RES100mN3133400E4298500,13.0,0.0,0.0,0.0,4.0,3.0,3.0,0.0,0.0,0.0,13.0,3.0,0.0,5.0,3.0,0.0,0.0,0.0,27.27,0.0,13.0,8.0,70.4,116.34,2.5,5.25,5.25,5.0,37.13,0.0,11.0,0.65684,4.925123,4.619014,0.0,0.160214,0.489894,0.0,0.036288,0.080329,13.0,6.374714,5.4333,0.377885,0.697053,0.0,0.0,0.0,0.028849,11.0,0.0,0.127,0.242444,0.467482,0.451517,0.721844,1.149898,0.805882,1.144245,0.669758,1.134064,0.561012,0.61514,0.5201,0.730879,0.40671,1.157401,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,3.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.027348,0.012676,0.0,3.601862,0.311806,0.026772,13.0,9.278994,1.91188,1.470423,0.120152,0.114262,0.016091,4.0,1.545926,1.44857,0.238163,0.933308,0.051655,0.012329,5.0,4.0,0.0,3.0,0.0,0.0,11.0,0.0,0.0,0.0,11.0,0.0,0.0,0.0,0.0,13.0,3.0,5.0,6.0,4.0,0.0,0.0,6.0,13.0,13.0,0.0,13.0,13.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,4.0,3.0,3.0,0.0,0.0,4.0,2.627546,1.102995,0.012342,0.252315,0.158964,0.033255,0.042535,11.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,9.0,0.0,0.0,11.0,0.024583,0.306998,0.618978,5.347801,3.146859,0.530944,0.929212,13.0,14.0,3.0,0.0,0.0,CRS3035RES1000mN3133000E4298000,CRS3035RES10000mN3130000E4290000,12.912261,0.993251,False,12.913074,0.135522,0.141167,0.141377,0.142475,0.145385,0.147049,0.143176,0.14085,0.137429,0.13504,0.154018,0.154882,0.154213,0.156097,0.158076,0.154294,0.120753,0.124487,3e-06,4e-06,0.069212,0.072956,0.076331,0.078044,0.080727,0.093269,0.090831,0.090674,0.093932,0.096513,0.230177,0.24858,0.25499,0.256201,0.257826,0.25198,0.244794,0.243709,0.242057,0.245646,0.194596,0.19441,0.188953,0.182072,0.181248,0.17689,0.171753,0.171939,0.172826,0.178881,0.255941,0.268295,0.280936,0.298126,0.308074,0.31005,0.313397,0.316354,0.316553,0.307881,0.1273562,0.1226461,0.118808,0.112481,0.1080333,0.088343,0.085451,0.13948,0.133403,0.130397,0.071023,0.068585,0.067207,0.061757,0.055101,0.074557,0.05259,0.068037,0.075489,0.070399,0.046847,0.044914,0.048069,0.041845,0.036416,0.0321,0.028007,0.024489,0.017338,0.013938,0.012161,0.010555,0.008576,0.006681,0.004984,0.003644,0.002626,0.00181,0.001157,0.000775,0.001077,1.409471,1.176827,0.842491,2.47596,1.81357,2.975606,1.166398,0.664745,0.388006,66330019019,6,6,33,0,19,19,0.076599,0.058923,0.0847,0.056467,0.043501,0.097877,0.052491,0.089984,0.102308,0.043077,0.085779,0.061271,0.079542,0.063634,0.08451,0.05634,0.066345,0.071084,0.055605,0.079435,0.098572,0.055447,0.061953,0.092929,0.096383,0.05783,0.072845,0.083252,0.084307,0.073769,0.080233,0.074061,0.040251,0.080502,0.059537,0.06495,2e-06,1e-06,2e-06,2e-06,0.042592,0.02662,0.034332,0.038624,0.076331,0.0,0.035475,0.04257,0.053818,0.026909,0.046635,0.046635,0.039739,0.051092,0.045337,0.045337,0.044494,0.049438,0.048257,0.048257,0.1301,0.100077,0.103575,0.145005,0.127495,0.127495,0.102481,0.153721,0.15756,0.100265,0.132621,0.119359,0.102656,0.142139,0.098801,0.144908,0.126792,0.115265,0.139199,0.106446,0.102703,0.091893,0.077764,0.116646,0.079942,0.109011,0.080326,0.101746,0.090624,0.090624,0.093648,0.083242,0.097078,0.074675,0.092583,0.079357,0.086413,0.086413,0.101107,0.077774,0.13309,0.122852,0.125763,0.142532,0.132664,0.148272,0.149063,0.149063,0.15746,0.150614,0.1749,0.13515,0.164946,0.148451,0.141527,0.174827,0.139872,0.176681,0.139043,0.168838,0.06367811,0.06367811,0.05164045,0.07100563,0.04096827,0.07783972,0.0463157,0.06616528,0.05960457,0.04842871,0.040774,0.047569,0.062146,0.023305,0.065091,0.074389,0.080042,0.053361,0.071125,0.059271,0.038243,0.03278,0.034292,0.034292,0.033603,0.033603,0.019002,0.042755,0.01102,0.044081,0.049705,0.024852,0.026295,0.026295,0.040022,0.028015,0.02696,0.048529,0.044799,0.0256,0.012776,0.03407,0.028071,0.016843,0.024034,0.024034,0.015692,0.026153,0.021849,0.014566,0.012037,0.020062,0.0,0.028007,0.0,0.024489,0.0,0.017338,0.0,0.013938,0.0,0.012161,0.0,0.010555,0.0,0.008576,0.0,0.006681,0.0,0.004984,0.0,0.003644,0.0,0.002626,0.0,0.00181,0.0,0.001157,0.0,0.000775,0.0,0.001077,0.73138,0.594084,0.46701,1.22128,0.902188,1.458327,0.581385,0.323943,0.11446,0.678091,0.582743,0.375481,1.254681,0.911382,1.517279,0.585013,0.340802,0.273546,6.394055,6.519019,12.911801,10.967703,3.980464,4.229951,4.229951,10.905376,10.905376,12.911801,1.0,12.0,125.0,74.0,53.0,77.0,55.0
6616,CRS3035RES100mN3133500E4298500,8.0,0.0,0.0,3.0,0.0,0.0,3.0,4.0,0.0,0.0,8.0,0.0,3.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,8.0,8.0,49.46,92.75,2.0,0.0,0.0,0.0,45.63,0.0,3.0,0.255614,2.314787,0.122414,0.0,0.062349,0.190646,0.0,0.014122,0.031261,8.0,1.21024,4.182347,0.233909,2.301371,0.0,0.0,0.0,0.017857,3.0,0.0,0.034636,0.066121,0.127495,0.123141,0.196866,0.313609,0.219786,0.312067,0.182661,0.30929,0.153003,0.167766,0.141846,0.199331,0.110921,0.315655,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.0,7.045182,0.437175,0.150745,0.149946,0.142595,0.020081,3.0,1.002892,1.173278,0.560106,0.285712,0.121481,0.028994,3.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,8.0,3.0,3.0,7.0,3.0,0.0,0.0,3.0,8.0,8.0,0.0,8.0,8.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,3.0,1.33373,1.571849,0.006595,0.134837,0.08495,0.017772,0.02273,3.0,3.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,3.0,0.016099,0.201042,0.405345,0.6068,0.788708,0.347695,0.608506,8.0,6.0,0.0,0.0,0.0,CRS3035RES1000mN3133000E4298000,CRS3035RES10000mN3130000E4290000,7.946007,0.993251,False,7.946507,0.041479,0.043207,0.043271,0.043607,0.044498,0.045007,0.043822,0.04311,0.042063,0.041332,0.038215,0.038429,0.038263,0.03873,0.039221,0.038283,0.029961,0.030887,0.02283,0.028846,0.155991,0.164429,0.172036,0.175895,0.181943,0.21021,0.204715,0.20436,0.211704,0.217521,0.054222,0.058557,0.060067,0.060352,0.060735,0.059358,0.057665,0.05741,0.057021,0.057866,0.049843,0.049795,0.048398,0.046635,0.046424,0.045308,0.043992,0.04404,0.044267,0.045818,0.150343,0.157599,0.165025,0.175122,0.180966,0.182127,0.184093,0.18583,0.185947,0.180853,0.3107637,0.2992705,0.2899051,0.2744665,0.2636136,0.082314,0.079619,0.129961,0.124298,0.121497,0.033817,0.032656,0.032,0.029405,0.026236,0.0355,0.02504,0.032395,0.035943,0.03352,0.022305,0.021385,0.022887,0.019924,0.017339,0.015284,0.013335,0.01166,0.008255,0.006636,0.00579,0.005026,0.004083,0.003181,0.002373,0.001735,0.00125,0.000862,0.000551,0.000369,0.000513,0.431397,0.343665,1.898803,0.583253,0.464521,1.747904,1.975709,0.316511,0.184745,66330019019,6,6,33,0,19,19,0.023445,0.018034,0.025924,0.017283,0.013314,0.029957,0.016066,0.027541,0.031313,0.013185,0.026254,0.018753,0.024346,0.019476,0.025866,0.017244,0.020306,0.021757,0.017019,0.024313,0.024457,0.013757,0.015372,0.023057,0.023914,0.014349,0.018074,0.020656,0.020918,0.018303,0.019907,0.018376,0.009987,0.019974,0.014772,0.016115,0.012453,0.010377,0.011218,0.017628,0.095994,0.059996,0.077378,0.087051,0.172036,0.0,0.079952,0.095943,0.121295,0.060648,0.105105,0.105105,0.089563,0.115152,0.10218,0.10218,0.100281,0.111423,0.10876,0.10876,0.030647,0.023575,0.024399,0.034158,0.030034,0.030034,0.024141,0.036211,0.037116,0.023619,0.031241,0.028117,0.024182,0.033483,0.023274,0.034136,0.029868,0.027153,0.032791,0.025075,0.026306,0.023537,0.019918,0.029877,0.020476,0.027922,0.020574,0.026061,0.023212,0.023212,0.023987,0.021321,0.024865,0.019127,0.023714,0.020326,0.022134,0.022134,0.025897,0.019921,0.078178,0.072165,0.073875,0.083725,0.077928,0.087096,0.087561,0.087561,0.092494,0.088472,0.102738,0.079388,0.096891,0.087202,0.083134,0.102695,0.082162,0.103784,0.081675,0.099177,0.1553819,0.1553819,0.1260086,0.1732619,0.09996728,0.1899378,0.1130156,0.1614509,0.145442,0.1181716,0.037991,0.044323,0.057905,0.021714,0.060648,0.069312,0.074579,0.049719,0.066271,0.055226,0.018209,0.015608,0.016328,0.016328,0.016,0.016,0.009048,0.020357,0.005247,0.020989,0.023666,0.011833,0.01252,0.01252,0.019056,0.013339,0.012837,0.023106,0.021331,0.012189,0.006083,0.016222,0.013366,0.008019,0.011444,0.011444,0.007471,0.012452,0.010403,0.006936,0.005731,0.009552,0.0,0.013335,0.0,0.01166,0.0,0.008255,0.0,0.006636,0.0,0.00579,0.0,0.005026,0.0,0.004083,0.0,0.003181,0.0,0.002373,0.0,0.001735,0.0,0.00125,0.0,0.000862,0.0,0.000551,0.0,0.000369,0.0,0.000513,0.223854,0.171072,1.052545,0.287692,0.231083,0.856638,0.93721,0.154242,0.054499,0.207543,0.172593,0.846258,0.295561,0.233438,0.891266,1.038499,0.162269,0.130246,3.968834,3.977673,7.945724,2.991192,0.0,3.172463,3.172463,2.974194,2.974194,7.945724,1.0,12.0,125.0,74.0,53.0,77.0,55.0
7017,CRS3035RES100mN3133600E4298400,18.0,3.0,0.0,0.0,3.0,0.0,4.0,3.0,0.0,3.0,18.0,3.0,0.0,3.0,4.0,6.0,0.0,33.33,16.67,0.0,18.0,15.0,67.02,102.1,1.88,0.0,0.0,0.0,50.01,70.0,10.0,0.788048,5.441221,0.377398,0.0,0.192218,3.031842,0.0,0.043536,0.096376,18.0,3.965199,8.310729,2.26745,3.304715,0.0,0.0,0.0,0.029786,10.0,0.0,0.115455,0.220404,0.424984,0.41047,0.656222,1.045362,0.73262,1.040223,0.608871,1.030967,0.510011,0.559218,0.472818,0.664435,0.369737,1.052183,9.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.0,8.0,0.0,0.0,0.0,0.0,9.0,0.0,0.0,5.0,3.0,0.0,3.0,0.0,0.0,9.0,0.0,4.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,9.0,0.047979,0.02224,0.0,5.751912,3.086944,0.046969,18.0,14.644002,2.763247,0.153108,0.152296,0.14483,0.020395,8.0,3.042204,2.924683,0.597308,1.735236,0.12955,0.03092,8.0,0.0,0.0,0.0,0.0,0.0,10.0,0.0,0.0,0.0,5.0,4.0,0.0,0.0,0.0,18.0,0.0,11.0,7.0,8.0,0.0,4.0,3.0,18.0,18.0,0.0,18.0,18.0,0.0,0.0,0.0,0.0,0.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.0,4.0,0.0,3.0,0.0,3.0,8.0,3.373131,1.854225,0.020747,0.424162,0.267232,0.055905,2.464499,10.0,8.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,10.0,0.019916,0.248715,0.501466,2.588059,2.590248,1.790782,2.174793,18.0,16.0,3.0,0.0,0.0,CRS3035RES1000mN3133000E4298000,CRS3035RES10000mN3130000E4290000,17.878515,0.993251,False,17.879641,0.247122,0.257416,0.257799,0.2598,0.265107,0.268142,0.261079,0.256837,0.250599,0.246242,0.079046,0.07949,0.079147,0.080113,0.081129,0.079188,0.061974,0.06389,0.046933,0.059302,0.07769,0.081892,0.085681,0.087603,0.090615,0.104693,0.101956,0.10178,0.105437,0.108334,0.261892,0.282831,0.290124,0.291502,0.29335,0.286699,0.278524,0.277289,0.275409,0.279492,0.102466,0.102367,0.099494,0.095871,0.095437,0.093142,0.090438,0.090536,0.091003,0.094191,0.340988,0.357447,0.374288,0.39719,0.410445,0.413076,0.417536,0.421475,0.42174,0.410187,0.1648364,0.1587401,0.1537725,0.1455835,0.1398268,0.3552,0.343573,0.560805,0.536371,0.524284,0.090916,0.087795,0.086032,0.079055,0.070535,0.095441,0.06732,0.087095,0.096634,0.090118,0.23945,0.22957,0.245698,0.213884,0.186134,0.164074,0.143153,0.125174,0.088622,0.07124,0.062157,0.053952,0.043835,0.034151,0.025473,0.018624,0.013421,0.009253,0.005911,0.003964,0.005502,2.570142,0.710212,0.945682,2.817112,0.954944,3.964371,3.082992,0.850942,1.983244,66330019019,6,6,33,0,19,19,0.139678,0.107444,0.154449,0.102966,0.079323,0.178476,0.095716,0.164084,0.186557,0.07855,0.156416,0.111726,0.145044,0.116035,0.154102,0.102735,0.120979,0.12962,0.101394,0.144848,0.05059,0.028457,0.031796,0.047694,0.049467,0.02968,0.037386,0.042727,0.043269,0.03786,0.041178,0.03801,0.020658,0.041316,0.030556,0.033334,0.0256,0.021333,0.023062,0.03624,0.047809,0.029881,0.038538,0.043355,0.085681,0.0,0.03982,0.047784,0.06041,0.030205,0.052347,0.052347,0.044606,0.05735,0.05089,0.05089,0.049944,0.055493,0.054167,0.054167,0.148026,0.113866,0.117846,0.164985,0.145062,0.145062,0.116601,0.174901,0.17927,0.114081,0.150894,0.135805,0.1168,0.161723,0.112414,0.164874,0.144262,0.131147,0.158379,0.121113,0.054079,0.048386,0.040947,0.06142,0.042094,0.0574,0.042296,0.053575,0.047719,0.047719,0.049311,0.043832,0.051117,0.039321,0.04875,0.041786,0.045501,0.045501,0.053238,0.040953,0.177314,0.163674,0.167553,0.189894,0.176747,0.197541,0.198595,0.198595,0.209783,0.200662,0.233017,0.180059,0.219756,0.19778,0.188555,0.23292,0.18635,0.23539,0.185246,0.224941,0.0824182,0.0824182,0.06683794,0.09190217,0.053025,0.1007475,0.05994614,0.08563735,0.07714584,0.062681,0.163938,0.191261,0.249871,0.093702,0.261709,0.299096,0.321822,0.214548,0.285973,0.238311,0.048955,0.041961,0.043898,0.043898,0.043016,0.043016,0.024325,0.05473,0.014107,0.056428,0.063628,0.031814,0.03366,0.03366,0.051232,0.035863,0.034512,0.062122,0.057348,0.03277,0.065305,0.174146,0.143481,0.086089,0.122849,0.122849,0.080206,0.133677,0.11168,0.074454,0.061528,0.102546,0.0,0.143153,0.0,0.125174,0.0,0.088622,0.0,0.07124,0.0,0.062157,0.0,0.053952,0.0,0.043835,0.0,0.034151,0.0,0.025473,0.0,0.018624,0.0,0.013421,0.0,0.009253,0.0,0.005911,0.0,0.003964,0.0,0.005502,1.333657,0.353561,0.524211,1.389554,0.475051,1.942915,1.622687,0.41468,0.585049,1.236485,0.356651,0.421472,1.427558,0.479893,2.021456,1.460304,0.436262,1.398194,8.641367,9.238276,17.877878,9.970639,8.956043,8.459902,8.459902,9.913979,9.913979,17.877878,1.0,12.0,125.0,74.0,53.0,77.0,55.0
7018,CRS3035RES100mN3133600E4298500,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,3.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,56.23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.551923,2.116137,0.106673,0.19677,0.0,0.0,0.0,0.008144,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,2.623896,0.172702,0.05955,0.059235,0.056331,0.007933,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,3.0,3.0,0.0,3.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,3.0,0.0,0.0,0.0,CRS3035RES1000mN3133000E4298000,CRS3035RES10000mN3130000E4290000,2.979753,0.993251,False,2.97994,0.016709,0.017405,0.017431,0.017566,0.017925,0.01813,0.017652,0.017366,0.016944,0.016649,0.015295,0.015381,0.015315,0.015502,0.015698,0.015323,0.011992,0.012363,0.009473,0.011969,0.015119,0.015937,0.016674,0.017048,0.017634,0.020374,0.019841,0.019807,0.020519,0.021082,0.02179,0.023532,0.024139,0.024253,0.024407,0.023854,0.023173,0.023071,0.022914,0.023254,0.02003,0.020011,0.019449,0.018741,0.018656,0.018207,0.017679,0.017698,0.017789,0.018412,0.078202,0.081977,0.085839,0.091092,0.094132,0.094735,0.095758,0.096661,0.096722,0.094072,0.1645544,0.1584685,0.1535094,0.1453344,0.1395876,3e-06,2e-06,4e-06,4e-06,4e-06,0.026453,0.025544,0.025031,0.023001,0.020523,0.027769,0.019587,0.025341,0.028116,0.02622,0.017448,0.016728,0.017903,0.015585,0.013563,0.011956,0.010431,0.009121,0.006458,0.005191,0.004529,0.003931,0.003194,0.002489,0.001856,0.001357,0.000978,0.000674,0.000431,0.000289,0.000401,0.173776,0.13831,0.184035,0.234386,0.186672,0.909191,0.761471,0.247586,0.144514,31590034034,3,1,59,0,34,34,0.006962,0.009747,0.009635,0.00777,0.009159,0.008272,0.010947,0.006619,0.007534,0.010391,0.009911,0.008219,0.011523,0.006129,0.008683,0.008683,0.009437,0.007507,0.008946,0.007703,0.008667,0.006628,0.006729,0.008652,0.009309,0.006006,0.006965,0.008537,0.009372,0.006326,0.008796,0.006526,0.005067,0.006925,0.006568,0.005795,0.00532,0.004152,0.005678,0.006291,0.009128,0.00599,0.006788,0.009149,0.010671,0.006003,0.00586,0.011188,0.008082,0.009552,0.012224,0.00815,0.009281,0.010561,0.007645,0.012162,0.012649,0.00787,0.009104,0.011979,0.011673,0.010117,0.011159,0.012372,0.009594,0.014545,0.013255,0.010999,0.014483,0.009924,0.011927,0.011927,0.010621,0.012552,0.012276,0.010795,0.009988,0.012926,0.010684,0.01257,0.008144,0.011886,0.010205,0.009805,0.010861,0.008588,0.00926,0.009481,0.009426,0.00923,0.009311,0.008897,0.006014,0.011664,0.009618,0.008079,0.007376,0.010413,0.010596,0.007817,0.031811,0.046391,0.03201,0.049967,0.043647,0.042192,0.042906,0.048186,0.051491,0.042641,0.049563,0.045172,0.040319,0.055439,0.045192,0.051469,0.041913,0.054809,0.045632,0.04844,0.0762274,0.088327,0.06704438,0.09142415,0.07549644,0.07801297,0.07789925,0.06743517,0.07353277,0.06605486,1e-06,1e-06,1e-06,1e-06,2e-06,2e-06,2e-06,2e-06,2e-06,2e-06,0.013226,0.013226,0.012648,0.012896,0.010763,0.014268,0.013801,0.009201,0.011703,0.00882,0.015785,0.011985,0.00901,0.010577,0.012003,0.013337,0.013121,0.014995,0.014234,0.011986,0.009237,0.008211,0.007088,0.00964,0.006962,0.010941,0.007793,0.007793,0.00667,0.006893,0.00515,0.006806,0.004392,0.006039,0.003274,0.005847,0.003229,0.003229,0.002942,0.002249,0.001698,0.002831,0.001966,0.001966,0.001996,0.001198,0.000747,0.001742,0.0,0.001856,0.0,0.001357,0.0,0.000978,0.0,0.000674,0.0,0.000431,0.0,0.000289,0.0,0.000401,0.092736,0.072471,0.091432,0.11566,0.090812,0.424484,0.370209,0.126295,0.063145,0.081041,0.065839,0.092603,0.118725,0.09586,0.484706,0.391262,0.121291,0.081369,1.447244,1.532696,2.979646,0.0,0.0,0.0,0.0,0.0,0.0,2.979646,1.0,12.0,125.0,74.0,53.0,77.0,55.0



Loading 1km census: inputs/cells_1km_with_binneds.parquet
  Total rows: 212,758
  Filtered to 30,142 1km cells

Saved filtered census to popsim/data/_census_*_filtered.parquet

SUMMARY
  100m cells in study area: 425,907
  1km cells in study area: 30,142
  MiD households: inputs/MiD2023_Haushalte.csv
  MiD persons: inputs/MiD2023_Personen.csv
  MiD filters: kernwo=[1, 2, 3], regiostar17=None
  RegioStar split mode: True

Set 'household_column' in Configuration and re-run Step 1,
or proceed to Step 2 if already set.

[Step 1/4] Complete.


## Step 2: Generate Geo Crosswalk and Control Totals

Creates the geographic hierarchy and control totals from filtered census data:
- `geo_cross_walk.csv` - mapping ZENSUS100m ‚Üí ZENSUS1km ‚Üí STAAT ‚Üí WELT
- `control_totals_*.csv` - one file per geography level

In [25]:
import pandas as pd
import numpy as np
import os
import re
import yaml
from unidecode import unidecode

print("[Step 2/4] Generating geo crosswalk and control totals...")
print("=" * 60)

if household_column is None:
    raise ValueError("household_column not set! Set it in Configuration and re-run Step 1.")

# Load filtered 100m census from Step 1 (1km is no longer used for controls)
census_100m = pd.read_parquet(f'{popsim_dir}/data/_census_100m_filtered.parquet')
print(f"Loaded {len(census_100m):,} 100m cells")

# Validate household column
if household_column not in census_100m.columns:
    raise ValueError(f"household_column '{household_column}' not found in census data.")

# Check household values
hh_values = census_100m[household_column]
if hh_values.isna().any():
    na_count = hh_values.isna().sum()
    print(f"WARNING: {na_count} cells have missing household values (will be set to 0)")
if (hh_values < 0).any():
    neg_count = (hh_values < 0).sum()
    print(f"WARNING: {neg_count} cells have negative household values")

# Helper to get 1km ID from 100m ID
def get_1km_from_100m(cell_id):
    """Convert 100m cell ID to corresponding 1km cell ID."""
    match = re.match(r'CRS3035RES100mN(\d+)E(\d+)', str(cell_id))
    if match:
        n, e = int(match.group(1)), int(match.group(2))
        n_1km = (n // 1000) * 1000
        e_1km = (e // 1000) * 1000
        return f"CRS3035RES1000mN{n_1km}E{e_1km}"
    return None

# Standardize column names
def clean_col_name(name):
    return unidecode(name).replace(" ", "").replace(".", "").replace(",", "").replace("-", "_")

# Rename columns
census_100m.columns = [clean_col_name(c) for c in census_100m.columns]
household_column_clean = clean_col_name(household_column)

# Find the ID column after cleaning
id_col_100m_clean = census_100m.columns[0]

# Create geo_cross_walk (hierarchy: ZENSUS100m -> ZENSUS1km -> STAAT -> WELT)
print("\nCreating geo_cross_walk...")
geo_cross = pd.DataFrame()
geo_cross['ZENSUS100m'] = census_100m[id_col_100m_clean]
geo_cross['ZENSUS1km'] = geo_cross['ZENSUS100m'].apply(get_1km_from_100m)
geo_cross['STAAT'] = 1
geo_cross['WELT'] = 1

geo_cross.to_csv(f'{popsim_dir}/data/geo_cross_walk.csv', index=False)
print(f"  Created {popsim_dir}/data/geo_cross_walk.csv ({len(geo_cross)} rows)")

# Create control_totals for 100m (lowest level)
print("\nCreating control totals...")

# Geography names (hierarchy from lowest to highest)
geo_names = ['ZENSUS100m', 'ZENSUS1km', 'STAAT', 'WELT']

# Add geography columns
census_100m = census_100m.rename(columns={id_col_100m_clean: 'ZENSUS100m'})
census_100m['ZENSUS1km'] = census_100m['ZENSUS100m'].apply(get_1km_from_100m)
census_100m['STAAT'] = 1
census_100m['WELT'] = 1

# Get base column names (before suffixing) for controls template
# Only include numeric columns (exclude string columns like cell IDs)
base_cols = [c for c in census_100m.columns 
             if c not in geo_names and pd.api.types.is_numeric_dtype(census_100m[c])]

print(f"  Found {len(base_cols)} numeric columns for controls")

# Suffix non-geo columns with _ZENSUS100m
for col in base_cols:
    census_100m.rename(columns={col: f"{col}_ZENSUS100m"}, inplace=True)

# The household column with suffix
household_column_suffixed = f"{household_column_clean}_ZENSUS100m"

census_100m = census_100m.fillna(0)
census_100m.to_csv(f'{popsim_dir}/data/control_totals_ZENSUS100m.csv', index=False)
print(f"  Created {popsim_dir}/data/control_totals_ZENSUS100m.csv")

# Create control_totals for 1km by aggregating 100m data
print("  Aggregating 100m -> 1km control totals...")
# Only aggregate numeric columns (those with _ZENSUS100m suffix)
numeric_cols = [f"{c}_ZENSUS100m" for c in base_cols]
agg_dict = {col: 'sum' for col in numeric_cols}
census_1km = census_100m.groupby('ZENSUS1km').agg(agg_dict).reset_index()

# Rename columns from _ZENSUS100m to _ZENSUS1km suffix
for col in list(census_1km.columns):
    if col.endswith('_ZENSUS100m'):
        new_col = col.replace('_ZENSUS100m', '_ZENSUS1km')
        census_1km.rename(columns={col: new_col}, inplace=True)

census_1km['STAAT'] = 1
census_1km['WELT'] = 1
census_1km.to_csv(f'{popsim_dir}/data/control_totals_ZENSUS1km.csv', index=False)
print(f"  Created {popsim_dir}/data/control_totals_ZENSUS1km.csv ({len(census_1km)} 1km cells)")

# Create control_totals for STAAT
staat_df = pd.DataFrame({'STAAT': [1], 'WELT': [1]})
staat_df.to_csv(f'{popsim_dir}/data/control_totals_STAAT.csv', index=False)
print(f"  Created {popsim_dir}/data/control_totals_STAAT.csv")

# Create control_totals for WELT (top level)
welt_df = pd.DataFrame({'WELT': [1]})
welt_df.to_csv(f'{popsim_dir}/data/control_totals_WELT.csv', index=False)
print(f"  Created {popsim_dir}/data/control_totals_WELT.csv")

# Create controls template - both 100m and 1km use same base columns
print("\nCreating controls template...")
controls_rows = []

total_hh_control = None

# For each base column, create both 100m and 1km control entries
for base_col in base_cols:
    col_100m = f"{base_col}_ZENSUS100m"
    col_1km = f"{base_col}_ZENSUS1km"
    
    # 100m control entry
    row_100m = {
        'target': f"{col_100m}_target",
        'geography': 'ZENSUS100m',
        'seed_table': '',
        'importance': '',
        'control_field': col_100m,
        'expression': ''
    }
    
    # 1km control entry  
    row_1km = {
        'target': f"{col_1km}_target",
        'geography': 'ZENSUS1km',
        'seed_table': '',
        'importance': '',
        'control_field': col_1km,
        'expression': ''
    }
    
    # Identify household control
    if col_100m == household_column_suffixed:
        total_hh_control = f"{col_100m}_target"
        row_100m['seed_table'] = 'households'
        row_100m['importance'] = 1000
        row_100m['expression'] = '(households.H_GEW > 0) & (households.H_GEW < np.inf)'
        row_1km['seed_table'] = 'households'
        row_1km['importance'] = 1000
        row_1km['expression'] = '(households.H_GEW > 0) & (households.H_GEW < np.inf)'
        # Insert household controls at beginning
        controls_rows.insert(0, row_1km)
        controls_rows.insert(0, row_100m)
    else:
        controls_rows.append(row_100m)
        controls_rows.append(row_1km)

controls_df = pd.DataFrame(controls_rows)
controls_df.to_csv(f'{popsim_dir}/configs/_prep3_controls.csv', index=False, sep=intermediate_sep)
print(f"  Created {popsim_dir}/configs/_prep3_controls.csv ({len(controls_df)} controls, sep='{intermediate_sep}')")
print(f"  Note: 100m and 1km controls use same base column names (derived from 100m data)")

if total_hh_control is None:
    raise ValueError(f"Could not find household control column '{household_column_suffixed}'!")

print(f"  Household control: {total_hh_control}")

# Update settings.yaml
print("\nUpdating PopSim configuration...")
with open(f'{popsim_dir}/configs/settings.yaml', 'r') as f:
    settings = yaml.safe_load(f)

# Geographies from top to bottom: WELT -> STAAT -> ZENSUS1km -> ZENSUS100m
settings['geographies'] = ['WELT', 'STAAT', 'ZENSUS1km', 'ZENSUS100m']
settings['seed_geography'] = seed_geography
settings['total_hh_control'] = total_hh_control

# Update input tables
idx = next((i for i, t in enumerate(settings['input_table_list']) if t['tablename'] == 'geo_cross_walk'), None)
if idx is not None:
    settings['input_table_list'] = settings['input_table_list'][:idx + 1]

for geo in ['ZENSUS100m', 'ZENSUS1km', 'STAAT', 'WELT']:
    settings['input_table_list'].append({
        'tablename': f'{geo}_control_data',
        'filename': f'control_totals_{geo}.csv'
    })

# Update output tables
if output_everything:
    settings['output_tables'] = {'action': 'skip', 'tables': 'geo_cross_walk'}
else:
    settings['output_tables'] = {
        'action': 'include',
        'tables': ['expanded_household_ids', 
                   'summary_ZENSUS100m', 'summary_ZENSUS1km', 'summary_STAAT', 'summary_WELT',
                   f'summary_ZENSUS100m_{seed_geography}']
    }

# Update models - add sub_balancing for each geography level below seed
settings['models'] = [m for m in settings['models'] if 'sub_balancing' not in m]
idx = settings['models'].index('integerize_final_seed_weights')
# Add sub_balancing for intermediate geographies (ZENSUS1km) then lowest (ZENSUS100m)
settings['models'].insert(idx + 1, 'sub_balancing.geography=ZENSUS1km')
settings['models'].insert(idx + 2, 'sub_balancing.geography=ZENSUS100m')

with open(f'{popsim_dir}/configs/settings.yaml', 'w') as f:
    yaml.dump(settings, f, default_flow_style=False)
print(f"  Updated {popsim_dir}/configs/settings.yaml")

# Update verification.yaml
with open(f'{popsim_dir}/scripts/verification.yaml', 'r') as f:
    verify = yaml.safe_load(f)

verify['group_geographies'] = ['WELT', 'STAAT', 'ZENSUS1km', 'ZENSUS100m']
verify['seed_cols']['geog'] = seed_geography
verify['summaries'] = [
    'output/final_summary_ZENSUS100m.csv',
    'output/final_summary_ZENSUS1km.csv',
    'output/final_summary_STAAT.csv',
    'output/final_summary_WELT.csv',
    f'output/final_summary_ZENSUS100m_{seed_geography}.csv'
]

with open(f'{popsim_dir}/scripts/verification.yaml', 'w') as f:
    yaml.dump(verify, f, default_flow_style=False)
print(f"  Updated {popsim_dir}/scripts/verification.yaml")

# Summary
print(f"\n{'='*60}")
print("SUMMARY")
print(f"{'='*60}")
total_hh = census_100m[household_column_suffixed].sum()
print(f"  Geographic hierarchy: WELT -> STAAT -> ZENSUS1km -> ZENSUS100m")
print(f"  Geographic units: {len(census_100m):,} (100m), {len(census_1km):,} (1km, aggregated)")
print(f"  Total households: {total_hh:,.0f}")
print(f"  Household column: {household_column_suffixed}")
print(f"  Controls defined: {len(controls_df)} ({len(base_cols)} numeric columns √ó 2 geographies)")
print(f"  Intermediate separator: '{intermediate_sep}'")
print(f"\nNext: Edit {popsim_dir}/configs/_prep3_controls.csv to add expressions")
print("for the controls you want, then run Step 3.")
print("\n[Step 2/4] Complete.")

[Step 2/4] Generating geo crosswalk and control totals...
Loaded 425,907 100m cells

Creating geo_cross_walk...
  Created popsim/data/geo_cross_walk.csv (425907 rows)

Creating control totals...
  Found 560 numeric columns for controls
  Created popsim/data/control_totals_ZENSUS100m.csv
  Aggregating 100m -> 1km control totals...
  Created popsim/data/control_totals_ZENSUS1km.csv (30146 1km cells)
  Created popsim/data/control_totals_STAAT.csv
  Created popsim/data/control_totals_WELT.csv

Creating controls template...
  Created popsim/configs/_prep3_controls.csv (1120 controls, sep=';')
  Note: 100m and 1km controls use same base column names (derived from 100m data)
  Household control: Insgesamt_Haushalte_Groesse_des_privaten_Haushalts_100m_Gitter_adj_ZENSUS100m_target

Updating PopSim configuration...
  Updated popsim/configs/settings.yaml
  Updated popsim/scripts/verification.yaml

SUMMARY
  Geographic hierarchy: WELT -> STAAT -> ZENSUS1km -> ZENSUS100m
  Geographic units: 425,907

## Step 3: Process Controls, Integerize, and Create PopSim Folders

1. Edit `popsim/configs/_prep3_controls.csv` to add expressions for the controls you want (done ONCE)
2. Run this cell to:
   - Filter 100m census data by RegioStaR17 (if split mode)
   - **Aggregate filtered 100m ‚Üí 1km** (ensures hierarchical consistency within each RegioStaR region)
   - **Smart integerize** 100m control totals (preserves 1km sums using largest remainder method)
   - If `regiostar_split=False`: Create single popsim folder with seed files
   - If `regiostar_split=True`: Create multiple `popsim_regiostar_{value}/` folders, each with census AND MiD filtered by that RegioStaR17 value

**Note**: 1km control totals are derived by aggregating the filtered 100m data, not from the original 1km census. This ensures that `sum(100m cells) = 1km total` within each RegioStaR region, even when 100m cells with different RegioStaR values share a 1km parent cell.

In [29]:
import pandas as pd
import numpy as np
import os
import re
import shutil
import yaml
from unidecode import unidecode

print("[Step 3/4] Processing controls, integerizing, and creating PopSim folders...")
print("=" * 60)

# Normalize filter lists
kernwo_list = kernwo if isinstance(kernwo, list) else ([kernwo] if kernwo else None)
regiostar17_list = regiostar17 if isinstance(regiostar17, list) else ([regiostar17] if regiostar17 else None)

# Load controls (intermediate file - use configured separator)
print(f"Loading controls template (separator: '{intermediate_sep}')...")
controls_df_full = pd.read_csv(f'{popsim_dir}/configs/_prep3_controls.csv', sep=intermediate_sep)
print(f"  Loaded {len(controls_df_full)} total controls from _prep3_controls.csv")

# Filter to controls that have expressions (these are the ones actually used)
controls_df = controls_df_full[controls_df_full['expression'].notna() & (controls_df_full['expression'] != '')].copy()
print(f"  {len(controls_df)} controls have expressions (will be used)")

# Extract control_field values - these are the census columns we actually need
needed_control_fields = set(controls_df['control_field'].tolist())
needed_100m_cols = {c for c in needed_control_fields if c.endswith('_ZENSUS100m')}
needed_1km_cols = {c for c in needed_control_fields if c.endswith('_ZENSUS1km')}
print(f"  100m columns needed: {len(needed_100m_cols)}")
print(f"  1km columns needed: {len(needed_1km_cols)}")

# =============================================================================
# SMART INTEGERIZATION FUNCTIONS
# =============================================================================

def get_1km_parent(id_100m: str) -> str:
    """Convert 100m cell ID to its parent 1km cell ID."""
    match = re.match(r'CRS3035RES100mN(\d+)E(\d+)', str(id_100m))
    if match:
        n, e = int(match.group(1)), int(match.group(2))
        n_1km = (n // 1000) * 1000
        e_1km = (e // 1000) * 1000
        return f'CRS3035RES1000mN{n_1km}E{e_1km}'
    return None

def largest_remainder_round(values: np.ndarray, target_sum: int) -> np.ndarray:
    """Distribute integer values using largest remainder method (Hamilton apportionment)."""
    if target_sum == 0 or len(values) == 0:
        return np.zeros(len(values), dtype=int)
    
    total = values.sum()
    if total == 0:
        return np.zeros(len(values), dtype=int)
    
    # Scale values to sum to target
    scaled = values * (target_sum / total)
    
    # Floor all values
    floored = np.floor(scaled).astype(int)
    
    # Calculate remainders and distribute deficit
    remainders = scaled - floored
    deficit = target_sum - floored.sum()
    
    if deficit > 0:
        indices = np.argsort(-remainders)[:deficit]
        floored[indices] += 1
    elif deficit < 0:
        indices = np.argsort(remainders)[:-deficit]
        floored[indices] -= 1
    
    return floored

def smart_integerize_column(df: pd.DataFrame, col: str, group_col: str = 'ZENSUS1km') -> pd.Series:
    """Integerize a single column, preserving 1km sums."""
    result = pd.Series(index=df.index, dtype=int)
    
    for group_id, group_df in df.groupby(group_col):
        values = group_df[col].values.astype(float)
        target = int(round(values.sum()))
        if target < 0:
            target = 0
        int_values = largest_remainder_round(values, target)
        result.loc[group_df.index] = int_values
    
    return result

def smart_integerize_census(df_100m: pd.DataFrame, cols_to_integerize: set, id_col: str) -> pd.DataFrame:
    """Smart integerize specified columns in 100m census data."""
    df = df_100m.copy()
    
    # Add 1km parent mapping
    df['_ZENSUS1km'] = df[id_col].apply(get_1km_parent)
    
    # Find which columns to integerize (intersection of needed and available)
    available_cols = set(df.columns)
    cols_to_process = cols_to_integerize.intersection(available_cols)
    
    if not cols_to_process:
        print("    No columns to integerize")
        df = df.drop(columns=['_ZENSUS1km'])
        return df
    
    print(f"    Integerizing {len(cols_to_process)} columns...")
    
    for i, col in enumerate(cols_to_process):
        if (i + 1) % 10 == 0 or i == 0:
            print(f"      Processing {i+1}/{len(cols_to_process)}: {col[:50]}...")
        df[col] = smart_integerize_column(df, col, '_ZENSUS1km')
    
    df = df.drop(columns=['_ZENSUS1km'])
    return df

# =============================================================================
# LOAD SEED DATA
# =============================================================================

print(f"\nLoading MiD seed data...")
print(f"  Households: {mid_households_path}")
print(f"  Persons: {mid_persons_path}")
seed_households_full = pd.read_csv(mid_households_path, sep=',')
seed_persons_full = pd.read_csv(mid_persons_path, sep=',')
print(f"  Loaded {len(seed_persons_full):,} persons, {len(seed_households_full):,} households")

# Track person counts BEFORE kernwo filter (needed for complete household check)
persons_per_hh_before_kernwo = seed_persons_full.groupby('H_ID').size()

# Apply kernwo filter globally (only if not None) - this is done once
if kernwo_list:
    print(f"\nApplying kernwo filter:")
    persons_before = len(seed_persons_full)
    if 'kernwo' in seed_persons_full.columns:
        seed_persons_full = seed_persons_full[seed_persons_full['kernwo'].isin(kernwo_list)]
        print(f"  kernwo {kernwo_list}: {persons_before:,} -> {len(seed_persons_full):,} persons")
        
        # Identify complete households (same person count before and after kernwo filter)
        persons_per_hh_after_kernwo = seed_persons_full.groupby('H_ID').size()
        common_hh = persons_per_hh_before_kernwo.index.intersection(persons_per_hh_after_kernwo.index)
        complete_households = set(
            common_hh[persons_per_hh_before_kernwo.loc[common_hh] == persons_per_hh_after_kernwo.loc[common_hh]]
        )
        incomplete_count = len(common_hh) - len(complete_households)
        lost_all_count = len(persons_per_hh_before_kernwo) - len(common_hh)
        print(f"  Complete households (no persons lost): {len(complete_households):,}")
        print(f"  Households that lost some persons: {incomplete_count:,}")
        print(f"  Households that lost all persons: {lost_all_count:,}")
else:
    # No kernwo filter - all households are complete
    complete_households = set(seed_households_full['H_ID'].unique())

# Essential columns
essential_cols = {'H_ID', 'H_GEW', 'HP_ID', 'P_ID', 'P_GEW'}
needed_cols = essential_cols.copy()

# Extract columns from expressions
pattern = r'\.(?P<col>[A-Za-z_][A-Za-z0-9_]*)'
for expr in controls_df['expression'].dropna():
    for match in re.finditer(pattern, str(expr)):
        needed_cols.add(match.group('col'))

print(f"\nColumns needed from expressions: {needed_cols - essential_cols}")

# Standardize column names helper
def clean_col_name(name):
    return unidecode(name).replace(" ", "").replace(".", "").replace(",", "").replace("-", "_")

# Compute the suffixed household column name (same logic as Step 2)
household_column_clean = clean_col_name(household_column)
household_column_suffixed = f"{household_column_clean}_ZENSUS100m"

# Helper function to process and save seed data for a specific folder
def create_popsim_folder(output_dir, census_100m_filtered, 
                         seed_persons_filtered, seed_households_filtered,
                         apply_integerization=True):
    """Create a complete popsim folder with all required files.
    
    1km control totals are derived by aggregating 100m data (before integerization).
    This ensures perfect hierarchical consistency: sum(100m) = 1km.
    
    Control totals files are filtered to only include needed columns (from controls.csv).
    """
    
    # Create directory structure
    os.makedirs(f"{output_dir}/data", exist_ok=True)
    os.makedirs(f"{output_dir}/configs", exist_ok=True)
    os.makedirs(f"{output_dir}/scripts", exist_ok=True)
    os.makedirs(f"{output_dir}/output", exist_ok=True)
    
    # Helper to get 1km ID from 100m ID
    def get_1km_from_100m(cell_id):
        match = re.match(r'CRS3035RES100mN(\d+)E(\d+)', str(cell_id))
        if match:
            n, e = int(match.group(1)), int(match.group(2))
            n_1km = (n // 1000) * 1000
            e_1km = (e // 1000) * 1000
            return f"CRS3035RES1000mN{n_1km}E{e_1km}"
        return None
    
    # Process census 100m
    census_100m_proc = census_100m_filtered.copy()
    census_100m_proc.columns = [clean_col_name(c) for c in census_100m_proc.columns]
    
    id_col_100m_clean = census_100m_proc.columns[0]
    
    # Geography names
    geo_names = ['ZENSUS100m', 'ZENSUS1km', 'STAAT', 'WELT']
    
    # Create geo_cross_walk
    geo_cross = pd.DataFrame()
    geo_cross['ZENSUS100m'] = census_100m_proc[id_col_100m_clean]
    geo_cross['ZENSUS1km'] = geo_cross['ZENSUS100m'].apply(get_1km_from_100m)
    geo_cross['STAAT'] = 1
    geo_cross['WELT'] = 1
    geo_cross.to_csv(f'{output_dir}/data/geo_cross_walk.csv', index=False)
    
    # Add geography columns
    census_100m_proc = census_100m_proc.rename(columns={id_col_100m_clean: 'ZENSUS100m'})
    census_100m_proc['ZENSUS1km'] = census_100m_proc['ZENSUS100m'].apply(get_1km_from_100m)
    census_100m_proc['STAAT'] = 1
    census_100m_proc['WELT'] = 1
    
    # Get numeric columns only (exclude string columns like cell IDs, RegioStaR17)
    numeric_base_cols = [c for c in census_100m_proc.columns 
                         if c not in geo_names and pd.api.types.is_numeric_dtype(census_100m_proc[c])]
    
    # Suffix numeric columns with _ZENSUS100m
    for col in numeric_base_cols:
        census_100m_proc.rename(columns={col: f"{col}_ZENSUS100m"}, inplace=True)
    
    census_100m_proc = census_100m_proc.fillna(0)
    
    # Aggregate 100m -> 1km BEFORE integerization
    # This ensures hierarchical consistency: sum(integerized_100m) = round(aggregated_1km)
    print(f"  Aggregating 100m -> 1km control totals...")
    # Only aggregate numeric columns (those with _ZENSUS100m suffix)
    numeric_cols = [f"{c}_ZENSUS100m" for c in numeric_base_cols]
    agg_dict = {col: 'sum' for col in numeric_cols}
    census_1km_proc = census_100m_proc.groupby('ZENSUS1km').agg(agg_dict).reset_index()
    
    # Rename columns from _ZENSUS100m to _ZENSUS1km suffix
    for col in list(census_1km_proc.columns):
        if col.endswith('_ZENSUS100m'):
            new_col = col.replace('_ZENSUS100m', '_ZENSUS1km')
            census_1km_proc.rename(columns={col: new_col}, inplace=True)
    
    census_1km_proc['STAAT'] = 1
    census_1km_proc['WELT'] = 1
    
    # Smart integerize 100m control columns (only the ones needed)
    # Uses the aggregated 1km sums as targets (via groupby sum -> round)
    if apply_integerization:
        print(f"  Smart integerizing 100m control totals...")
        census_100m_proc = smart_integerize_census(census_100m_proc, needed_100m_cols, 'ZENSUS100m')
        
        # Integerize 1km columns (simple rounding of aggregated sums)
        for col in needed_1km_cols:
            if col in census_1km_proc.columns:
                census_1km_proc[col] = census_1km_proc[col].round().astype(int)
    
    # Filter to only needed columns (geography + control fields)
    # This dramatically reduces file size by excluding unused census columns
    cols_100m_to_keep = ['ZENSUS100m'] + sorted(needed_100m_cols.intersection(set(census_100m_proc.columns)))
    cols_1km_to_keep = ['ZENSUS1km'] + sorted(needed_1km_cols.intersection(set(census_1km_proc.columns)))
    
    census_100m_out = census_100m_proc[cols_100m_to_keep]
    census_1km_out = census_1km_proc[cols_1km_to_keep]
    
    census_100m_out.to_csv(f'{output_dir}/data/control_totals_ZENSUS100m.csv', index=False)
    census_1km_out.to_csv(f'{output_dir}/data/control_totals_ZENSUS1km.csv', index=False)
    
    print(f"  Created control_totals: {len(census_100m_out)} 100m cells ({len(cols_100m_to_keep)} cols), {len(census_1km_out)} 1km cells ({len(cols_1km_to_keep)} cols)")
    
    # Create STAAT and WELT control totals
    staat_df = pd.DataFrame({'STAAT': [1], 'WELT': [1]})
    staat_df.to_csv(f'{output_dir}/data/control_totals_STAAT.csv', index=False)
    
    welt_df = pd.DataFrame({'WELT': [1]})
    welt_df.to_csv(f'{output_dir}/data/control_totals_WELT.csv', index=False)
    
    # Process seed data - keep only COMPLETE households
    # (households that didn't lose any persons due to kernwo filter)
    print(f"  Filtering to complete households only...")
    
    hh_ids_original = set(seed_households_filtered['H_ID'].unique())
    
    # Keep only complete households (intersection with complete_households from kernwo analysis)
    seed_households_complete = seed_households_filtered[
        seed_households_filtered['H_ID'].isin(complete_households)
    ].copy()
    
    # Keep only persons from complete households
    seed_persons_complete = seed_persons_filtered[
        seed_persons_filtered['H_ID'].isin(seed_households_complete['H_ID'])
    ].copy()
    
    # Report filtering
    hh_removed = len(hh_ids_original) - len(seed_households_complete)
    if hh_removed > 0:
        print(f"    Removed {hh_removed} incomplete/empty households ({len(seed_households_complete)} remain)")
    
    # Filter to needed columns
    p_cols = list(needed_cols.intersection(seed_persons_complete.columns))
    h_cols = list(needed_cols.intersection(seed_households_complete.columns))
    
    seed_persons_out = seed_persons_complete[p_cols].copy()
    seed_households_out = seed_households_complete[h_cols].copy()
    
    # Add STAAT geography
    seed_persons_out['STAAT'] = 1
    seed_households_out['STAAT'] = 1
    
    # Save seed files (comma-separated for PopSim)
    seed_persons_out.to_csv(f'{output_dir}/data/seed_persons.csv', index=False)
    seed_households_out.to_csv(f'{output_dir}/data/seed_households.csv', index=False)
    
    # Copy controls.csv (only rows with expressions)
    controls_df.to_csv(f'{output_dir}/configs/controls.csv', index=False)
    
    # Copy and adapt settings.yaml from base popsim folder
    shutil.copy(f'{popsim_dir}/configs/settings.yaml', f'{output_dir}/configs/settings.yaml')
    
    # Copy other config files if they exist
    for config_file in ['logging.yaml']:
        src = f'{popsim_dir}/configs/{config_file}'
        if os.path.exists(src):
            shutil.copy(src, f'{output_dir}/configs/{config_file}')
    
    # Copy scripts
    for script_file in ['verification.yaml']:
        src = f'{popsim_dir}/scripts/{script_file}'
        if os.path.exists(src):
            shutil.copy(src, f'{output_dir}/scripts/{script_file}')
    
    # Copy run script
    run_script = f'{popsim_dir}/run_populationsim.py'
    if os.path.exists(run_script):
        shutil.copy(run_script, f'{output_dir}/run_populationsim.py')
    
    return {
        'cells_100m': len(census_100m_out),
        'cells_1km': len(census_1km_out),
        'households': census_100m_proc[household_column_suffixed].sum(),
        'seed_persons': len(seed_persons_out),
        'seed_households': len(seed_households_out)
    }

# =============================================================================
# MAIN LOGIC: Split by RegioStar or single folder
# =============================================================================

if regiostar_split:
    print(f"\n{'='*60}")
    print("REGIOSTAR SPLIT MODE")
    print(f"{'='*60}")
    
    # Load the 100m census data (already spatially filtered in Step 1)
    # Note: 1km data is now derived by aggregating filtered 100m, not from original file
    census_100m_base = pd.read_parquet(f'{popsim_dir}/data/_census_100m_filtered.parquet')
    
    # Get RegioStaR17 column name from 100m census (may vary in case)
    regiostar_col_100m = [c for c in census_100m_base.columns if c.lower() == 'regiostar17'][0]
    
    # Get unique RegioStaR17 values from census (these are the ones in the study area)
    unique_regiostar = sorted(census_100m_base[regiostar_col_100m].dropna().unique())
    print(f"\nFound {len(unique_regiostar)} unique RegioStaR17 values in study area: {unique_regiostar}")
    
    # Check MiD has RegioStaR17 column
    if 'RegioStaR17' not in seed_persons_full.columns:
        raise ValueError("MiD persons data does not have 'RegioStaR17' column!")
    if 'RegioStaR17' not in seed_households_full.columns:
        raise ValueError("MiD households data does not have 'RegioStaR17' column!")
    
    created_folders = []
    
    for rs_value in unique_regiostar:
        rs_int = int(rs_value)  # Convert from float if needed
        folder_name = f"popsim_regiostar_{rs_int}"
        print(f"\n--- Creating {folder_name} ---")
        
        # Filter 100m census by RegioStaR17
        # (1km will be derived by aggregating this filtered 100m data)
        census_100m_rs = census_100m_base[census_100m_base[regiostar_col_100m] == rs_value].copy()
        
        # Filter MiD by RegioStaR17
        seed_persons_rs = seed_persons_full[seed_persons_full['RegioStaR17'] == rs_int].copy()
        seed_households_rs = seed_households_full[seed_households_full['RegioStaR17'] == rs_int].copy()
        
        print(f"  Census: {len(census_100m_rs)} 100m cells (1km will be aggregated)")
        print(f"  MiD: {len(seed_persons_rs):,} persons, {len(seed_households_rs):,} households")
        
        if len(census_100m_rs) == 0:
            print(f"  WARNING: No census cells for RegioStaR17={rs_int}, skipping!")
            continue
        
        if len(seed_persons_rs) == 0 or len(seed_households_rs) == 0:
            print(f"  WARNING: No MiD data for RegioStaR17={rs_int}, skipping!")
            continue
        
        # Create the folder (1km derived from filtered 100m, with smart integerization)
        stats = create_popsim_folder(
            folder_name, 
            census_100m_rs, 
            seed_persons_rs, 
            seed_households_rs,
            apply_integerization=True
        )
        
        created_folders.append({
            'folder': folder_name,
            'regiostar17': rs_int,
            **stats
        })
        print(f"  Created {folder_name}/ with {stats['cells_100m']} 100m cells, {stats['cells_1km']} 1km cells, {stats['households']:.0f} target HH")
    
    # Summary
    print(f"\n{'='*60}")
    print("SUMMARY - RegioStar Split Mode")
    print(f"{'='*60}")
    print(f"\nCreated {len(created_folders)} popsim folders:")
    for info in created_folders:
        print(f"  {info['folder']}: RS17={info['regiostar17']}, "
              f"{info['cells_100m']} 100m, {info['cells_1km']} 1km, {info['households']:.0f} HH, "
              f"{info['seed_persons']:,} seed persons")
    print(f"\nNote: 1km data derived from filtered 100m (hierarchically consistent)")
    print(f"Note: Only complete households (no persons lost to kernwo filter) are included")
    print(f"Note: Control totals filtered to only needed columns ({len(needed_100m_cols)} 100m, {len(needed_1km_cols)} 1km)")

else:
    # Original single-folder behavior
    print(f"\n{'='*60}")
    print("SINGLE FOLDER MODE")
    print(f"{'='*60}")
    
    # Apply regiostar17 filter if specified (only in single mode)
    seed_persons = seed_persons_full.copy()
    seed_households = seed_households_full.copy()
    
    if regiostar17_list:
        print(f"\nApplying regiostar17 filter:")
        persons_before = len(seed_persons)
        households_before = len(seed_households)
        if 'RegioStaR17' in seed_persons.columns:
            seed_persons = seed_persons[seed_persons['RegioStaR17'].isin(regiostar17_list)]
        if 'RegioStaR17' in seed_households.columns:
            seed_households = seed_households[seed_households['RegioStaR17'].isin(regiostar17_list)]
        print(f"  regiostar17 {regiostar17_list}: {len(seed_persons):,} persons, {len(seed_households):,} households")
    
    print(f"\nFinal counts:")
    print(f"  Persons: {len(seed_persons):,}")
    print(f"  Households: {len(seed_households):,}")
    
    # Load 100m census data (1km will be derived by aggregation)
    census_100m = pd.read_parquet(f'{popsim_dir}/data/_census_100m_filtered.parquet')
    
    # Create the folder (1km derived from 100m, with smart integerization)
    stats = create_popsim_folder(
        popsim_dir, 
        census_100m, 
        seed_persons, 
        seed_households,
        apply_integerization=True
    )
    
    print(f"\nCreated (all comma-separated for PopSim):")
    print(f"  {popsim_dir}/data/seed_persons.csv ({stats['seed_persons']} rows)")
    print(f"  {popsim_dir}/data/seed_households.csv ({stats['seed_households']} rows)")
    print(f"  {popsim_dir}/configs/controls.csv ({len(controls_df)} controls)")
    print(f"  Note: 1km data derived from 100m (hierarchically consistent)")
    print(f"  Note: Only complete households included")
    print(f"  Note: Control totals filtered to only needed columns")

print("\n[Step 3/4] Complete.")

[Step 3/4] Processing controls, integerizing, and creating PopSim folders...
Loading controls template (separator: ';')...
  Loaded 44 total controls from _prep3_controls.csv
  44 controls have expressions (will be used)
  100m columns needed: 22
  1km columns needed: 22

Loading MiD seed data...
  Households: inputs/MiD2023_Haushalte.csv
  Persons: inputs/MiD2023_Personen.csv
  Loaded 420,979 persons, 218,101 households

Applying kernwo filter:
  kernwo [1, 2, 3]: 420,979 -> 299,889 persons
  Complete households (no persons lost): 155,525
  Households that lost some persons: 1
  Households that lost all persons: 62,572

Columns needed from expressions: {'HP_ALTER', 'inf', 'HP_SEX'}

REGIOSTAR SPLIT MODE

Found 17 unique RegioStaR17 values in study area: [111.0, 112.0, 113.0, 114.0, 115.0, 121.0, 123.0, 124.0, 125.0, 211.0, 213.0, 214.0, 215.0, 221.0, 223.0, 224.0, 225.0]

--- Creating popsim_regiostar_111 ---
  Census: 7494 100m cells (1km will be aggregated)
  MiD: 48,456 persons, 37

## Step 4: Validate and Run

Validates the setup and provides instructions for running PopSim.

In [30]:
import os
import glob
import json
import yaml
import pandas as pd

print("[Step 4/4] Validating setup...")
print("=" * 60)

def validate_popsim_folder(folder_path, folder_name):
    """Validate a single popsim folder and return errors list."""
    errors = []
    
    required_files = [
        'data/geo_cross_walk.csv',
        'data/seed_persons.csv',
        'data/seed_households.csv',
        'data/control_totals_ZENSUS100m.csv',
        'data/control_totals_ZENSUS1km.csv',
        'data/control_totals_STAAT.csv',
        'data/control_totals_WELT.csv',
        'configs/settings.yaml',
        'configs/controls.csv',
    ]
    
    print(f"\nChecking {folder_name}...")
    for f in required_files:
        full_path = f"{folder_path}/{f}"
        if os.path.exists(full_path):
            size = os.path.getsize(full_path)
            print(f"  [OK] {f} ({size:,} bytes)")
        else:
            print(f"  [MISSING] {f}")
            errors.append(f"Missing: {f}")
    
    # Check controls
    controls_path = f"{folder_path}/configs/controls.csv"
    if os.path.exists(controls_path):
        try:
            controls = pd.read_csv(controls_path)
            empty = controls['expression'].isna().sum()
            if empty > 0:
                errors.append(f"{empty} controls missing expressions")
            else:
                print(f"  {len(controls)} controls, all have expressions")
        except Exception as e:
            errors.append(f"Error reading controls: {e}")
    
    # Check settings
    settings_path = f"{folder_path}/configs/settings.yaml"
    if os.path.exists(settings_path):
        try:
            with open(settings_path) as f:
                settings = yaml.safe_load(f)
            print(f"  Geographies: {settings.get('geographies')}")
            print(f"  Total HH control: {settings.get('total_hh_control')}")
        except Exception as e:
            errors.append(f"Error reading settings: {e}")
    
    return errors

# =============================================================================
# VALIDATION
# =============================================================================

all_errors = {}

if regiostar_split:
    # Find all popsim_regiostar_* folders
    regiostar_folders = sorted(glob.glob("popsim_regiostar_*"))
    
    if not regiostar_folders:
        print("\nWARNING: No popsim_regiostar_* folders found!")
        print("Run Step 3 first to create them.")
    else:
        print(f"\nFound {len(regiostar_folders)} RegioStar folders to validate")
        
        for folder in regiostar_folders:
            folder_errors = validate_popsim_folder(folder, folder)
            if folder_errors:
                all_errors[folder] = folder_errors
        
        # Summary
        print(f"\n{'='*60}")
        if all_errors:
            print("VALIDATION FAILED")
            for folder, errors in all_errors.items():
                print(f"\n  {folder}:")
                for e in errors:
                    print(f"    - {e}")
        else:
            print("VALIDATION PASSED")
            print(f"\n{len(regiostar_folders)} folders ready to run PopSim.")
            print("\nTo run all folders:")
            print("uv run batch_run_popsim.py")

else:
    # Single folder validation
    folder_errors = validate_popsim_folder(popsim_dir, popsim_dir)
    if folder_errors:
        all_errors[popsim_dir] = folder_errors
    
    # Summary
    print(f"\n{'='*60}")
    if all_errors:
        print("VALIDATION FAILED")
        for e in all_errors[popsim_dir]:
            print(f"  - {e}")
    else:
        print("VALIDATION PASSED")
        print(f"\nReady to run PopSim:")
        print(f"uv run populationsim -w {popsim_dir}")

print(f"{'='*60}")
print("\n[Step 4/4] Complete.")

[Step 4/4] Validating setup...

Found 17 RegioStar folders to validate

Checking popsim_regiostar_111...
  [OK] data/geo_cross_walk.csv (502,130 bytes)
  [OK] data/seed_persons.csv (2,214,125 bytes)
  [OK] data/seed_households.csv (794,681 bytes)
  [OK] data/control_totals_ZENSUS100m.csv (933,320 bytes)
  [OK] data/control_totals_ZENSUS1km.csv (27,177 bytes)
  [OK] data/control_totals_STAAT.csv (15 bytes)
  [OK] data/control_totals_WELT.csv (7 bytes)
  [OK] configs/settings.yaml (1,684 bytes)
  [OK] configs/controls.csv (6,426 bytes)
  44 controls, all have expressions
  Geographies: ['WELT', 'STAAT', 'ZENSUS1km', 'ZENSUS100m']
  Total HH control: Insgesamt_Haushalte_Groesse_des_privaten_Haushalts_100m_Gitter_adj_ZENSUS100m_target

Checking popsim_regiostar_112...
  [OK] data/geo_cross_walk.csv (142,407 bytes)
  [OK] data/seed_persons.csv (1,234,125 bytes)
  [OK] data/seed_households.csv (431,412 bytes)
  [OK] data/control_totals_ZENSUS100m.csv (261,648 bytes)
  [OK] data/control_total

## Utilities: Reset

Clean up generated files to start fresh.

In [None]:
import os
import glob
import shutil

def reset(confirm=False, include_regiostar_folders=False):
    """Delete all generated files.
    
    Args:
        confirm: Set to True to actually delete files
        include_regiostar_folders: Set to True to also delete popsim_regiostar_* folders
    """
    files = [
        f'{popsim_dir}/data/geo_cross_walk.csv',
        f'{popsim_dir}/data/seed_persons.csv',
        f'{popsim_dir}/data/seed_households.csv',
        f'{popsim_dir}/data/control_totals_ZENSUS100m.csv',
        f'{popsim_dir}/data/control_totals_ZENSUS1km.csv',
        f'{popsim_dir}/data/control_totals_STAAT.csv',
        f'{popsim_dir}/data/control_totals_WELT.csv',
        f'{popsim_dir}/data/_census_100m_filtered.parquet',
        f'{popsim_dir}/data/_census_1km_filtered.parquet',
        f'{popsim_dir}/configs/controls.csv',
        f'{popsim_dir}/configs/_prep3_controls.csv',
    ]
    
    existing_files = [f for f in files if os.path.exists(f)]
    
    # Find RegioStar folders
    regiostar_folders = sorted(glob.glob("popsim_regiostar_*")) if include_regiostar_folders else []
    
    if not existing_files and not regiostar_folders:
        print("No files to delete.")
        return
    
    if existing_files:
        print("Files to delete:")
        for f in existing_files:
            print(f"  {f}")
    
    if regiostar_folders:
        print(f"\nRegioStar folders to delete ({len(regiostar_folders)}):")
        for f in regiostar_folders:
            print(f"  {f}/")
    
    if not confirm:
        cmd = "reset(confirm=True"
        if regiostar_folders:
            cmd += ", include_regiostar_folders=True"
        cmd += ")"
        print(f"\nRun {cmd} to delete.")
        return
    
    for f in existing_files:
        os.remove(f)
        print(f"Deleted: {f}")
    
    for folder in regiostar_folders:
        shutil.rmtree(folder)
        print(f"Deleted: {folder}/")
    
    print("\nReset complete.")

# Show what would be deleted
reset(confirm=False, include_regiostar_folders=True)

## Step 5: Assign Households to Buildings

Merges the PopSim results with the buildings GeoPackage:
- Loads household IDs from `popsim_combined/final_expanded_household_ids_combined.csv`
- Loads buildings from `inputs/buildings_with_households.gpkg`
- Distributes households evenly among buildings with `has_home=True` in each 100m cell
- Falls back to any buildings in the cell if no `has_home=True` buildings exist
- Saves result to `buildings_with_assigned_households.gpkg`

In [3]:
import pandas as pd
import geopandas as gpd
import numpy as np

print("[Step 5] Assigning households to buildings...")
print("=" * 60)

# =============================================================================
# CONFIGURATION
# =============================================================================
combined_results_path = "popsim_combined/final_expanded_household_ids_combined.csv"
buildings_gpkg_path = f"{inputs_dir}/buildings_with_households.gpkg"
output_gpkg_path = "buildings_with_assigned_households.gpkg"

# =============================================================================
# LOAD DATA
# =============================================================================
print(f"\nLoading combined PopSim results: {combined_results_path}")
combined_df = pd.read_csv(combined_results_path)
print(f"  Loaded {len(combined_df):,} household records")
print(f"  Unique cells: {combined_df['ZENSUS100m'].nunique():,}")
print(f"  Unique HH IDs: {combined_df['H_ID'].nunique():,}")

# Group households by cell
hh_by_cell = combined_df.groupby('ZENSUS100m')['H_ID'].apply(list).to_dict()
popsim_cells = set(hh_by_cell.keys())
print(f"  Cells with households: {len(popsim_cells):,}")

print(f"\nLoading buildings GeoPackage: {buildings_gpkg_path}")
buildings_gdf = gpd.read_file(buildings_gpkg_path)
print(f"  Loaded {len(buildings_gdf):,} buildings")
print(f"  Unique cells: {buildings_gdf['cell_id'].nunique():,}")

gpkg_cells = set(buildings_gdf['cell_id'].dropna().unique())

# =============================================================================
# CHECK CELL MISMATCHES
# =============================================================================
print(f"\n{'='*60}")
print("CELL MISMATCH ANALYSIS")
print(f"{'='*60}")

cells_only_in_popsim = popsim_cells - gpkg_cells
cells_only_in_gpkg = gpkg_cells - popsim_cells
cells_in_both = popsim_cells & gpkg_cells

print(f"  Cells in both popsim and geopkg: {len(cells_in_both):,}")
print(f"  Cells only in popsim (no buildings): {len(cells_only_in_popsim):,}")
print(f"  Cells only in geopkg (no households): {len(cells_only_in_gpkg):,}")

if cells_only_in_popsim:
    # Count orphan households
    orphan_hh_count = sum(len(hh_by_cell[cell]) for cell in cells_only_in_popsim)
    print(f"\n  WARNING: {orphan_hh_count:,} households in {len(cells_only_in_popsim):,} cells have NO buildings!")
    print(f"  Sample cells without buildings (first 10):")
    for cell in sorted(cells_only_in_popsim)[:10]:
        print(f"    {cell}: {len(hh_by_cell[cell])} HHs")

# =============================================================================
# ASSIGN HOUSEHOLDS TO BUILDINGS
# =============================================================================
print(f"\n{'='*60}")
print("ASSIGNING HOUSEHOLDS TO BUILDINGS")
print(f"{'='*60}")

# Clear existing HH_IDs
buildings_gdf['HH_IDs'] = None

# Track statistics
cells_with_no_has_home = []
total_hh_assigned = 0
total_hh_to_fallback = 0

# Build index of buildings by cell for fast lookup
print("\nBuilding cell index...")
buildings_gdf['_idx'] = buildings_gdf.index
cell_to_buildings = buildings_gdf.groupby('cell_id')['_idx'].apply(list).to_dict()

# Pre-compute has_home mask
has_home_mask = buildings_gdf['has_home'].fillna(False).astype(bool)

print("Distributing households...")
processed_cells = 0
for cell_id, hh_ids in hh_by_cell.items():
    if cell_id not in cell_to_buildings:
        # Cell has no buildings at all - already logged above
        continue
    
    building_indices = cell_to_buildings[cell_id]
    
    # Get buildings with has_home=True
    home_building_indices = [idx for idx in building_indices if has_home_mask[idx]]
    
    if home_building_indices:
        # Normal case: distribute among has_home=True buildings
        target_indices = home_building_indices
    else:
        # Fallback: no has_home buildings, use all buildings in cell
        cells_with_no_has_home.append((cell_id, len(hh_ids), len(building_indices)))
        target_indices = building_indices
        total_hh_to_fallback += len(hh_ids)
    
    # Distribute households evenly (round-robin)
    n_buildings = len(target_indices)
    hh_per_building = [[] for _ in range(n_buildings)]
    
    for i, hh_id in enumerate(hh_ids):
        hh_per_building[i % n_buildings].append(str(hh_id))
    
    # Assign to buildings
    for i, idx in enumerate(target_indices):
        if hh_per_building[i]:
            buildings_gdf.at[idx, 'HH_IDs'] = ';'.join(hh_per_building[i])
    
    total_hh_assigned += len(hh_ids)
    processed_cells += 1
    
    if processed_cells % 50000 == 0:
        print(f"  Processed {processed_cells:,} / {len(hh_by_cell):,} cells...")

# Clean up temporary column
buildings_gdf = buildings_gdf.drop(columns=['_idx'])

# =============================================================================
# REPORT FALLBACK CASES
# =============================================================================
print(f"\n{'='*60}")
print("FALLBACK CASES (cells with no has_home=True buildings)")
print(f"{'='*60}")

if cells_with_no_has_home:
    print(f"  Total cells requiring fallback: {len(cells_with_no_has_home):,}")
    print(f"  Total households assigned to fallback buildings: {total_hh_to_fallback:,}")
    print(f"\n  Affected cells (showing first 20):")
    for cell_id, hh_count, building_count in sorted(cells_with_no_has_home, key=lambda x: -x[1])[:20]:
        print(f"    {cell_id}: {hh_count} HHs -> {building_count} buildings (no has_home)")
else:
    print("  No fallback cases - all cells had has_home=True buildings.")

# =============================================================================
# SAVE OUTPUT
# =============================================================================
print(f"\n{'='*60}")
print("SAVING OUTPUT")
print(f"{'='*60}")

print(f"\nSaving to: {output_gpkg_path}")
buildings_gdf.to_file(output_gpkg_path, driver="GPKG")
print(f"  Saved {len(buildings_gdf):,} buildings")

# Verify
buildings_with_hh = buildings_gdf['HH_IDs'].notna().sum()
print(f"  Buildings with assigned households: {buildings_with_hh:,}")

# =============================================================================
# SUMMARY
# =============================================================================
print(f"\n{'='*60}")
print("SUMMARY")
print(f"{'='*60}")
print(f"  Total households in popsim: {len(combined_df):,}")
print(f"  Households assigned to buildings: {total_hh_assigned:,}")
print(f"  Households in cells with no buildings: {len(combined_df) - total_hh_assigned:,}")
print(f"  Households assigned via fallback (no has_home): {total_hh_to_fallback:,}")
print(f"  Buildings with households: {buildings_with_hh:,}")
print(f"  Output file: {output_gpkg_path}")

print("\n[Step 5] Complete.")

[Step 5] Assigning households to buildings...

Loading combined PopSim results: popsim_combined/final_expanded_household_ids_combined.csv
  Loaded 3,803,578 household records
  Unique cells: 329,837
  Unique HH IDs: 130,068
  Cells with households: 329,837

Loading buildings GeoPackage: inputs/buildings_with_households.gpkg
  Loaded 7,582,736 buildings
  Unique cells: 755,668

CELL MISMATCH ANALYSIS
  Cells in both popsim and geopkg: 328,087
  Cells only in popsim (no buildings): 1,750
  Cells only in geopkg (no households): 427,581

  Sample cells without buildings (first 10):
    CRS3035RES100mN3133200E4299100: 5 HHs
    CRS3035RES100mN3133400E4298500: 4 HHs
    CRS3035RES100mN3133500E4298500: 3 HHs
    CRS3035RES100mN3133600E4298400: 9 HHs
    CRS3035RES100mN3133700E4298400: 13 HHs
    CRS3035RES100mN3133900E4297800: 14 HHs
    CRS3035RES100mN3134000E4297800: 6 HHs
    CRS3035RES100mN3134000E4297900: 16 HHs
    CRS3035RES100mN3137200E4290300: 3 HHs
    CRS3035RES100mN3138900E4289600