# 🏊‍♀️ Berlin Pools — End-to-End Pipeline (Refactored)

**Purpose:** run this notebook top-to-bottom to build a unified dataset of Berlin pools from legacy and OSM sources, **adding**:

---

## What this notebook does

1) **Environment & Config**  
   - Auto-installs minimal deps (pandas, geopandas, shapely, etc.).  
   - Sets input/output paths & simple toggles.  
   - Declares the **LOR GeoJSON** path and the 8-digit **district_id** mapping.

2) **Run legacy extractor (if needed)**  
   - Executes `pool_data_processing.ipynb` to reproduce:  
     - `berlin_pools_final_dataset.csv`  
     - `pools_data_cleaned.csv`

3) **Load LOR polygons & enrich Legacy with districts/ortsteile**  
   - Loads `lor_ortsteile.geojson` (WGS84).  
   - Normalizes district labels; **maps district_id → 8 digits**.  
   - Derives **ortsteil_id** from `gml_id`.  
   - Spatial join (point-in-polygon) to the legacy CSV →  
     **`legacy_enriched_with_lor.csv`** (adds `district`, `district_id`, `ortsteil`, `ortsteil_id`).

4) **(Optional) OSM wide export + public named**  
   - Uses OSMnx to build **`osm_pools_wide.csv`** for Berlin.  
   - Writes **`osm_public_named.csv`** (named, non-private OSM features).  
   - *Skip if you already have these files.*

5) **OSM enrichment with districts/ortsteile**  
   - Applies the same LOR spatial join to OSM points →  
     **`osm_enriched_with_lor.csv`** (adds `district`, `district_id`, `ortsteil`, `ortsteil_id`).

6) **Cross-check (from enriched files) → pairing & new pools**  
   - Filters OSM to pool-like features, de-dups by name+coords.  
   - Matches **Legacy vs OSM** by normalized name and ≤ **250 m** distance.  
   - Outputs:  
     - **`legacy_enrichment_list.csv`** — candidate pairs to pull attributes from OSM  
       (both sides carry `district/ortsteil` from LOR).  
     - **`osm_public_named.csv`** — OSM pools not matched to legacy (potential new pools).

7) **Reverse geocode addresses (cache-aware)**  
   - Fills **`street`** & **`postal_code`** in `osm_public_named.csv`.  
   - Fills **`street_o`** & **`postal_code_o`** in `legacy_enrichment_list.csv` (OSM side only).  
   - Uses `reverse_geocode_cache.csv` to avoid re-querying.  
   - **Does not** modify district fields (they come from LOR).

8) **Build master** → `pools_master_minimal.csv`  
   - Combines **legacy_enriched_with_lor** + **unmatched OSM named**.  
   - Preserves Legacy `pool_id`; generates OSM `pool_id` (`OSM_<id>` or coordinate surrogate).  
   - Carries **`district`, `district_id` (8-digit), `ortsteil`, `ortsteil_id`**.  
   - Fills blank `open_all_year` → `False`.  
   - Ensures unique `pool_id`.

9) **Validation & Outputs**  
   - QC checks: duplicates, **district_id** format/membership, **ortsteil_id** format, postal codes, coords.  
   - Small samples of any problematic rows.

10) **Finalize schema & save**  
   - Coerces types (`latitude/longitude` → float).  
   - Pads `district_id` → **8 digits**, `ortsteil_id` → **4 digits** (when non-empty).  
   - Reorders columns and overwrites `pools_master_minimal.csv`.

---

**Notes**
- All steps are **idempotent**; re-running updates outputs in place.  
- LOR join is the **single source of truth** for `district`, `district_id`, `ortsteil`, `ortsteil_id`.  
- Reverse-geocoding is cached; districts are **not** inferred from reverse geocode.

# How to use this notebook / file

1. **Pull the most recent data** from [baederleben.de](https://baederleben.de/abfragen/baeder-suche.php) — save it as **`baederleben_berlin.csv`** and place it **in the same folder as this notebook**.
2. **Place** **`lor_ortsteile.geojson`** **in the same folder** as this notebook.
3. **Run the notebook top to bottom.**
4. **Final CSV output:** **`pools_master_minimal.csv`** (written to this folder).
5. **Database publish:** The notebook uploads the result to the DB table **`berlin_source_data.pools_refactored`** (aka *pools_refactored*).



## **1) Environment & Config**

**Why:** 
- Make the notebook reproducible and self-contained.

- Install/import core geo stack used for our two tasks (8-digit district_id mapping + adding ortsteil & ortsteil_id from LOR).

- Centralize all file paths and toggles.


In [1]:
# --- Minimal deps auto-install (safe no-ops if already installed) ---
import sys, subprocess, importlib



def ensure(pkg, pip_name=None):
    try:
        importlib.import_module(pkg)
    except ImportError:
        subprocess.check_call([sys.executable, "-m", "pip", "install", pip_name or pkg])

# Core
for mod, pipn in [
    ("pandas", None),
    ("numpy", None),
    ("geopy", None),
    ("nbformat", None),
    ("nbconvert", None),
    # --- GEO stack for our 2 tasks ---
    ("shapely", None),        # geometry + point-in-polygon
    ("pyproj", None),         # CRS handling
    ("rtree", None),          # spatial index for fast spatial joins
    ("pyogrio", None),        # fast vector IO backend (optional but nice)
    ("geopandas", None),      # spatial joins + read geojson
]:
    ensure(mod, pipn)

# OSMnx only needed if MAKE_OSM_WIDE = True
try:
    import osmnx  # noqa
except Exception:
    pass

from pathlib import Path
import pandas as pd
import numpy as np

# Geo imports for later cells (used for ortsteil + district join)
import geopandas as gpd
from shapely.geometry import Point


print("versions →",
      "pandas", pd.__version__,
      "| geopandas", gpd.__version__)

versions → pandas 2.3.3 | geopandas 1.1.1


In [3]:
# --- Configuration knobs & paths (adjust as needed) ---
from pathlib import Path

def resolve_path(p):
    return Path(str(p)).expanduser().resolve()

# Input artifacts expected next to this notebook
LEGACY_NOTEBOOK = resolve_path("pool_data_processing.ipynb")
LEGACY_XLSX     = resolve_path("baederleben_berlin.xlsx")

# Outputs produced by legacy extractor
LEGACY_MAIN_CSV = resolve_path("berlin_pools_final_dataset.csv")
LEGACY_ALT_CSV  = resolve_path("pools_data_cleaned.csv")

# OSM artifacts
MAKE_OSM_WIDE   = True  # set True to build OSM via OSMnx (heavy + may need internet)
OSM_PLACE       = "Berlin, Germany"
OSM_WIDE_CSV    = resolve_path("osm_pools_wide.csv")   # optional
OSM_PUBLIC_NAMED_CSV = resolve_path("osm_public_named.csv")

# Matching & filtering knobs
DIST_M          = 100.0
STRICT_PUBLIC   = False  # keep False unless you want to only keep strictly public pools

# Enrichment cache for reverse geocoding
CACHE_PATH      = resolve_path("reverse_geocode_cache.csv")

print("Configured paths:")
print("  LEGACY_NOTEBOOK:", LEGACY_NOTEBOOK)
print("  LEGACY_XLSX    :", LEGACY_XLSX)
print("  LEGACY_MAIN_CSV:", LEGACY_MAIN_CSV)
print("  LEGACY_ALT_CSV :", LEGACY_ALT_CSV)
print("  OSM_WIDE_CSV   :", OSM_WIDE_CSV, "(build:", MAKE_OSM_WIDE, "place:", OSM_PLACE, ")")
print("  OSM_PUBLIC_NAMED_CSV:", OSM_PUBLIC_NAMED_CSV)
print("  DIST_M:", DIST_M, "| STRICT_PUBLIC:", STRICT_PUBLIC)
print("  CACHE_PATH:", CACHE_PATH)

# --- NEW: LOR (Ortsteil) source for our enrichment ---
# Put the GeoJSON file in the same folder as this notebook (or adjust the path).
LOR_GEOJSON = resolve_path("lor_ortsteile.geojson")
if not LOR_GEOJSON.exists():
    raise FileNotFoundError(
        f"LOR_GEOJSON not found at {LOR_GEOJSON}. "
        "Place the GeoJSON locally (e.g., 'lor_ortsteile.geojson') and update the path."
    )

# --- NEW: canonical district_id mapping (your exact IDs) ---
DISTRICT_ID_MAP = {
    "Mitte": "11001001",
    "Friedrichshain-Kreuzberg": "11002002",
    "Pankow": "11003003",
    "Charlottenburg-Wilmersdorf": "11004004",
    "Spandau": "11005005",
    "Steglitz-Zehlendorf": "11006006",
    "Tempelhof-Schöneberg": "11007007",
    "Neukölln": "11008008",
    "Treptow-Köpenick": "11009009",
    "Marzahn-Hellersdorf": "11010010",
    "Lichtenberg": "11011011",
    "Reinickendorf": "11012012",
}

# --- NEW: lat/lon column names we’ll use for the spatial join ---
LAT_COL_NAME = "latitude"
LON_COL_NAME = "longitude"

# --- NEW: output paths for enriched artifacts ---
LEGACY_ENRICHED_CSV = resolve_path("legacy_enriched_with_lor.csv")
OSM_ENRICHED_CSV    = resolve_path("osm_enriched_with_lor.csv")

# Quick sanity prints
print("\nSanity:")
print("  LOR_GEOJSON exists:", LOR_GEOJSON.exists())
print("  Districts in map:", len(DISTRICT_ID_MAP))


Configured paths:
  LEGACY_NOTEBOOK: C:\Users\micha\Projects VS\final pull 16.10.2025\pool_data_processing.ipynb
  LEGACY_XLSX    : C:\Users\micha\Projects VS\final pull 16.10.2025\baederleben_berlin.xlsx
  LEGACY_MAIN_CSV: C:\Users\micha\Projects VS\final pull 16.10.2025\berlin_pools_final_dataset.csv
  LEGACY_ALT_CSV : C:\Users\micha\Projects VS\final pull 16.10.2025\pools_data_cleaned.csv
  OSM_WIDE_CSV   : C:\Users\micha\Projects VS\final pull 16.10.2025\osm_pools_wide.csv (build: True place: Berlin, Germany )
  OSM_PUBLIC_NAMED_CSV: C:\Users\micha\Projects VS\final pull 16.10.2025\osm_public_named.csv
  DIST_M: 100.0 | STRICT_PUBLIC: False
  CACHE_PATH: C:\Users\micha\Projects VS\final pull 16.10.2025\reverse_geocode_cache.csv

Sanity:
  LOR_GEOJSON exists: True
  Districts in map: 12



## **2) Run legacy extractor (if needed)**

**Why:** Regenerate the legacy CSVs (only if they don’t already exist), then pick the one we’ll use and autodetect its latitude/longitude column names for later spatial joins.
This executes `pool_data_processing.ipynb` with its own working directory to generate the CSVs.

**Outputs:**
- `berlin_pools_final_dataset.csv`
- `pools_data_cleaned.csv`


In [4]:
# ---  Ensure legacy CSV exists, pick the file to use, and detect coord columns ---

from nbconvert.preprocessors import ExecutePreprocessor
import nbformat

# 3.1) If BOTH legacy outputs are missing, run the extractor notebook
need_legacy = (not LEGACY_MAIN_CSV.exists()) and (not LEGACY_ALT_CSV.exists())
if need_legacy:
    if not LEGACY_NOTEBOOK.exists():
        raise FileNotFoundError(f"Legacy notebook missing: {LEGACY_NOTEBOOK}")
    if not LEGACY_XLSX.exists():
        raise FileNotFoundError(f"Legacy input Excel missing: {LEGACY_XLSX}")

    print("[info] Executing legacy extractor notebook…")
    with open(LEGACY_NOTEBOOK, "r", encoding="utf-8") as f:
        nb = nbformat.read(f, as_version=4)
    ep = ExecutePreprocessor(timeout=1800, kernel_name="python3")
    ep.preprocess(nb, resources={"metadata": {"path": str(LEGACY_NOTEBOOK.parent)}})

# 3.2) Choose which legacy CSV we'll use downstream
LEGACY_IN_USE = LEGACY_MAIN_CSV if LEGACY_MAIN_CSV.exists() else LEGACY_ALT_CSV
if not LEGACY_IN_USE.exists():
    raise FileNotFoundError("No legacy CSV found. Expected one of: "
                            f"{LEGACY_MAIN_CSV} or {LEGACY_ALT_CSV}")

print("[use] Legacy CSV ->", LEGACY_IN_USE)

# 3.3) Detect coordinate column names for LEGACY
_probe_leg = pd.read_csv(LEGACY_IN_USE, nrows=5)
leg_cols_lower = {c.lower(): c for c in _probe_leg.columns}
LEGACY_LAT_COL = leg_cols_lower.get("latitude") or leg_cols_lower.get("lat")
LEGACY_LON_COL = leg_cols_lower.get("longitude") or leg_cols_lower.get("lon") or leg_cols_lower.get("lng")
if not LEGACY_LAT_COL or not LEGACY_LON_COL:
    raise KeyError("Legacy CSV must have latitude/longitude (or lat/lon/lng). "
                   f"Found: {list(_probe_leg.columns)}")
print(f"[schema/legacy] latitude='{LEGACY_LAT_COL}', longitude='{LEGACY_LON_COL}'")

# 3.4) (Optional) Detect coordinate column names for OSM, if file exists
OSM_LAT_COL = OSM_LON_COL = None
if OSM_PUBLIC_NAMED_CSV.exists():
    _probe_osm = pd.read_csv(OSM_PUBLIC_NAMED_CSV, nrows=5)
    osm_cols_lower = {c.lower(): c for c in _probe_osm.columns}
    OSM_LAT_COL = osm_cols_lower.get("latitude") or osm_cols_lower.get("lat")
    OSM_LON_COL = osm_cols_lower.get("longitude") or osm_cols_lower.get("lon") or osm_cols_lower.get("lng")
    if not OSM_LAT_COL or not OSM_LON_COL:
        print("[warn] OSM CSV present but missing obvious lat/lon columns. "
              f"Found: {list(_probe_osm.columns)}")
    else:
        print(f"[schema/osm] latitude='{OSM_LAT_COL}', longitude='{OSM_LON_COL}'")
else:
    print("[info] OSM_PUBLIC_NAMED_CSV not found yet (we can enrich it later).")




[info] Executing legacy extractor notebook…
[use] Legacy CSV -> C:\Users\micha\Projects VS\final pull 16.10.2025\berlin_pools_final_dataset.csv
[schema/legacy] latitude='latitude', longitude='longitude'
[info] OSM_PUBLIC_NAMED_CSV not found yet (we can enrich it later).


## **3) (if 2) was run) Load LOR polygons & enrich LEGACY**

**Why:** Attach official 8-digit district IDs and Ortsteil fields (ortsteil, ortsteil_id) to the legacy dataset via a spatial join with the LOR (Ortsteil) polygons.

In [5]:
# === Consolidated: LOR loader + suffix-safe enrichment (district_id + ortsteil + ortsteil_id) ===

import geopandas as gpd
import pandas as pd

def _normalize_bezirk_name(s: str) -> str:
    if not isinstance(s, str): return ""
    s = s.strip().replace("–", "-").replace("—", "-").replace(" - ", "-")
    return s

def load_lor_gdf(geojson_path: Path, district_id_map: dict) -> gpd.GeoDataFrame:
    """Load Ortsteil polygons, normalize names, derive ortsteil_id, and map district_id."""
    # Try fast engine first; fall back to Fiona if needed
    try:
        gdf = gpd.read_file(geojson_path, engine="pyogrio")
    except Exception:
        gdf = gpd.read_file(geojson_path, engine="fiona")

    # Ensure WGS84
    if gdf.crs is None:
        gdf.set_crs(epsg=4326, inplace=True)
    else:
        gdf = gdf.to_crs(epsg=4326)

    # GeoJSON props: BEZIRK (district), OTEIL (ortsteil), gml_id (e.g., re_ortsteil.0805)
    gdf = gdf.rename(columns={"BEZIRK": "district_label", "OTEIL": "ortsteil"})
    gdf["district_label"] = gdf["district_label"].astype(str).map(_normalize_bezirk_name)
    gdf["ortsteil"] = gdf["ortsteil"].astype(str)

    # Derive 4-digit ortsteil_id from the tail of gml_id
    gdf["ortsteil_id"] = gdf["gml_id"].astype(str).str.split(".").str[-1].str.zfill(4)

    # Map your exact district_id codes (TASK #1)
    gdf["district_id"] = gdf["district_label"].map(district_id_map).astype("string")

    keep = ["district_label", "district_id", "ortsteil", "ortsteil_id", "geometry"]
    return gdf[keep].copy()

def attach_lor_fields(df: pd.DataFrame,
                      lor_polys: gpd.GeoDataFrame,
                      lat_col: str,
                      lon_col: str) -> pd.DataFrame:
    """
    Adds: district_label, district_id, ortsteil, ortsteil_id via spatial join (within).
    Proactively DROPS any pre-existing enrichment columns to avoid _left/_right suffixes.
    """
    if df is None or len(df) == 0:
        out = df.copy()
        for c in ["district_label","district_id","ortsteil","ortsteil_id"]:
            if c not in out.columns: out[c] = pd.Series(dtype="string")
        return out

    missing = [c for c in (lat_col, lon_col) if c not in df.columns]
    if missing:
        raise KeyError(f"Missing required columns for spatial join: {missing}. "
                       f"Available: {list(df.columns)}")

    # Drop any existing enrichment cols to prevent suffixes
    base = df.drop(columns=["district_label","district_id","ortsteil","ortsteil_id"], errors="ignore").copy()

    # Points + spatial join
    points = gpd.GeoDataFrame(
        base, geometry=gpd.points_from_xy(base[lon_col], base[lat_col]), crs="EPSG:4326"
    )
    joined = gpd.sjoin(points, lor_polys, how="left", predicate="within")
    joined = joined.drop(columns=[c for c in joined.columns if c.startswith("index_")], errors="ignore")

    # Ensure canonical columns exist (no suffixes) and are string dtype
    for c in ["district_label","district_id","ortsteil","ortsteil_id"]:
        if c not in joined.columns:
            for alt in (f"{c}_right", f"{c}_left", f"{c}_y", f"{c}_x"):
                if alt in joined.columns:
                    joined[c] = joined[alt]
                    break
        if c in joined.columns:
            joined[c] = joined[c].astype("string")
        else:
            joined[c] = pd.Series(pd.NA, dtype="string")

    return pd.DataFrame(joined.drop(columns=["geometry"]))

# --- Load polygons once ---
lor_gdf = load_lor_gdf(LOR_GEOJSON, DISTRICT_ID_MAP)
print(f"[ok] LOR polygons loaded: {len(lor_gdf)} features")

# --- Enrich LEGACY using detected coord columns from Step 3 ---
legacy_df = pd.read_csv(LEGACY_IN_USE)
legacy_enriched = attach_lor_fields(legacy_df, lor_gdf,
                                    lat_col=LEGACY_LAT_COL, lon_col=LEGACY_LON_COL)
legacy_enriched.to_csv(LEGACY_ENRICHED_CSV, index=False)
print(f"[ok] Legacy enriched -> {LEGACY_ENRICHED_CSV.name} | rows={len(legacy_enriched)} | "
      f"missing district_id={legacy_enriched['district_id'].isna().sum()} | "
      f"missing ortsteil_id={legacy_enriched['ortsteil_id'].isna().sum()}")



[ok] LOR polygons loaded: 96 features
[ok] Legacy enriched -> legacy_enriched_with_lor.csv | rows=144 | missing district_id=0 | missing ortsteil_id=0


In [6]:
# --- Validation ---

import pandas as pd

df = pd.read_csv(LEGACY_ENRICHED_CSV)

need_cols = ["district_label","district_id","ortsteil","ortsteil_id"]
print("Missing columns:", [c for c in need_cols if c not in df.columns])

for c in need_cols:
    print(f"Nulls in {c}:", df[c].isna().sum())

mapped = df["district_label"].map(DISTRICT_ID_MAP).astype("string")
mism = (df["district_id"].astype("string") != mapped) & df["district_label"].notna()
print("district_id mismatches vs mapping:", int(mism.sum()))

uniq = df[["district_label","district_id"]].dropna().drop_duplicates().sort_values("district_label")
print("unique districts present:", len(uniq))
display(uniq.head(20))



Missing columns: []
Nulls in district_label: 0
Nulls in district_id: 0
Nulls in ortsteil: 0
Nulls in ortsteil_id: 0
district_id mismatches vs mapping: 0
unique districts present: 12


Unnamed: 0,district_label,district_id
11,Charlottenburg-Wilmersdorf,11004004
4,Friedrichshain-Kreuzberg,11002002
3,Lichtenberg,11011011
7,Marzahn-Hellersdorf,11010010
10,Mitte,11001001
8,Neukölln,11008008
5,Pankow,11003003
0,Reinickendorf,11012012
6,Spandau,11005005
9,Steglitz-Zehlendorf,11006006


## **4) (Optional) Build OSM wide export + osm public named**

**Why:** reproducible OSM pull to CSV (osm_pools_wide.csv). This step can be slow and needs internet.

**Skip it** if you already have osm_pools_wide.csv / osm_public_named.csv.


In [7]:
# --- : (Optional) Build OSM wide export for Berlin ---

# Ensure osmnx is available (your first cell defined `ensure`)
try:
    import osmnx as ox
except Exception:
    ensure("osmnx", "osmnx")
    import osmnx as ox

place = OSM_PLACE
print(f"[info] Building OSM wide for: {place}")

# 7.1) Get boundary polygon (WGS84)
boundary = ox.geocode_to_gdf(place).to_crs(epsg=4326)

# 7.2) Tags we’ll query; this captures most pool-like places
tags = {
    "leisure": ["swimming_pool", "swimming_area", "water_park", "beach_resort", "sports_centre"],
    "amenity": ["public_bath"],  # rare but include
}

frames = []
poly = boundary.geometry.iloc[0]

for key, values in tags.items():
    try:
        g = ox.features_from_polygon(poly, {key: values})
        if not g.empty:
            g = g.to_crs(epsg=4326)
            g["src_tag_key"] = key
            frames.append(g)
            print(f"[ok] fetched {len(g)} features for {key}={values}")
        else:
            print(f"[info] no features for {key}={values}")
    except Exception as e:
        print(f"[warn] fetch failed for {key}={values}: {e}")

if frames:
    gdf = pd.concat(frames, ignore_index=True)
    # use centroids to get a single representative point for any geometry
    gdf_proj = gdf.to_crs(epsg=25833)              # ETRS89 / UTM zone 33N — great for Berlin
    centroids_proj = gdf_proj.geometry.centroid
    centroids_wgs84 = gpd.GeoSeries(centroids_proj, crs="EPSG:25833").to_crs(epsg=4326)

    gdf["latitude"]  = centroids_wgs84.y
    gdf["longitude"] = centroids_wgs84.x

    # Choose useful columns; create if missing so downstream is stable
    keep_cols = [
        "name", "latitude", "longitude",
        "addr:street", "addr:postcode",
        "phone", "website", "opening_hours", "wheelchair",
        "access", "leisure", "amenity", "sport", "src_tag_key"
    ]
    for c in keep_cols:
        if c not in gdf.columns:
            gdf[c] = pd.NA

    wide = gdf[keep_cols].copy().rename(columns={"addr:street":"street", "addr:postcode":"postal_code"})
    # Keep named rows for the “public_named” output
    wide["name_norm"] = wide["name"].fillna("").str.strip()
    public_mask = wide["name_norm"] != ""
    if STRICT_PUBLIC:
        public_mask &= ~wide["access"].fillna("").str.contains("private", case=False)

    public_named = wide.loc[public_mask].drop(columns=["name_norm"])

    # Save outputs
    wide.to_csv(OSM_WIDE_CSV, index=False)
    public_named.to_csv(OSM_PUBLIC_NAMED_CSV, index=False)

    print(f"[ok] Wrote {OSM_WIDE_CSV.name}: {len(wide)} rows")
    print(f"[ok] Wrote {OSM_PUBLIC_NAMED_CSV.name}: {len(public_named)} rows (named{' & non-private' if STRICT_PUBLIC else ''})")
else:
    print("[warn] No OSM features found with the configured tags.")



[info] Building OSM wide for: Berlin, Germany
[ok] fetched 1800 features for leisure=['swimming_pool', 'swimming_area', 'water_park', 'beach_resort', 'sports_centre']
[ok] fetched 3 features for amenity=['public_bath']
[ok] Wrote osm_pools_wide.csv: 1803 rows
[ok] Wrote osm_public_named.csv: 638 rows (named)


## **5) (if 4) was run) OSM enrichment with districts/ortsteile** 

**Why:** attach Berlin’s official LOR attributes (district + Ortsteil) to the OSM points, so every record carries:

- district_label (human district name)

- district_id (your exact 8-digit codes)

- ortsteil and ortsteil_id (neighborhood + 4-digit id)

In [8]:
# ---  Enrich OSM (district_id + ortsteil + ortsteil_id) ---

import pandas as pd

# make sure polygons are in memory (in case of kernel restarts)
try:
    lor_gdf
except NameError:
    lor_gdf = load_lor_gdf(LOR_GEOJSON, DISTRICT_ID_MAP)

if OSM_PUBLIC_NAMED_CSV.exists():
    osm_df = pd.read_csv(OSM_PUBLIC_NAMED_CSV)

    # normalize coordinate columns if needed (your Step 7 already wrote latitude/longitude)
    rename_map = {}
    if "lat" in osm_df.columns and "latitude" not in osm_df.columns:
        rename_map["lat"] = "latitude"
    if "lon" in osm_df.columns and "longitude" not in osm_df.columns:
        rename_map["lon"] = "longitude"
    if "lng" in osm_df.columns and "longitude" not in rename_map and "longitude" not in osm_df.columns:
        rename_map["lng"] = "longitude"
    if rename_map:
        osm_df = osm_df.rename(columns=rename_map)

    # drop any pre-existing enrichment cols so we never get suffixes
    osm_df = osm_df.drop(columns=["district","district_label","district_id","invalid_district",
                                  "ortsteil","ortsteil_id"], errors="ignore")

    # enrich via our suffix-safe helper
    osm_enriched = attach_lor_fields(osm_df, lor_gdf, lat_col="latitude", lon_col="longitude")
    osm_enriched.to_csv(OSM_ENRICHED_CSV, index=False)

    print(f"[ok] OSM enriched -> {OSM_ENRICHED_CSV.name} | "
          f"rows={len(osm_enriched)} | missing ortsteil_id={osm_enriched['ortsteil_id'].isna().sum()}")
else:
    print("[info] OSM_PUBLIC_NAMED_CSV not found — run Step 7 first.")


[ok] OSM enriched -> osm_enriched_with_lor.csv | rows=638 | missing ortsteil_id=1


In [8]:
# --Inspect the missing row--

import pandas as pd

osm_enriched = pd.read_csv(OSM_ENRICHED_CSV)
miss = osm_enriched["ortsteil_id"].isna()

print("Missing rows:", miss.sum())
display(osm_enriched.loc[miss, ["name","latitude","longitude","street","postal_code","leisure","amenity","access"]])



Missing rows: 1


Unnamed: 0,name,latitude,longitude,street,postal_code,leisure,amenity,access
560,Jahnsportstätte Ahrensfelde,52.577805,13.566118,,,sports_centre,,


Ahrensfelde is outside Berlin (Brandenburg), so it won’t match any Berlin Ortsteil polygon. That’s why ortsteil_id is NaN — and that’s correct behavior.

In [9]:
# Filter OSM_ENRICHED to points strictly inside the Berlin boundary
import geopandas as gpd, pandas as pd

# ensure boundary is available
try:
    boundary
except NameError:
    import osmnx as ox
    boundary = ox.geocode_to_gdf(OSM_PLACE).to_crs(epsg=4326)

osm = pd.read_csv(OSM_ENRICHED_CSV)
gdf_pts = gpd.GeoDataFrame(osm, geometry=gpd.points_from_xy(osm["longitude"], osm["latitude"]), crs="EPSG:4326")

berlin_poly = boundary.geometry.iloc[0]
inside = gdf_pts.within(berlin_poly)         # strict containment
dropped = len(gdf_pts) - int(inside.sum())

osm_berlin_only = pd.DataFrame(gdf_pts.loc[inside].drop(columns="geometry"))
osm_berlin_only.to_csv(OSM_ENRICHED_CSV, index=False)

print(f"[ok] Kept Berlin-only rows: {len(osm_berlin_only)} | dropped outside: {dropped}")


[ok] Kept Berlin-only rows: 637 | dropped outside: 1


In [10]:
# - Validate
df = pd.read_csv(OSM_ENRICHED_CSV)
for c in ["district_label","district_id","ortsteil","ortsteil_id"]:
    print(f"Nulls in {c}:", df[c].isna().sum())


Nulls in district_label: 0
Nulls in district_id: 0
Nulls in ortsteil: 0
Nulls in ortsteil_id: 0


## **6) Cross-check (from enriched files) → pairing & new pools**  
**Why:** find what’s missing in the legacy tool vs. OSM after adding LOR fields (district + Ortsteil).

You’ll get:
- `legacy_enrichment_list.csv` — candidates to pull extra attributes from OSM
- `osm_public_named.csv` — OSM pools not matched to legacy (potential new pools)

In [11]:
# prefer enriched; fall back to raw paths
LEGACY_PATH = (LEGACY_ENRICHED_CSV if LEGACY_ENRICHED_CSV.exists()
               else (LEGACY_MAIN_CSV if LEGACY_MAIN_CSV.exists() else LEGACY_ALT_CSV))

In [12]:
# === Cross-check using ENRICHED files but OSM filtered like before ===
from pathlib import Path
import pandas as pd, numpy as np, math, unicodedata, re

# Inputs (enriched)
LEGACY_PATH = LEGACY_ENRICHED_CSV
OSM_NAMED_ENRICHED_PATH = OSM_ENRICHED_CSV
DIST_M = 250
STRICT_PUBLIC = False  # keep consistent with your pipeline

print(f"[info] Using legacy (enriched): {LEGACY_PATH.name}")
print(f"[info] Using OSM (enriched):    {OSM_NAMED_ENRICHED_PATH.name}")

legacy = pd.read_csv(LEGACY_PATH)
osm    = pd.read_csv(OSM_NAMED_ENRICHED_PATH)

# --- helpers (same as your pipeline) ---
def normalize_ascii(s: str) -> str:
    if not isinstance(s, str): return ""
    s = unicodedata.normalize("NFKD", s)
    return "".join(ch for ch in s if not unicodedata.combining(ch))

def normalize_name(s: str) -> str:
    if not isinstance(s, str): return ""
    s = normalize_ascii(s).lower().strip()
    s = re.sub(r"[^a-z0-9]+", " ", s)
    return re.sub(r"\s+", " ", s).strip()

def is_public_access(v, strict=False) -> bool:
    if v is None or (isinstance(v, float) and math.isnan(v)):
        return not strict
    v = str(v).strip().lower()
    if v in {"private","customers","residents"}: return False
    return (v in {"yes","public"} if strict else v in {"","yes","public","permissive"} or v is None)

def haversine_m(lat1, lon1, lat2, lon2) -> float:
    if any(pd.isna([lat1, lon1, lat2, lon2])): return np.nan
    R=6371000.0
    phi1=math.radians(lat1); phi2=math.radians(lat2)
    dphi=math.radians(lat2-lat1); dl=math.radians(lon2-lon1)
    a=math.sin(dphi/2)**2+math.cos(phi1)*math.cos(phi2)*math.sin(dl/2)**2
    return 2*R*math.asin(math.sqrt(a))

def colmap(df): return {c.lower(): c for c in df.columns}
def cget(cols, *choices):
    for ch in choices:
        if ch in cols: return cols[ch]
    return None

def has_swimming(val):
    if val is None or (isinstance(val, float) and np.isnan(val)): return False
    s = str(val).lower()
    return ("swimming" in s) or ("schwimm" in s)

def detect_lat_lon(df):
    c = colmap(df)
    lat = cget(c, "latitude","lat")
    lon = cget(c, "longitude","lon","lng")
    if not lat or not lon:
        raise KeyError(f"Could not detect lat/lon in columns: {list(df.columns)}")
    return lat, lon

# --- 1) Filter OSM (enriched) to pool-like + public + valid coords, then de-dup (as before) ---
c = colmap(osm)
leisure_c = cget(c, "leisure")
sport_c   = cget(c, "sport")
access_c  = cget(c, "access")

leisure = osm[leisure_c].astype(str).str.lower() if leisure_c else pd.Series([""]*len(osm), index=osm.index)
sport   = osm[sport_c].astype(str) if sport_c else pd.Series([""]*len(osm), index=osm.index)

is_pool = (
    leisure.eq("swimming_pool") |
    leisure.eq("swimming_area") |
    (leisure.eq("sports_centre") & sport.apply(has_swimming))
)

pub_mask = osm[access_c].apply(lambda v: is_public_access(v, STRICT_PUBLIC)) if access_c else pd.Series([True]*len(osm), index=osm.index)
lat_o, lon_o = detect_lat_lon(osm)
coord_mask = pd.to_numeric(osm[lat_o], errors="coerce").notna() & pd.to_numeric(osm[lon_o], errors="coerce").notna()

osm_f = osm[is_pool & pub_mask & coord_mask].copy()
osm_f["name_norm"] = osm_f.get("name","").astype(str).map(normalize_name)
osm_f["lat_r"] = pd.to_numeric(osm_f[lat_o], errors="coerce").round(5)
osm_f["lon_r"] = pd.to_numeric(osm_f[lon_o], errors="coerce").round(5)
osm_f = osm_f.drop_duplicates(subset=["name_norm","lat_r","lon_r"], keep="first").drop(columns=["lat_r","lon_r"])

# --- 2) Legacy side: prepare names/coords ---
legacy["name_norm"] = legacy.get("name","").astype(str).map(normalize_name)
lat_l, lon_l = detect_lat_lon(legacy)

print({
    "legacy_rows": len(legacy),
    "osm_enriched_rows": len(osm),
    "osm_filtered_like_old_pipeline": len(osm_f)
})

# --- 3) Candidate join and matching ---
L = legacy.assign(bucket=legacy["name_norm"].str[:10])
O = osm_f.assign(bucket=osm_f["name_norm"].str[:10])

cand = L.merge(O, on="bucket", suffixes=("_l","_o"), how="inner")
cand["name_match"] = (cand["name_norm_l"].ne("")) & cand["name_norm_l"].eq(cand["name_norm_o"])
cand["dist_m"] = cand.apply(lambda r: haversine_m(r[lat_l+"_l"], r[lon_l+"_l"], r[lat_o+"_o"], r[lon_o+"_o"]), axis=1)

legacy_enrichment_list = cand[(cand["name_match"]) | (cand["dist_m"] <= DIST_M)].copy()

keep_cols = [
    # legacy side + LOR
    "name_l", f"{lat_l}_l", f"{lon_l}_l",
    "district_label_l","district_id_l","ortsteil_l","ortsteil_id_l",
    # osm side + LOR + addr
    "name_o", f"{lat_o}_o", f"{lon_o}_o", "street_o","postal_code_o",
    "district_label_o","district_id_o","ortsteil_o","ortsteil_id_o",
    "website_o","opening_hours_o","wheelchair_o","phone_o",
    # matching info
    "dist_m","name_match"
]
legacy_enrichment_list = legacy_enrichment_list[[c for c in keep_cols if c in legacy_enrichment_list.columns]]

# --- 4) Unmatched OSM named (from filtered set), keep all enriched columns ---
def rounded_key(df, lat_col, lon_col):
    return df["name_norm"].astype(str) + "|" + pd.to_numeric(df[lat_col], errors="coerce").round(5).astype(str) + "|" + pd.to_numeric(df[lon_col], errors="coerce").round(5).astype(str)

cand["_key_o"] = (
    cand["name_norm_o"].astype(str) + "|" +
    pd.to_numeric(cand[lat_o+"_o"], errors="coerce").round(5).astype(str) + "|" +
    pd.to_numeric(cand[lon_o+"_o"], errors="coerce").round(5).astype(str)
)
matched_keys = set(
    cand.loc[(cand["name_match"]) | (cand["dist_m"] <= DIST_M), "_key_o"].tolist()
)

osm_f["_key"] = rounded_key(osm_f, lat_o, lon_o)
osm_public_named = osm_f[~osm_f["_key"].isin(matched_keys)].drop(columns=["_key"]).copy()

# --- 5) Save exactly the two files you want ---
legacy_enrichment_list.to_csv("legacy_enrichment_list.csv", index=False)
osm_public_named.to_csv("osm_public_named.csv", index=False)

print("[ok] Wrote: legacy_enrichment_list.csv, osm_public_named.csv")
print({
    "legacy_enrichment_list": len(legacy_enrichment_list),
    "osm_public_named": len(osm_public_named)
})






[info] Using legacy (enriched): legacy_enriched_with_lor.csv
[info] Using OSM (enriched):    osm_enriched_with_lor.csv
{'legacy_rows': 144, 'osm_enriched_rows': 637, 'osm_filtered_like_old_pipeline': 69}
[ok] Wrote: legacy_enrichment_list.csv, osm_public_named.csv
{'legacy_enrichment_list': 48, 'osm_public_named': 23}



## 6.1 Load inputs & prep candidate pairs

**Why:** standardize names on both sides so our later fuzzy name + distance matching is stable and reproducible.:



In [13]:
# --- Normalized names on ENRICHED dataframes (legacy + OSM filtered) ---

import unicodedata, re
import numpy as np
import pandas as pd

def normalize_name(s: str) -> str:
    if s is None or (isinstance(s, float) and np.isnan(s)): 
        s = ""
    s = str(s)
    s = unicodedata.normalize("NFKD", s)
    s = "".join(ch for ch in s if not unicodedata.combining(ch))
    s = re.sub(r"[^a-z0-9]+", " ", s.lower()).strip()
    return re.sub(r"\s+", " ", s).strip()

def ensure_text_col(df, *candidates, create="name"):
    for c in candidates:
        if c in df.columns:
            return c
    if create not in df.columns:
        df[create] = ""
    return create

def coalesce_cols(df, *cols):
    present = [c for c in cols if c in df.columns]
    if not present:
        return pd.Series([""] * len(df), index=df.index, dtype=object)
    out = df[present[0]].astype(str).fillna("")
    for c in present[1:]:
        nxt = df[c].astype(str).fillna("")
        use_nxt = out.str.strip().eq("")
        out = out.where(~use_nxt, nxt)
    return out.fillna("")

# ---- LEGACY (enriched) ----
# Use 'legacy' DF if you still have it in memory; otherwise reload:
try:
    legacy
except NameError:
    legacy = pd.read_csv(LEGACY_ENRICHED_CSV)

legacy_name_col = ensure_text_col(legacy, "name", "pool_name", create="name")
legacy["name"] = legacy[legacy_name_col].astype(str).fillna("").str.strip()
legacy["name_norm"] = legacy["name"].map(normalize_name)

# ---- OSM (enriched + filtered as in your pipeline: osm_f) ----
# If you don't have osm_f in memory, rebuild the filtered view from osm_enriched:
try:
    osm_f
except NameError:
    osm = pd.read_csv(OSM_ENRICHED_CSV)
    # minimal re-filter (same as earlier): keep valid coords only
    lat_o = "latitude" if "latitude" in osm.columns else "lat"
    lon_o = "longitude" if "longitude" in osm.columns else "lon"
    coord_mask = pd.to_numeric(osm[lat_o], errors="coerce").notna() & pd.to_numeric(osm[lon_o], errors="coerce").notna()
    osm_f = osm[coord_mask].copy()

# Coalesce common OSM name fields; take first token before ';' (OSM often stores multiple names)
osm_f["name"] = coalesce_cols(
    osm_f,
    "name","official_name","short_name","alt_name","name:de","name:en","brand","operator"
).astype(str).fillna("").str.split(";").str[0].str.strip()

osm_f["name_norm"] = osm_f["name"].map(normalize_name)

# (Optional) quick peek
print("[legacy] sample:", legacy[["name","name_norm"]].head(3).to_dict(orient="records"))
print("[osm_f] sample:", osm_f[["name","name_norm"]].head(3).to_dict(orient="records"))




[legacy] sample: [{'name': 'Strandbad Lübars', 'name_norm': 'strandbad lubars'}, {'name': 'Kleine Schwimmhalle Wuhlheide', 'name_norm': 'kleine schwimmhalle wuhlheide'}, {'name': 'Kombibad Mariendorf', 'name_norm': 'kombibad mariendorf'}]
[osm_f] sample: [{'name': 'Schwimmschule Wassermeloni', 'name_norm': 'schwimmschule wassermeloni'}, {'name': 'Freibad Lübars', 'name_norm': 'freibad lubars'}, {'name': '1. Berliner Kinder-Schwimmschule', 'name_norm': '1 berliner kinder schwimmschule'}]



## **7) Reverse geocode addresses (cache-aware)**  

**Why:** some OSM points don’t carry addresses. We fill street and postal_code from Nominatim while keeping districts from LOR (we do not touch district/district_id here). **with a local cache** to avoid repeated lookups.


### 7.1 Reverse-geocode osm_public_named.csv and legacy (enriched)

In [14]:
# --- Enrich street / postal_code using reverse geocode; district comes from LOR ---
from pathlib import Path
import pandas as pd, re, unicodedata
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter

# --------------------------- CONFIG ---------------------------
OSM_PUBLIC_NAMED_CSV = Path("osm_public_named.csv")   # already enriched with LOR
CACHE_PATH = Path("reverse_geocode_cache.csv")

# Prefer enriched legacy so we keep district_label/district_id/ortsteil/ortsteil_id
try:
    LEGACY_CSV = LEGACY_ENRICHED_CSV if LEGACY_ENRICHED_CSV.exists() else (
        LEGACY_MAIN_CSV if LEGACY_MAIN_CSV.exists() else LEGACY_ALT_CSV
    )
except Exception:
    LEGACY_CSV = Path("legacy_enriched_with_lor.csv") if Path("legacy_enriched_with_lor.csv").exists() else Path("berlin_pools_final_dataset.csv")

WRITE_LEGACY_INPLACE = True  # write back to LEGACY_CSV

# -------------------------- HELPERS ---------------------------
def strip_accents(s: str) -> str:
    s = "" if s is None else str(s)
    s = unicodedata.normalize("NFKD", s)
    return "".join(ch for ch in s if not unicodedata.combining(ch))

def is_blank(x) -> bool:
    if pd.isna(x): return True
    return str(x).strip().lower() in {"", "nan", "none", "<na>", "null"}

def key(lat, lon):
    try: return round(float(lat), 6), round(float(lon), 6)
    except Exception: return None

def clean_postcode(p):
    m = re.search(r"\b(1[0-4]\d{3})\b", str(p))
    return m.group(1) if m else ""

def street_from_address(addr: str) -> str:
    if not addr: return ""
    # simple heuristics for "Word Word 123" or common street tokens
    m = re.search(r"([A-Za-zÄÖÜäöüß\-\s]+)\s(\d+[a-zA-Z]?)", str(addr))
    if m: return m.group(0).strip()
    tokens = ("straße","str.","strasse","allee","damm","weg","platz","ufer","chaussee","ring","steig","promenade","gasse","pfad")
    parts = [p.strip() for p in str(addr).split(",") if p.strip()]
    for p in parts:
        if any(t in p.lower() for t in tokens):
            return p.strip()
    return ""

# ------------------ shared cache + geocoder ------------------
cache_cols = ["lat","lon","street","postal_code"]
cache = pd.read_csv(CACHE_PATH) if CACHE_PATH.exists() else pd.DataFrame(columns=cache_cols)
for c in cache_cols:
    if c not in cache.columns: cache[c] = ""
cache["key"] = cache.apply(lambda r: key(r.get("lat"), r.get("lon")), axis=1)
cache_dict = {k: (s, p) for k, s, p in zip(cache["key"], cache["street"], cache["postal_code"]) if pd.notna(k)}

# Respect Nominatim usage policy (identify yourself + rate limit)
geolocator = Nominatim(user_agent="pools_enrichment/1.0 (contact: your_email@example.com)")
reverse = RateLimiter(geolocator.reverse, min_delay_seconds=1.0, max_retries=2, error_wait_seconds=2.0, swallow_exceptions=True)

def enrich_table_inplace(df, lat_candidates, lon_candidates, street_col, post_col, sanity_label=""):
    """
    Fills ONLY street/postal_code via reverse geocoding; DOES NOT touch district.
    District should come from LOR columns (district_label/district_id) outside this function.
    """
    # ensure cols
    for c in [street_col, post_col]:
        if c not in df.columns: df[c] = ""
        df[c] = df[c].astype("string")

    # lat/lon pick
    def pickcol(cands):
        for c in cands:
            if c in df.columns: return c
        return None
    lat_c = pickcol(lat_candidates)
    lon_c = pickcol(lon_candidates)
    if not lat_c or not lon_c:
        print(f("[skip] {sanity_label}: no lat/lon"))
        return

    # who needs?
    need_mask = (
        df[street_col].apply(is_blank) |
        df[post_col].apply(is_blank)
    ) & df[lat_c].notna() & df[lon_c].notna()

    need = df[need_mask].copy()
    print(f"[info] {sanity_label}: rows needing street/postcode:", int(need_mask.sum()))

    filled_rows = []
    for idx, r in need.iterrows():
        k = key(r[lat_c], r[lon_c])
        if not k:
            continue

        s, p = cache_dict.get(k, ("",""))

        if is_blank(s) or is_blank(p):
            loc = reverse((r[lat_c], r[lon_c]), exactly_one=True, addressdetails=True, language="de", zoom=18)
            ad  = (loc.raw.get("address") if loc and hasattr(loc, "raw") else {}) or {}

            # street / house no
            if is_blank(s):
                road = ad.get("road") or ad.get("pedestrian") or ad.get("footway") or ad.get("path") or ""
                hn   = ad.get("house_number") or ""
                s = " ".join([x for x in [road, hn] if x]).strip()

            # postcode
            if is_blank(p):
                p = clean_postcode(ad.get("postcode", ""))

            if k:
                filled_rows.append({"lat": k[0], "lon": k[1], "street": s, "postal_code": p, "key": k})

        # assign back only if blank
        if not is_blank(s) and is_blank(df.at[idx, street_col]):
            df.at[idx, street_col] = str(s)
        if not is_blank(p) and is_blank(df.at[idx, post_col]):
            df.at[idx, post_col] = str(p)

    # persist cache (only if we actually added any rows)
    if filled_rows:
        new = pd.DataFrame(filled_rows)
        # ensure consistent columns/order
        for c in ["lat", "lon", "street", "postal_code", "key"]:
            if c not in new.columns:
                new[c] = pd.NA
        new = new[["lat", "lon", "street", "postal_code", "key"]]

        # concat with existing cache
        cc = pd.concat([cache, new], ignore_index=True)
        cc = cc.drop_duplicates(subset=["key"], keep="last")

        # write and refresh in-memory map
        cc[["lat", "lon", "street", "postal_code"]].to_csv(CACHE_PATH, index=False)
        cc["key"] = cc.apply(lambda r: key(r.get("lat"), r.get("lon")), axis=1)
        cache_dict.update({
            k: (s, p)
            for k, s, p in zip(cc["key"], cc["street"], cc["postal_code"])
            if pd.notna(k)
        })

def adopt_lor_district(df, district_text_col="district"):
    """
    Ensure the human-readable 'district' column is filled from LOR's district_label.
    Never overwrite non-blank 'district' values.
    """
    if "district_label" in df.columns:
        if district_text_col not in df.columns:
            df[district_text_col] = ""
        mask = df[district_text_col].astype(str).str.strip().eq("")
        df.loc[mask, district_text_col] = df.loc[mask, "district_label"].astype("string")

# ------------------ OSM FILE ------------------
osm = pd.read_csv(OSM_PUBLIC_NAMED_CSV)

# Pre-fill district text from LOR
adopt_lor_district(osm, district_text_col="district")

# Optional: prefill from OSM-style address columns
if "addr_street" in osm.columns:
    m = osm["street"].astype(str).str.strip().eq("") if "street" in osm.columns else pd.Series(True, index=osm.index)
    osm.loc[m, "street"] = osm.loc[m, "addr_street"].astype(str)
if "addr_postcode" in osm.columns:
    m = osm["postal_code"].astype(str).str.strip().eq("") if "postal_code" in osm.columns else pd.Series(True, index=osm.index)
    osm.loc[m, "postal_code"] = osm.loc[m, "addr_postcode"].astype(str)

# Reverse geocode only missing street/postal_code
enrich_table_inplace(
    osm,
    lat_candidates=("latitude","lat"),
    lon_candidates=("longitude","lon"),
    street_col="street",
    post_col="postal_code",
    sanity_label="OSM"
)

# normalize post code as text
osm["postal_code"] = osm["postal_code"].astype(str).str.replace(".0","",regex=False)
osm.to_csv(OSM_PUBLIC_NAMED_CSV, index=False)

# ------------------ LEGACY FILE ------------------
legacy = pd.read_csv(LEGACY_CSV)

# Ensure target columns exist
for c in ["street","postal_code","district"]:
    if c not in legacy.columns: legacy[c] = ""

# Fill district text from LOR first
adopt_lor_district(legacy, district_text_col="district")

# Try to prefill street from any "address" field
if "address" in legacy.columns:
    mask = legacy["street"].astype(str).str.strip().eq("")
    legacy.loc[mask, "street"] = legacy.loc[mask, "address"].astype(str).map(street_from_address)

# Reverse geocode only missing street/postal_code
enrich_table_inplace(
    legacy,
    lat_candidates=("latitude","lat"),
    lon_candidates=("longitude","lon"),
    street_col="street",
    post_col="postal_code",
    sanity_label="LEGACY"
)
legacy["postal_code"] = legacy["postal_code"].astype(str).str.replace(".0","",regex=False)

# write legacy back
if WRITE_LEGACY_INPLACE:
    legacy.to_csv(LEGACY_CSV, index=False)
    print("[ok] Updated legacy file →", LEGACY_CSV.name)
else:
    out_legacy = LEGACY_CSV.with_name(LEGACY_CSV.stem + "_enriched.csv")
    legacy.to_csv(out_legacy, index=False)
    print("[ok] Wrote legacy enriched →", out_legacy.name)

# ------------------ tiny checks ------------------
def peek_missing(df, label):
    miss_st = int(df["street"].astype(str).str.strip().eq("").sum()) if "street" in df.columns else 0
    miss_pc = int(df["postal_code"].astype(str).str.strip().eq("").sum()) if "postal_code" in df.columns else 0
    miss_di = int(df["district"].astype(str).str.strip().eq("").sum()) if "district" in df.columns else 0
    print(f"[report] {label} missing → street:{miss_st}  postal_code:{miss_pc}  district:{miss_di}")

peek_missing(osm, "OSM")
peek_missing(legacy, "LEGACY")




[info] OSM: rows needing street/postcode: 14


  cc = pd.concat([cache, new], ignore_index=True)


[info] LEGACY: rows needing street/postcode: 3
[ok] Updated legacy file → legacy_enriched_with_lor.csv
[report] OSM missing → street:0  postal_code:0  district:0
[report] LEGACY missing → street:0  postal_code:0  district:0


  cc = pd.concat([cache, new], ignore_index=True)


### 7.2 Fill OSM-side addresses in legacy_enrichment_list.csv

In [15]:
# --- Fill street_o / postal_code_o in legacy_enrichment_list.csv via reverse geocoding ---

import pandas as pd

# If you just ran the previous cell, enrich_table_inplace + cache + reverse are already defined.
# We only need to call it pointing at the OSM-side columns.

lxl_path = "legacy_enrichment_list.csv"
lxl = pd.read_csv(lxl_path)

# Count blanks before
def _blanks(s):
    return int(s.astype(str).str.strip().eq("").sum())

before_st = _blanks(lxl["street_o"]) if "street_o" in lxl.columns else None
before_pc = _blanks(lxl["postal_code_o"]) if "postal_code_o" in lxl.columns else None

# Enrich ONLY the OSM side (suffix _o). We pass multiple candidate names to be robust.
enrich_table_inplace(
    lxl,
    lat_candidates=("latitude_o","lat_o"),
    lon_candidates=("longitude_o","lon_o","lng_o"),
    street_col="street_o",
    post_col="postal_code_o",
    sanity_label="LEGACY_ENRICHMENT_LIST (OSM side)"
)

# Tidy postcode type
if "postal_code_o" in lxl.columns:
    lxl["postal_code_o"] = lxl["postal_code_o"].astype(str).str.replace(".0","",regex=False)

# Save back
lxl.to_csv(lxl_path, index=False)

# Report after
after_st = _blanks(lxl["street_o"]) if "street_o" in lxl.columns else None
after_pc = _blanks(lxl["postal_code_o"]) if "postal_code_o" in lxl.columns else None

print(f"[ok] legacy_enrichment_list updated → {lxl_path}")
if before_st is not None and after_st is not None:
    print(f"  street_o blanks:      {before_st} → {after_st}")
if before_pc is not None and after_pc is not None:
    print(f"  postal_code_o blanks: {before_pc} → {after_pc}")




[info] LEGACY_ENRICHMENT_LIST (OSM side): rows needing street/postcode: 13
[ok] legacy_enrichment_list updated → legacy_enrichment_list.csv
  street_o blanks:      0 → 0
  postal_code_o blanks: 0 → 0


  cc = pd.concat([cache, new], ignore_index=True)



## **8) Build master → pools_master_minimal.csv**

**Why:** merge legacy rows and OSM-only rows into one clean, minimal table that carries the two enrichment tasks end-to-end:
- ✅ District IDs fixed to you canonical codes.
- ✅ ortsteil (neighborhood) + ortsteil_id joined from LOR polygons.

**What we ensure:**
- **pool_id** present for every row (legacy preserved; OSM generated deterministically).
- **district_id** stays aligned with the 1100… codes
- **ortsteil** + **ortsteil_id** present for all rows that spatially fall into an LOR polygon.
- `open_all_year` defaults to False when not clearly true-ish.


In [16]:
# — Build final master (legacy + legacy_enrichment_list + osm_public_named) — self-contained & neighborhood-ready
import pandas as pd
import numpy as np
from pathlib import Path
import unicodedata, re

# ---------- robust path setup (works even if Config cell wasn't run) ----------
def _p(x): return Path(str(x)).expanduser().resolve()

LEGACY_ENRICHED_CSV = globals().get("LEGACY_ENRICHED_CSV", _p("legacy_enriched_with_lor.csv"))
LEGACY_MAIN_CSV     = globals().get("LEGACY_MAIN_CSV",     _p("berlin_pools_final_dataset.csv"))
LEGACY_ALT_CSV      = globals().get("LEGACY_ALT_CSV",      _p("pools_data_cleaned.csv"))
OSM_PUBLIC_NAMED_CSV= globals().get("OSM_PUBLIC_NAMED_CSV",_p("osm_public_named.csv"))
ENRICH_LIST_PATH    = _p("legacy_enrichment_list.csv")

def _first_existing(*paths: Path) -> Path | None:
    for p in paths:
        if p and isinstance(p, Path) and p.exists():
            return p
    return None

LEGACY_IN = _first_existing(LEGACY_ENRICHED_CSV, LEGACY_MAIN_CSV, LEGACY_ALT_CSV)
if not LEGACY_IN:
    raise FileNotFoundError(
        "No legacy CSV found. Expected one of: "
        f"{LEGACY_ENRICHED_CSV.name}, {LEGACY_MAIN_CSV.name}, {LEGACY_ALT_CSV.name}"
    )

if not Path(OSM_PUBLIC_NAMED_CSV).exists():
    raise FileNotFoundError(f"OSM named CSV not found: {OSM_PUBLIC_NAMED_CSV}")

print(f"[info] Using legacy (enriched if available): {Path(LEGACY_IN).name}")
print(f"[info] Using OSM (public_named):            {Path(OSM_PUBLIC_NAMED_CSV).name}")
print(f"[info] Using enrichment list (if present):  {ENRICH_LIST_PATH.name} (exists={ENRICH_LIST_PATH.exists()})")

legacy    = pd.read_csv(LEGACY_IN)
osm_named = pd.read_csv(OSM_PUBLIC_NAMED_CSV)

# ---------- helper utilities ----------
def pickcol(df, *options):
    for c in options:
        if c in df.columns:
            return c
    return None

def normalize_name(s: str) -> str:
    s = "" if pd.isna(s) else str(s)
    s = unicodedata.normalize("NFKD", s)
    s = "".join(ch for ch in s if not unicodedata.combining(ch))
    s = re.sub(r"[^a-z0-9]+", " ", s.lower()).strip()
    return re.sub(r"\s+", " ", s).strip()

# ---------- optional: enrichment from legacy_enrichment_list.csv ----------
if ENRICH_LIST_PATH.exists():
    enrich = pd.read_csv(ENRICH_LIST_PATH)
    legacy_name_col = pickcol(legacy, "name","pool_name") or "name"
    legacy["_name_norm"] = legacy[legacy_name_col].astype(str).map(normalize_name)

    # Be defensive: if name_l column is missing, skip enrichment safely
    if "name_l" in enrich.columns:
        enrich["_name_norm"] = enrich["name_l"].astype(str).map(normalize_name)
        osm_add_cols = [c for c in ["website_o","opening_hours_o","wheelchair_o","phone_o","street_o","postal_code_o"] if c in enrich.columns]
        enrich_small  = enrich[["_name_norm"] + osm_add_cols].drop_duplicates("_name_norm")
        legacy_enriched = legacy.merge(enrich_small, on="_name_norm", how="left")

        # Fill legacy columns from OSM-side columns only when legacy value is blank
        fill_map = {
            "website": ("website_o",),
            "opening_hours": ("opening_hours_o",),
            "wheelchair": ("wheelchair_o",),
            "street": ("street_o",),
            "postal_code": ("postal_code_o",),
        }
        for lcol, ocands in fill_map.items():
            if lcol in legacy_enriched.columns:
                for ocol in ocands:
                    if ocol in legacy_enriched.columns:
                        legacy_enriched[lcol] = legacy_enriched[lcol].where(
                            legacy_enriched[lcol].notna() & (legacy_enriched[lcol].astype(str).str.strip() != ""),
                            legacy_enriched[ocol]
                        )
        legacy_enriched.drop(columns=[c for c in ["_name_norm"] + osm_add_cols if c in legacy_enriched.columns],
                             inplace=True, errors="ignore")
    else:
        print("[info] enrichment list present but 'name_l' missing — skipping enrich merge.")
        legacy_enriched = legacy.copy()
else:
    legacy_enriched = legacy.copy()

# ---------- ID generation for OSM-only rows ----------
def make_osm_pool_id(df):
    id_col = pickcol(df, "source_id","osm_id","element_id","osmid","@id","id")
    if id_col:
        base = df[id_col].fillna("").astype(str).str.strip()
    else:
        base = pd.Series([""] * len(df), index=df.index)
    latc = pickcol(df, "lat","latitude")
    lonc = pickcol(df, "lon","longitude")
    if latc and lonc:
        coord_sur = df[latc].round(6).astype(str) + "_" + df[lonc].round(6).astype(str)
    else:
        coord_sur = pd.Series([f"{i:05d}" for i in range(len(df))], index=df.index)
    base = np.where(base == "", coord_sur, base)
    return pd.Series("OSM_" + pd.Series(base), index=df.index)

# Helpers to read neighborhood fields (support old/new names)
def get_neighborhood_cols(df):
    neighborhood    = df.get("neighborhood", df.get("ortsteil", ""))
    neighborhood_id = df.get("neighborhood_id", df.get("ortsteil_id", ""))
    return neighborhood, neighborhood_id

# ---------- legacy rows (already LOR-enriched) ----------
legacy_nei, legacy_nei_id = get_neighborhood_cols(legacy_enriched)
legacy_final = pd.DataFrame({
    "pool_id":         legacy_enriched.get(pickcol(legacy_enriched, "pool_id","id","legacy_id"), "L_" + legacy_enriched.index.astype(str)),
    "district":        legacy_enriched.get("district_label", legacy_enriched.get("district","")),
    "district_id":     legacy_enriched.get("district_id", ""),
    "neighborhood":    legacy_nei,
    "neighborhood_id": legacy_nei_id,
    "name":            legacy_enriched.get(pickcol(legacy_enriched, "name","pool_name"), ""),
    "pool_type":       legacy_enriched.get(pickcol(legacy_enriched, "pool_type","type"), ""),
    "street":          legacy_enriched.get(pickcol(legacy_enriched, "street","address"), ""),
    "postal_code":     legacy_enriched.get(pickcol(legacy_enriched, "postal_code","postcode","zip"), ""),
    "latitude":        pd.to_numeric(legacy_enriched.get(pickcol(legacy_enriched, "latitude","lat"), np.nan), errors="coerce"),
    "longitude":       pd.to_numeric(legacy_enriched.get(pickcol(legacy_enriched, "longitude","lon"), np.nan), errors="coerce"),
    "open_all_year":   legacy_enriched.get(pickcol(legacy_enriched, "open_all_year","open_all_year_round","open_year_round"), pd.NA),
})

# ---------- OSM-only (unmatched) rows — should already be LOR-enriched ----------
osm_nei, osm_nei_id = get_neighborhood_cols(osm_named)
osm_final = pd.DataFrame({
    "pool_id":         make_osm_pool_id(osm_named),
    "district":        osm_named.get("district_label", osm_named.get("district","")),
    "district_id":     osm_named.get("district_id", ""),
    "neighborhood":    osm_nei,
    "neighborhood_id": osm_nei_id,
    "name":            osm_named.get("name", ""),
    "pool_type":       osm_named.get("pool_type", osm_named.get("leisure", "")),
    "street":          osm_named.get("street", osm_named.get("addr_street", "")),
    "postal_code":     osm_named.get("postal_code", osm_named.get("addr_postcode", "")),
    "latitude":        pd.to_numeric(osm_named.get(pickcol(osm_named, "latitude","lat"), np.nan), errors="coerce"),
    "longitude":       pd.to_numeric(osm_named.get(pickcol(osm_named, "longitude","lon"), np.nan), errors="coerce"),
    "open_all_year":   pd.NA,
})

# ---------- combine & tidy ----------
final_master = pd.concat([legacy_final, osm_final], ignore_index=True)

for c in ["pool_id","district","district_id","neighborhood","neighborhood_id","name","pool_type","street","postal_code"]:
    if c in final_master.columns:
        final_master[c] = final_master[c].astype("string")
for c in ["latitude","longitude"]:
    if c in final_master.columns:
        final_master[c] = pd.to_numeric(final_master[c], errors="coerce")

if "open_all_year" in final_master.columns:
    s = final_master["open_all_year"].astype(str).str.strip().str.lower()
    trueish = {"true","1","yes","y","ja"}
    final_master["open_all_year"] = s.map(lambda x: True if x in trueish else False)

dup_ct = int(final_master["pool_id"].duplicated().sum())
if dup_ct:
    print(f"[warn] duplicate pool_id found: {dup_ct} → keeping first")
    final_master = final_master.drop_duplicates(subset=["pool_id"], keep="first")

final_master = final_master[[
    "pool_id","district","district_id","neighborhood","neighborhood_id",
    "name","pool_type","street","postal_code","latitude","longitude","open_all_year"
]]
final_master.to_csv("pools_master_minimal.csv", index=False)
print("[ok] Wrote pools_master_minimal.csv with", len(final_master), "rows")
print({
    "legacy_rows_in": len(legacy),
    "osm_public_named_in": len(osm_named),
    "final_rows_out": len(final_master)
})



[info] Using legacy (enriched if available): legacy_enriched_with_lor.csv
[info] Using OSM (public_named):            osm_public_named.csv
[info] Using enrichment list (if present):  legacy_enrichment_list.csv (exists=True)
[ok] Wrote pools_master_minimal.csv with 167 rows
{'legacy_rows_in': 144, 'osm_public_named_in': 23, 'final_rows_out': 167}


### 8.1) Final tweak — renumber OSM-generated pool_ids to 1..400

In [17]:
# Replace surrogate/coord OSM IDs with consecutive numbers 1..400 (no extra columns)
# Targets:
#   • OSM_sur_###              (surrogate)
#   • OSM_<lat>_<lon>          (coordinate-style)
import pandas as pd, re
from pathlib import Path

PATH = Path("pools_master_minimal.csv")
MIN_NUM, MAX_NUM = 1, 400   # adjust if you want a different range

df = pd.read_csv(PATH, dtype={"pool_id": "string"})
pool = df["pool_id"].astype("string")

# Patterns
coord_pat = re.compile(r"^OSM_-?\d+(?:\.\d+)?_-?\d+(?:\.\d+)?$")
sur_pat   = re.compile(r"^OSM_sur_\d+$")

# Build target list in order of first appearance
is_coord = pool.str.match(coord_pat, na=False)
is_sur   = pool.str.match(sur_pat, na=False)
targets  = pool[is_coord | is_sur].dropna().astype(str).unique().tolist()

if not targets:
    print("[info] No OSM surrogate/coord IDs found; nothing changed.")
else:
    # Collision-avoidance: keep any existing non-target IDs
    existing = set(pool[~pool.isin(targets)].astype(str).tolist())

    # Available numbers in the chosen range that aren’t already used
    available = [str(i) for i in range(MIN_NUM, MAX_NUM + 1) if str(i) not in existing]

    if len(available) < len(targets):
        raise RuntimeError(
            f"Not enough free numeric IDs in {MIN_NUM}..{MAX_NUM}. "
            f"Need {len(targets)}, have {len(available)}. Increase MAX_NUM or free up IDs."
        )

    # Deterministic mapping (first appearance wins)
    mapping = {old: new for old, new in zip(targets, available[:len(targets)])}

    # Apply mapping
    df["pool_id"] = df["pool_id"].astype("string").map(lambda x: mapping.get(str(x), str(x)))
    df.to_csv(PATH, index=False)

    # Summary
    coord_ct = int(is_coord.sum())
    sur_ct   = int(is_sur.sum())
    print(f"[ok] Renumbered {len(mapping)} OSM IDs → {MIN_NUM}..{MAX_NUM} with no collisions.")
    print(f"     (coord-style: {coord_ct}, surrogate: {sur_ct})")
    print("Examples:", dict(list(mapping.items())[:5]))




[ok] Renumbered 23 OSM IDs → 1..400 with no collisions.
     (coord-style: 23, surrogate: 0)
Examples: {'OSM_52.459526_13.31501': '1', 'OSM_52.617812_13.33578': '2', 'OSM_52.454536_13.329503': '3', 'OSM_52.546965_13.555637': '4', 'OSM_52.427549_13.470163': '5'}



## **9) Validation & Outputs**

**Why:** quick QC on the final pools_master_minimal.csv to confirm IDs, geography, and basic fields look sane before hand-off.


In [18]:
# — Quick QC on pools_master_minimal.csv (8-digit district_id + neighborhood) ===
import pandas as pd, numpy as np, re
from pathlib import Path

PATH = Path("pools_master_minimal.csv")
df = pd.read_csv(PATH)

print("Rows:", len(df))
print("Missing pool_id:", int(df["pool_id"].astype(str).str.strip().eq("").sum()))

# --- district_id: 8-digit codes, membership + label consistency
# Prefer the mapping we defined earlier; otherwise fall back to a literal set
if 'DISTRICT_ID_MAP' in globals():
    allowed_ids = set(DISTRICT_ID_MAP.values())
    id_to_label = {v: k for k, v in DISTRICT_ID_MAP.items()}
else:
    allowed_ids = {
        "11001001","11002002","11003003","11004004","11005005","11006006",
        "11007007","11008008","11009009","11010010","11011011","11012012"
    }
    id_to_label = {
        "11001001":"Mitte","11002002":"Friedrichshain-Kreuzberg","11003003":"Pankow",
        "11004004":"Charlottenburg-Wilmersdorf","11005005":"Spandau","11006006":"Steglitz-Zehlendorf",
        "11007007":"Tempelhof-Schöneberg","11008008":"Neukölln","11009009":"Treptow-Köpenick",
        "11010010":"Marzahn-Hellersdorf","11011011":"Lichtenberg","11012012":"Reinickendorf",
    }

did = df["district_id"].astype(str).str.strip()
print("Missing district_id:", int(did.eq("").sum()))

bad_format_mask = ~did.str.fullmatch(r"\d{8}") & did.ne("")
print("Invalid district_id format (not 8 digits):", int(bad_format_mask.sum()))

bad_value_mask = ~did.isin(list(allowed_ids)) & did.ne("")
print("district_id not in allowed set:", int(bad_value_mask.sum()))

# district label vs id consistency (if present)
if "district" in df.columns:
    dlabel = df["district"].astype(str).str.strip()
    expected_from_id = did.map(id_to_label).fillna("")
    mism_label = (dlabel.ne("")) & (expected_from_id.ne("")) & (dlabel != expected_from_id)
    print("district vs district_id label mismatches:", int(mism_label.sum()))
else:
    mism_label = pd.Series(False, index=df.index)

# --- neighborhood checks (renamed from ortsteil)
neigh_missing = int(df.get("neighborhood", pd.Series([""]*len(df))).astype(str).str.strip().eq("").sum())
print("Missing neighborhood:", neigh_missing)

neigh_id = df.get("neighborhood_id", pd.Series([""]*len(df))).astype(str).str.strip()
bad_neigh_id = (neigh_id != "") & ~neigh_id.str.fullmatch(r"\d{4}")
print("Invalid neighborhood_id format (not 4 digits):", int(bad_neigh_id.sum()))

# --- basic uniqueness & formatting checks for pool_id
dupe_ids = int(df["pool_id"].duplicated().sum())
print("Duplicate pool_id:", dupe_ids)

# leftovers: surrogate or coordinate-style OSM ids
pool = df["pool_id"].astype(str)
left_sur = pool.str.fullmatch(r"OSM_sur_\d+").sum()
left_coord = pool.str.fullmatch(r"OSM_-?\d+(?:\.\d+)?_-?\d+(?:\.\d+)?").sum()
print("Leftover surrogate IDs (OSM_sur_###):", int(left_sur))
print("Leftover coord-style IDs (OSM_<lat>_<lon>):", int(left_coord))

# --- postal code: Berlin 10xxx–14xxx (allow empty)
pc = df["postal_code"].astype(str).str.strip()
bad_pc_mask = (pc != "") & ~pc.str.fullmatch(r"1[0-4]\d{3}")
print("Suspicious postal_code (non-empty but not 10xxx–14xxx):", int(bad_pc_mask.sum()))

# --- coordinates sanity
lat = pd.to_numeric(df["latitude"], errors="coerce")
lon = pd.to_numeric(df["longitude"], errors="coerce")
missing_lat = int(lat.isna().sum())
missing_lon = int(lon.isna().sum())
out_lat = int((~lat.between(-90, 90)).sum())
out_lon = int((~lon.between(-180, 180)).sum())
print("Missing latitude:", missing_lat, "| Missing longitude:", missing_lon)
print("Out-of-range latitude:", out_lat, "| Out-of-range longitude:", out_lon)

# --- open_all_year should be boolean/text. Show distribution.
print("\nopen_all_year dtype:", df["open_all_year"].dtype)
print(df["open_all_year"].value_counts(dropna=False))

# --- Peek at problems (up to 5 rows each) ---
def peek(mask, cols, title):
    m = df[mask]
    if not m.empty:
        print(f"\n{title} (showing up to 5):")
        display(m[cols].head(5))

core_cols = ["pool_id","district","district_id","neighborhood","neighborhood_id",
             "name","street","postal_code","latitude","longitude","open_all_year"]

peek(did.eq(""), core_cols, "Rows with missing district_id")
peek(bad_format_mask, core_cols, "Rows with invalid district_id format")
peek(bad_value_mask, core_cols, "Rows with district_id not in allowed set")
peek(mism_label, core_cols, "district label vs id mismatches")

peek(df.get("neighborhood","").astype(str).str.strip().eq(""), core_cols, "Rows with missing neighborhood")
peek(bad_neigh_id, core_cols, "Rows with bad neighborhood_id")

peek(pool.str.fullmatch(r"OSM_sur_\d+"), core_cols, "Leftover surrogate pool_id")
peek(pool.str.fullmatch(r"OSM_-?\d+(?:\.\d+)?_-?\d+(?:\.\d+)?"), core_cols, "Leftover coord-style pool_id")

peek(bad_pc_mask, core_cols, "Rows with suspicious postal_code")
peek(lat.isna() | lon.isna(), core_cols, "Rows with missing coordinates")
peek((~lat.between(-90, 90)) | (~lon.between(-180, 180)), core_cols, "Rows with out-of-range coordinates")

print("\nSample (top 10):")
display(df.head(10))





Rows: 167
Missing pool_id: 0
Missing district_id: 0
Invalid district_id format (not 8 digits): 167
district_id not in allowed set: 167
district vs district_id label mismatches: 0
Missing neighborhood: 0
Invalid neighborhood_id format (not 4 digits): 167
Duplicate pool_id: 0
Leftover surrogate IDs (OSM_sur_###): 0
Leftover coord-style IDs (OSM_<lat>_<lon>): 0
Suspicious postal_code (non-empty but not 10xxx–14xxx): 0
Missing latitude: 0 | Missing longitude: 0
Out-of-range latitude: 0 | Out-of-range longitude: 0

open_all_year dtype: bool
open_all_year
False    88
True     79
Name: count, dtype: int64

Rows with invalid district_id format (showing up to 5):


Unnamed: 0,pool_id,district,district_id,neighborhood,neighborhood_id,name,street,postal_code,latitude,longitude,open_all_year
0,472,Reinickendorf,11012012.0,Lübars,1208.0,Strandbad Lübars,Am Freibad 9,13469,52.61824,13.33519,False
1,473,Treptow-Köpenick,11009009.0,Oberschöneweide,909.0,Kleine Schwimmhalle Wuhlheide,An der Wuhlheide 161,12459,52.45993,13.53965,True
2,474,Tempelhof-Schöneberg,11007007.0,Mariendorf,704.0,Kombibad Mariendorf,Ankogelweg 95,12107,52.41972,13.40154,True
3,475,Lichtenberg,11011011.0,Fennpfuhl,1111.0,Schwimmhalle Anton-Saefkow-Platz,Anton-Saefkow-Platz 1,10369,52.53093,13.47184,True
4,476,Friedrichshain-Kreuzberg,11002002.0,Kreuzberg,202.0,Stadtbad Kreuzberg - Baerwaldbad,Baerwaldstraße 64-67,10961,52.49451,13.40432,True



Rows with district_id not in allowed set (showing up to 5):


Unnamed: 0,pool_id,district,district_id,neighborhood,neighborhood_id,name,street,postal_code,latitude,longitude,open_all_year
0,472,Reinickendorf,11012012.0,Lübars,1208.0,Strandbad Lübars,Am Freibad 9,13469,52.61824,13.33519,False
1,473,Treptow-Köpenick,11009009.0,Oberschöneweide,909.0,Kleine Schwimmhalle Wuhlheide,An der Wuhlheide 161,12459,52.45993,13.53965,True
2,474,Tempelhof-Schöneberg,11007007.0,Mariendorf,704.0,Kombibad Mariendorf,Ankogelweg 95,12107,52.41972,13.40154,True
3,475,Lichtenberg,11011011.0,Fennpfuhl,1111.0,Schwimmhalle Anton-Saefkow-Platz,Anton-Saefkow-Platz 1,10369,52.53093,13.47184,True
4,476,Friedrichshain-Kreuzberg,11002002.0,Kreuzberg,202.0,Stadtbad Kreuzberg - Baerwaldbad,Baerwaldstraße 64-67,10961,52.49451,13.40432,True



Rows with bad neighborhood_id (showing up to 5):


Unnamed: 0,pool_id,district,district_id,neighborhood,neighborhood_id,name,street,postal_code,latitude,longitude,open_all_year
0,472,Reinickendorf,11012012.0,Lübars,1208.0,Strandbad Lübars,Am Freibad 9,13469,52.61824,13.33519,False
1,473,Treptow-Köpenick,11009009.0,Oberschöneweide,909.0,Kleine Schwimmhalle Wuhlheide,An der Wuhlheide 161,12459,52.45993,13.53965,True
2,474,Tempelhof-Schöneberg,11007007.0,Mariendorf,704.0,Kombibad Mariendorf,Ankogelweg 95,12107,52.41972,13.40154,True
3,475,Lichtenberg,11011011.0,Fennpfuhl,1111.0,Schwimmhalle Anton-Saefkow-Platz,Anton-Saefkow-Platz 1,10369,52.53093,13.47184,True
4,476,Friedrichshain-Kreuzberg,11002002.0,Kreuzberg,202.0,Stadtbad Kreuzberg - Baerwaldbad,Baerwaldstraße 64-67,10961,52.49451,13.40432,True



Sample (top 10):


Unnamed: 0,pool_id,district,district_id,neighborhood,neighborhood_id,name,pool_type,street,postal_code,latitude,longitude,open_all_year
0,472,Reinickendorf,11012012.0,Lübars,1208.0,Strandbad Lübars,Naturbad,Am Freibad 9,13469,52.61824,13.33519,False
1,473,Treptow-Köpenick,11009009.0,Oberschöneweide,909.0,Kleine Schwimmhalle Wuhlheide,Hallenbad,An der Wuhlheide 161,12459,52.45993,13.53965,True
2,474,Tempelhof-Schöneberg,11007007.0,Mariendorf,704.0,Kombibad Mariendorf,Kombibad,Ankogelweg 95,12107,52.41972,13.40154,True
3,475,Lichtenberg,11011011.0,Fennpfuhl,1111.0,Schwimmhalle Anton-Saefkow-Platz,Hallenbad,Anton-Saefkow-Platz 1,10369,52.53093,13.47184,True
4,476,Friedrichshain-Kreuzberg,11002002.0,Kreuzberg,202.0,Stadtbad Kreuzberg - Baerwaldbad,Hallenbad,Baerwaldstraße 64-67,10961,52.49451,13.40432,True
5,477,Pankow,11003003.0,Weißensee,302.0,Strandbad am Weißen See,Naturbad,Berliner Allee 155,13086,52.55396,13.46583,False
6,478,Spandau,11005005.0,Staaken,504.0,Sommerbad Staaken-West,Freibad,Brunsbüttler Damm 443,13591,52.53386,13.13123,False
7,479,Marzahn-Hellersdorf,11010010.0,Hellersdorf,1005.0,Schwimmhalle Kaulsdorf,Schulbad,Clara-Zetkin-Weg 13,12619,52.5208,13.58541,True
8,480,Neukölln,11008008.0,Neukölln,801.0,Sommerbad Neukölln,Freibad,Columbiadamm 160-180,10965,52.48025,13.41595,False
9,481,Steglitz-Zehlendorf,11006006.0,Lichterfelde,602.0,Schwimmhalle Finckensteinallee,Hallenbad,Finckensteinallee 73,12205,52.43225,13.29791,False


# **10) === Finalize schema & save (from CSV) ===**

**Why:** lock the final table into a clean, predictable schema (columns present, types coerced, IDs padded) before hand-off or DB import.

In [19]:
import pandas as pd

df = pd.read_csv("pools_master_minimal.csv")

# --- Backward-compat: if old columns exist, rename to new ones ---
rename_map = {}
if "ortsteil" in df.columns and "neighborhood" not in df.columns:
    rename_map["ortsteil"] = "neighborhood"
if "ortsteil_id" in df.columns and "neighborhood_id" not in df.columns:
    rename_map["ortsteil_id"] = "neighborhood_id"
if rename_map:
    df = df.rename(columns=rename_map)

# --- required columns (now includes district, neighborhood, neighborhood_id) ---
need_cols = [
    "pool_id","name","pool_type","street","postal_code",
    "latitude","longitude","open_all_year",
    "district","district_id","neighborhood","neighborhood_id",
]

# ensure all columns exist
for c in need_cols:
    if c not in df.columns:
        df[c] = pd.NA if c in ["latitude","longitude"] else ""

# coords → float64
df["latitude"]  = pd.to_numeric(df["latitude"], errors="coerce").astype("float64")
df["longitude"] = pd.to_numeric(df["longitude"], errors="coerce").astype("float64")

# object/text columns (same style as your code, but null-safe)
obj_cols = [
    "pool_id","name","pool_type","street","postal_code",
    "open_all_year","district","district_id","neighborhood","neighborhood_id"
]
for c in obj_cols:
    df[c] = df[c].astype("string").fillna("").astype("object")

# default blanks for open_all_year -> "False" (keep as text, same as your original)
df.loc[df["open_all_year"].astype(str).str.strip().eq(""), "open_all_year"] = "False"

# pad IDs only when non-empty; strip accidental ".0"
mask_did = df["district_id"].astype(str).str.strip().ne("")
df.loc[mask_did, "district_id"] = (
    df.loc[mask_did, "district_id"].astype(str).str.replace(".0","",regex=False).str.zfill(8)
)

mask_nei = df["neighborhood_id"].astype(str).str.strip().ne("")
df.loc[mask_nei, "neighborhood_id"] = (
    df.loc[mask_nei, "neighborhood_id"].astype(str).str.replace(".0","",regex=False).str.zfill(4)
)

# reorder & save
df = df[need_cols]
df.to_csv("pools_master_minimal.csv", index=False)

print(df.dtypes)



pool_id             object
name                object
pool_type           object
street              object
postal_code         object
latitude           float64
longitude          float64
open_all_year       object
district            object
district_id         object
neighborhood        object
neighborhood_id     object
dtype: object


# **10.2 Populate pools_refactored in Postgres** 

In [25]:
# 10.2 — Create + populate berlin_source_data.pools_refactored

import pandas as pd
from sqlalchemy import create_engine, text

# --- Load your final CSV with stable dtypes ---
csv_path = "pools_master_minimal.csv"
df = pd.read_csv(
    csv_path,
    dtype={
        "pool_id": "string",
        "name": "string",
        "pool_type": "string",
        "street": "string",
        "postal_code": "string",
        "district": "string",
        "district_id": "string",
        "neighborhood": "string",      # <-- new
        "neighborhood_id": "string",   # <-- new
    }
)

# Backward-compat (if older CSV still had 'ortsteil*')
if "neighborhood" not in df.columns and "ortsteil" in df.columns:
    df["neighborhood"] = df["ortsteil"].astype("string")
if "neighborhood_id" not in df.columns and "ortsteil_id" in df.columns:
    df["neighborhood_id"] = df["ortsteil_id"].astype("string")

# Coerce coords + boolean
df["latitude"]  = pd.to_numeric(df.get("latitude"), errors="coerce")
df["longitude"] = pd.to_numeric(df.get("longitude"), errors="coerce")

trueish = {"true","1","yes","y","ja","wahr"}
s = df.get("open_all_year", False).astype(str).str.strip().str.lower()
df["open_all_year"] = s.map(lambda x: True if x in trueish else False)

# Drop duplicate PKs to avoid insert errors
if "pool_id" in df.columns:
    before = len(df)
    df = df.drop_duplicates(subset=["pool_id"], keep="first")
    after = len(df)
    if before != after:
        print(f"[info] dropped {before-after} duplicate rows by pool_id")

# (Optional) peek max string lengths to spot columns that might need larger VARCHARs
print("\n[max string lengths]")
for col in ["pool_id","name","pool_type","street","postal_code",
            "district","district_id","neighborhood","neighborhood_id"]:
    if col in df.columns:
        m = int(df[col].fillna("").astype(str).str.len().max())
        print(f"  {col:16s} → {m}")

# --- Connect to Postgres (your exact block) ---
# Connect to postgres DB
user_name='michalina_pacholska'
password='9iqk3zPUATp43zVl'
# Conection
host = 'localhost'
port = '5433'
database = 'layereddb'
schema='berlin_source_data'

#connection to db after you opened tunnel
engine = create_engine(f'postgresql+psycopg2://{user_name}:{password}@{host}:{port}/{database}')

# --- Create target table (if not exists) ---
create_table_query = f"""
CREATE SCHEMA IF NOT EXISTS {schema};

CREATE TABLE IF NOT EXISTS {schema}.pools_refactored (
    pool_id           VARCHAR(32) PRIMARY KEY,
    name              VARCHAR(200) NOT NULL,
    pool_type         VARCHAR(100),
    street            VARCHAR(200),
    postal_code       VARCHAR(10),
    latitude          DECIMAL(9,6),
    longitude         DECIMAL(9,6),
    open_all_year     BOOLEAN NOT NULL DEFAULT FALSE,
    district          VARCHAR(100),
    district_id       VARCHAR(8),
    neighborhood      VARCHAR(100),
    neighborhood_id   VARCHAR(4),
    CONSTRAINT pools_refactored_district_fk
        FOREIGN KEY (district_id)
        REFERENCES {schema}.districts(district_id)
        ON DELETE RESTRICT
        ON UPDATE CASCADE
);

CREATE INDEX IF NOT EXISTS pools_refactored_district_id_idx
  ON {schema}.pools_refactored(district_id);
CREATE INDEX IF NOT EXISTS pools_refactored_neighborhood_id_idx
  ON {schema}.pools_refactored(neighborhood_id);
"""

with engine.begin() as conn:
    for stmt in create_table_query.strip().split(";\n\n"):
        if stmt.strip():
            conn.execute(text(stmt))
print("Table 'pools_refactored' created or already exists.")

# --- Insert data ---
# If you want to fully refresh, uncomment:
# with engine.begin() as conn:
#     conn.execute(text(f"TRUNCATE TABLE {schema}.pools_refactored"))

df.to_sql(
    'pools_refactored',
    engine,
    schema=schema,
    if_exists='append',  # keep table, just insert rows
    index=False,
    method='multi',
    chunksize=2000,
)
print("DataFrame sent to PostgreSQL (.to_sql append).")




[max string lengths]
  pool_id          → 5
  name             → 49
  pool_type        → 21
  street           → 28
  postal_code      → 5
  district         → 26
  district_id      → 8
  neighborhood     → 20
  neighborhood_id  → 4
Table 'pools_refactored' created or already exists.
DataFrame sent to PostgreSQL (.to_sql append).


# **10.2 --- Quick verification query (top 10) ---** 

In [26]:
with engine.connect() as conn:
    preview = pd.read_sql(
        text(f"""
            SELECT pool_id, name, district, district_id, neighborhood, neighborhood_id,
                   postal_code, latitude, longitude, open_all_year
            FROM {schema}.pools_refactored
            ORDER BY pool_id
            LIMIT 10
        """),
        conn
    )
preview

Unnamed: 0,pool_id,name,district,district_id,neighborhood,neighborhood_id,postal_code,latitude,longitude,open_all_year
0,1,Schwimmschule Wassermeloni,Steglitz-Zehlendorf,11006006,Steglitz,601,12165,52.459527,13.315011,False
1,10,Seebad Friedrichshagen,Treptow-Köpenick,11009009,Friedrichshagen,911,12587,52.44595,13.63062,False
2,10223,DRK Kliniken Berlin,Tempelhof-Schöneberg,11007007,Mariendorf,704,12109,52.439756,13.397093,False
3,10224,Wannseeschule,Steglitz-Zehlendorf,11006006,Wannsee,607,14109,52.430556,13.161794,False
4,1023,Strandbad Jungfernheide,Charlottenburg-Wilmersdorf,11004004,Charlottenburg-Nord,406,13629,52.54394,13.27431,False
5,10231,Evangelisches Hubertuskrankenhaus,Steglitz-Zehlendorf,11006006,Nikolassee,606,14129,52.431208,13.219199,True
6,10232,Schwimmbad Berlin Steglitz,Steglitz-Zehlendorf,11006006,Steglitz,601,12165,52.459351,13.314782,True
7,10238,Badestelle Reiherwerder am Forsthaus,Reinickendorf,11012012,Tegel,1202,13505,52.585398,13.25538,False
8,1024,Strandbad Halensee,Charlottenburg-Wilmersdorf,11004004,Grunewald,404,14193,52.49399,13.28308,False
9,1025,Stadtbad Charlottenburg - Alte Halle,Charlottenburg-Wilmersdorf,11004004,Charlottenburg,401,10585,52.51436,13.30956,True


# Berlin Pools — Refactoring Outcome (Legacy ↔︎ OSM)

**Final stance**  
- **OSM does not fully replace the legacy source today** (names too sparse; specs like length/depth largely missing).  
- **Best choice: Hybrid enrichment.** Keep **legacy** as the backbone; use **OSM** to **fill** contacts/websites (fallback), **opening_hours** (when present), **accessibility (*wheelchair*)**, and **address fixes**.  
- **Specs (length/depth/lanes)** stay **separate/parked** until we secure a reliable source.

**Coverage reality check (OSM, current snapshot)**  
- Most enrichment tags are empty or ~**≤1%** filled.  
- **wheelchair:** ~**14%** coverage → useful as an accessibility flag.  
- **website:** ~**9%** coverage, but **primary** website data comes from our original source ([baederleben.de](https://baederleben.de/abfragen/baeder-suche.php) → official Bäder pages); OSM is a **secondary fallback**.  
- Decision remains aligned with my colleague’s recommendation; **most relevant columns remain unchanged**.

**What stays in the backbone (from legacy and lor_ortsteile.geojson)**  
- Canonical **name**, **pool_type**, **street**, **postal_code**, **district / neighborhood IDs**, and core **geos**.  
- Stable dtypes; normalized IDs; consistent Bezirk/Ortsteil mapping.

**What OSM adds (when present)**  
- **Website** (fallback only, if legacy lacks it or confirms the same URL).  
- **opening_hours** (store as provided; do not rely on completeness).  
- **wheelchair** (capture as accessibility flag).  
- **Address nits & alt names** (use to fix obvious errors or fill blanks).  
- **OSM identifier** saved for traceability.

**What we deliberately keep separate / defer**  
- **length, depth, lane counts, indoor/outdoor specs** → **not reliable in OSM now**; keep in a **separate specs table** (or backlog) until a trustworthy upstream is identified.

**Quality & rules applied**  
- Deduped by **normalized names** + **proximity**; excluded **access=private**.  
- Address sanity checks; fixed residual **district/neighborhood IDs**; enforced **string** dtypes for IDs.  
- Merge logic favors **legacy** when both exist; OSM used to **fill gaps or correct obvious issues**.  
- Stored **OSM_id** for auditing; scripted Overpass pull limited to Berlin; added caching to avoid churn.

**Risks & mitigations**  
- **OSM sparsity/volatility:** Keep OSM as enrichment only; log deltas on refresh.  
- **Website consistency:** Prefer **official Bäderbetrieb** URLs; use OSM only if missing in legacy.  
- **Specs reliability:** Parked until a curated or official specs source is identified.

**Next steps**  
1. Run the **hybrid end_to_end pipeline** to publish the enriched layer (legacy backbone + OSM add-ons).  
2. Open a **mini-backlog** to source **specs** (e.g., structured scrape/API from official Bäder pages).  
3. Schedule **periodic OSM refresh** (quarterly) with a coverage report (*wheelchair/website/hours*) to revisit the decision when fill rates improve.

**TL;DR:** Keep **legacy as truth**; add **OSM** where it’s strong (wheelchair, some websites/hours, address fixes). **Park specs.**
"""))












