# **🏊 Pools Refactor — Enrichment Research and Analysis**

## **Purpose**

Create a repeatable, minimal pipeline to compare legacy pools with OpenStreetMap (OSM) so you can:

1. Enrich existing legacy rows with missing details from OSM, and

2. Add genuinely new, named public pools from OSM.

## **Inputs & outputs**

**Inputs**:

- pools.csv (legacy)

- osm_pools_wide.csv (OSM wide export; optionally built in the notebook)

**Outputs (CSV in project root)**:

- legacy_enrichment_list.csv – pairs of legacy↔OSM likely referring to the same place (for enrichment)

- osm_new_public_named.csv – OSM pools not in legacy, with a clear name (for addition)

## **Parameters (defaults in the notebook)**

DIST_M = 100.0 — max distance (meters) to treat two points as the same place when names differ.

STRICT_PUBLIC = False — lenient public filter for OSM (""|yes|public|permissive and missing → included).

MAKE_OSM_WIDE — **set True only when you want to (re)build osm_pools_wide.csv via OSMnx.**

## **What the pipeline does (section-by-section)**

### 1) Configure paths & parameters
- Define file locations (`pools.csv`, `osm_pools_wide.csv`).
- Set matching knobs:
  - `DIST_M` — maximum distance (meters) to consider two entries the same place.
  - `STRICT_PUBLIC` — OSM public-access strictness.

---

### 2) (Optional) Build OSM wide export
- Geocode the study area polygon with OSMnx.
- Query & merge three OSM feature sets:
  - `leisure=swimming_pool` (basins/complexes)
  - `leisure=sports_centre` **filtered** to `sport=swimming`
  - `leisure=swimming_area` (open-water/lidos)
- Keep useful attributes (names, address, access, contacts, etc.).
- Compute accurate centroids in a metric CRS (UTM33N) → convert back to WGS84.
- Write `osm_pools_wide.csv`.

---

### 3) Helper functions (normalization, mapping)
- Normalize names (accent fold, lowercase, remove punctuation).
- **Coalesce OSM names** from multiple fields → one best `name`:
  - `name`, `official_name`, `short_name`, `alt_name`, `name:*`, `brand`, `operator`
- Map Legacy and OSM to a **unified schema**; apply the public filter on OSM.

---

### 4) Load CSVs & map to unified schema
- Read inputs and produce:
  - `legacy_db` and `osm_db` with consistent columns and normalized names.

---

### 5) Build `legacy_enrichment_list` (candidates to enrich)
- Create a light **blocking key**: first 10 chars of `name_norm`.
- Join Legacy ↔ OSM on the block to propose candidate pairs.
- Compute Haversine distance (meters).
- **Keep** a pair if:
  - exact normalized name match **OR**
  - distance ≤ `DIST_M`
- Output includes `dist_m`, `name_match`, and OSM contact fields to drive enrichment.

---

### 6) Build `osm_new_public_named` (new & named)
- From OSM public rows **not matched** above, keep only those with a **non-empty coalesced name**.
- These are high-quality additions for your master list.

---

### 7) Save outputs + summary
- Write two CSVs to the project root:
  - `legacy_enrichment_list.csv`
  - `osm_new_public_named.csv`
- Print a compact run summary (counts + settings).


-----
------



# 1) Configure paths & parameters

**What this does (bullet points):**
- Sets your input file paths.
- Lets you optionally build a fresh OSM export (set `MAKE_OSM_WIDE=True`).
- Defines matching knobs:
  - `DIST_M`: maximum distance (in meters) to consider two pools the same place.
  - `STRICT_PUBLIC`: if `True`, keep only OSM rows with `access` exactly `yes|public`.



In [3]:

from pathlib import Path

def resolve_path(p):
    return Path(str(p)).expanduser()

# Inputs
LEGACY_CSV = resolve_path("pools.csv")            # legacy CSV
OSM_CSV    = resolve_path("osm_pools_wide.csv")   # will be created if MAKE_OSM_WIDE is True

# Optional OSM build
MAKE_OSM_WIDE = True              # set True to build OSM_CSV from OSMnx
OSM_PLACE     = "Berlin, Germany"  # used if MAKE_OSM_WIDE=True

# Parameters
DIST_M = 100.0
STRICT_PUBLIC = False

print("Configured:")
print("  LEGACY_CSV =", LEGACY_CSV)
print("  OSM_CSV    =", OSM_CSV)
print("  MAKE_OSM_WIDE:", MAKE_OSM_WIDE, "| PLACE:", OSM_PLACE)
print("  DIST_M     =", DIST_M, "m")
print("  STRICT_PUBLIC =", STRICT_PUBLIC)


Configured:
  LEGACY_CSV = pools.csv
  OSM_CSV    = osm_pools_wide.csv
  MAKE_OSM_WIDE: True | PLACE: Berlin, Germany
  DIST_M     = 100.0 m
  STRICT_PUBLIC = False



# 2)Build OSM wide export

**What this does (bullet points):**
- Geocodes the study area polygon with OSMnx.
- Queries three feature sets and merges them:
  1. `leisure=swimming_pool` (actual basins / complexes)
  2. `leisure=sports_centre` filtered to `sport=swimming`
  3. `leisure=swimming_area` (open-water / lidos; include if useful)
- Keeps useful attributes (names, address, access, contact, dimensions).
- Computes accurate centroids in a **metric CRS (UTM33N)** and converts back to WGS84.
- Writes a **wide** CSV (`osm_pools_wide.csv`) for downstream steps.


In [4]:

if MAKE_OSM_WIDE:
    import osmnx as ox, geopandas as gpd, pandas as pd, numpy as np

    print("OSMnx version:", ox.__version__)
    ox.settings.use_cache = True
    ox.settings.log_console = False

    # Place polygon
    place_gdf = ox.geocode_to_gdf(OSM_PLACE)
    if place_gdf.empty:
        raise ValueError(f"Could not geocode place: {OSM_PLACE}")
    place_poly = place_gdf.geometry.iloc[0]

    # v1/v2 helper
    def fetch(poly, tags):
        if hasattr(ox, "features_from_polygon"):
            return ox.features_from_polygon(poly, tags=tags)      # OSMnx 2.x
        elif hasattr(ox, "geometries_from_polygon"):
            return ox.geometries_from_polygon(poly, tags)         # OSMnx 1.x
        else:
            raise RuntimeError("OSMnx lacks expected geometry functions.")

    # Queries
    g_pool = fetch(place_poly, {"leisure": "swimming_pool"})
    g_sc   = fetch(place_poly, {"leisure": "sports_centre"})
    g_area = fetch(place_poly, {"leisure": "swimming_area"})

    def has_swimming(val):
        if val is None or (isinstance(val, float) and np.isnan(val)): return False
        if isinstance(val, (list, tuple, set)): return any(str(x).lower()=="swimming" for x in val)
        return "swimming" in str(val).lower()

    if "sport" in g_sc.columns:
        g_sc = g_sc[g_sc["sport"].apply(has_swimming)].copy()
    else:
        g_sc = g_sc.iloc[0:0].copy()

    g = pd.concat([g_pool, g_sc, g_area], ignore_index=True, sort=False)
    if g.empty:
        raise ValueError("No pool-related features found for this area.")

    # Keep columns
    prefer_cols = [
        "element_id","osmid","name","official_name","short_name","alt_name",
        "name:de","name:en","brand","operator",
        "access",
        "addr:street","addr:housenumber","addr:postcode","addr:city",
        "website","contact:website","url",
        "phone","contact:phone",
        "opening_hours","wheelchair",
        "length","depth","width",
        "indoor","covered",
        "leisure","sport","swimming_pool",
        "geometry",
    ]
    have_cols = [c for c in prefer_cols if c in g.columns]
    g = g[have_cols].copy()

    # Centroids in metric CRS → back to WGS84
    if not isinstance(g, gpd.GeoDataFrame):
        g = gpd.GeoDataFrame(g, geometry="geometry", crs="EPSG:4326")
    g = g.set_crs("EPSG:4326", allow_override=True)
    g_proj = g.to_crs("EPSG:32633")  # UTM 33N ~ Berlin
    cent   = g_proj.geometry.centroid
    cent_w = cent.to_crs("EPSG:4326")
    g["lat"] = cent_w.y
    g["lon"] = cent_w.x

    # Flatten website/phone
    def coalesce(df, *cols):
        for c in cols:
            if c in df.columns: return df[c]
        return pd.Series(index=df.index, dtype=object)
    g["website_flat"] = coalesce(g, "website","contact:website","url")
    g["phone_flat"]   = coalesce(g, "phone","contact:phone")

    # Stable ID: element_id -> osmid -> surrogate
    if "element_id" in g.columns and g["element_id"].notna().any():
        id_series = g["element_id"].astype("string")
    elif "osmid" in g.columns and g["osmid"].notna().any():
        id_series = g["osmid"].astype("string")
    else:
        id_series = pd.Series("", index=g.index, dtype="string")
    needs_sur = id_series.isna() | id_series.str.strip().eq("") | (id_series == "<NA>")
    id_series = id_series.astype("string")
    sur_idx = pd.Series(g.index.astype(str), index=g.index)
    id_series.loc[needs_sur] = "sur_" + sur_idx.loc[needs_sur]

    out = pd.DataFrame({
        "osm_id": id_series,
        "name": g.get("name", pd.Series(index=g.index, dtype=object)),
        "official_name": g.get("official_name", pd.Series(index=g.index, dtype=object)),
        "short_name": g.get("short_name", pd.Series(index=g.index, dtype=object)),
        "alt_name": g.get("alt_name", pd.Series(index=g.index, dtype=object)),
        "name:de": g.get("name:de", pd.Series(index=g.index, dtype=object)),
        "name:en": g.get("name:en", pd.Series(index=g.index, dtype=object)),
        "brand": g.get("brand", pd.Series(index=g.index, dtype=object)),
        "operator": g.get("operator", pd.Series(index=g.index, dtype=object)),
        "access": g.get("access", pd.Series(index=g.index, dtype=object)),
        "lat": g["lat"], "lon": g["lon"],
        "addr_street": g.get("addr:street", pd.Series(index=g.index, dtype=object)),
        "addr_housenumber": g.get("addr:housenumber", pd.Series(index=g.index, dtype=object)),
        "addr_postcode": g.get("addr:postcode", pd.Series(index=g.index, dtype=object)),
        "addr_city": g.get("addr:city", pd.Series(index=g.index, dtype=object)),
        "website": g["website_flat"],
        "phone": g["phone_flat"],
        "opening_hours": g.get("opening_hours", pd.Series(index=g.index, dtype=object)),
        "wheelchair": g.get("wheelchair", pd.Series(index=g.index, dtype=object)),
        "length": g.get("length", pd.Series(index=g.index, dtype=object)),
        "depth": g.get("depth", pd.Series(index=g.index, dtype=object)),
        "width": g.get("width", pd.Series(index=g.index, dtype=object)),
        "indoor": g.get("indoor", pd.Series(index=g.index, dtype=object)),
        "covered": g.get("covered", pd.Series(index=g.index, dtype=object)),
        "leisure": g.get("leisure", pd.Series(index=g.index, dtype=object)),
        "sport": g.get("sport", pd.Series(index=g.index, dtype=object)),
        "swimming_pool": g.get("swimming_pool", pd.Series(index=g.index, dtype=object)),
    }).drop_duplicates(subset=["osm_id"], keep="first")

    print(f"Counts — pools:{len(g_pool)}  sports_centres(swimming):{len(g_sc)}  swimming_areas:{len(g_area)}  merged:{len(out)}")
    out.to_csv(OSM_CSV, index=False, encoding="utf-8")
    print(f"[ok] Wrote OSM wide export → {OSM_CSV} (rows={len(out)})")
else:
    print("[skip] MAKE_OSM_WIDE is False — expecting OSM_CSV to already exist.")


OSMnx version: 2.0.6
Counts — pools:919  sports_centres(swimming):65  swimming_areas:33  merged:1017
[ok] Wrote OSM wide export → osm_pools_wide.csv (rows=1017)



# 3) Helper functions (normalization, mapping)

**What this does (bullet points):**
- Normalizes text and names for consistent matching.
- Maps **Legacy** and **OSM** columns into a **unified schema**.
- For OSM, coalesces the best available name from:  
  `name`, `official_name`, `short_name`, `alt_name`, `name:*`, `brand`, `operator`.
- Applies public-access filtering (`STRICT_PUBLIC`) for OSM.


In [6]:

import pandas as pd, numpy as np, math, unicodedata, re

def normalize_ascii(s: str) -> str:
    if not isinstance(s, str): return ""
    s = unicodedata.normalize("NFKD", s)
    return "".join(ch for ch in s if not unicodedata.combining(ch))

def normalize_name(s: str) -> str:
    if not isinstance(s, str): return ""
    s = normalize_ascii(s).lower().strip()
    s = re.sub(r"[^a-z0-9]+", " ", s)
    return re.sub(r"\s+", " ", s).strip()

def is_public_access(v, strict: bool=False) -> bool:
    if v is None or (isinstance(v, float) and math.isnan(v)): return not strict
    v = str(v).strip().lower()
    if v in {"private","customers","residents"}: return False
    return (v in {"yes","public"} if strict else v in {"","yes","public","permissive"} or v is None)

def haversine_m(lat1, lon1, lat2, lon2) -> float:
    if any(pd.isna([lat1, lon1, lat2, lon2])): return np.nan
    R=6371000.0; phi1=math.radians(lat1); phi2=math.radians(lat2)
    dphi=math.radians(lat2-lat1); dl=math.radians(lon2-lon1)
    a=math.sin(dphi/2)**2+math.cos(phi1)*math.cos(phi2)*math.sin(dl/2)**2
    return 2*R*math.asin(math.sqrt(a))

def pick(df, *choices):
    for c in choices:
        if c in df.columns: return c
    return None

def map_osm_to_db(df: pd.DataFrame, strict_public: bool=False) -> pd.DataFrame:
    if df.empty: return df.copy()
    cols = {c.lower(): c for c in df.columns}
    def cget(*choices):
        for ch in choices:
            if ch in cols: return cols[ch]
        return None

    # robust name coalescer
    name_candidates = [
        "name", "official_name", "short_name", "alt_name",
        "name:de","name:en","name:pl","name:fr","name:it","name:es",
        "brand","operator"
    ]
    present = [cols[c] for c in name_candidates if c in cols]
    def coalesce_series(df, colnames):
        if not colnames:
            return pd.Series([""]*len(df), index=df.index, dtype=object)
        out = df[colnames[0]].astype(object).fillna("")
        for c in colnames[1:]:
            nxt = df[c].astype(object).fillna("")
            use_next = out.astype(str).str.strip().eq("")
            out = out.where(~use_next, nxt)
        return out.fillna("").astype(str)
    name_best = coalesce_series(df, present)

    access_col=cget("access"); lat_col=cget("lat","latitude"); lon_col=cget("lon","longitude")
    dist_col=cget("district","bezirk","district_id")
    length_col=cget("length","length_m","pool_length"); depth_col=cget("depth","depth_m","pool_depth")
    type_col=cget("pool_type","leisure","sport","type"); indoor_col=cget("indoor_outdoor","indoor","covered")
    phone_col=cget("phone","contact:phone"); web_col=cget("website","contact:website","url")
    oh_col=cget("opening_hours"); wh_col=cget("wheelchair")
    id_col=cget("osm_id","@id","id","element_id","osmid")

    df_f = df.copy()
    if access_col and access_col in df_f.columns:
        df_f = df_f[df_f[access_col].apply(lambda v: is_public_access(v, strict_public))].copy()

    out = pd.DataFrame({
        "source":"osm",
        "source_id": df_f.get(id_col, pd.Series(index=df_f.index, dtype=object)),
        "name": name_best.reindex(df_f.index).fillna(""),
        "name_norm": name_best.reindex(df_f.index).fillna("").astype(str).map(normalize_name),
        "lat": pd.to_numeric(df_f.get(lat_col, np.nan), errors="coerce"),
        "lon": pd.to_numeric(df_f.get(lon_col, np.nan), errors="coerce"),
        "address": df_f.get(cget("addr:full","address","addr_street"), ""),
        "district": df_f.get(dist_col, ""),
        "pool_type": df_f.get(type_col, ""),
        "indoor_outdoor": df_f.get(indoor_col, ""),
        "length_m": pd.to_numeric(df_f.get(length_col, np.nan), errors="coerce"),
        "depth_m": pd.to_numeric(df_f.get(depth_col, np.nan), errors="coerce"),
        "phone": df_f.get(phone_col, ""),
        "website": df_f.get(web_col, ""),
        "opening_hours": df_f.get(oh_col, ""),
        "wheelchair": df_f.get(wh_col, ""),
    })

    # keep source of name for QA
    def name_source_row(row):
        for c in present:
            val = row.get(c, "")
            if isinstance(val, str) and val.strip():
                return c
        return ""
    out["name_source"] = ""
    if present:
        out["name_source"] = df_f[present].apply(name_source_row, axis=1)

    return out

def map_legacy_to_db(df: pd.DataFrame) -> pd.DataFrame:
    if df.empty: return df.copy()
    cols={c.lower():c for c in df.columns}
    def cget(*choices):
        for ch in choices:
            if ch in cols: return cols[ch]
        return None
    return pd.DataFrame({
        "source":"legacy",
        "source_id": df.get(cget("legacy_id","id"), pd.Series(index=df.index, dtype=object)),
        "name": df.get(cget("name","pool_name"), ""),
        "name_norm": df.get(cget("name","pool_name"), "").astype(str).map(normalize_name),
        "lat": pd.to_numeric(df.get(cget("lat","latitude"), np.nan), errors="coerce"),
        "lon": pd.to_numeric(df.get(cget("lon","longitude"), np.nan), errors="coerce"),
        "address": df.get(cget("address","addr"), ""),
        "district": df.get(cget("district","bezirk"), ""),
        "pool_type": df.get(cget("pool_type","type"), ""),
        "indoor_outdoor": df.get(cget("indoor_outdoor","indoor"), ""),
        "length_m": pd.to_numeric(df.get(cget("length_m","length"), np.nan), errors="coerce"),
        "depth_m": pd.to_numeric(df.get(cget("depth_m","depth"), np.nan), errors="coerce"),
        "phone": df.get(cget("phone"), ""),
        "website": df.get(cget("website","url"), ""),
        "opening_hours": df.get(cget("opening_hours","hours"), ""),
        "wheelchair": df.get(cget("wheelchair"), ""),
    })



# 4) Load CSVs & map to unified schema

**What this does (bullet points):**
- Reads `pools.csv` and `osm_pools_wide.csv`.
- Maps both to the unified schema (`legacy_db`, `osm_db`).
- Prints counts and a quick audit of how many OSM rows have a usable (coalesced) name.


In [7]:

legacy_raw = pd.read_csv(LEGACY_CSV)
osm_raw    = pd.read_csv(OSM_CSV)

legacy_db = map_legacy_to_db(legacy_raw)
osm_db    = map_osm_to_db(osm_raw, strict_public=STRICT_PUBLIC)

print("Rows — legacy:", len(legacy_db),
      "| OSM wide:", len(osm_raw),
      "| OSM public subset:", len(osm_db))

print("Non-empty names (coalesced):", (osm_db['name'].astype(str).str.strip() != "").sum())
if "name_source" in osm_db.columns:
    print("\nTop name sources:")
    print(osm_db['name_source'].value_counts(dropna=False).head(10))


Rows — legacy: 144 | OSM wide: 1017 | OSM public subset: 307
Non-empty names (coalesced): 69

Top name sources:
name_source
        238
name     69
Name: count, dtype: int64



# 5) Build legacy_enrichment_list (candidates to enrich)

**What this does (bullet points):**
- Creates lightweight **blocking keys** from normalized names (first 10 chars).
- Joins Legacy ↔ OSM on that block to get candidate pairs.
- Flags **exact name matches** (`name_match`) and computes **distance** (meters).
- Keeps candidates where **exact name matches** OR **distance ≤ `DIST_M`**.


In [8]:

L = legacy_db.assign(bucket=legacy_db["name_norm"].str[:10])
O = osm_db.assign(bucket=osm_db["name_norm"].str[:10])

cand = L.merge(O, on="bucket", suffixes=("_l","_o"))
cand["name_match"] = (cand["name_norm_l"].ne("")) & cand["name_norm_l"].eq(cand["name_norm_o"])
cand["dist_m"] = cand.apply(lambda r: haversine_m(r["lat_l"], r["lon_l"], r["lat_o"], r["lon_o"]), axis=1)

legacy_enrichment_list = cand[(cand["name_match"]) | (cand["dist_m"] <= DIST_M)].copy()

keep_cols = [
    "source_id_l","name_l","lat_l","lon_l","address_l","district_l",
    "source_id_o","name_o","lat_o","lon_o","address_o","district_o",
    "dist_m","name_match","website_o","opening_hours_o","wheelchair_o"
]
legacy_enrichment_list = legacy_enrichment_list[[c for c in keep_cols if c in legacy_enrichment_list.columns]]
print("legacy_enrichment_list:", len(legacy_enrichment_list))


legacy_enrichment_list: 47



# 6) Build osm_new_public_named (OSM not in legacy, named only)

**What this does (bullet points):**
- Takes OSM **public** rows **not matched** to legacy above.
- Requires a **non-empty coalesced name** for clear additions.
- Final result is the list of **new named** pools from OSM to add.


In [9]:

matched_osm_ids = set(legacy_enrichment_list.get("source_id_o", pd.Series(dtype=object)).dropna().tolist())
osm_unmatched = osm_db[~osm_db["source_id"].isin(matched_osm_ids)].copy()

# coalesced name already resides in 'name'
osm_new_public_named = osm_unmatched[osm_unmatched["name"].astype(str).str.strip().ne("")].copy()
print("osm_new_public_named:", len(osm_new_public_named))


osm_new_public_named: 24



# 7) Save outputs + Summary

**What this does (bullet points):**
- Writes **CSV only** to the **main folder** (no Excel, no subfolders):
  - `legacy_enrichment_list.csv`
  - `osm_new_public_named.csv`
- Prints a compact summary of key counts and settings.


In [10]:

# Write CSVs to project root
legacy_enrichment_list.to_csv(Path("legacy_enrichment_list.csv"), index=False)
osm_new_public_named.to_csv(Path("osm_new_public_named.csv"), index=False)

summary = {
    "legacy_rows": len(legacy_db),
    "osm_public_rows": len(osm_db),
    "legacy_enrichment_list": len(legacy_enrichment_list),
    "osm_new_public_named": len(osm_new_public_named),
    "dist_m": DIST_M,
    "strict_public": STRICT_PUBLIC,
    "outputs": ["legacy_enrichment_list.csv", "osm_new_public_named.csv"]
}
print(summary)


{'legacy_rows': 144, 'osm_public_rows': 307, 'legacy_enrichment_list': 47, 'osm_new_public_named': 24, 'dist_m': 100.0, 'strict_public': False, 'outputs': ['legacy_enrichment_list.csv', 'osm_new_public_named.csv']}


- Legacy rows: 144

- OSM public subset: 307

- Enrichment candidates (legacy_enrichment_list): 47

- New OSM (named) to add (osm_new_public_named): 24

# ==== Attribute coverage & gaps report (Legacy vs OSM) ====

In [12]:
import pandas as pd

# 0) Pretty display options (optional)
pd.set_option("display.max_rows", 200)
pd.set_option("display.precision", 3)

# 1) Which columns exist only in OSM vs only in Legacy?
legacy_cols = set(legacy_db.columns)
osm_cols    = set(osm_db.columns)

only_in_osm    = sorted(osm_cols - legacy_cols)
only_in_legacy = sorted(legacy_cols - osm_cols)
common_cols    = sorted(osm_cols & legacy_cols)

print("Columns only in OSM:", only_in_osm)
print("Columns only in Legacy:", only_in_legacy)
print("Columns in both (common):", common_cols)

# 2) Coverage (row counts) – high-level
coverage_summary = pd.DataFrame(
    {"source": ["legacy", "osm_public_subset"], "rows": [len(legacy_db), len(osm_db)]}
)
print("\n=== Coverage Summary ===")
display(coverage_summary)

# 3) Completeness by column (share of filled values) for common fields
def column_filled_ratio(df: pd.DataFrame, col: str) -> float:
    s = df[col]
    # Numeric -> filled if non-null
    if pd.api.types.is_numeric_dtype(s):
        return float(s.notna().mean())
    # Datetime/boolean -> filled if non-null
    if pd.api.types.is_datetime64_any_dtype(s) or pd.api.types.is_bool_dtype(s):
        return float(s.notna().mean())
    # Everything else (object/string etc.) -> filled if non-null and non-empty after strip
    s_str = s.astype(str)
    return float((s.notna() & s_str.str.strip().ne("")).mean())

def completeness_for(df: pd.DataFrame, cols: list[str]) -> pd.Series:
    if not cols:
        return pd.Series(dtype=float)
    return pd.Series({c: column_filled_ratio(df, c) for c in cols})

legacy_comp = completeness_for(legacy_db, common_cols)
osm_comp    = completeness_for(osm_db, common_cols)

comp = pd.DataFrame({
    "column": common_cols,
    "legacy_filled_ratio": [legacy_comp.get(c, float("nan")) for c in common_cols],
    "osm_filled_ratio":    [osm_comp.get(c, float("nan")) for c in common_cols],
})
comp["gap_osm_minus_legacy"] = comp["osm_filled_ratio"] - comp["legacy_filled_ratio"]

print("\n=== Completeness Comparison (common columns) ===")
display(comp.sort_values("gap_osm_minus_legacy", ascending=False).reset_index(drop=True))

# 4) Focus on spec fields (pool type, length, depth) – explicit gap check
spec_cols = [c for c in ["pool_type", "length_m", "depth_m"] if c in common_cols]
spec = comp[comp["column"].isin(spec_cols)].sort_values("column")

print("\n=== Spec Fields Gap (pool_type, length_m, depth_m) ===")
if not spec.empty:
    display(spec.reset_index(drop=True))
else:
    print("None of the spec fields were found in the common schema.")

# 5) Quick peek of where OSM vs Legacy adds more value
print("\nTop 10 columns where OSM is more complete than Legacy:")
display(comp.sort_values("gap_osm_minus_legacy", ascending=False).head(10).reset_index(drop=True))

print("\nTop 10 columns where Legacy is more complete than OSM:")
display(comp.sort_values("gap_osm_minus_legacy", ascending=True).head(10).reset_index(drop=True))

# 6) Acceptance-Criteria view (printed summary)
print("\n=== Acceptance-Criteria View ===")
print(f"- Coverage: legacy={len(legacy_db)} | osm_public_subset={len(osm_db)}")
print("- Attributes compared: in-memory (no files written)")
if not spec.empty:
    print("- Gaps (spec fields): see table above for pool_type/length_m/depth_m")
else:
    print("- Gaps (spec fields): none detected in common columns")



Columns only in OSM: ['name_source']
Columns only in Legacy: []
Columns in both (common): ['address', 'depth_m', 'district', 'indoor_outdoor', 'lat', 'length_m', 'lon', 'name', 'name_norm', 'opening_hours', 'phone', 'pool_type', 'source', 'source_id', 'website', 'wheelchair']

=== Coverage Summary ===


Unnamed: 0,source,rows
0,legacy,144
1,osm_public_subset,307



=== Completeness Comparison (common columns) ===


Unnamed: 0,column,legacy_filled_ratio,osm_filled_ratio,gap_osm_minus_legacy
0,source_id,0.0,1.0,1.0
1,wheelchair,0.0,0.143,0.143
2,address,0.0,0.137,0.137
3,website,0.0,0.091,0.091
4,opening_hours,0.0,0.078,0.078
5,phone,0.0,0.042,0.042
6,indoor_outdoor,0.0,0.029,0.029
7,length_m,0.0,0.01,0.01
8,depth_m,0.0,0.0,0.0
9,lon,1.0,1.0,0.0



=== Spec Fields Gap (pool_type, length_m, depth_m) ===


Unnamed: 0,column,legacy_filled_ratio,osm_filled_ratio,gap_osm_minus_legacy
0,depth_m,0.0,0.0,0.0
1,length_m,0.0,0.01,0.01
2,pool_type,1.0,1.0,0.0



Top 10 columns where OSM is more complete than Legacy:


Unnamed: 0,column,legacy_filled_ratio,osm_filled_ratio,gap_osm_minus_legacy
0,source_id,0.0,1.0,1.0
1,wheelchair,0.0,0.143,0.143
2,address,0.0,0.137,0.137
3,website,0.0,0.091,0.091
4,opening_hours,0.0,0.078,0.078
5,phone,0.0,0.042,0.042
6,indoor_outdoor,0.0,0.029,0.029
7,length_m,0.0,0.01,0.01
8,depth_m,0.0,0.0,0.0
9,lon,1.0,1.0,0.0



Top 10 columns where Legacy is more complete than OSM:


Unnamed: 0,column,legacy_filled_ratio,osm_filled_ratio,gap_osm_minus_legacy
0,name,1.0,0.225,-0.775
1,name_norm,1.0,0.225,-0.775
2,lat,1.0,1.0,0.0
3,depth_m,0.0,0.0,0.0
4,lon,1.0,1.0,0.0
5,source,1.0,1.0,0.0
6,pool_type,1.0,1.0,0.0
7,district,0.0,0.0,0.0
8,length_m,0.0,0.01,0.01
9,indoor_outdoor,0.0,0.029,0.029



=== Acceptance-Criteria View ===
- Coverage: legacy=144 | osm_public_subset=307
- Attributes compared: in-memory (no files written)
- Gaps (spec fields): see table above for pool_type/length_m/depth_m


1. Coverage: legacy 144 vs OSM (public subset) 307 → OSM sees more venues overall.

2. Schema overlap: everything is shared except name_source (OSM-only helper column).

3. Where OSM wins (more complete):
source_id, wheelchair, address, website, opening_hours, phone, some indoor_outdoor.
(Exactly what we want OSM for: contact & accessibility.)

4. Where legacy wins (more complete):
name and name_norm (legacy 100% vs OSM ~22.5%).
(OSM has many unnamed features even after coalescing.)

5. Spec fields ( “gaps”):
pool_type is full in both; length_m and depth_m are basically empty on both sides (≈0–1%) - therefore lenght and depth can be not pulled at all from OSM 



**OSM does not fully replace the old source today (naming too sparse, specs missing).**

**Best choice: hybrid enrichment — legacy for the backbone; OSM to fill contact, hours, accessibility, and some addresses.**

**Keep specs separate until we secure a reliable source for length/depth.**