# Notebook 05 — Geocoding & Entity Resolution (Build schools_master_v2)

## Purpose
This notebook converts the **union-first, high-recall** `schools_master_v1`
into a **deduplicated, geocoded, region-aware** canonical dataset:

> `schools_master_v2.parquet`

This step is intentionally **lossy and opinionated**, with full audit logs,
and is the **last place where entity identity may change**.

Scoring, ranking, and Top-K artifact generation begin **after** this notebook.

---

## Inputs
- `data/processed/notebook04/schools_master_v1.parquet`
- `data/processed/notebook04/schools_locations_v1.parquet`
- 
## Outputs
- `data/processed/notebook05/schools_master_v2.parquet`
- `data/processed/notebook05/entity_merge_log.csv`
- `data/processed/notebook05/unresolved_orphans.csv`
- `data/processed/notebook05/geo_coverage_report.csv`

---

## Index

### 00. Notebook Setup
00.1 Imports and environment setup  
00.2 Paths, run IDs, and version tags  
00.3 Deterministic settings (hash seeds, thresholds)  
00.4 Logging helpers and audit utilities  

---

### 01. Load schools_master_v1
01.1 Load canonical v1 dataset  
01.2 Schema validation and required columns  
01.3 Primary key integrity check (`school_id`)  
01.4 High-level profiling (row counts, source mix)  

---

### 02. Classify Location Completeness
02.1 Address completeness taxonomy  
- Full address (street + city + state + zip)  
- Partial address (city/state only)  
- Name + county only  
- Missing / malformed  

02.2 Tag rows with `location_status`  
02.3 Initial orphan extraction (no usable address)  

---

### 03. Geocoding Strategy
03.1 Geocoding providers and configuration  
(Google / Mapbox / OpenCage / stub for offline runs)

03.2 Confidence tiers
- `high` — rooftop or exact address  
- `medium` — city-level centroid  
- `low` — county-level or inferred  
- `failed` — no result  

03.3 Batch geocoding execution  
03.4 Attach `latitude`, `longitude`, `geo_confidence`  
03.5 Geocoding QA report  

---

### 04. Orphan Recovery
04.1 Reintroduce dropped private / CAIS schools  
04.2 Attempt best-effort geocoding  
04.3 Flag unresolved rows (`geo_confidence = failed`)  
04.4 Emit `unresolved_orphans.csv`  

---

### 05. Candidate Duplicate Detection
05.1 Name normalization and similarity metrics  
05.2 Spatial proximity candidate generation  
- Same ZIP or city/state  
- Distance thresholds (e.g., < 50m, < 200m)  

05.3 Candidate pair table (pre-merge)  

---

### 06. Entity Resolution & Merge Rules
06.1 Auto-merge criteria  
- High name similarity AND close spatial match  
- Shared external identifiers (when present)  

06.2 Do-not-merge guards  
- Multi-campus detection  
- Distinct grade spans  
- Program-level vs school-level entities  

06.3 Merge policy
- Canonical display name selection  
- Boolean tag union (OR)  
- Provenance preservation  

06.4 Manual review bucket (deferred merges)  

---

### 07. Apply Merges & Build schools_master_v2
07.1 Execute merges  
07.2 Generate new canonical rows  
07.3 Emit `entity_merge_log.csv`  
07.4 Post-merge integrity checks  
- Row counts  
- No duplicate physical entities  

---

### 08. Region Assignment (Geo-Based)
08.1 Region definitions (polygons)  
- Bay Area  
- SoCal  
- NorCal  
- State / National fallback  

08.2 Point-in-polygon assignment  
08.3 Region QA and edge cases  

---

### 09. Final Quality Gates
09.1 No duplicate physical schools  
09.2 All rows have `location_status`  
09.3 All rows have region or explicit exclusion reason  
09.4 Geo coverage summary  

---

### 10. Export schools_master_v2
10.1 Write `schools_master_v2.parquet`  
10.2 Write QA and audit artifacts  
10.3 Freeze schema contract for scoring  

---

## Contract with Notebook 06 (Scoring)
- Input rows represent **one real-world school**
- Duplicates resolved or explicitly flagged
- Location + region available for all rank-eligible rows
- Scoring MUST NOT perform entity resolution


## 00. Notebook Setup

This notebook builds `schools_master_v2` from `schools_master_v1` by:
- classifying location completeness
- geocoding with cache-first lookups
- performing entity resolution (spatial + name-aware)
- assigning precise regions (point-in-polygon)
- exporting audit logs (merge log, unresolved orphans, geo coverage report)

Key principles:
- Deterministic + idempotent outputs
- Cache-first geocoding (avoid repeated API calls)
- Every merge is auditable (merge log)
- No scoring or ranking in this notebook


In [543]:
# ============================
# 00.1 Imports + Globals
# ============================

from __future__ import annotations

import os
import re
import json
import math
import time
import hashlib
from dataclasses import dataclass
from datetime import datetime, timezone
from pathlib import Path
from typing import Any, Dict, Iterable, List, Optional, Tuple

import numpy as np
import pandas as pd

# ============================
# 00.2 Paths + Run Metadata
# ============================

def find_project_root(start: Path | None = None) -> Path:
    """
    Walk upward from `start` (default: cwd) until we find a folder that looks like the repo root.
    Heuristic: contains a `data/` directory (and optionally .git / pyproject / etc).
    """
    start = start or Path.cwd()
    start = start.resolve()

    for p in [start, *start.parents]:
        if (p / "data").exists() and (p / "data").is_dir():
            return p
    # Fallback: previous behavior
    return (start / "..").resolve()

PROJECT_ROOT = find_project_root()

DATA_DIR = PROJECT_ROOT / "data"
PROCESSED_DIR = DATA_DIR / "processed"
RAW_DIR = DATA_DIR / "raw"

# Inputs from Notebook 04
NB04_DIR = PROCESSED_DIR / "notebook04"
IN_SCHOOLS_V1 = NB04_DIR / "schools_master_v1.parquet"
IN_SCHOOLS_LOCATIONS_V1 = NB04_DIR / "schools_locations_v1.parquet"

# Outputs for this notebook
NB05_DIR = PROCESSED_DIR / "notebook05"
NB05_DIR.mkdir(parents=True, exist_ok=True)

# Canonical outputs
OUT_SCHOOLS_V2_PARQUET = NB05_DIR / "schools_master_v2.parquet"
OUT_SCHOOLS_V2_CSV = NB05_DIR / "schools_master_v2.csv"

OUT_MERGE_LOG = NB05_DIR / "entity_merge_log.csv"
OUT_UNRESOLVED = NB05_DIR / "unresolved_orphans.csv"
OUT_GEO_REPORT = NB05_DIR / "geo_coverage_report.csv"

# Geocode cache files (write parquet for speed; optionally export csv for inspection)
GEO_CACHE_PATH_PARQUET = NB05_DIR / "geocode_cache_v1.parquet"
GEO_CACHE_PATH_CSV = NB05_DIR / "geocode_cache_v1.csv"

GEO_FAIL_CACHE_PATH_PARQUET = NB05_DIR / "geocode_failures_v1.parquet"
GEO_FAIL_CACHE_PATH_CSV = NB05_DIR / "geocode_failures_v1.csv"

# Zip county file
ZIP_COUNTY_PATH = RAW_DIR / "zip_county.xlsx"

# geocode cache lives in notebook05 outputs
GEO_CACHE_PATH = NB05_DIR / "geo_cache_v1.parquet"

log(f"GEO_CACHE_PATH: {GEO_CACHE_PATH}")
log(f"ZIP_COUNTY_PATH: {ZIP_COUNTY_PATH}")

# Run metadata
RUN_ID = datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ")
PIPELINE_VERSION = "05_v1"  # bump as you iterate

print("CWD:", Path.cwd())
print("PROJECT_ROOT:", PROJECT_ROOT)
print("Input:", IN_SCHOOLS_V1)
print("Exists IN_SCHOOLS_V1?", IN_SCHOOLS_V1.exists())
print("Input locations:", IN_SCHOOLS_LOCATIONS_V1)
print("Exists IN_SCHOOLS_LOCATIONS_V1?", IN_SCHOOLS_LOCATIONS_V1.exists())
print("RUN_ID:", RUN_ID)
print("Output (parquet):", OUT_SCHOOLS_V2_PARQUET)
print("Output (csv):", OUT_SCHOOLS_V2_CSV)

if not IN_SCHOOLS_V1.exists():
    # Helpful debug context if root detection is wrong
    print("\n[DEBUG] Nearby folders under PROJECT_ROOT:")
    try:
        for x in sorted(PROJECT_ROOT.iterdir()):
            if x.is_dir():
                print(" -", x.name)
    except Exception as e:
        print("Could not list PROJECT_ROOT dirs:", e)
    raise FileNotFoundError(f"Missing input parquet: {IN_SCHOOLS_V1}")

if not IN_SCHOOLS_LOCATIONS_V1.exists():
    raise FileNotFoundError(f"Missing locations parquet: {IN_SCHOOLS_LOCATIONS_V1}")

# 2b) tiny read
_tmp_loc = pd.read_parquet(
    IN_SCHOOLS_LOCATIONS_V1,
    columns=["school_id", "address", "city", "state", "zip"]
)
print("✅ locations parquet readable. rows:", len(_tmp_loc))
del _tmp_loc


if not SCHOOLS_LOCATION_V1.exists():
    raise FileNotFoundError(f"Missing locations parquet: {SCHOOLS_LOCATION_V1}")

# ============================
# Input Artifact Sanity
# ============================

# 1) Confirm file type + quick metadata
print("\n[INPUT SANITY]")
print("IN_SCHOOLS_V1 suffix:", IN_SCHOOLS_V1.suffix)
print("IN_SCHOOLS_V1 size (MB):", round(IN_SCHOOLS_V1.stat().st_size / 1e6, 2))
print("IN_SCHOOLS_V1 modified:", datetime.fromtimestamp(IN_SCHOOLS_V1.stat().st_mtime).isoformat())

# 2) Load a *tiny* sample to confirm it is readable (parquet only)
try:
    _tmp = pd.read_parquet(IN_SCHOOLS_V1, columns=["school_id", "name", "city", "state"])
    print("✅ parquet readable. rows:", len(_tmp))
except Exception as e:
    raise RuntimeError(f"Could not read {IN_SCHOOLS_V1}. Error: {e}")

# 3) Hard fail if the dataset is suspiciously small
if len(_tmp) < 10_000:
    raise ValueError(f"schools_master_v1 looks too small ({len(_tmp)} rows). Wrong input?")
    
del _tmp

# 2b) Load a tiny sample from locations to confirm it is readable
_tmp_loc = pd.read_parquet(
    IN_SCHOOLS_LOCATIONS_V1,
    columns=["school_id", "address", "city", "state", "zip"]
)
print("✅ locations parquet readable. rows:", len(_tmp_loc))
del _tmp_loc   

[110005.52s] GEO_CACHE_PATH: /Users/jennifer-david/Documents/work/SpringBoard/projects/Capstone Projects/smart-school/data/processed/notebook05/geo_cache_v1.parquet
[110005.52s] ZIP_COUNTY_PATH: /Users/jennifer-david/Documents/work/SpringBoard/projects/Capstone Projects/smart-school/data/raw/zip_county.xlsx
CWD: /Users/jennifer-david/Documents/work/SpringBoard/projects/Capstone Projects/smart-school/notebooks
PROJECT_ROOT: /Users/jennifer-david/Documents/work/SpringBoard/projects/Capstone Projects/smart-school
Input: /Users/jennifer-david/Documents/work/SpringBoard/projects/Capstone Projects/smart-school/data/processed/notebook04/schools_master_v1.parquet
Exists IN_SCHOOLS_V1? True
Input locations: /Users/jennifer-david/Documents/work/SpringBoard/projects/Capstone Projects/smart-school/data/processed/notebook04/schools_locations_v1.parquet
Exists IN_SCHOOLS_LOCATIONS_V1? True
RUN_ID: 20260119T063439Z
Output (parquet): /Users/jennifer-david/Documents/work/SpringBoard/projects/Capstone P

In [545]:
# ============================
# 00.3 Determinism + Thresholds (updated)
# ============================

# Stable seeds (only for any random sampling/debugging)
GLOBAL_SEED = 42
np.random.seed(GLOBAL_SEED)

# Entity resolution thresholds (v1 defaults; tune later)
MAX_MERGE_DISTANCE_M = 50          # strong spatial signal (MUST be paired with name similarity)
MAX_CANDIDATE_DISTANCE_M = 200     # broader candidate generation window
NAME_SIM_THRESHOLD_AUTO = 0.92     # auto-merge requires high name similarity (plus distance)
NAME_SIM_THRESHOLD_REVIEW = 0.85   # send to review if in this band and close

# Guardrail: never allow distance-only merges (even under MAX_MERGE_DISTANCE_M)
ALLOW_DISTANCE_ONLY_MERGES = False

# Sanity checks to prevent subtle mistakes
assert 0 < NAME_SIM_THRESHOLD_REVIEW <= NAME_SIM_THRESHOLD_AUTO <= 1.0, (
    f"Bad name thresholds: REVIEW={NAME_SIM_THRESHOLD_REVIEW}, AUTO={NAME_SIM_THRESHOLD_AUTO}"
)
assert 0 < MAX_MERGE_DISTANCE_M <= MAX_CANDIDATE_DISTANCE_M, (
    f"Bad distance thresholds: MERGE={MAX_MERGE_DISTANCE_M}, CANDIDATE={MAX_CANDIDATE_DISTANCE_M}"
)

# Hash utility for deterministic tie-breaking / stable pseudo-random
def stable_hash_to_float(s: str, salt: str = "nb05_v1") -> float:
    """
    Deterministically map a string to a float in [0, 1).
    Useful for stable tie-breaking (e.g., sorting equally-scored candidates).
    """
    if s is None:
        s = ""
    h = hashlib.sha256((salt + "|" + str(s)).encode("utf-8")).hexdigest()
    # 15 hex digits => range [0, 16**15)
    n = int(h[:15], 16)
    return n / float(16**15)

print("Determinism initialized. GLOBAL_SEED =", GLOBAL_SEED)
print(
    "Thresholds:",
    f"MAX_MERGE_DISTANCE_M={MAX_MERGE_DISTANCE_M},",
    f"MAX_CANDIDATE_DISTANCE_M={MAX_CANDIDATE_DISTANCE_M},",
    f"AUTO={NAME_SIM_THRESHOLD_AUTO}, REVIEW={NAME_SIM_THRESHOLD_REVIEW},",
    f"ALLOW_DISTANCE_ONLY_MERGES={ALLOW_DISTANCE_ONLY_MERGES}",
)

# ============================
# 00.4 Logging Helpers (updated)
# ============================

# Optional IPython display (safe fallback for script-mode)
try:
    from IPython.display import display
except Exception:
    display = None

# Keep the original notebook start time if cell is re-run
RESET_TIMER = True
if RESET_TIMER or "_t0" not in globals():
    _t0 = time.time()

def log(msg: str) -> None:
    dt = time.time() - _t0
    print(f"[{dt:8.2f}s] {msg}")

def df_info(df: pd.DataFrame, name: str, head: int = 3) -> None:
    log(f"{name}: shape={df.shape}")
    if display is not None:
        display(df.head(head))
    else:
        print(df.head(head).to_string(index=False))

log("Notebook 05 setup complete.")


Determinism initialized. GLOBAL_SEED = 42
Thresholds: MAX_MERGE_DISTANCE_M=50, MAX_CANDIDATE_DISTANCE_M=200, AUTO=0.92, REVIEW=0.85, ALLOW_DISTANCE_ONLY_MERGES=False
[    0.00s] Notebook 05 setup complete.


## 01. Load schools_master_v1

In this section we:
- load `schools_master_v1`
- validate the **minimum schema contract**
- verify `school_id` integrity (non-null + unique)
- run quick profiling for sanity (row counts, state distribution, public/private mix if available)

**Output of this section:** an in-memory `df_v1` DataFrame that is safe to geocode + dedupe.


In [548]:
# ============================
# 01.1 Load schools_master_v1 + CAIS private backfill + probes (updated)
# ============================

def load_schools_master_v1() -> pd.DataFrame:
    """
    Load schools_master_v1 from Notebook04 output (parquet is source of truth).
    """
    parquet_path = IN_SCHOOLS_V1
    if parquet_path.exists():
        log(f"Loading parquet: {parquet_path}")
        return pd.read_parquet(parquet_path)
    raise FileNotFoundError(f"Could not find schools_master_v1 parquet at {parquet_path}.")


def probe_name(df: pd.DataFrame, q: str) -> pd.DataFrame:
    base_cols = ["school_id","name","address","city","state","zip","is_private","has_cais"]
    cols = [c for c in base_cols if c in df.columns]

    # match
    name_series = df["name"].fillna("").astype(str)
    m = name_series.str.lower().str.contains(q.lower(), regex=False)

    out = df.loc[m, cols].copy()

    # safe sorting: build the lists together
    sort_cols = [c for c in ["has_cais", "is_private", "name"] if c in out.columns]
    ascending_map = {"has_cais": False, "is_private": False, "name": True}
    ascending = [ascending_map[c] for c in sort_cols]

    if sort_cols:
        out = out.sort_values(sort_cols, ascending=ascending)

    return out.head(50)


df_v1 = load_schools_master_v1()

# --- Lightweight dtype normalization (helps later; optional but recommended) ---
# zip should be string to avoid losing leading zeros + keep <NA> consistent
if "zip" in df_v1.columns:
    df_v1["zip"] = df_v1["zip"].astype("string")

# ensure booleans behave predictably even if nullable
for b in ["has_cais", "is_private", "is_charter", "is_public"]:
    if b in df_v1.columns:
        # keep as pandas BooleanDtype (nullable) if you want; this ensures fillna works
        df_v1[b] = df_v1[b].astype("boolean")

# --- Patch: CAIS membership implies independent/private sector ---
mask_cais = df_v1["has_cais"].fillna(False)
before_private = int(df_v1.loc[mask_cais, "is_private"].fillna(False).sum())

df_v1.loc[mask_cais, "is_private"] = True

after_private = int(df_v1.loc[mask_cais, "is_private"].fillna(False).sum())
changed = after_private - before_private

log(f"Loaded df_v1: shape={df_v1.shape}")
log(f"CAIS rows: {int(mask_cais.sum())} | is_private before={before_private} after={after_private} (changed={changed})")

# Invariant: no CAIS school should be non-private
violations = df_v1.loc[mask_cais & ~df_v1["is_private"].fillna(False)]
assert violations.empty, f"CAIS private backfill failed for {len(violations)} rows"

# Visibility: charter distribution (do not fix yet, but detect)
if "is_charter" in df_v1.columns:
    log("Charter distribution:")
    print(df_v1["is_charter"].value_counts(dropna=False))

# quick peek
df_info(df_v1, "df_v1", head=3)

# Probes for well-known Bay Area schools
if display is not None:
    display(probe_name(df_v1, "harker"))
    display(probe_name(df_v1, "hamlin"))
    display(probe_name(df_v1, "head-royce"))
    display(probe_name(df_v1, "lick-wilmerding"))
else:
    print(probe_name(df_v1, "harker").to_string(index=False))
    print(probe_name(df_v1, "hamlin").to_string(index=False))
    print(probe_name(df_v1, "head-royce").to_string(index=False))
    print(probe_name(df_v1, "lick-wilmerding").to_string(index=False))

log("Section 01.1 complete.")


[    1.95s] Loading parquet: /Users/jennifer-david/Documents/work/SpringBoard/projects/Capstone Projects/smart-school/data/processed/notebook04/schools_master_v1.parquet
[    2.00s] Loaded df_v1: shape=(126616, 18)
[    2.00s] CAIS rows: 98 | is_private before=70 after=98 (changed=28)
[    2.00s] Charter distribution:
is_charter
False    119073
True       7543
Name: count, dtype: Int64
[    2.00s] df_v1: shape=(126616, 18)


Unnamed: 0,school_id,name,city,state,zip,county,is_public,is_private,is_charter,serves_pk,serves_elementary,serves_middle,serves_high,has_cais,has_ib,has_waldorf,has_montessori,has_any_enrichment
0,PUB_010000500870,Albertville Middle School,Albertville,AL,35950,,True,False,False,False,False,False,False,False,False,False,False,False
1,PUB_010000500871,Albertville High School,Albertville,AL,35950,,True,False,False,False,False,False,False,False,False,False,False,False
2,PUB_010000500879,Albertville Intermediate School,Albertville,AL,35950,,True,False,False,False,False,False,False,False,False,False,False,False


Unnamed: 0,school_id,name,city,state,zip,is_private,has_cais
124632,CAIS_baab550e497c,The Harker School,San Jose,CA,95129,True,True
119411,PRI_A2190096,HARKER,SAN JOSE,CA,95129,True,False
58431,PUB_341599002995,General Charles G. Harker School,WOOLWICH TOWNSHIP,NJ,8085,False,False
87632,PUB_482566002869,HARKER HEIGHTS EL,HARKER HEIGHTS,TX,76548,False,False
87653,PUB_482566008695,HARKER HEIGHTS H S,HARKER HEIGHTS,TX,76548,False,False
65245,PUB_370063000254,Harkers Island Elementary,Harkers Island,NC,28531,False,False


Unnamed: 0,school_id,name,city,state,zip,is_private,has_cais
124631,CAIS_a2ad7c2318c2,The Hamlin School,San Francisco,CA,94115.0,True,True
120690,PRI_A9106812,HAMLIN ROBINSON SCHOOL,SEATTLE,WA,98144.0,True,False
102946,PRI_00083553,HAMLIN SCHOOL,SAN FRANCISCO,CA,94115.0,True,False
123220,PRI_BB160346,THE HAMLIN SCHOOL,SAN FRANCISCO,CA,94115.0,True,False
86664,PUB_482226002234,HAMLIN COLLEGIATE EL,HAMLIN,TX,79520.0,False,False
86665,PUB_482226002235,HAMLIN COLLEGIATE H S,HAMLIN,TX,79520.0,False,False
22605,PUB_120144008845,HAMLIN ELEMENTARY,WINTER GARDEN,FL,34787.0,False,False
22606,PUB_120144008872,HAMLIN MIDDLE,WINTER GARDEN,FL,34787.0,False,False
84652,PUB_481527001061,HAMLIN MIDDLE,CORPUS CHRISTI,TX,78411.0,False,False
49167,PUB_273384001584,HAMLINE ELEMENTARY SCHOOL,SAINT PAUL,MN,55104.0,False,False


Unnamed: 0,school_id,name,city,state,zip,is_private,has_cais
124633,CAIS_b6653a2ab18f,Head-Royce School,Oakland,CA,94602,True,True


Unnamed: 0,school_id,name,city,state,zip,is_private,has_cais
124639,CAIS_d0d81d7b5167,Lick-Wilmerding High School,San Francisco,CA,94112,True,True


[    2.13s] Section 01.1 complete.


In [550]:
# ============================
# 01.2 Minimal schema expectations
# ============================

# Canonical required fields for entity resolution
REQUIRED_COLS = [
    "school_id",
    "name",     
    "city",
    "state",
]

missing_required = [c for c in REQUIRED_COLS if c not in df_v1.columns]
if missing_required:
    raise ValueError(f"Missing required columns: {missing_required}")

log("Required columns present: " + ", ".join(REQUIRED_COLS))

# Optional-but-important columns (used later if present)
OPTIONAL_COLS = [
    "zip",
    "county",
    "is_public",
    "is_private",
    "is_charter",
    "serves_pk",
    "serves_elementary",
    "serves_middle",
    "serves_high",
    "grade_band_source",
    "has_cais",
    "has_ib",
    "has_montessori",
    "has_waldorf",
    "has_any_enrichment",
]

present_optional = [c for c in OPTIONAL_COLS if c in df_v1.columns]
missing_optional = [c for c in OPTIONAL_COLS if c not in df_v1.columns]

log(f"Optional columns present ({len(present_optional)}): {present_optional}")
if missing_optional:
    log(f"Optional columns missing ({len(missing_optional)}): {missing_optional}")

# ----------------------------
# CAIS invariants 
# ----------------------------
if "has_cais" in df_v1.columns and "is_private" in df_v1.columns:
    cais_mask = df_v1["has_cais"].fillna(False)
    cais_not_private = int((cais_mask & ~df_v1["is_private"].fillna(False)).sum())
    log(f"CAIS invariant check: cais_not_private={cais_not_private}")
    if cais_not_private != 0:
        raise ValueError(
            f"Found {cais_not_private} CAIS rows that are not marked private. "
            "Expected 0 after CAIS backfill patch."
        )
        
log("Section 01.2 complete.")

[    3.03s] Required columns present: school_id, name, city, state
[    3.03s] Optional columns present (14): ['zip', 'county', 'is_public', 'is_private', 'is_charter', 'serves_pk', 'serves_elementary', 'serves_middle', 'serves_high', 'has_cais', 'has_ib', 'has_montessori', 'has_waldorf', 'has_any_enrichment']
[    3.03s] Optional columns missing (1): ['grade_band_source']
[    3.03s] CAIS invariant check: cais_not_private=0
[    3.03s] Section 01.2 complete.


In [552]:
# ============================
# 01.3 Load schools_locations_v1 + choose best location per school_id (UPDATED v2)
#   - Produces: loc_best (1 row per school_id, best_* cols)
#   - Updates:  df_v1 = df_v1 + best_* cols (does NOT drop anything)
# ============================

def load_schools_locations_v1() -> pd.DataFrame:
    parquet_path = IN_SCHOOLS_LOCATIONS_V1
    if parquet_path.exists():
        log(f"Loading locations parquet: {parquet_path}")
        return pd.read_parquet(parquet_path)
    raise FileNotFoundError(f"Could not find schools_locations_v1 parquet at {parquet_path}.")

loc_v1 = load_schools_locations_v1()
log(f"Loaded loc_v1: shape={loc_v1.shape}")

# --- Normalize common dtypes (safe if cols absent) ---
if "zip" in loc_v1.columns:
    loc_v1["zip"] = loc_v1["zip"].astype("string")

for c in ["latitude", "longitude", "lat", "lng", "lon"]:
    if c in loc_v1.columns:
        loc_v1[c] = pd.to_numeric(loc_v1[c], errors="coerce")

# --- Standardize lat/lng column names if needed ---
rename_map = {}
if "latitude" not in loc_v1.columns and "lat" in loc_v1.columns:
    rename_map["lat"] = "latitude"
if "longitude" not in loc_v1.columns and "lng" in loc_v1.columns:
    rename_map["lng"] = "longitude"
if "longitude" not in loc_v1.columns and "lon" in loc_v1.columns:
    rename_map["lon"] = "longitude"
if rename_map:
    loc_v1 = loc_v1.rename(columns=rename_map)

# --- Required column check ---
if "school_id" not in loc_v1.columns:
    raise ValueError("schools_locations_v1 must contain school_id")

# --- Quick check: is locations truly multi-row per school_id? ---
dup_rate = loc_v1["school_id"].duplicated().mean()
log(f"locations duplicate rate: {dup_rate:.2%}")

# --- Helper: non-empty string series check ---
def _nonempty(df_: pd.DataFrame, col: str) -> pd.Series:
    if col not in df_.columns:
        return pd.Series(False, index=df_.index)
    s = df_[col].astype("string")
    return s.notna() & (s.str.strip() != "")

has_address = _nonempty(loc_v1, "address")
has_city    = _nonempty(loc_v1, "city")
has_state   = _nonempty(loc_v1, "state")
has_zip     = _nonempty(loc_v1, "zip")
has_latlng  = (
    (loc_v1["latitude"].notna() & loc_v1["longitude"].notna())
    if ("latitude" in loc_v1.columns and "longitude" in loc_v1.columns)
    else pd.Series(False, index=loc_v1.index)
)

# --- Location quality scoring ---
geo_bonus = pd.Series(0, index=loc_v1.index, dtype="int64")
if "geo_quality" in loc_v1.columns:
    g = loc_v1["geo_quality"].fillna("").astype(str).str.lower()
    geo_bonus += g.str.contains("rooftop").astype(int) * 6
    geo_bonus += g.str.contains("addr").astype(int) * 4
    geo_bonus += g.str.contains("city").astype(int) * 2
    geo_bonus += g.str.contains("county").astype(int) * 1

loc_v1["location_quality_score"] = (
    has_latlng.astype(int) * 10 +
    has_address.astype(int) * 5 +
    has_zip.astype(int) * 2 +
    (has_city & has_state).astype(int) * 1 +
    geo_bonus
).astype("int64")

# --- Deterministic tie-breaker using stable hash on school_id + location_key/address ---
tie_key = loc_v1["school_id"].astype(str)
if "location_key" in loc_v1.columns:
    tie_key = tie_key + "|" + loc_v1["location_key"].fillna("").astype(str)
elif "address_norm" in loc_v1.columns:
    tie_key = tie_key + "|" + loc_v1["address_norm"].fillna("").astype(str)
elif "address" in loc_v1.columns:
    tie_key = tie_key + "|" + loc_v1["address"].fillna("").astype(str)

loc_v1["_tie"] = tie_key.map(lambda s: stable_hash_to_float(s, salt="best_loc_v2"))

# --- Pick best row per school_id (highest score, then tie) ---
loc_best_raw = (
    loc_v1.sort_values(
        ["school_id", "location_quality_score", "_tie"],
        ascending=[True, False, True],
        kind="mergesort"
    )
    .drop_duplicates("school_id", keep="first")
    .copy()
)

# --- Rename best location columns into a clean namespace ---
rename_best = {}
for c in [
    "source_name","source_record_id","address","city","state","zip",
    "address_norm","zip_norm","geo_quality","location_key","created_at_utc",
    "latitude","longitude"
]:
    if c in loc_best_raw.columns:
        rename_best[c] = f"best_{c}"

# keep school_id + score + best_* columns
keep_cols = ["school_id", "location_quality_score"] + [c for c in rename_best.keys()]
loc_best = loc_best_raw[keep_cols].rename(columns=rename_best)

log(f"loc_best (1 per school_id): shape={loc_best.shape}")
df_info(loc_best, "loc_best", head=3)

# --- Attach to df_v1 (IMPORTANT: keep best_* in df_v1 for downstream + collapse) ---
if "df_v1" not in globals():
    raise ValueError("df_v1 not found. Load entities first before 01.3.")

df_v1 = df_v1.merge(loc_best, on="school_id", how="left", validate="one_to_one")
log(f"df_v1 (entities + best_location): shape={df_v1.shape}")

has_loc = df_v1["location_quality_score"].notna()
log(f"Best-location coverage: {has_loc.mean():.2%} ({int(has_loc.sum())}/{len(df_v1)})")

log("Section 01.3 complete.")


[    4.08s] Loading locations parquet: /Users/jennifer-david/Documents/work/SpringBoard/projects/Capstone Projects/smart-school/data/processed/notebook04/schools_locations_v1.parquet
[    4.15s] Loaded loc_v1: shape=(124619, 12)
[    4.15s] locations duplicate rate: 0.00%
[    4.50s] loc_best (1 per school_id): shape=(124619, 13)
[    4.50s] loc_best: shape=(124619, 13)


Unnamed: 0,school_id,location_quality_score,best_source_name,best_source_record_id,best_address,best_city,best_state,best_zip,best_address_norm,best_zip_norm,best_geo_quality,best_location_key,best_created_at_utc
102274,PRI_00000033,14,PSS_PRIVATE,33,700 ALBERT RAINS BLVD,GADSDEN,AL,35901,700 albert rains blvd,35901,addr_city_zip,a22ef354b69a44b7bdb97702537b246395a02080,2026-01-15T07:35:40Z
102275,PRI_00000044,14,PSS_PRIVATE,44,601 JAMES I HARRISON JR PKWY E,TUSCALOOSA,AL,35405,601 james i harrison jr pkwy e,35405,addr_city_zip,d05fcb9152bde12eb0cca0a09e54994a1acb3726,2026-01-15T07:35:40Z
102276,PRI_00000055,14,PSS_PRIVATE,55,2300 BEASLEY AVE NW,HUNTSVILLE,AL,35816,2300 beasley ave nw,35816,addr_city_zip,d82433808e5f568f6976d09b35060d1d3c594a33,2026-01-15T07:35:40Z


[    4.64s] df_v1 (entities + best_location): shape=(126616, 30)
[    4.64s] Best-location coverage: 98.42% (124619/126616)
[    4.64s] Section 01.3 complete.


## 01.5 CAIS → Private Merge Pass (Deterministic)

This section performs a targeted entity resolution pass to merge
CAIS-minted schools into their corresponding California private schools.

### Algorithm Flow

1. Select CAIS-only rows (`school_id` starts with `CAIS_`)
2. Build a candidate pool of California private schools
3. Normalize names and street addresses deterministically
4. For each CAIS school:
   - Compute name similarity
   - Compute street address similarity
   - Apply ZIP bonus if applicable
   - Select best candidate by weighted score
5. Auto-merge only when confidence thresholds are met:
   - name_sim ≥ 0.92
   - AND (addr_sim ≥ 0.70 OR ZIP match)
6. Transfer `has_cais=True` to private record
7. Drop merged CAIS rows
8. Enforce invariants (row counts, uniqueness, CAIS preservation)

This process is fully deterministic and reproducible.


In [555]:
# ============================
# 01.5 CAIS -> Private merge mapping (UPDATED: unique perfect + ambiguous perfect same addr+zip canon)
# ============================

from difflib import SequenceMatcher
import unicodedata

# ----------------------------
# Helpers
# ----------------------------
def _ascii(s: str) -> str:
    s = unicodedata.normalize("NFKD", s)
    return s.encode("ascii", "ignore").decode("ascii")

NAME_STOPWORDS = {
    "the","school","schools","of","and","for","at","in","on","a","an",
    "academy","campus","lower","upper","middle","high","elementary",
    "junior","senior","primary","secondary","charter"
}

def norm_basic(s: Any) -> str:
    if pd.isna(s):
        return ""
    s = _ascii(str(s)).lower()
    s = s.replace("&", " and ")
    s = re.sub(r"[^a-z0-9\s]", " ", s)
    s = re.sub(r"\s+", " ", s).strip()
    return s

def name_key(s: Any) -> str:
    base = norm_basic(s)
    toks = [t for t in base.split() if t and t not in NAME_STOPWORDS]
    return " ".join(toks)

def sim(a: str, b: str) -> float:
    if not a or not b:
        return 0.0
    return SequenceMatcher(None, a, b).ratio()

def normalize_zip5(z: Any) -> str:
    if pd.isna(z):
        return ""
    s = str(z).strip()
    if not s:
        return ""
    m = re.search(r"(\d{5})", s)
    return m.group(1) if m else ""

def canon_addr(s: Any) -> str:
    """
    Canonicalize address norm to reduce trivial mismatches:
    - normalize common suffixes (street->st, avenue->ave, etc.)
    - remove punctuation
    - collapse spaces
    """
    if pd.isna(s):
        return ""
    x = str(s).lower().strip()

    # common suffixes
    x = re.sub(r"\bstreet\b", "st", x)
    x = re.sub(r"\bavenue\b", "ave", x)
    x = re.sub(r"\broad\b", "rd", x)
    x = re.sub(r"\bboulevard\b", "blvd", x)
    x = re.sub(r"\bdrive\b", "dr", x)
    x = re.sub(r"\blane\b", "ln", x)
    x = re.sub(r"\bplace\b", "pl", x)
    x = re.sub(r"\bcourt\b", "ct", x)

    # drop punctuation and collapse
    x = re.sub(r"[^a-z0-9\s]", " ", x)
    x = re.sub(r"\s+", " ", x).strip()
    return x

# ----------------------------
# Build working frames from df (post-01.3)
# ----------------------------
assert "df" in globals(), "Expected `df` from Section 01.3 (entities + best_location)."

work = df.copy()

# Identify CAIS minted rows
cais_minted_mask = work["school_id"].astype(str).str.startswith("CAIS_")
cais = work.loc[cais_minted_mask].copy()

# Private CA pool excluding CAIS minted
priv_ca = work[
    (work.get("state") == "CA")
    & (work.get("is_private").fillna(False))
    & (~cais_minted_mask)
].copy()

log(f"CAIS minted rows: {len(cais)}")
log(f"Private CA pool (non-CAIS): {len(priv_ca)}")

# If nothing to do, short-circuit
if len(cais) == 0 or len(priv_ca) == 0:
    log("No CAIS minted rows or no Private CA pool. Skipping 01.5.")
else:
    # Normalize private pool once
    priv_ca["name_norm"] = priv_ca["name"].map(name_key)
    priv_ca["zip5"] = priv_ca.get("best_zip_norm", priv_ca.get("zip")).map(normalize_zip5) if ("best_zip_norm" in priv_ca.columns or "zip" in priv_ca.columns) else ""

    # ----------------------------
    # top K candidates per CAIS row
    # ----------------------------
    K = 10
    rows = []
    for _, r in cais.iterrows():
        from_id = r["school_id"]
        from_name = r.get("name", "")

        cais_name = name_key(from_name)

        # IMPORTANT: CAIS rows have no best_city/best_zip, so do NOT narrow by city/zip
        # Narrow only by first token to cut wild matches
        cand = priv_ca
        first = cais_name.split()[0] if cais_name else ""
        if first:
            cand_first = cand[cand["name_norm"].str.startswith(first)]
            if len(cand_first) > 0:
                cand = cand_first

        # score
        scored = []
        for _, p in cand.iterrows():
            n = sim(cais_name, p["name_norm"])
            scored.append((p["school_id"], p.get("name", ""), p.get("best_city", p.get("city", "")), p.get("best_zip_norm", p.get("zip", "")), n,
                           p.get("best_address", ""), p.get("best_address_norm", ""), p.get("best_zip_norm", ""), p.get("best_geo_quality", ""), p.get("best_location_key", "")))
        scored.sort(key=lambda x: x[4], reverse=True)

        top = scored[:K] if len(scored) > 0 else []
        for to_id, to_name, to_city, to_zip, n, best_addr, best_addr_norm, best_zip_norm, best_geo_q, best_loc_key in top:
            rows.append({
                "from_school_id": from_id,
                "from_name": from_name,
                "from_city": r.get("best_city", r.get("city", "")),
                "from_zip": r.get("best_zip", r.get("zip", "")),
                "to_school_id": to_id,
                "to_name": to_name,
                "to_city": to_city,
                "to_zip": to_zip,
                "name_sim": float(n),
                "city_match": False,   # CAIS has no city
                "zip_match": False,    # CAIS has no zip
                "best_address": best_addr,
                "best_address_norm": best_addr_norm,
                "best_zip_norm": best_zip_norm,
                "best_geo_quality": best_geo_q,
                "best_location_key": best_loc_key,
            })

    candidates = pd.DataFrame(rows)
    if len(candidates) == 0:
        log("No candidates generated. Skipping 01.5.")
    else:
        # Save Option A inspection artifacts
        OUT_OPT_A_CAND = NB05_DIR / "inspect_cais_optionA_candidates.csv"
        candidates.to_csv(OUT_OPT_A_CAND, index=False)

        # Best match per CAIS row (for inspection)
        best = (
            candidates.sort_values(["from_school_id", "name_sim"], ascending=[True, False])
                      .groupby("from_school_id", as_index=False)
                      .head(1)
        )

        OUT_OPT_A_SUM = NB05_DIR / "inspect_cais_optionA_summary.csv"
        best.to_csv(OUT_OPT_A_SUM, index=False)

        log(f"Top candidate matches (first 30 rows):")
        display(candidates.head(30))

        log("Best match per CAIS row:")
        display(best)

        log(f"Saved Option A summary: {OUT_OPT_A_SUM}")
        log(f"Saved Option A candidates: {OUT_OPT_A_CAND}")

        # ----------------------------
        # Forced merges
        # ----------------------------
        PERFECT = 1.0
        NEAR_PERFECT = 0.98

        perfect_counts = (
            candidates[candidates["name_sim"] >= PERFECT]
            .groupby("from_school_id")["to_school_id"]
            .nunique()
            .rename("perfect_match_count")
        )
        near_counts = (
            candidates[candidates["name_sim"] >= NEAR_PERFECT]
            .groupby("from_school_id")["to_school_id"]
            .nunique()
            .rename("near_perfect_count")
        )

        cais_summary = (
            candidates.groupby("from_school_id", as_index=False)
                      .first()[["from_school_id","from_name","from_city","from_zip"]]
                      .merge(perfect_counts.reset_index(), on="from_school_id", how="left")
                      .merge(near_counts.reset_index(), on="from_school_id", how="left")
        )
        cais_summary["perfect_match_count"] = cais_summary["perfect_match_count"].fillna(0).astype(int)
        cais_summary["near_perfect_count"] = cais_summary["near_perfect_count"].fillna(0).astype(int)

        log("CAIS perfect/near-perfect match counts:")
        display(cais_summary.sort_values(["perfect_match_count","near_perfect_count"], ascending=False))

        # (A) Unique perfect name match
        forced_unique = (
            candidates[candidates["name_sim"] >= PERFECT]
            .merge(cais_summary[cais_summary["perfect_match_count"] == 1][["from_school_id"]],
                   on="from_school_id", how="inner")
            .sort_values(["from_school_id","to_school_id"])
            .groupby("from_school_id", as_index=False)
            .head(1)
        )

        forced_map = forced_unique[[
            "from_school_id","to_school_id","from_name","to_name","name_sim"
        ]].copy()
        forced_map["method"] = "cais_unique_perfect_name"
        forced_map["confidence"] = "high"
        forced_map = forced_map[[
            "from_school_id","to_school_id","method","confidence","name_sim","from_name","to_name"
        ]]

        # Review bucket: ambiguous perfect OR near-perfect without perfect
        review = candidates.merge(cais_summary, on=["from_school_id","from_name","from_city","from_zip"], how="left")
        review = review[
            (review["name_sim"] >= NEAR_PERFECT)
            & (
                (review["perfect_match_count"] > 1)
                | (review["perfect_match_count"] == 0)
            )
        ].copy().sort_values(["from_school_id","name_sim"], ascending=[True, False])

        # (B) Upgrade ambiguous perfect matches when same canonical addr+zip (safe)
        ambig = review[review["name_sim"] >= PERFECT].copy()
        if len(ambig) > 0:
            ambig["addr_norm_canon"] = ambig["best_address_norm"].apply(canon_addr)

            same_addr = (
                ambig.groupby("from_school_id")
                     .agg(
                         n_targets=("to_school_id","nunique"),
                         n_addr=("addr_norm_canon","nunique"),
                         n_zip=("best_zip_norm","nunique"),
                     )
                     .reset_index()
            )
            qual = same_addr[(same_addr["n_targets"] > 1) & (same_addr["n_addr"] == 1) & (same_addr["n_zip"] == 1)]
            log(f"Ambiguous-perfect with same addr+zip (safe): {len(qual)}")
            display(qual)

            ambig_safe = ambig.merge(qual[["from_school_id"]], on="from_school_id", how="inner")

            # deterministic pick among targets
            chosen = (
                ambig_safe.sort_values(["from_school_id","to_school_id"])
                          .groupby("from_school_id", as_index=False)
                          .head(1)
            )

            forced_extra = chosen[[
                "from_school_id","to_school_id","from_name","to_name","name_sim"
            ]].copy()
            forced_extra["method"] = "cais_ambiguous_perfect_same_addr_zip"
            forced_extra["confidence"] = "high"
            forced_extra = forced_extra[forced_map.columns]

            forced_map = pd.concat([forced_map, forced_extra], ignore_index=True)
            forced_map = forced_map.drop_duplicates(["from_school_id","to_school_id","method"])

        # Save artifacts
        OUT_FORCED_CAIS = NB05_DIR / "forced_merges_cais_v1.csv"
        OUT_REVIEW_CAIS = NB05_DIR / "cais_merge_review_candidates_v1.csv"

        forced_map.to_csv(OUT_FORCED_CAIS, index=False)
        review.to_csv(OUT_REVIEW_CAIS, index=False)

        log(f"Forced CAIS merges total: {len(forced_map)}")
        display(forced_map)

        log(f"Review CAIS candidates: {len(review)}")
        display(review.head(50))

        log(f"Saved forced CAIS merge map: {OUT_FORCED_CAIS}")
        log(f"Saved CAIS review candidates: {OUT_REVIEW_CAIS}")

log("Section 01.5 complete (mapping + review). Do NOT drop rows here; apply merges in main merge phase.")


[    6.84s] CAIS minted rows: 25
[    6.84s] Private CA pool (non-CAIS): 2452
[    7.60s] Top candidate matches (first 30 rows):


Unnamed: 0,from_school_id,from_name,from_city,from_zip,to_school_id,to_name,to_city,to_zip,name_sim,city_match,zip_match,best_address,best_address_norm,best_zip_norm,best_geo_quality,best_location_key
0,CAIS_0fb36d096d98,The Academy,,,PRI_00068579,ST THOMAS MORE SCHOOL,ALHAMBRA,91803,0.0,False,False,2510 S FREMONT AVE,2510 s fremont ave,91803,addr_city_zip,cbc1ac53041878f4f22c3c2eb50ff5e7213d7e07
1,CAIS_0fb36d096d98,The Academy,,,PRI_00068604,HOLY ANGELS SCHOOL,ARCADIA,91007,0.0,False,False,360 CAMPUS DR,360 campus dr,91007,addr_city_zip,87b803ef48902f428a36fb798e73fabbad3bef1d
2,CAIS_0fb36d096d98,The Academy,,,PRI_00068615,OUR LADY OF FATIMA SCHOOL,ARTESIA,90701,0.0,False,False,18626 CLARKDALE AVE,18626 clarkdale ave,90701,addr_city_zip,b1c82819fc3b77e319eaf6a023e1b981eb23ae87
3,CAIS_0fb36d096d98,The Academy,,,PRI_00068626,ST BERNARD SCHOOL,BELLFLOWER,90706,0.0,False,False,9626 PARK ST,9626 park st,90706,addr_city_zip,fd1d2706f6d10831dc0f312678fc09c6a0ca9247
4,CAIS_0fb36d096d98,The Academy,,,PRI_00068637,ST DOMINIC SAVIO SCHOOL,BELLFLOWER,90706,0.0,False,False,9750 FOSTER RD,9750 foster rd,90706,addr_city_zip,6aa77e8dd6d0d1418b4797c6d884a5e3fc326485
5,CAIS_0fb36d096d98,The Academy,,,PRI_00068648,ST FRANCIS XAVIER SCHOOL,BURBANK,91504,0.0,False,False,3601 SCOTT RD,3601 scott rd,91504,addr_city_zip,e79dbcb72d7e6f65af04ff72ceaae4f38d50f89e
6,CAIS_0fb36d096d98,The Academy,,,PRI_00068659,ST ROBERT BELLARMINE ELEMENTARY SCHOOL,BURBANK,91501,0.0,False,False,154 N 5TH ST,154 n 5th st,91501,addr_city_zip,174f925144354d032a06984711080644658da3c6
7,CAIS_0fb36d096d98,The Academy,,,PRI_00068681,OUR LADY OF THE VALLEY SCHOOL,CANOGA PARK,91303,0.0,False,False,22041 GAULT ST,22041 gault st,91303,addr_city_zip,c38063645a2dc3900d652c2984698316fd50d0fb
8,CAIS_0fb36d096d98,The Academy,,,PRI_00068706,ST JOHN EUDES SCHOOL,CHATSWORTH,91311,0.0,False,False,9925 MASON AVE,9925 mason ave,91311,addr_city_zip,df2b489c9cd16ac83ed349b7eaacd32a839c3551
9,CAIS_0fb36d096d98,The Academy,,,PRI_00068717,OUR LADY OF THE ASSUMPTION SCHOOL,CLAREMONT,91711,0.0,False,False,611 W BONITA AVE,611 w bonita ave,91711,addr_city_zip,7accaca4e523e3f0a9097254b7d98726f21b7425


[    7.61s] Best match per CAIS row:


Unnamed: 0,from_school_id,from_name,from_city,from_zip,to_school_id,to_name,to_city,to_zip,name_sim,city_match,zip_match,best_address,best_address_norm,best_zip_norm,best_geo_quality,best_location_key
0,CAIS_0fb36d096d98,The Academy,,,PRI_00068579,ST THOMAS MORE SCHOOL,ALHAMBRA,91803,0.0,False,False,2510 S FREMONT AVE,2510 s fremont ave,91803,addr_city_zip,cbc1ac53041878f4f22c3c2eb50ff5e7213d7e07
10,CAIS_1ff9891ebd76,Millennium School,,,PRI_A1700371,MILLENNIUM SCHOOL OF SAN FRANCISCO,SAN FRANCISCO,94103,0.588235,False,False,245 VALENCIA ST,245 valencia st,94103,addr_city_zip,1cde91aeb183fc70d22e7914595be6cf4edf4ef0
11,CAIS_3584e195542a,German International School of Silicon Valley,,,PRI_A1900425,GERMAN INTL SCHOOL OF SILICON VALLEY,MOUNTAIN VIEW,94043,0.852459,False,False,310 EASY ST,310 easy st,94043,addr_city_zip,e5572f34fa16da3a6813458a7a83f8d15c20dc9e
12,CAIS_38bcb29a33d7,Field Middle School,,,PRI_00069415,MAYFIELD JUNIOR SCHOOL,PASADENA,91101,0.769231,False,False,405 S EUCLID AVE,405 s euclid ave,91101,addr_city_zip,310e4c6447de037a75ca8169570e983440cccb82
22,CAIS_42ecd004ae1c,"Convent & Stuart Hall, Schools of the Sacred H...",,,PRI_01608968,CONVENT & STUART HALL SCHOOLS OF THE SACRED HEART,SAN FRANCISCO,94115,0.820513,False,False,2222 BROADWAY ST,2222 broadway st,94115,addr_city_zip,6302fce45121612ccccaa8c2477f2efe01a3852a
23,CAIS_453b3e9b276c,The International School of San Francisco,,,PRI_A0101087,INTERNATIONAL SCHOOL OF LOS ANGELES-BURBANK,BURBANK,91506,0.7,False,False,1105 W RIVERSIDE DR,1105 w riverside dr,91506,addr_city_zip,0dd284d511e827dbcf84f67a24f027732b156534
29,CAIS_4eaacca69eb8,Cathedral School for Boys,,,PRI_A0900259,CATHEDRAL CATHOLIC HIGH SCHOOL,SAN DIEGO,92130,0.6875,False,False,5555 DEL MAR HEIGHTS RD,5555 del mar heights rd,92130,addr_city_zip,eb28953951d1a86922dbb328a9ba4194dec01302
32,CAIS_628a857df100,Crystal Springs Uplands School,,,PRI_BB910264,SANTA FE SPRINGS CHRISTIAN,SANTA FE SPRINGS,90670,0.571429,False,False,11434 OTTO ST,11434 otto st,90670,addr_city_zip,8fe9cb84870f1ca650f57254ff724f3f9b6f57e5
42,CAIS_725851f3335f,East Bay School,,,PRI_A1300209,EAST BAY SCHOOL FOR BOYS,BERKELEY,94704,0.761905,False,False,2340 DURANT AVE,2340 durant ave,94704,addr_city_zip,ebff643aae58d6d74ead13173c07431ef31534e8
52,CAIS_73d7a678711d,The Kehillah School,,,PRI_A0500422,KEHILLAH JEWISH HIGH SCHOOL,PALO ALTO,94303,0.695652,False,False,3900 FABIAN WAY,3900 fabian way,94303,addr_city_zip,668aa5505e0341d5de2abecd99750b5be3fe6629


[    7.61s] Saved Option A summary: /Users/jennifer-david/Documents/work/SpringBoard/projects/Capstone Projects/smart-school/data/processed/notebook05/inspect_cais_optionA_summary.csv
[    7.61s] Saved Option A candidates: /Users/jennifer-david/Documents/work/SpringBoard/projects/Capstone Projects/smart-school/data/processed/notebook05/inspect_cais_optionA_candidates.csv
[    7.61s] CAIS perfect/near-perfect match counts:


Unnamed: 0,from_school_id,from_name,from_city,from_zip,perfect_match_count,near_perfect_count
14,CAIS_a2ad7c2318c2,The Hamlin School,,,2,2
19,CAIS_c22442dc399b,La Scuola International School,,,2,2
18,CAIS_baab550e497c,The Harker School,,,1,1
0,CAIS_0fb36d096d98,The Academy,,,0,0
1,CAIS_1ff9891ebd76,Millennium School,,,0,0
2,CAIS_3584e195542a,German International School of Silicon Valley,,,0,0
3,CAIS_38bcb29a33d7,Field Middle School,,,0,0
4,CAIS_42ecd004ae1c,"Convent & Stuart Hall, Schools of the Sacred H...",,,0,0
5,CAIS_453b3e9b276c,The International School of San Francisco,,,0,0
6,CAIS_4eaacca69eb8,Cathedral School for Boys,,,0,0


[    7.62s] Ambiguous-perfect with same addr+zip (safe): 0


Unnamed: 0,from_school_id,n_targets,n_addr,n_zip


[    7.62s] Forced CAIS merges total: 1


Unnamed: 0,from_school_id,to_school_id,method,confidence,name_sim,from_name,to_name
0,CAIS_baab550e497c,PRI_A2190096,cais_unique_perfect_name,high,1.0,The Harker School,HARKER


[    7.62s] Review CAIS candidates: 4


Unnamed: 0,from_school_id,from_name,from_city,from_zip,to_school_id,to_name,to_city,to_zip,name_sim,city_match,zip_match,best_address,best_address_norm,best_zip_norm,best_geo_quality,best_location_key,perfect_match_count,near_perfect_count
93,CAIS_a2ad7c2318c2,The Hamlin School,,,PRI_00083553,HAMLIN SCHOOL,SAN FRANCISCO,94115,1.0,False,False,2120 BROADWAY ST,2120 broadway st,94115,addr_city_zip,5c30869c8bb514ac9dacc6eb2ca9040aea12582e,2,2
94,CAIS_a2ad7c2318c2,The Hamlin School,,,PRI_BB160346,THE HAMLIN SCHOOL,SAN FRANCISCO,94115,1.0,False,False,2120 BROADWAY,2120 broadway,94115,addr_city_zip,1f8059e37c6d9800cc31d1529ae359bb3f26723e,2,2
116,CAIS_c22442dc399b,La Scuola International School,,,PRI_A1900492,LA SCUOLA INTERNATIONAL SCHOOL,SAN FRANCISCO,94107,1.0,False,False,728 20TH ST,728 20th st,94107,addr_city_zip,b71144bc5fd7f61b1dabccb052a3dbf9810a1372,2,2
117,CAIS_c22442dc399b,La Scuola International School,,,PRI_A2190105,LA SCUOLA INTERNATIONAL SCHOOL,SAN FRANCISCO,94110,1.0,False,False,3250 18TH ST,3250 18th st,94110,addr_city_zip,53d95f4c6b81a1673f21443f85bb523dbdead771,2,2


[    7.63s] Saved forced CAIS merge map: /Users/jennifer-david/Documents/work/SpringBoard/projects/Capstone Projects/smart-school/data/processed/notebook05/forced_merges_cais_v1.csv
[    7.63s] Saved CAIS review candidates: /Users/jennifer-david/Documents/work/SpringBoard/projects/Capstone Projects/smart-school/data/processed/notebook05/cais_merge_review_candidates_v1.csv
[    7.63s] Section 01.5 complete (mapping + review). Do NOT drop rows here; apply merges in main merge phase.


In [557]:
cais = df[df["school_id"].astype(str).str.startswith("CAIS_")]
print(cais)

                school_id                                               name  \
0       CAIS_0fb36d096d98                                        The Academy   
1       CAIS_1ff9891ebd76                                  Millennium School   
2       CAIS_3584e195542a      German International School of Silicon Valley   
3       CAIS_38bcb29a33d7                                Field Middle School   
4       CAIS_42ecd004ae1c  Convent & Stuart Hall, Schools of the Sacred H...   
5       CAIS_453b3e9b276c          The International School of San Francisco   
6       CAIS_4eaacca69eb8                          Cathedral School for Boys   
7       CAIS_628a857df100                     Crystal Springs Uplands School   
8       CAIS_725851f3335f                                    East Bay School   
9       CAIS_73d7a678711d                                The Kehillah School   
10      CAIS_8aea417dcbe4                    Lycée Français de San Francisco   
11      CAIS_8b882bc69a4a               

In [559]:
# ============================
# 01.6 Apply forced CAIS merges
#   - Works even if CAIS_* sources were already collapsed away.
#   - Only requires dst/to IDs exist.
#   - Still sets has_cais=True on targets (if column exists).
#   - Produces: forced_merge_edges (to be unioned into merge_edges)
# ============================

FORCED_CAIS_PATH = NB05_DIR / "forced_merges_cais_v1.csv"
assert FORCED_CAIS_PATH.exists(), f"Missing forced merges file: {FORCED_CAIS_PATH}"

forced = pd.read_csv(FORCED_CAIS_PATH)

need = {"from_school_id", "to_school_id", "method", "confidence"}
missing = need - set(forced.columns)
assert not missing, f"forced_merges_cais_v1.csv missing columns: {missing}"

# Only keep high confidence
forced = forced[forced["confidence"].astype(str).str.lower().isin(["high"])].copy()

log(f"Loaded forced CAIS merges: {len(forced)}")
if display is not None:
    display(forced)
else:
    print(forced.head(20).to_string(index=False))

# ---- Use current entity df (df_v1 after 01.3/01.7, or df fallback) ----
entity_df = df_v1 if "df_v1" in globals() else df
id_set = set(entity_df["school_id"].astype(str).unique())

# ---- Validate targets exist; sources may be missing if already collapsed ----
bad_to = forced[~forced["to_school_id"].astype(str).isin(id_set)]
assert bad_to.empty, (
    "Forced merges have to_school_id not in current entity set (these targets must exist):\n"
    f"{bad_to.head(20)}"
)

missing_from = forced[~forced["from_school_id"].astype(str).isin(id_set)]
if len(missing_from) > 0:
    log(f"NOTE: {len(missing_from)} forced from_school_id values are not in entity_df (likely already collapsed). This is OK.")
    if display is not None:
        display(missing_from[["from_school_id","to_school_id","method","confidence"]].head(20))

# ---- Prevent self-loops ----
self_loops = forced[forced["from_school_id"].astype(str) == forced["to_school_id"].astype(str)]
assert self_loops.empty, f"Forced merges contain self loops:\n{self_loops}"

# ---- (Optional) cycle check on forced-only edges ----
edge_map = dict(zip(forced["from_school_id"].astype(str), forced["to_school_id"].astype(str)))

def _detect_cycle(edge_map: dict) -> list:
    seen_global = set()
    for start in edge_map.keys():
        if start in seen_global:
            continue
        path = []
        cur = start
        seen_local = set()
        while cur in edge_map:
            if cur in seen_local:
                i = path.index(cur)
                return path[i:] + [cur]
            seen_local.add(cur)
            path.append(cur)
            cur = edge_map[cur]
        seen_global |= seen_local
    return []

cycle = _detect_cycle(edge_map)
assert not cycle, f"Cycle detected in forced merges: {cycle}"

# ---- Update CAIS tag on targets (important!) ----
if "has_cais" in entity_df.columns:
    tgt_ids = forced["to_school_id"].astype(str).unique().tolist()
    before = int(entity_df.loc[entity_df["school_id"].astype(str).isin(tgt_ids), "has_cais"].fillna(False).sum())
    entity_df.loc[entity_df["school_id"].astype(str).isin(tgt_ids), "has_cais"] = True
    after = int(entity_df.loc[entity_df["school_id"].astype(str).isin(tgt_ids), "has_cais"].fillna(False).sum())
    log(f"Applied has_cais=True to forced targets: before={before} after={after}")

# ---- Build merge edges frame for union-find ----
forced_merge_edges = forced.rename(columns={
    "from_school_id": "src_school_id",
    "to_school_id": "dst_school_id",
}).copy()

forced_merge_edges["merge_source"] = "forced_cais"
forced_merge_edges["merge_reason"] = forced_merge_edges["method"].astype(str)
forced_merge_edges["merge_confidence"] = forced_merge_edges["confidence"].astype(str)

keep_cols = ["src_school_id", "dst_school_id", "merge_source", "merge_reason", "merge_confidence"]
forced_merge_edges = forced_merge_edges[keep_cols].drop_duplicates()

log(f"forced_merge_edges ready: {forced_merge_edges.shape}")
if display is not None:
    display(forced_merge_edges)
else:
    print(forced_merge_edges.to_string(index=False))

log("01.6 complete. Next: UNION these edges into your main merge-apply step (01.7).")


[   12.25s] Loaded forced CAIS merges: 1


Unnamed: 0,from_school_id,to_school_id,method,confidence,name_sim,from_name,to_name
0,CAIS_baab550e497c,PRI_A2190096,cais_unique_perfect_name,high,1.0,The Harker School,HARKER


[   12.33s] Applied has_cais=True to forced targets: before=0 after=1
[   12.34s] forced_merge_edges ready: (1, 5)


Unnamed: 0,src_school_id,dst_school_id,merge_source,merge_reason,merge_confidence
0,CAIS_baab550e497c,PRI_A2190096,forced_cais,cais_unique_perfect_name,high


[   12.34s] 01.6 complete. Next: UNION these edges into your main merge-apply step (01.7).


In [561]:
# ============================
# 01.7 Apply merge edges (union-find) + collapse df_v1 (UPDATED v3.1)
#   - Avoids duplicate best_* columns when df_v1 already contains them
#   - Optionally "refreshes" best_* from loc_best deterministically
# ============================

from typing import Dict, List

# ---- config: do you want to overwrite existing best_* with loc_best? ----
REFRESH_BEST_FROM_LOC_BEST = False  # set True if you want loc_best to replace existing best_*

# --- Collect merge edges ---
all_edges = []
if "forced_merge_edges" in globals() and forced_merge_edges is not None and len(forced_merge_edges) > 0:
    all_edges.append(forced_merge_edges)

merge_edges = (
    pd.concat(all_edges, ignore_index=True)
    if len(all_edges)
    else pd.DataFrame(columns=["src_school_id","dst_school_id","merge_source","merge_reason","merge_confidence"])
)

log(f"merge_edges: {merge_edges.shape}")
if display is not None:
    display(merge_edges.head(20))
else:
    print(merge_edges.head(20).to_string(index=False))

# --- Union-Find (Disjoint Set Union) ---
class DSU:
    def __init__(self):
        self.parent: Dict[str, str] = {}
        self.rank: Dict[str, int] = {}

    def find(self, x: str) -> str:
        if x not in self.parent:
            self.parent[x] = x
            self.rank[x] = 0
            return x
        while self.parent[x] != x:
            self.parent[x] = self.parent[self.parent[x]]
            x = self.parent[x]
        return x

    def union(self, a: str, b: str) -> None:
        ra, rb = self.find(a), self.find(b)
        if ra == rb:
            return
        if self.rank[ra] < self.rank[rb]:
            self.parent[ra] = rb
        elif self.rank[ra] > self.rank[rb]:
            self.parent[rb] = ra
        else:
            self.parent[rb] = ra
            self.rank[ra] += 1

dsu = DSU()

for r in merge_edges.itertuples(index=False):
    dsu.union(str(r.src_school_id), str(r.dst_school_id))

# --- Build component groups over df ids + edge endpoints ---
ids_df = df_v1["school_id"].astype(str).unique().tolist()
ids_edges = (
    pd.concat([merge_edges["src_school_id"], merge_edges["dst_school_id"]], ignore_index=True)
      .astype(str).unique().tolist()
    if len(merge_edges) else []
)
all_ids = sorted(set(ids_df) | set(ids_edges))

comp = pd.DataFrame({
    "school_id": all_ids,
    "root": [dsu.find(s) for s in all_ids]
})

dst_pref = set(merge_edges["dst_school_id"].astype(str).tolist()) if len(merge_edges) else set()

def pick_canonical(ids: List[str]) -> str:
    ids = list(set(ids))

    dst_ids = [i for i in ids if i in dst_pref]
    pool = dst_ids if dst_ids else ids

    non_cais = [i for i in pool if not str(i).startswith("CAIS_")]
    pool = non_cais if non_cais else pool

    pool = sorted(pool, key=lambda x: stable_hash_to_float(x, salt="canonical_pick_v31"))
    return pool[0]

canon_map = (
    comp.groupby("root")["school_id"]
        .apply(lambda s: pick_canonical(s.tolist()))
        .rename("canonical_school_id")
        .reset_index()
)

id_to_canon_df = comp.merge(canon_map, on="root", how="left")[["school_id","canonical_school_id"]]
id_to_canon = dict(zip(id_to_canon_df["school_id"], id_to_canon_df["canonical_school_id"]))
log(f"Canonical map built: {len(id_to_canon):,} ids mapped")

# --- Apply mapping & collapse ---
df_pre = df_v1.copy()
df_pre["canonical_school_id"] = df_pre["school_id"].astype(str).map(id_to_canon)

df_pre["_is_canonical_row"] = (df_pre["school_id"].astype(str) == df_pre["canonical_school_id"].astype(str)).astype(int)
df_pre["_tie"] = df_pre["school_id"].astype(str).map(lambda s: stable_hash_to_float(s, salt="collapse_tie_v31"))

df_collapsed = (
    df_pre.sort_values(
            ["canonical_school_id", "_is_canonical_row", "_tie"],
            ascending=[True, False, True],
            kind="mergesort"
        )
        .groupby("canonical_school_id", as_index=False)
        .head(1)
        .drop(columns=["_is_canonical_row", "_tie"])
)

df_collapsed["school_id"] = df_collapsed["canonical_school_id"]
df_collapsed = df_collapsed.drop(columns=["canonical_school_id"])

log(f"df_v1 before collapse: {df_v1.shape} | after collapse: {df_collapsed.shape}")

# --- Sanity: forced src ids should be gone IF they existed in df_v1 ---
if len(merge_edges):
    forced_src = set(merge_edges["src_school_id"].astype(str).tolist())
    still_there = df_collapsed[df_collapsed["school_id"].astype(str).isin(forced_src)]
    log(f"Forced-src still present after collapse: {len(still_there)} (expected 0 if src existed in df_v1)")
    if len(still_there) and display is not None:
        display(still_there[[c for c in ["school_id","name","city","state"] if c in still_there.columns]].head(20))

# -------------------------------------------------------------------
# Merge loc_best ONLY if needed (or if refreshing)
# -------------------------------------------------------------------
best_cols_existing = [c for c in df_collapsed.columns if c == "location_quality_score" or c.startswith("best_")]
has_best_already = len(best_cols_existing) > 0

if "loc_best" in globals() and loc_best is not None and len(loc_best) > 0:
    if REFRESH_BEST_FROM_LOC_BEST:
        # drop existing best columns then merge cleanly
        if has_best_already:
            df_collapsed = df_collapsed.drop(columns=best_cols_existing, errors="ignore")
            log(f"Dropped existing best_* columns (refresh mode): {len(best_cols_existing)}")

        df_collapsed = df_collapsed.merge(loc_best, on="school_id", how="left", validate="one_to_one")
        log(f"Merged loc_best (refresh mode): {df_collapsed.shape}")

    else:
        # only merge if best_* not already present
        if has_best_already:
            log(f"Skipping loc_best merge: best_* already present ({len(best_cols_existing)} cols).")
        else:
            df_collapsed = df_collapsed.merge(loc_best, on="school_id", how="left", validate="one_to_one")
            log(f"Merged loc_best (missing mode): {df_collapsed.shape}")
else:
    log("WARNING: loc_best not found; df_v1 will not include best_* columns (unless already present).")

# Update df_v1
df_v1 = df_collapsed
log("01.7 complete. df_v1 is canonicalized and ready for downstream geo_query / geocoding.")


[   14.08s] merge_edges: (1, 5)


Unnamed: 0,src_school_id,dst_school_id,merge_source,merge_reason,merge_confidence
0,CAIS_baab550e497c,PRI_A2190096,forced_cais,cais_unique_perfect_name,high


[   15.07s] Canonical map built: 126,616 ids mapped
[   15.36s] df_v1 before collapse: (126616, 30) | after collapse: (126615, 30)
[   15.36s] Forced-src still present after collapse: 0 (expected 0 if src existed in df_v1)
[   15.36s] Skipping loc_best merge: best_* already present (12 cols).
[   15.36s] 01.7 complete. df_v1 is canonicalized and ready for downstream geo_query / geocoding.


## 02. Classify Location Completeness

Even though `address` is 100% non-null, we still want a **deterministic location taxonomy** for:
- robust geocoding inputs (full vs partial query strings)
- flagging malformed or low-quality addresses
- avoiding bad merges based on weak geo

We will create:
- `address_norm` (normalized address string)
- `location_status` (full / partial / weak / missing)
- `geo_query` + `geo_query_type` (what we will send to the geocoder / cache key)
- basic coverage report for each location tier

**Output of this section:** `df_loc` (df_v1 + location fields)


In [564]:
# ============================
# 02.1–02.3 Location normalization + geo_query (UPDATED: prefer best_* columns)
# ============================

import unicodedata

# ---------- helpers ----------
def _clean_str(x: Any) -> str:
    if pd.isna(x):
        return ""
    return str(x).strip()

def normalize_text(s: str) -> str:
    s = s.lower()
    s = re.sub(r"[^\w\s]", " ", s)
    s = re.sub(r"\s+", " ", s).strip()
    return s

def normalize_zip(z: Any) -> str:
    if pd.isna(z):
        return ""
    s = str(z).strip()
    if re.fullmatch(r"\d+\.0", s):
        s = s.split(".")[0]
    digits = re.sub(r"\D", "", s)
    return digits[:5] if len(digits) >= 5 else digits

def build_address_norm(address: str, city: str, state: str, zip5: str) -> str:
    parts = [normalize_text(address), normalize_text(city), normalize_text(state), normalize_zip(zip5)]
    parts = [p for p in parts if p]
    return " | ".join(parts)

def normalize_geocode_query(s: Any) -> str:
    if pd.isna(s):
        return ""
    s = str(s).replace("...", " ")
    s = unicodedata.normalize("NFKD", s).encode("ascii", "ignore").decode("ascii")
    s = re.sub(r"[\r\n\t]+", " ", s)
    s = re.sub(r"\bview on\b.*$", "", s, flags=re.IGNORECASE).strip()
    s = re.sub(r"\s+", " ", s).strip()
    s = re.sub(r"\s*,\s*", ", ", s)
    s = re.sub(r"(,\s*){2,}", ", ", s).strip(" ,")
    return s

# ---------- choose location columns (prefer best_*) ----------
def pick_col(df: pd.DataFrame, preferred: str, fallback: str) -> str:
    return preferred if preferred in df.columns else fallback

df_loc = df_v1.copy()

ADDR_COL  = pick_col(df_loc, "best_address", "address")
CITY_COL  = pick_col(df_loc, "best_city", "city")
STATE_COL = pick_col(df_loc, "best_state", "state")
ZIP_COL   = pick_col(df_loc, "best_zip", "zip")

log(f"Using location cols: addr={ADDR_COL}, city={CITY_COL}, state={STATE_COL}, zip={ZIP_COL}")

# ---------- normalize fields ----------
df_loc["zip5"] = df_loc[ZIP_COL].apply(normalize_zip) if ZIP_COL in df_loc.columns else ""
df_loc["name_norm"]  = df_loc["name"].apply(lambda x: normalize_text(_clean_str(x)))
df_loc["city_norm"]  = df_loc[CITY_COL].apply(lambda x: normalize_text(_clean_str(x))) if CITY_COL in df_loc.columns else ""
df_loc["state_norm"] = df_loc[STATE_COL].apply(lambda x: normalize_text(_clean_str(x))) if STATE_COL in df_loc.columns else ""

df_loc["address_norm"] = df_loc.apply(
    lambda r: build_address_norm(
        _clean_str(r.get(ADDR_COL)),
        _clean_str(r.get(CITY_COL)),
        _clean_str(r.get(STATE_COL)),
        r.get("zip5"),
    ),
    axis=1
)

# ---------- classify completeness ----------
def classify_location(row: pd.Series) -> str:
    address = _clean_str(row.get(ADDR_COL))
    city    = _clean_str(row.get(CITY_COL))
    state   = _clean_str(row.get(STATE_COL))
    zip5    = _clean_str(row.get("zip5"))

    has_city_state = bool(city) and bool(state)
    has_zip = len(zip5) == 5

    addr_norm = normalize_text(address)
    looks_like_po_box = ("po box" in addr_norm) or (re.search(r"\bp\s*o\s*box\b", addr_norm) is not None)
    too_short = len(addr_norm) < 6
    no_number = re.search(r"\d", address) is None

    if not has_city_state:
        return "missing_city_state"
    if not address or too_short:
        return "partial_city_state"
    if looks_like_po_box:
        return "weak_po_box"
    if no_number:
        return "weak_no_street_number"
    return "full_address" if has_zip else "full_address_no_zip"

df_loc["location_status"] = df_loc.apply(classify_location, axis=1)

# ---------- build geo_query ----------
def build_geo_query(row: pd.Series) -> Tuple[str, str]:
    address = _clean_str(row.get(ADDR_COL))
    city    = _clean_str(row.get(CITY_COL))
    state   = _clean_str(row.get(STATE_COL))
    zip5    = _clean_str(row.get("zip5"))

    status = row["location_status"]

    if status in ("full_address", "full_address_no_zip", "weak_no_street_number", "weak_po_box"):
        if zip5:
            return f"{address}, {city}, {state} {zip5}", "address_city_state_zip"
        return f"{address}, {city}, {state}", "address_city_state"

    if status == "partial_city_state":
        name = _clean_str(row.get("name"))
        return f"{name}, {city}, {state}", "name_city_state"

    name = _clean_str(row.get("name"))
    return f"{name}", "name_only"

geo_parts = df_loc.apply(build_geo_query, axis=1, result_type="expand")
df_loc["geo_query"] = geo_parts[0].apply(normalize_geocode_query)
df_loc["geo_query_type"] = geo_parts[1]

# ---------- quick sanity check ----------
display(df_loc.loc[df_loc["school_id"].isin(["PRI_A2190096"]),
                   ["school_id","name",ADDR_COL,CITY_COL,STATE_COL,ZIP_COL,"location_status","geo_query_type","geo_query"]])

# ---------- completeness report ----------
report = (
    df_loc["location_status"]
    .value_counts(dropna=False)
    .rename_axis("location_status")
    .reset_index(name="count")
)
report["pct"] = (report["count"] / len(df_loc) * 100).round(2)
display(report)

report.to_csv(OUT_GEO_REPORT, index=False)
log(f"Saved geo coverage report: {OUT_GEO_REPORT}")

log("Section 02.1–02.3 complete (best_* preferred).")


[   20.55s] Using location cols: addr=best_address, city=best_city, state=best_state, zip=best_zip


Unnamed: 0,school_id,name,best_address,best_city,best_state,best_zip,location_status,geo_query_type,geo_query
119411,PRI_A2190096,HARKER,500 SARATOGA AVE,SAN JOSE,CA,95129,full_address,address_city_state_zip,"500 SARATOGA AVE, SAN JOSE, CA 95129"


Unnamed: 0,location_status,count,pct
0,full_address,112557,88.9
1,weak_po_box,11817,9.33
2,missing_city_state,1996,1.58
3,weak_no_street_number,242,0.19
4,partial_city_state,3,0.0


[   25.37s] Saved geo coverage report: /Users/jennifer-david/Documents/work/SpringBoard/projects/Capstone Projects/smart-school/data/processed/notebook05/geo_coverage_report.csv
[   25.37s] Section 02.1–02.3 complete (best_* preferred).


In [566]:
# ============================
# 02.4 PO Box geocode policy (recommended)
# ============================

po_mask = df_loc["location_status"].eq("weak_po_box")

# Replace geo_query for PO boxes with name + city/state (more likely to resolve to campus area)
df_loc.loc[po_mask, "geo_query_type"] = "name_city_state_po_box"
df_loc.loc[po_mask, "geo_query"] = (
    df_loc.loc[po_mask, "name"].astype(str) + ", " +
    df_loc.loc[po_mask, "best_city"].astype(str) + ", " +
    df_loc.loc[po_mask, "best_state"].astype(str)
).apply(normalize_geocode_query)

log(f"PO Box rows adjusted for geocoding: {int(po_mask.sum())}")


[   28.92s] PO Box rows adjusted for geocoding: 11817


In [568]:
# 1) Make sure no PO box rows are still using address_* query types
bad = df_loc[df_loc["location_status"].eq("weak_po_box") & df_loc["geo_query_type"].str.contains("address", na=False)]
log(f"PO box rows still address-based: {len(bad)}")
display(bad[["school_id","name","best_address","best_city","best_state","geo_query_type","geo_query"]].head(10))

# 2) Spot-check a few updated queries
display(
    df_loc[df_loc["location_status"].eq("weak_po_box")]
      [["school_id","name","best_address","best_city","best_state","geo_query_type","geo_query"]]
      .head(15)
)


[   33.08s] PO box rows still address-based: 0


Unnamed: 0,school_id,name,best_address,best_city,best_state,geo_query_type,geo_query


Unnamed: 0,school_id,name,best_address,best_city,best_state,geo_query_type,geo_query
102284,PRI_00000237,HOLY FAMILY CRISTO REY CATHOLIC HIGH SCHOOL,PO BOX 19577,BIRMINGHAM,AL,name_city_state_po_box,"HOLY FAMILY CRISTO REY CATHOLIC HIGH SCHOOL, B..."
102293,PRI_00000645,CHRIST THE KING CATHOLIC SCHOOL,PO BOX 1890,DAPHNE,AL,name_city_state_po_box,"CHRIST THE KING CATHOLIC SCHOOL, DAPHNE, AL"
102302,PRI_00000995,LOWNDES ACADEMY,PO BOX 99,LOWNDESBORO,AL,name_city_state_po_box,"LOWNDES ACADEMY, LOWNDESBORO, AL"
102303,PRI_00001026,ABBEVILLE CHRISTIAN ACADEMY,PO BOX 9,ABBEVILLE,AL,name_city_state_po_box,"ABBEVILLE CHRISTIAN ACADEMY, ABBEVILLE, AL"
102317,PRI_00001616,THE ALTAMONT SCHOOL,PO BOX 131429,BIRMINGHAM,AL,name_city_state_po_box,"THE ALTAMONT SCHOOL, BIRMINGHAM, AL"
102321,PRI_00001864,EDGEWOOD ACADEMY,PO BOX 160,ELMORE,AL,name_city_state_po_box,"EDGEWOOD ACADEMY, ELMORE, AL"
102324,PRI_00001944,DECATUR HERITAGE CHRISTIAN ACADEMY,PO BOX 5659,DECATUR,AL,name_city_state_po_box,"DECATUR HERITAGE CHRISTIAN ACADEMY, DECATUR, AL"
102325,PRI_00002154,LYMAN WARD MILITARY ACADEMY,PO BOX 550,CAMP HILL,AL,name_city_state_po_box,"LYMAN WARD MILITARY ACADEMY, CAMP HILL, AL"
102330,PRI_00002289,PIKE LIBERAL ARTS SCHOOL,PO BOX 329,TROY,AL,name_city_state_po_box,"PIKE LIBERAL ARTS SCHOOL, TROY, AL"
102332,PRI_00002405,SPRINGWOOD SCHOOL,PO BOX 1030,LANETT,AL,name_city_state_po_box,"SPRINGWOOD SCHOOL, LANETT, AL"


In [570]:
# CA-only view of the geo coverage report
df_ca = df_loc[df_loc["best_state"].fillna(df_loc["state"]).astype(str).eq("CA")].copy()
log(f"CA rows in df_loc: {len(df_ca):,}")

ca_report = (
    df_ca["location_status"]
      .value_counts(dropna=False)
      .rename_axis("location_status")
      .reset_index(name="count")
)
ca_report["pct"] = (ca_report["count"] / len(df_ca) * 100).round(2)
display(ca_report)

# CA-only PO box sample
display(
    df_ca[df_ca["location_status"].eq("weak_po_box")]
      [["school_id","name","best_address","best_city","best_state","geo_query_type","geo_query"]]
      .head(25)
)

[   37.59s] CA rows in df_loc: 14,848


Unnamed: 0,location_status,count,pct
0,full_address,12039,81.08
1,missing_city_state,1996,13.44
2,weak_po_box,810,5.46
3,weak_no_street_number,3,0.02


Unnamed: 0,school_id,name,best_address,best_city,best_state,geo_query_type,geo_query
102715,PRI_00075225,ST JOSEPH SCHOOL,PO BOX 3246,FREMONT,CA,name_city_state_po_box,"ST JOSEPH SCHOOL, FREMONT, CA"
102828,PRI_00077867,ST MARY'S HIGH SCHOOL,PO BOX 7247,STOCKTON,CA,name_city_state_po_box,"ST MARY'S HIGH SCHOOL, STOCKTON, CA"
102878,PRI_00080949,CATE SCHOOL,PO BOX 5005,CARPINTERIA,CA,name_city_state_po_box,"CATE SCHOOL, CARPINTERIA, CA"
102947,PRI_00083564,MIDLAND SCHOOL CORPORATION,PO BOX 8,LOS OLIVOS,CA,name_city_state_po_box,"MIDLAND SCHOOL CORPORATION, LOS OLIVOS, CA"
102979,PRI_00087208,I'SOT SCHOOL,PO BOX 346,CANBY,CA,name_city_state_po_box,"I'SOT SCHOOL, CANBY, CA"
103001,PRI_00088813,THE WALDORF SCHOOL OF MENDOCINO COUNTY,PO BOX 349,CALPELLA,CA,name_city_state_po_box,"THE WALDORF SCHOOL OF MENDOCINO COUNTY, CALPEL..."
103006,PRI_00089395,PAGE ACADEMY,PO BOX 10909,COSTA MESA,CA,name_city_state_po_box,"PAGE ACADEMY, COSTA MESA, CA"
103067,PRI_00095285,FEATHER RIVER ADVENTIST SCHOOL,PO BOX 2811,OROVILLE,CA,name_city_state_po_box,"FEATHER RIVER ADVENTIST SCHOOL, OROVILLE, CA"
108545,PRI_01613388,LAKE TAHOE PREPATORY SCHOOL,P.O BOX 2180,OLYMPIC VALLEY,CA,name_city_state_po_box,"LAKE TAHOE PREPATORY SCHOOL, OLYMPIC VALLEY, CA"
108649,PRI_01897608,TRUTH TABERNACLE CHRISTIAN SCHOOL,PO BOX 5393,FRESNO,CA,name_city_state_po_box,"TRUTH TABERNACLE CHRISTIAN SCHOOL, FRESNO, CA"


In [572]:
# What does "missing_city_state" look like in CA?
miss = df_ca[df_ca["location_status"].eq("missing_city_state")].copy()
display(miss[["school_id","name","best_address","best_city","best_state","best_zip","geo_query_type","geo_query"]].head(25))

# Are they truly missing or just blank/NA?
log("best_city non-empty: " + str((miss["best_city"].fillna("").astype(str).str.strip() != "").mean()))
log("best_state non-empty: " + str((miss["best_state"].fillna("").astype(str).str.strip() != "").mean()))
log("best_zip non-empty: " + str((miss["best_zip"].fillna("").astype(str).str.strip() != "").mean()))


Unnamed: 0,school_id,name,best_address,best_city,best_state,best_zip,geo_query_type,geo_query
124628,CAIS_0b0a82aa7039,Escuela Bilingüe Internacional,,,,,name_only,Escuela Bilingue Internacional
124619,CAIS_0fb36d096d98,The Academy,,,,,name_only,The Academy
124641,CAIS_1ff9891ebd76,Millennium School,,,,,name_only,Millennium School
124630,CAIS_3584e195542a,German International School of Silicon Valley,,,,,name_only,German International School of Silicon Valley
124620,CAIS_37b76c325857,Bayhill High School,,,,,name_only,Bayhill High School
124629,CAIS_38bcb29a33d7,Field Middle School,,,,,name_only,Field Middle School
124625,CAIS_42ecd004ae1c,"Convent & Stuart Hall, Schools of the Sacred H...",,,,,name_only,"Convent & Stuart Hall, Schools of the Sacred H..."
124635,CAIS_453b3e9b276c,The International School of San Francisco,,,,,name_only,The International School of San Francisco
124623,CAIS_4eaacca69eb8,Cathedral School for Boys,,,,,name_only,Cathedral School for Boys
124626,CAIS_628a857df100,Crystal Springs Uplands School,,,,,name_only,Crystal Springs Uplands School


[   42.98s] best_city non-empty: 0.0
[   42.98s] best_state non-empty: 0.0
[   42.98s] best_zip non-empty: 0.0


In [574]:
# ============================
# 02.4 Patch name_only queries for CAIS / CA_* rows (safe state inference)
#   - CAIS_* and CA_* are California-only sources in your system
#   - Converts name_only -> name_state to reduce bad geocodes
# ============================

df_loc = df_loc.copy()

is_name_only = df_loc["geo_query_type"].astype(str).eq("name_only")
missing_city_state = df_loc["location_status"].astype(str).eq("missing_city_state")

# Infer CA from ID prefix (safe for your pipeline)
id_str = df_loc["school_id"].astype(str)
is_ca_infer = id_str.str.startswith("CAIS_") | id_str.str.startswith("CA_")

mask = is_name_only & missing_city_state & is_ca_infer

log(f"Patch candidates (name_only + missing_city_state + inferred CA): {int(mask.sum()):,}")

# Update geo_query + type
df_loc.loc[mask, "geo_query"] = (
    df_loc.loc[mask, "name"].fillna("").astype(str).str.strip() + ", CA"
).apply(normalize_geocode_query)

df_loc.loc[mask, "geo_query_type"] = "name_state_ca"

# Optional: track for auditing / later enrichment work
df_loc["needs_location_enrichment"] = False
df_loc.loc[missing_city_state, "needs_location_enrichment"] = True

# Quick check
display(df_loc.loc[mask, ["school_id","name","geo_query_type","geo_query"]].head(25))

log("02.4 complete: CA-inferred name_only rows upgraded to name_state_ca.")


[   44.20s] Patch candidates (name_only + missing_city_state + inferred CA): 1,990


Unnamed: 0,school_id,name,geo_query_type,geo_query
124628,CAIS_0b0a82aa7039,Escuela Bilingüe Internacional,name_state_ca,"Escuela Bilingue Internacional, CA"
124619,CAIS_0fb36d096d98,The Academy,name_state_ca,"The Academy, CA"
124641,CAIS_1ff9891ebd76,Millennium School,name_state_ca,"Millennium School, CA"
124630,CAIS_3584e195542a,German International School of Silicon Valley,name_state_ca,"German International School of Silicon Valley, CA"
124620,CAIS_37b76c325857,Bayhill High School,name_state_ca,"Bayhill High School, CA"
124629,CAIS_38bcb29a33d7,Field Middle School,name_state_ca,"Field Middle School, CA"
124625,CAIS_42ecd004ae1c,"Convent & Stuart Hall, Schools of the Sacred H...",name_state_ca,"Convent & Stuart Hall, Schools of the Sacred H..."
124635,CAIS_453b3e9b276c,The International School of San Francisco,name_state_ca,"The International School of San Francisco, CA"
124623,CAIS_4eaacca69eb8,Cathedral School for Boys,name_state_ca,"Cathedral School for Boys, CA"
124626,CAIS_628a857df100,Crystal Springs Uplands School,name_state_ca,"Crystal Springs Uplands School, CA"


[   44.22s] 02.4 complete: CA-inferred name_only rows upgraded to name_state_ca.


# Section 03 — Geocoding & Location Resolution (v1)

## Purpose

Section 03 constructs a **robust, deterministic geospatial layer** for schools in the Smart School Finder pipeline.

Key goals:
- Preserve **every possible campus / address** ever observed
- Enable **safe, idempotent geocoding**
- Produce a **single best location per school** for downstream analytics
- Support **spatial duplicate detection** in later merge stages

This section does **not** merge schools — it prepares the spatial foundation required for reliable entity resolution.

---

## 03.1 Normalize Location Inputs

**Objective**
- Standardize raw address strings into geocode-safe query keys
- Preserve original provenance while enabling canonical joins

**Key outputs**
- `geo_query` (raw, normalized)
- `geo_query_type` (address_city_state_zip, city_state, etc.)

---

## 03.2 Generate Geocoding Keys

**Objective**
- Produce deterministic `(geo_query, geo_query_type)` pairs
- Ensure identical addresses across sources map to the same key

**Notes**
- Keys are canonicalized via `canon_q`
- This step is idempotent by construction

---

## 03.3 Build Initial Geocode TODO

**Objective**
- Identify which location keys require geocoding
- Scope intentionally limited to control API usage

**Initial inclusion rules**
- Bay Area schools
- Enriched schools (CAIS, IB, Montessori, Waldorf)

**Output**
- `todo_to_geocode` (initial batch)

---

## 03.4 Geocode Cache Management

**Objective**
- Load, normalize, and version the geocode cache
- Ensure:
  - No duplicate keys
  - Best result retained per key
  - Legacy rows backfilled safely

**Artifact**
- `geo_cache` (versioned, append-safe)

---

## 03.5 Attach Geocode Results to Location Rows

**Objective**
- Join cached geocode results back onto location rows
- Preserve:
  - `geo_status`
  - confidence
  - provider metadata

**Output**
- `df_loc_geo` (location rows + geo metadata)

---

## 03.6 Build `schools_locations` (FINAL)

**Objective**
- Create the canonical **many-rows-per-school** location table
- Preserve every campus / address ever seen
- Ensure deterministic, stable IDs

**Key logic**
- `geo_query_clean`:
  - Extracts the **last valid US address**
  - Fixes double-concatenated or polluted strings
- `location_id`:
  - Deterministic hash of `(school_id, geo_query_clean)`
- One row per `(school_id, address)` pair

**Artifact**
- `schools_locations_v1.parquet`
- `schools_locations_v1.csv`

---

## 03.7 Select Best Location per School

**Objective**
- Reduce many locations → **one best location per school**
- Rank using:
  1. Presence of lat/lng
  2. Confidence
  3. Status
  4. Recency

**Output**
- `best_location` (one row per school)

---

## 03.8 Attach Best Location to `schools_master`

**Objective**
- Enrich the canonical one-row-per-school table
- Avoid any merging or row duplication

**Fields added**
- `best_geo_query`
- `best_latitude`
- `best_longitude`
- `best_geo_confidence_label`
- `best_geo_confidence_score`
- `best_geo_status`
- `has_latlng` (strict boolean, no nulls)

**Artifact**
- `schools_master_geo_v1.parquet`
- `schools_master_geo_v1.csv`

---

## Guarantees After Section 03

After completing Section 03:

- Every observed address is preserved
- Every geocode key is deterministic and cache-safe
- Each school has **at most one best location**
- `has_latlng` is reliable and warning-free
- The dataset is ready for:
  - spatial duplicate detection
  - region flags
  - merge candidate generation

---

In [577]:
# ============================
# 03.0 Paths + Cache Schema (Single Source of Truth)
# ============================

from pathlib import Path
from datetime import datetime, timezone
import pandas as pd
import numpy as np
import re

# ---- required notebook globals ----
assert "NB05_DIR" in globals(), "NB05_DIR not found (set in your setup/paths section)."
assert "RAW_DIR" in globals(), "RAW_DIR not found (set in your setup/paths section)."
assert "log" in globals(), "log() not found."

CACHE_VERSION = "v1"

# ---- paths ----
GEO_CACHE_PATH = NB05_DIR / "geo_cache_v1.parquet"
ZIP_COUNTY_PATH = RAW_DIR / "zip_county.xlsx"

log(f"GEO_CACHE_PATH: {GEO_CACHE_PATH}")
log(f"ZIP_COUNTY_PATH: {ZIP_COUNTY_PATH}")

# ---- cache schema ----
CACHE_COLS = [
    "geo_query","geo_query_type",
    "latitude","longitude",
    "geo_confidence","geo_provider",
    "geocoded_at_utc","cache_version",
    "geo_status",
]

def canon_q(s: object) -> str:
    s = "" if pd.isna(s) else str(s)
    s = s.replace("\u00a0", " ")
    s = re.sub(r"\s+", " ", s).strip()
    return s

def load_or_init_cache(path: Path, cache_version: str) -> pd.DataFrame:
    if path.exists():
        log(f"Loading geocode cache: {path}")
        df = pd.read_parquet(path)
    else:
        log("Geocode cache not found. Initializing empty cache.")
        df = pd.DataFrame(columns=CACHE_COLS)

    # ensure all required cols exist
    for c in CACHE_COLS:
        if c not in df.columns:
            df[c] = pd.NA

    # normalize
    df["cache_version"] = df["cache_version"].astype(str).fillna(cache_version)
    df = df[df["cache_version"] == cache_version].copy()

    df["geo_query"] = df["geo_query"].map(canon_q)
    df["geo_query_type"] = df["geo_query_type"].map(canon_q)
    df["latitude"] = pd.to_numeric(df["latitude"], errors="coerce")
    df["longitude"] = pd.to_numeric(df["longitude"], errors="coerce")
    df["geo_confidence"] = df["geo_confidence"].astype(str).fillna("")
    df["geo_provider"] = df["geo_provider"].astype(str).fillna("")
    df["geo_status"] = df["geo_status"].astype(str).fillna("")
    df["geocoded_at_utc"] = pd.to_datetime(df["geocoded_at_utc"], utc=True, errors="coerce")

    # backfill geo_status for any legacy successful rows
    legacy_ok = (df["geo_status"] == "") & df["latitude"].notna() & df["longitude"].notna()
    df.loc[legacy_ok, "geo_status"] = "geocoded"

    log(f"Cache rows (v={cache_version}): {len(df):,}")
    return df

geo_cache = load_or_init_cache(GEO_CACHE_PATH, CACHE_VERSION)


[   46.75s] GEO_CACHE_PATH: /Users/jennifer-david/Documents/work/SpringBoard/projects/Capstone Projects/smart-school/data/processed/notebook05/geo_cache_v1.parquet
[   46.75s] ZIP_COUNTY_PATH: /Users/jennifer-david/Documents/work/SpringBoard/projects/Capstone Projects/smart-school/data/raw/zip_county.xlsx
[   46.75s] Loading geocode cache: /Users/jennifer-david/Documents/work/SpringBoard/projects/Capstone Projects/smart-school/data/processed/notebook05/geo_cache_v1.parquet
[   46.78s] Cache rows (v=v1): 3,298


In [579]:
# ============================
# 03.1 Build canonical missing geocode keys (keys_all - cache_success)
# ============================

assert "df_loc" in globals(), "df_loc not found. Run Section 02.* to create df_loc."
assert "geo_cache" in globals(), "geo_cache not found. Run 03.0 first."

log("03.1 Building canonical missing geocode key list...")

# 1) all unique keys from df_loc
keys_all = (
    df_loc[["geo_query", "geo_query_type"]]
    .copy()
)
keys_all["geo_query"] = keys_all["geo_query"].map(canon_q)
keys_all["geo_query_type"] = keys_all["geo_query_type"].map(canon_q)

keys_all = (
    keys_all
    .drop_duplicates(["geo_query", "geo_query_type"])
    .reset_index(drop=True)
)

log(f"03.1 Total unique geocode keys (from df_loc): {len(keys_all):,}")

# 2) successful keys from cache (lat/lng present)
cache_success = geo_cache[
    geo_cache["latitude"].notna() &
    geo_cache["longitude"].notna()
][["geo_query", "geo_query_type"]].copy()

cache_success["geo_query"] = cache_success["geo_query"].map(canon_q)
cache_success["geo_query_type"] = cache_success["geo_query_type"].map(canon_q)

cache_success = (
    cache_success
    .drop_duplicates(["geo_query", "geo_query_type"])
    .reset_index(drop=True)
)

log(f"03.1 Successful geocode keys in cache: {len(cache_success):,}")

# 3) missing = keys_all - cache_success
missing_keys = keys_all.merge(
    cache_success,
    on=["geo_query", "geo_query_type"],
    how="left",
    indicator=True
)

missing_keys = (
    missing_keys[missing_keys["_merge"] == "left_only"]
    .drop(columns="_merge")
    .reset_index(drop=True)
)

log(f"03.1 Canonical missing geocode keys: {len(missing_keys):,}")

# 4) invariants
assert not missing_keys.duplicated(["geo_query", "geo_query_type"]).any(), "Duplicates found in missing_keys."
assert (missing_keys["geo_query"].str.len() > 0).all(), "Empty geo_query found in missing_keys."
assert (missing_keys["geo_query_type"].str.len() > 0).all(), "Empty geo_query_type found in missing_keys."

# preview
try:
    display(missing_keys.head(20))
except Exception:
    print(missing_keys.head(20))

log("03.1 complete: missing_keys built.")


[   48.82s] 03.1 Building canonical missing geocode key list...
[   49.17s] 03.1 Total unique geocode keys (from df_loc): 119,418
[   49.18s] 03.1 Successful geocode keys in cache: 3,296
[   49.20s] 03.1 Canonical missing geocode keys: 116,129


Unnamed: 0,geo_query,geo_query_type
0,"Escuela Bilingue Internacional, CA",name_state_ca
1,"The Academy, CA",name_state_ca
2,"Millennium School, CA",name_state_ca
3,"Bayhill High School, CA",name_state_ca
4,"Field Middle School, CA",name_state_ca
5,"East Bay School, CA",name_state_ca
6,"Lycee Francais de San Francisco, CA",name_state_ca
7,"Chinese American International School, CA",name_state_ca
8,"Head-Royce School, CA",name_state_ca
9,"Brandeis Marin, CA",name_state_ca


[   49.25s] 03.1 complete: missing_keys built.


In [581]:
# ============================
# 03.2 Prioritize missing keys (robust to best_* columns)
# ============================

assert "missing_keys" in globals(), "missing_keys not found. Run 03.1 first."
assert "df_loc" in globals(), "df_loc not found."
log("03.2 — Build prioritized missing keys (robust to best_* columns)")

# ---- choose location columns (prefer best_*) ----
CITY_COL  = "best_city"  if "best_city"  in df_loc.columns else ("city"  if "city"  in df_loc.columns else None)
STATE_COL = "best_state" if "best_state" in df_loc.columns else ("state" if "state" in df_loc.columns else None)
ZIP_COL   = "best_zip"   if "best_zip"   in df_loc.columns else ("zip"   if "zip"   in df_loc.columns else None)

log(f"03.2 Using cols: city={CITY_COL}, state={STATE_COL}, zip={ZIP_COL}")

def _col_or_default(col: str, default):
    return df_loc[col] if (col is not None and col in df_loc.columns) else default

# ---- build row-level flags then aggregate to key-level ----
tmp = df_loc.copy()

tmp["geo_query"] = tmp["geo_query"].map(canon_q)
tmp["geo_query_type"] = tmp["geo_query_type"].map(canon_q)

# flags that may or may not exist
tmp["is_private"] = tmp["is_private"] if "is_private" in tmp.columns else False
tmp["is_enriched"] = tmp["has_any_enrichment"] if "has_any_enrichment" in tmp.columns else False

# CA flag from state column if present, else infer from geo_query containing ", CA"
if STATE_COL is not None:
    tmp["is_ca"] = tmp[STATE_COL].astype(str).str.upper().str.strip().eq("CA")
else:
    tmp["is_ca"] = tmp["geo_query"].str.contains(r",\s*CA(\s|$)", case=False, na=False)

# city_norm (for Bay Area city heuristic)
if CITY_COL is not None:
    tmp["city_norm"] = tmp[CITY_COL].astype(str).str.lower().str.strip()
else:
    tmp["city_norm"] = ""

# zip5 from zip col if present else extract from query
if ZIP_COL is not None:
    tmp["zip5"] = tmp[ZIP_COL].map(normalize_zip)
else:
    tmp["zip5"] = tmp["geo_query"].map(extract_zip5_from_query)

# key-level aggregation (max over any row sharing that key)
key_flags = (
    tmp.groupby(["geo_query","geo_query_type"], as_index=False)
       .agg({
           "is_ca": "max",
           "is_private": "max",
           "is_enriched": "max",
           "city_norm": "first",
           "zip5": "first",
       })
)

# ---- join onto missing_keys ----
todo = missing_keys.merge(key_flags, on=["geo_query","geo_query_type"], how="left")

# ---- Bay Area heuristics (only for prioritization order) ----
BAYAREA_CITIES = {
    "san jose","santa clara","sunnyvale","cupertino","palo alto","menlo park","mountain view",
    "los altos","los altos hills","saratoga","campbell","milpitas","fremont","newark","union city",
    "oakland","berkeley","alameda","san leandro","hayward","castro valley","dublin","pleasanton",
    "san ramon","danville","walnut creek","concord","lafayette","moraga","orinda","richmond",
    "san francisco","daly city","south san francisco","san bruno","millbrae","burlingame",
    "san mateo","foster city","redwood city","belmont","san carlos","half moon bay",
    "sausalito","mill valley","san rafael","novato","napa","vallejo","fairfield","vacaville",
    "sonoma","petaluma","santa rosa"
}

todo["is_bayarea_guess"] = todo["city_norm"].fillna("").isin(BAYAREA_CITIES)

todo["zip2"] = todo["zip5"].fillna("").astype(str).str[:2]
todo["is_bayarea_zip_guess"] = todo["zip2"].isin({"94","95"})

# Priority score: BayArea city > Enriched > CA > Private > Zip2 hint
todo["priority_score"] = (
    10*todo["is_bayarea_guess"].fillna(False).astype(int) +
     6*todo["is_enriched"].fillna(False).astype(int) +
     3*todo["is_ca"].fillna(False).astype(int) +
     1*todo["is_private"].fillna(False).astype(int) +
     1*todo["is_bayarea_zip_guess"].fillna(False).astype(int)
)

todo = todo.sort_values(["priority_score","geo_query"], ascending=[False, True]).reset_index(drop=True)

log(f"03.2 Total missing keys: {len(todo):,}")
log(f"03.2 Top priority (score>=10): {(todo['priority_score']>=10).sum():,}")

try:
    display(todo.head(30))
except Exception:
    print(todo.head(30))

missing_prioritized = todo  # keep this name for downstream 03.3
log("03.2 complete: missing_prioritized ready.")


[   50.01s] 03.2 — Build prioritized missing keys (robust to best_* columns)
[   50.01s] 03.2 Using cols: city=best_city, state=best_state, zip=best_zip
[   50.70s] 03.2 Total missing keys: 116,129
[   50.70s] 03.2 Top priority (score>=10): 1,253


Unnamed: 0,geo_query,geo_query_type,is_ca,is_private,is_enriched,city_norm,zip5,is_bayarea_guess,zip2,is_bayarea_zip_guess,priority_score
0,"35660 CEDAR BLVD, NEWARK, CA 94560",address_city_state_zip,True,True,False,newark,94560,True,94,True,15
1,"38051 STENHAMMER DR, FREMONT, CA 94536",address_city_state_zip,True,True,False,fremont,94536,True,94,True,15
2,"38451 FREMONT BLVD, FREMONT, CA 94536",address_city_state_zip,True,True,False,fremont,94536,True,94,True,15
3,"40374 FREMONT BLVD, FREMONT, CA 94538",address_city_state_zip,True,True,False,fremont,94538,True,94,True,15
4,"40803 FREMONT BLVD, FREMONT, CA 94538",address_city_state_zip,True,True,False,fremont,94538,True,94,True,15
5,"45819 BRIDGEPORT PL, FREMONT, CA 94539",address_city_state_zip,True,True,False,fremont,94539,True,94,True,15
6,"ANCHOR EDUCATION, SAN LEANDRO, CA",name_city_state_po_box,True,True,False,san leandro,94577,True,94,True,15
7,"CHAMPION KINDER INTERNATIONAL SCHOOL, LOS ALTO...",name_city_state_po_box,True,True,False,los altos,94024,True,94,True,15
8,"CORE EDUCATION ACADEMY, WALNUT CREEK, CA",name_city_state_po_box,True,True,False,walnut creek,94598,True,94,True,15
9,"CREATIVE LEARNING CENTER, LOS ALTOS, CA",name_city_state_po_box,True,True,False,los altos,94023,True,94,True,15


[   50.71s] 03.2 complete: missing_prioritized ready.


In [583]:
# ============================
# 03.3 Build TODO keys (MVP Bay Area) + Must-have school overrides
#   Output: todo_to_geocode  (for 03.4)
# ============================

log("03.3 started")

assert "df_loc" in globals(), "df_loc not found"
assert "geo_cache" in globals(), "geo_cache not found (run 03.0 first)"
assert "canon_q" in globals(), "canon_q() not found (defined in 03.0)"
assert "log" in globals(), "log() not found"

TOP_N = 1000
CACHE_VERSION = "v1"

# --- Bay Area ZIP allowlist (ZIP3 is more precise than ZIP2) ---
BAYAREA_ZIP3_ALLOW = {
    "940","941","942","943","944","945","946","947","948","949",
    "950","951",
}

def extract_zip5(s: str):
    m = re.search(r"\b(\d{5})(?:-\d{4})?\b", s or "")
    return m.group(1) if m else None

def zip3_from_query(q: str):
    z5 = extract_zip5(q)
    return z5[:3] if z5 else None

# ----------------------------
# 0) Build ok_keys (already geocoded) from geo_cache
# ----------------------------
_ok = geo_cache["latitude"].notna() & geo_cache["longitude"].notna()
ok_keys = set(zip(geo_cache.loc[_ok, "geo_query"], geo_cache.loc[_ok, "geo_query_type"]))
log(f"03.3 cache ok_keys: {len(ok_keys):,}")

# ----------------------------
# 1) MUST-HAVE Bay Area schools (force address keys so OpenCage can hit)
#    Patch only CAIS rows that are stuck on name_state_ca.
# ----------------------------
MUST_HAVE_ADDR = {
    "The Academy": "1212 4th St, San Rafael, CA 94901",
    "Millennium School": "345 12th St, Oakland, CA 94607",
    "Field Middle School": "2550 W El Camino Real, Mountain View, CA 94040",
    "East Bay School": "2350 Powell St, Emeryville, CA 94608",
    "Lycee Francais de San Francisco": "1201 Ortega St, San Francisco, CA 94122",
    "Lycée Français de San Francisco": "1201 Ortega St, San Francisco, CA 94122",
    "Chinese American International School": "150 Oak St, San Francisco, CA 94102",
    "Brandeis Marin": "180 N San Pedro Rd, San Rafael, CA 94903",
    "Keys School": "3100 Webber St, Palo Alto, CA 94306",
}

df_loc = df_loc.copy()

# keep originals for audit (optional)
if "geo_query_orig" not in df_loc.columns:
    df_loc["geo_query_orig"] = pd.NA
if "geo_query_type_orig" not in df_loc.columns:
    df_loc["geo_query_type_orig"] = pd.NA

is_cais = df_loc["school_id"].astype(str).str.startswith("CAIS_")
is_name_state = df_loc["geo_query_type"].astype(str).eq("name_state_ca")
is_must = df_loc["name"].astype(str).str.strip().isin(MUST_HAVE_ADDR.keys())

patch_mask = is_cais & is_name_state & is_must
log(f"03.3 must-have CAIS patches to apply: {int(patch_mask.sum()):,}")

df_loc.loc[patch_mask, "geo_query_orig"] = df_loc.loc[patch_mask, "geo_query"]
df_loc.loc[patch_mask, "geo_query_type_orig"] = df_loc.loc[patch_mask, "geo_query_type"]
df_loc.loc[patch_mask, "geo_query"] = df_loc.loc[patch_mask, "name"].astype(str).str.strip().map(MUST_HAVE_ADDR)
df_loc.loc[patch_mask, "geo_query_type"] = "address_city_state_zip"

if patch_mask.any():
    display(df_loc.loc[patch_mask, ["school_id","name","geo_query_orig","geo_query_type_orig","geo_query","geo_query_type"]])

# ----------------------------
# 2) Build TODO keys (addr first), Bay Area-ish via ZIP3 allowlist
# ----------------------------
tmp = df_loc[["geo_query","geo_query_type","school_id","name"]].copy()
tmp["geo_query"] = tmp["geo_query"].map(canon_q)
tmp["geo_query_type"] = tmp["geo_query_type"].astype(str)

# --- Address keys: CA + ZIP3 allowlist ---
zip3 = tmp["geo_query"].map(zip3_from_query)
addr_mask = (
    tmp["geo_query_type"].eq("address_city_state_zip")
    & tmp["geo_query"].str.contains(r",\s*CA\b", case=False, na=False)
    & zip3.isin(BAYAREA_ZIP3_ALLOW)
)

addr_keys = tmp.loc[addr_mask, ["geo_query","geo_query_type"]].drop_duplicates().reset_index(drop=True)
addr_need = ~addr_keys.apply(lambda r: (r["geo_query"], r["geo_query_type"]) in ok_keys, axis=1)
todo_addr = addr_keys.loc[addr_need].reset_index(drop=True)
log(f"03.3 TODO addr keys (ZIP3 Bay Area allowlist): {len(todo_addr):,}")

# --- Must-have name keys fallback (ONLY the must-have list; keep tiny) ---
must_mask = tmp["name"].astype(str).str.strip().isin(MUST_HAVE_ADDR.keys())
must_keys = tmp.loc[must_mask, ["geo_query","geo_query_type"]].drop_duplicates().reset_index(drop=True)
must_need = ~must_keys.apply(lambda r: (r["geo_query"], r["geo_query_type"]) in ok_keys, axis=1)
todo_must = must_keys.loc[must_need].reset_index(drop=True)

# Keep only name-like keys here (don’t duplicate addr list)
todo_must = todo_must[todo_must["geo_query_type"].isin(["name_city_state", "name_state_ca", "name_city_state_po_box"])].reset_index(drop=True)
log(f"03.3 TODO name keys (must-have only): {len(todo_must):,}")

# --- Combine (addr first), cap ---
todo_to_geocode = (
    pd.concat([todo_addr, todo_must], ignore_index=True)
    .drop_duplicates(["geo_query","geo_query_type"])
    .head(TOP_N)
    .reset_index(drop=True)
)

log(f"03.3 TODO total: {len(todo_to_geocode):,} (TOP_N={TOP_N})")
display(todo_to_geocode.head(30))

log("03.3 completed")


[   52.71s] 03.3 started
[   52.71s] 03.3 cache ok_keys: 3,296
[   52.77s] 03.3 must-have CAIS patches to apply: 8


Unnamed: 0,school_id,name,geo_query_orig,geo_query_type_orig,geo_query,geo_query_type
124619,CAIS_0fb36d096d98,The Academy,"The Academy, CA",name_state_ca,"1212 4th St, San Rafael, CA 94901",address_city_state_zip
124641,CAIS_1ff9891ebd76,Millennium School,"Millennium School, CA",name_state_ca,"345 12th St, Oakland, CA 94607",address_city_state_zip
124629,CAIS_38bcb29a33d7,Field Middle School,"Field Middle School, CA",name_state_ca,"2550 W El Camino Real, Mountain View, CA 94040",address_city_state_zip
124627,CAIS_725851f3335f,East Bay School,"East Bay School, CA",name_state_ca,"2350 Powell St, Emeryville, CA 94608",address_city_state_zip
124640,CAIS_8aea417dcbe4,Lycée Français de San Francisco,"Lycee Francais de San Francisco, CA",name_state_ca,"1201 Ortega St, San Francisco, CA 94122",address_city_state_zip
124624,CAIS_a18f4a9cd078,Chinese American International School,"Chinese American International School, CA",name_state_ca,"150 Oak St, San Francisco, CA 94102",address_city_state_zip
124622,CAIS_d6d2f01752a3,Brandeis Marin,"Brandeis Marin, CA",name_state_ca,"180 N San Pedro Rd, San Rafael, CA 94903",address_city_state_zip
124637,CAIS_d9ad26e7d66c,Keys School,"Keys School, CA",name_state_ca,"3100 Webber St, Palo Alto, CA 94306",address_city_state_zip


[   53.18s] 03.3 TODO addr keys (ZIP3 Bay Area allowlist): 6
[   53.20s] 03.3 TODO name keys (must-have only): 0
[   53.20s] 03.3 TODO total: 6 (TOP_N=1000)


Unnamed: 0,geo_query,geo_query_type
0,"1212 4th St, San Rafael, CA 94901",address_city_state_zip
1,"345 12th St, Oakland, CA 94607",address_city_state_zip
2,"2550 W El Camino Real, Mountain View, CA 94040",address_city_state_zip
3,"1201 Ortega St, San Francisco, CA 94122",address_city_state_zip
4,Golden Hills Education Ctr. 2460 Clay Bank Rd....,address_city_state_zip
5,Golden Hills Education Ctr. 2460 Clay Bank Rd....,address_city_state_zip


[   53.20s] 03.3 completed


In [585]:
# ============================
# 03.4 Geocode TODO keys (OpenCage) -> update geo_cache (idempotent)
#   Inputs : todo_to_geocode, geo_cache, GEO_CACHE_PATH, CACHE_VERSION
#   Output : geo_cache (updated + saved), df_loc stays as-is
# ============================

log("03.4 started")

assert "todo_to_geocode" in globals(), "todo_to_geocode not found (run 03.3)"
assert "geo_cache" in globals(), "geo_cache not found (run 03.0)"
assert "GEO_CACHE_PATH" in globals(), "GEO_CACHE_PATH not found (run 03.0)"
assert "CACHE_VERSION" in globals(), "CACHE_VERSION not found (run 03.0)"
assert "canon_q" in globals(), "canon_q() not found (run 03.0)"
assert "log" in globals(), "log() not found"

# ---- Provide your OpenCage key either via env var or a notebook variable
OPENCAGE_API_KEY = os.getenv("OPENCAGE_API_KEY", None)
assert OPENCAGE_API_KEY, "Missing OPENCAGE_API_KEY env var. Set it before running 03.4."

# ---- OpenCage endpoint
OPENCAGE_URL = "https://api.opencagedata.com/geocode/v1/json"

# ---- gentle defaults
SLEEP_S = 1.2          # keep this >= 1s to avoid rate limits
SAVE_EVERY = 25        # persist cache every N requests
MAX_RESULTS = 1        # we only need best hit

def clean_address_query(q: str) -> str:
    """
    Fix common upstream issues:
    - Duplicate concatenation: "X. X, City, CA ZIP" -> keep the latter
    - "Street. Street, City, CA ZIP" -> keep the comma-based part if present
    """
    q = canon_q(q)

    # If the string contains a comma-delimited address part, keep from the first comma chunk onward
    # But only if it looks like a CA address.
    if ", CA " in q:
        # Sometimes we have "6130 Foo Ave. 6130 Foo Dr., San Jose, CA 951..."
        # If there's a comma later, keep the last comma-anchored address segment.
        parts = q.split(",")
        if len(parts) >= 3:
            # rebuild from the last 3-ish segments: "street, city, CA ZIP"
            tail = ",".join(parts[-3:]).strip()
            # keep any street portion immediately before the city (the segment before city)
            # If we already kept "street, city, CA ZIP" this is fine.
            q = tail

    # If there are two street-number patterns back-to-back, keep the latter half
    # e.g., "123 A St 123 B St, City, CA 99999"
    m = re.search(r"(\b\d{1,6}\s+.+,\s*[^,]+,\s*CA\s*\d{5}\b)", q)
    if m:
        q = m.group(1)

    return canon_q(q)

def opencage_geocode(query: str) -> dict:
    params = {
        "q": query,
        "key": OPENCAGE_API_KEY,
        "limit": MAX_RESULTS,
        "no_annotations": 1,
    }
    r = requests.get(OPENCAGE_URL, params=params, timeout=30)
    r.raise_for_status()
    return r.json()

# ---- build a fast lookup of existing cache rows (by key)
geo_cache["geo_query"] = geo_cache["geo_query"].map(canon_q)
geo_cache["geo_query_type"] = geo_cache["geo_query_type"].map(canon_q)
geo_cache["cache_version"] = geo_cache["cache_version"].astype(str).fillna(CACHE_VERSION)

cache_key_cols = ["geo_query", "geo_query_type", "cache_version"]
geo_cache["_k"] = list(zip(geo_cache["geo_query"], geo_cache["geo_query_type"], geo_cache["cache_version"]))
cache_keys = set(geo_cache["_k"].tolist())

# ---- prepare todo list
todo = todo_to_geocode.copy()
todo["geo_query"] = todo["geo_query"].map(clean_address_query)
todo["geo_query_type"] = todo["geo_query_type"].map(canon_q)
todo["cache_version"] = CACHE_VERSION
todo["_k"] = list(zip(todo["geo_query"], todo["geo_query_type"], todo["cache_version"]))

# If a key exists in cache already (even failed), you can choose to skip or retry.
# Here: retry ONLY if it's not already geocoded.
if "geo_status" not in geo_cache.columns:
    geo_cache["geo_status"] = ""

status_map = dict(zip(geo_cache["_k"], geo_cache["geo_status"].astype(str)))
def should_attempt(k):
    s = status_map.get(k, "")
    return s != "geocoded"  # retry for blanks/errors/no_result

todo = todo[todo["_k"].map(should_attempt)].copy()
log(f"03.4 TODO after cleaning + geocoded-skip: {len(todo):,}")

new_rows = []
attempted = 0
success = 0
no_result = 0
errors = 0

for i, row in todo.reset_index(drop=True).iterrows():
    attempted += 1
    q = row["geo_query"]
    qt = row["geo_query_type"]

    now_utc = datetime.now(timezone.utc)

    try:
        data = opencage_geocode(q)
        results = data.get("results", []) or []

        if not results:
            no_result += 1
            new_rows.append({
                "geo_query": q,
                "geo_query_type": qt,
                "latitude": np.nan,
                "longitude": np.nan,
                "geo_confidence": "",
                "geo_provider": "opencage",
                "geocoded_at_utc": now_utc,
                "cache_version": CACHE_VERSION,
                "geo_status": "no_result",
            })
        else:
            best = results[0]
            geom = best.get("geometry", {}) or {}
            lat = geom.get("lat", np.nan)
            lng = geom.get("lng", np.nan)
            conf = best.get("confidence", "")

            ok = pd.notna(lat) and pd.notna(lng)
            success += int(bool(ok))

            new_rows.append({
                "geo_query": q,
                "geo_query_type": qt,
                "latitude": lat,
                "longitude": lng,
                "geo_confidence": str(conf),
                "geo_provider": "opencage",
                "geocoded_at_utc": now_utc,
                "cache_version": CACHE_VERSION,
                "geo_status": "geocoded" if ok else "no_result",
            })

    except Exception as e:
        errors += 1
        new_rows.append({
            "geo_query": q,
            "geo_query_type": qt,
            "latitude": np.nan,
            "longitude": np.nan,
            "geo_confidence": "",
            "geo_provider": "opencage",
            "geocoded_at_utc": now_utc,
            "cache_version": CACHE_VERSION,
            "geo_status": f"error:{type(e).__name__}",
        })

    # rate limit
    time.sleep(SLEEP_S)

    # periodic save
    if attempted % SAVE_EVERY == 0:
        add_df = pd.DataFrame(new_rows)
        geo_cache = pd.concat([geo_cache.drop(columns=["_k"], errors="ignore"), add_df], ignore_index=True)

        # dedupe keep newest by geocoded_at_utc
        geo_cache["geocoded_at_utc"] = pd.to_datetime(geo_cache["geocoded_at_utc"], utc=True, errors="coerce")
        geo_cache = geo_cache.sort_values("geocoded_at_utc").drop_duplicates(cache_key_cols, keep="last").reset_index(drop=True)

        geo_cache.to_parquet(GEO_CACHE_PATH, index=False)
        log(f"03.4 checkpoint save: attempted={attempted:,} success={success:,} no_result={no_result:,} errors={errors:,}")

# final merge + save
if new_rows:
    add_df = pd.DataFrame(new_rows)
    geo_cache = pd.concat([geo_cache.drop(columns=["_k"], errors="ignore"), add_df], ignore_index=True)

geo_cache["geocoded_at_utc"] = pd.to_datetime(geo_cache["geocoded_at_utc"], utc=True, errors="coerce")
geo_cache = geo_cache.sort_values("geocoded_at_utc").drop_duplicates(cache_key_cols, keep="last").reset_index(drop=True)

geo_cache.to_parquet(GEO_CACHE_PATH, index=False)

log(f"03.4 completed: attempted={attempted:,} success={success:,} no_result={no_result:,} errors={errors:,}")
log(f"Cache rows now: {len(geo_cache):,}")
display(geo_cache.tail(20))

log("03.4 completed")

[   54.02s] 03.4 started
[   54.04s] 03.4 TODO after cleaning + geocoded-skip: 4
[   63.59s] 03.4 completed: attempted=4 success=4 no_result=0 errors=0
[   63.59s] Cache rows now: 3,302


Unnamed: 0,geo_query,geo_query_type,latitude,longitude,geo_confidence,geo_provider,geocoded_at_utc,cache_version,geo_status
3282,"701 CABOT ST, BEVERLY, MA 01915",address_city_state_zip,42.582187,-70.897799,high,opencage,2026-01-18 08:51:06.326333+00:00,v1,geocoded
3283,"7101 Stanton Ave. 7101 Stanton Ave., Buena Par...",address_city_state_zip,33.857411,-117.993823,high,opencage,2026-01-18 08:51:06.671915+00:00,v1,geocoded
3284,"715 CARRELL ST, TOMBALL, TX 77375",address_city_state_zip,30.105138,-95.607495,high,opencage,2026-01-18 08:51:06.960092+00:00,v1,geocoded
3285,"7421 MIRANO DR, GOLETA, CA 93117",address_city_state_zip,34.438007,-119.885384,high,opencage,2026-01-18 08:51:07.258016+00:00,v1,geocoded
3286,"7777 62ND AVE NE, SEATTLE, WA 98115",address_city_state_zip,47.687834,-122.264955,high,opencage,2026-01-18 08:51:07.634643+00:00,v1,geocoded
3287,"8001 Santa Fe Ave. 8001 Santa Fe Ave., Walnut ...",address_city_state_zip,33.966367,-118.230184,medium,opencage,2026-01-18 08:51:08.128082+00:00,v1,geocoded
3288,"827 KIRK RD, DECATUR, GA 30030",address_city_state_zip,33.76257,-84.280934,high,opencage,2026-01-18 08:51:08.480166+00:00,v1,geocoded
3289,8670 East Running Springs Dr. 8670 East Runnin...,address_city_state_zip,33.860011,-117.725809,high,opencage,2026-01-18 08:51:08.960376+00:00,v1,geocoded
3290,"8700 S VIEW RD, AUSTIN, TX 78737",address_city_state_zip,30.230404,-97.9125,high,opencage,2026-01-18 08:51:10.030147+00:00,v1,geocoded
3291,"880 MANZANITA DR, LOS OSOS, CA 93402",address_city_state_zip,35.309953,-120.835914,high,opencage,2026-01-18 08:51:10.430747+00:00,v1,geocoded


[   63.60s] 03.4 completed


In [587]:
ok = geo_cache["geo_status"].astype(str).eq("geocoded")
log(f"Cache geocoded rows (v={CACHE_VERSION}): {int(ok.sum()):,} / {len(geo_cache):,}")

# confirm those 195 todo keys are now geocoded
todo_keys = set(zip(
    todo_to_geocode["geo_query"].map(canon_q),
    todo_to_geocode["geo_query_type"].map(canon_q),
))
cache_keys_ok = set(zip(
    geo_cache.loc[ok, "geo_query"],
    geo_cache.loc[ok, "geo_query_type"],
))
hit = len(todo_keys & cache_keys_ok)
log(f"Geocoded todo_to_geocode: {hit:,} / {len(todo_keys):,}")


[   68.21s] Cache geocoded rows (v=v1): 3,300 / 3,302
[   68.22s] Geocoded todo_to_geocode: 4 / 6


In [589]:
def clean_address_query(q: str) -> str:
    q = canon_q(q)
    if ", CA " in q:
        parts = q.split(",")
        if len(parts) >= 3:
            tail = ",".join(parts[-3:]).strip()
            q = tail
    m = re.search(r"(\b\d{1,6}\s+.+,\s*[^,]+,\s*CA\s*\d{5}\b)", q)
    if m:
        q = m.group(1)
    return canon_q(q)

ok = geo_cache["geo_status"].astype(str).eq("geocoded")
cache_pairs_ok = set(zip(
    geo_cache.loc[ok, "geo_query"],
    geo_cache.loc[ok, "geo_query_type"],
))

todo_pairs_clean = set(zip(
    todo_to_geocode["geo_query"].map(clean_address_query),
    todo_to_geocode["geo_query_type"].map(canon_q),
))

hit = len(todo_pairs_clean & cache_pairs_ok)
log(f"Geocoded todo_to_geocode (CLEANED compare): {hit:,} / {len(todo_pairs_clean):,}")


[   76.81s] Geocoded todo_to_geocode (CLEANED compare): 6 / 6


In [591]:
# ============================
# 03.5 Attach geocodes from cache -> df_loc_geo (robust join)
# ============================

log("03.5 started")

assert "df_loc" in globals(), "df_loc not found"
assert "geo_cache" in globals(), "geo_cache not found"
assert "canon_q" in globals(), "canon_q() not found"
assert "CACHE_VERSION" in globals(), "CACHE_VERSION not found"

import re

def clean_address_query(q: str) -> str:
    q = canon_q(q)
    if ", CA " in q:
        parts = q.split(",")
        if len(parts) >= 3:
            q = ",".join(parts[-3:]).strip()
    m = re.search(r"(\b\d{1,6}\s+.+,\s*[^,]+,\s*CA\s*\d{5}\b)", q)
    if m:
        q = m.group(1)
    return canon_q(q)

loc = df_loc.copy()
loc["geo_query"] = loc["geo_query"].map(canon_q)
loc["geo_query_type"] = loc["geo_query_type"].map(canon_q)
loc["geo_query_clean"] = loc["geo_query"].map(clean_address_query)

cache = geo_cache.copy()
cache["geo_query"] = cache["geo_query"].map(canon_q)
cache["geo_query_type"] = cache["geo_query_type"].map(canon_q)
cache["cache_version"] = cache["cache_version"].astype(str).fillna(CACHE_VERSION)

# keep only geocoded rows in this version
cache_ok = cache[
    (cache["cache_version"] == CACHE_VERSION) &
    (cache["geo_status"].astype(str) == "geocoded") &
    cache["latitude"].notna() &
    cache["longitude"].notna()
].copy()

# if duplicates exist, keep most recent
cache_ok["geocoded_at_utc"] = pd.to_datetime(cache_ok["geocoded_at_utc"], utc=True, errors="coerce")
cache_ok = cache_ok.sort_values("geocoded_at_utc").drop_duplicates(
    ["geo_query","geo_query_type","cache_version"], keep="last"
)

# join on cleaned query (left) -> cache query (right)
df_loc_geo = loc.merge(
    cache_ok[[
        "geo_query","geo_query_type",
        "latitude","longitude",
        "geo_confidence","geo_provider","geocoded_at_utc","geo_status"
    ]].rename(columns={"geo_query": "geo_query_clean"}),
    how="left",
    left_on=["geo_query_clean","geo_query_type"],
    right_on=["geo_query_clean","geo_query_type"],
    validate="m:1",
)

has_latlng = df_loc_geo["latitude"].notna() & df_loc_geo["longitude"].notna()
log(f"03.5 df_loc_geo lat/lng coverage: {int(has_latlng.sum()):,} / {len(df_loc_geo):,} ({has_latlng.mean():.2%})")

display(df_loc_geo.loc[~has_latlng, ["school_id","name","geo_query","geo_query_type"]].head(30))

log("03.5 completed")


[   77.67s] 03.5 started
[   78.78s] 03.5 df_loc_geo lat/lng coverage: 3,502 / 126,615 (2.77%)


Unnamed: 0,school_id,name,geo_query,geo_query_type
0,CAIS_0b0a82aa7039,Escuela Bilingüe Internacional,"Escuela Bilingue Internacional, CA",name_state_ca
4,CAIS_37b76c325857,Bayhill High School,"Bayhill High School, CA",name_state_ca
20,CAIS_b6653a2ab18f,Head-Royce School,"Head-Royce School, CA",name_state_ca
27,CA_10621176135818,Charlie Keyan Armenian School,"Charlie Keyan Armenian School, CA",name_state_ca
28,CA_10621176148878,Clovis Christian School,"Clovis Christian School, CA",name_state_ca
29,CA_10621176162093,Graham Yalle Visual and Performing Arts Academy,Graham Yalle Visual and Performing Arts Academ...,name_state_ca
30,CA_10621176165567,Provident Academy,"Provident Academy, CA",name_state_ca
31,CA_10621176166011,Belmont Private Academy,"Belmont Private Academy, CA",name_state_ca
32,CA_10621176169080,Royal Beginnings,"Royal Beginnings, CA",name_state_ca
33,CA_10621176169155,Spectrum Center Clovis,"Spectrum Center Clovis, CA",name_state_ca


[   78.79s] 03.5 completed


In [594]:
# Does df_loc_geo contain the Harker address key (or clean-equivalent)?
HK_QUERY = "500 SARATOGA AVE, SAN JOSE, CA 95129"

hit = df_loc_geo[df_loc_geo["geo_query"].astype(str).str.contains("500 SARATOGA", case=False, na=False)]
print("df_loc_geo rows containing '500 saratoga':", len(hit))
display(hit[["school_id","name","geo_query","geo_query_type","latitude","longitude","geo_status"]].head(20))


df_loc_geo rows containing '500 saratoga': 1


Unnamed: 0,school_id,name,geo_query,geo_query_type,latitude,longitude,geo_status
19127,PRI_A2190096,HARKER,"500 SARATOGA AVE, SAN JOSE, CA 95129",address_city_state_zip,37.317627,-121.971867,geocoded


In [596]:
# ============================
# 03.6 Build schools_locations (many rows per school; deterministic + idempotent) — FINAL
#   Input : df_loc_geo (from 03.5)
#   Output: NB05_DIR/schools_locations_v1.parquet (+ csv)
# ============================

log("03.6 started")

import re
import hashlib

assert "df_loc_geo" in globals(), "df_loc_geo not found (run 03.5)"
assert "NB05_DIR" in globals(), "NB05_DIR not found"
assert "canon_q" in globals(), "canon_q() not found"

OUT_LOC = NB05_DIR / "schools_locations_v1.parquet"
OUT_LOC_CSV = NB05_DIR / "schools_locations_v1.csv"

def stable_id(*parts: str, n: int = 16) -> str:
    s = "||".join("" if p is None else str(p) for p in parts)
    return hashlib.sha1(s.encode("utf-8")).hexdigest()[:n]

# Match the LAST US address pattern in the string (fixes double-concat like "A. B, City, ST 12345")

US_ADDR_RE = re.compile(
    r"(\b\d{1,6}(?:-\d{1,6})?\s+[^,]+?),\s*([^,]+?),\s*([A-Z]{2})\s*(\d{5})(?:-\d{4})?\b"
)

STATE_ZIP_RE = re.compile(r"\b([A-Z]{2})\s*(\d{5})(?:-\d{4})?\b")

def clean_address_query_any_state(q: str) -> str:
    q = canon_q(q)

    # A) best: extract the LAST full address occurrence
    matches = list(US_ADDR_RE.finditer(q))
    if matches:
        m = matches[-1]
        street, city, state, zip5 = m.group(1), m.group(2), m.group(3), m.group(4)
        return canon_q(f"{street}, {city}, {state} {zip5}")

    # B) fallback: if string contains commas, try to salvage the last "..., City, ST ZIP"
    parts = [p.strip() for p in q.split(",") if p.strip()]
    if len(parts) >= 2:
        # find the last segment containing "ST ZIP"
        idx = None
        for i in range(len(parts) - 1, -1, -1):
            if STATE_ZIP_RE.search(parts[i]):
                idx = i
                break

        # if we found state+zip at part i, try to use previous part as city and previous as street-ish
        if idx is not None:
            tail = parts[max(0, idx-2): idx+1]
            # tail could be [street, city, "ST ZIP"] or ["city", "ST ZIP"]
            if len(tail) == 3:
                return canon_q(f"{tail[0]}, {tail[1]}, {tail[2]}")
            if len(tail) == 2:
                return canon_q(f"{tail[0]}, {tail[1]}")

    # otherwise: return normalized original
    return q


loc = df_loc_geo.copy()
loc["geo_query"] = loc["geo_query"].map(canon_q)
loc["geo_query_type"] = loc["geo_query_type"].map(canon_q)

# Keep only real address rows (skip placeholders + name-only)
addr = loc[
    (loc["geo_query_type"] == "address_city_state_zip") &
    (~loc["geo_query"].str.contains(r"^<<<", na=False))
].copy()

# Build clean join key + deterministic id
addr["geo_query_clean"] = addr["geo_query"].map(clean_address_query_any_state)

addr["location_id"] = addr.apply(
    lambda r: "LOC_" + stable_id(str(r["school_id"]), str(r["geo_query_clean"])),
    axis=1
)

schools_locations = addr[[
    "location_id",
    "school_id",
    "name",
    "geo_query",          # raw (may include duplicates)
    "geo_query_clean",    # canonical join key
    "geo_query_type",
    "latitude",
    "longitude",
    "geo_status",
    "geo_confidence",
    "geo_provider",
    "geocoded_at_utc",
]].copy()

schools_locations["is_geocoded"] = schools_locations["latitude"].notna() & schools_locations["longitude"].notna()
schools_locations["created_at_utc"] = pd.Timestamp.utcnow()

# Dedupe: keep newest geocode per location_id
schools_locations["geocoded_at_utc"] = pd.to_datetime(schools_locations["geocoded_at_utc"], utc=True, errors="coerce")
schools_locations = schools_locations.sort_values(["location_id","geocoded_at_utc"])
schools_locations = schools_locations.drop_duplicates(["location_id"], keep="last").reset_index(drop=True)

changed = (schools_locations["geo_query_clean"] != schools_locations["geo_query"]).sum()
log(f"03.6 cleaned geo_query rows changed: {int(changed):,}")

# show some examples that changed
display(
    schools_locations.loc[schools_locations["geo_query_clean"] != schools_locations["geo_query"],
                          ["school_id","name","geo_query","geo_query_clean"]]
    .head(20)
)

OUT_LOC.parent.mkdir(parents=True, exist_ok=True)
schools_locations.to_parquet(OUT_LOC, index=False)
schools_locations.to_csv(OUT_LOC_CSV, index=False)

# Quick sanity: how many got “cleaned”?
changed = (schools_locations["geo_query_clean"] != schools_locations["geo_query"]).sum()
log(f"03.6 cleaned geo_query rows changed: {int(changed):,}")

log(f"03.6 saved: {OUT_LOC}")
log(f"03.6 rows: {len(schools_locations):,} | geocoded: {int(schools_locations['is_geocoded'].sum()):,}")
display(schools_locations.head(20))

log("03.6 completed")


[   95.02s] 03.6 started
[   96.54s] 03.6 cleaned geo_query rows changed: 615


Unnamed: 0,school_id,name,geo_query,geo_query_clean
115,PRI_A1101376,OLTC INSTITUTE,"(2ND) ADDRESS 3825 ROLAND BLVD, SAINT LOUIS, M...","3825 ROLAND BLVD, SAINT LOUIS, MO 63121"
640,PRI_A9504505,AUGUSTA MENNONITE SCHOOL,"E 19280 COUNTY ROAD GG, AUGUSTA, WI 54722","19280 COUNTY ROAD GG, AUGUSTA, WI 54722"
814,PRI_A1702933,CENTRAL OREGON INTERGOVERNMENTAL COUNCIL SKILL...,"1645 NE FORBES RD, STE 108, BEND, OR 97701","STE 108, BEND, OR 97701"
871,PUB_390444601391,Oak Intermediate Elementary School,#1 Glenwood Tiger Trail #1 Glenwood Tiger Trai...,1 Glenwood Tiger Trail #1 Glenwood Tiger Trail...
934,PRI_A1702936,COIC - LA PINE,"1645 NE FORBES RD, STE 108, BEND, OR 97701","STE 108, BEND, OR 97701"
984,PUB_720003000480,BRAULIO AYALA PEREZ,CARR 924 RAMAL 938 BO MAMBICHE BLANCO HC 03 BO...,924 RAMAL 938 BO MAMBICHE BLANCO HC 03 BOX 659...
1070,PUB_720003000573,DAVID ANTONGIORGI CORDOVA,"CARR 121 KM 5 HM O BO MACHUCHAL HC 9 BOX 4635,...","121 KM 5 HM O BO MACHUCHAL HC 9 BOX 4635, SABA..."
1119,PRI_A2104117,MEADOW LANE PAROCHIAL SCHOOL,"N 6302 SANDHILL AVE, CHILI, WI 54420","6302 SANDHILL AVE, CHILI, WI 54420"
1227,PUB_010027000059,Fairhope Middle School,"Two Pirate Drive 2 Pirate Dr, Fairhope, AL 36532","2 Pirate Dr, Fairhope, AL 36532"
1596,PUB_350006000120,WHERRY ELEMENTARY,"BLDG 25000 - KAFB EAST BLDG 25000 - KAFB EAST,...","25000 - KAFB EAST BLDG 25000 - KAFB EAST, ALBU..."


[   97.21s] 03.6 cleaned geo_query rows changed: 615
[   97.21s] 03.6 saved: /Users/jennifer-david/Documents/work/SpringBoard/projects/Capstone Projects/smart-school/data/processed/notebook05/schools_locations_v1.parquet
[   97.21s] 03.6 rows: 112,807 | geocoded: 3,150


Unnamed: 0,location_id,school_id,name,geo_query,geo_query_clean,geo_query_type,latitude,longitude,geo_status,geo_confidence,geo_provider,geocoded_at_utc,is_geocoded,created_at_utc
0,LOC_0001110bfa8231af,PRI_A1990729,LEGACY CLASSICAL CHRISTIAN ACADEMY,"1209 N SAGINAW BLVD G-207, SAGINAW, TX 76179","1209 N SAGINAW BLVD G-207, SAGINAW, TX 76179",address_city_state_zip,,,,,,NaT,False,2026-01-19 06:36:16.667383+00:00
1,LOC_00022cd26b287855,PUB_060172913979,Rocketship Rising Stars,"3173 Senter Rd. 350 Twin Dolphin Dr. Ste. 109,...","3173 Senter Rd. 350 Twin Dolphin Dr. Ste. 109,...",address_city_state_zip,37.293723,-121.833725,geocoded,medium,opencage,2026-01-18 07:07:25.521601+00:00,True,2026-01-19 06:36:16.667383+00:00
2,LOC_0003df1b9c683b8c,PUB_470129000388,Franklin Co High School,"833 Bypass RD 833 Bypass RD, Winchester, TN 37398","833 Bypass RD 833 Bypass RD, Winchester, TN 37398",address_city_state_zip,,,,,,NaT,False,2026-01-19 06:36:16.667383+00:00
3,LOC_000405c25d9fc8c0,PUB_410450001416,Lake Creek Learning Center,"391 Lake Creek Loop Rd 391 Lake Creek Loop Rd,...","391 Lake Creek Loop Rd 391 Lake Creek Loop Rd,...",address_city_state_zip,,,,,,NaT,False,2026-01-19 06:36:16.667383+00:00
4,LOC_00049c7c3f635501,PRI_A1902341,MERRIMAC HEIGHTS ACADEMY,"102 W MAIN ST, MERRIMAC, MA 01860","102 W MAIN ST, MERRIMAC, MA 01860",address_city_state_zip,,,,,,NaT,False,2026-01-19 06:36:16.667383+00:00
5,LOC_0004dff929dfd822,PUB_251332202949,Worcester Cultural Academy Charter Public School,"81 Plantation Street 81 Plantation Street, Wor...","81 Plantation Street 81 Plantation Street, Wor...",address_city_state_zip,,,,,,NaT,False,2026-01-19 06:36:16.667383+00:00
6,LOC_00054e9fd6a3a5dd,PUB_063417009434,Howard Inghram Elementary,"1695 West 19th St. 1695 West 19th St., San Ber...","1695 West 19th St. 1695 West 19th St., San Ber...",address_city_state_zip,,,,,,NaT,False,2026-01-19 06:36:16.667383+00:00
7,LOC_000559c72125ba03,PUB_220162001243,Franklin Junior High School,"525 Morris Street 525 Morris Street, Franklin,...","525 Morris Street 525 Morris Street, Franklin,...",address_city_state_zip,,,,,,NaT,False,2026-01-19 06:36:16.667383+00:00
8,LOC_00059c5b141e66bf,PUB_210510002342,Bluegrass Discovery Academy High,"551 Viking Drive 551 Viking Drive, Morehead, K...","551 Viking Drive 551 Viking Drive, Morehead, K...",address_city_state_zip,,,,,,NaT,False,2026-01-19 06:36:16.667383+00:00
9,LOC_0005df4a7a031c16,PRI_A1901132,FAITH ACADEMY,"3832 W NEW HAMPSHIRE ST, ORLANDO, FL 32808","3832 W NEW HAMPSHIRE ST, ORLANDO, FL 32808",address_city_state_zip,,,,,,NaT,False,2026-01-19 06:36:16.667383+00:00


[   97.22s] 03.6 completed


In [597]:
# ============================
# 03.7 Best geocoded location per school (display-only fields)
# ============================

log("03.7 started")

assert "schools_locations" in globals(), "schools_locations not found (run 03.6)"

locs = schools_locations.copy()

# confidence normalization
conf = locs["geo_confidence"].astype(str).str.lower()
conf_map = {"high": 3, "medium": 2, "low": 1}
locs["conf_score"] = pd.to_numeric(conf, errors="coerce")
locs.loc[locs["conf_score"].isna(), "conf_score"] = conf.map(conf_map).fillna(0)

locs["has_latlng"] = locs["latitude"].notna() & locs["longitude"].notna()
locs["geocoded_at_utc"] = pd.to_datetime(locs["geocoded_at_utc"], utc=True, errors="coerce")

# pick best row per school: geocoded > higher confidence > newest
locs_sorted = locs.sort_values(
    ["school_id", "has_latlng", "conf_score", "geocoded_at_utc"],
    ascending=[True, False, False, False]
)

best_location = locs_sorted.drop_duplicates("school_id", keep="first")[[
    "school_id",
    "location_id",
    "geo_query_clean",
    "latitude", "longitude",
    "geo_confidence", "geo_provider",
    "geocoded_at_utc",
    "geo_status",
    "has_latlng",
]].rename(columns={
    "location_id": "best_location_id",
    "geo_query_clean": "best_geo_query",
    "latitude": "best_latitude",
    "longitude": "best_longitude",
    "geo_confidence": "best_geo_confidence",
    "geo_provider": "best_geo_provider",
    "geocoded_at_utc": "best_geocoded_at_utc",
    "geo_status": "best_geo_status",
})

log(f"03.7 best_location rows: {len(best_location):,}")
log(f"03.7 schools with lat/lng: {int(best_location['has_latlng'].sum()):,}")

display(best_location.head(20))

# quick Harker check
display(best_location[best_location["school_id"] == "PRI_A2190096"])

log("03.7 completed")


[   97.22s] 03.7 started
[   97.38s] 03.7 best_location rows: 112,807
[   97.38s] 03.7 schools with lat/lng: 3,150


Unnamed: 0,school_id,best_location_id,best_geo_query,best_latitude,best_longitude,best_geo_confidence,best_geo_provider,best_geocoded_at_utc,best_geo_status,has_latlng
37024,CAIS_0fb36d096d98,LOC_5400abb3b9d55e6f,"1212 4th St, San Rafael, CA 94901",37.973382,-122.529727,10.0,opencage,2026-01-19 06:35:34.269463+00:00,geocoded,True
100950,CAIS_1ff9891ebd76,LOC_e531a4822f01f3fb,"345 12th St, Oakland, CA 94607",37.801842,-122.269525,10.0,opencage,2026-01-19 06:35:37.249900+00:00,geocoded,True
28239,CAIS_38bcb29a33d7,LOC_403e57214a9f94c5,"2550 W El Camino Real, Mountain View, CA 94040",37.400426,-122.11166,8.0,opencage,2026-01-19 06:35:39.495647+00:00,geocoded,True
72015,CAIS_725851f3335f,LOC_a3879486763636de,"2350 Powell St, Emeryville, CA 94608",37.837647,-122.30268,10.0,opencage,2026-01-18 07:58:10.332204+00:00,geocoded,True
83121,CAIS_8aea417dcbe4,LOC_bcb56ed3c6199bda,"1201 Ortega St, San Francisco, CA 94122",37.752059,-122.476879,9.0,opencage,2026-01-19 06:35:41.645331+00:00,geocoded,True
68980,CAIS_a18f4a9cd078,LOC_9cb45a9b554aa15b,"150 Oak St, San Francisco, CA 94102",37.775378,-122.421605,10.0,opencage,2026-01-18 07:58:12.424328+00:00,geocoded,True
32848,CAIS_d6d2f01752a3,LOC_4abf72d4e015a031,"180 N San Pedro Rd, San Rafael, CA 94903",38.001337,-122.522714,10.0,opencage,2026-01-18 07:58:14.669413+00:00,geocoded,True
52327,CAIS_d9ad26e7d66c,LOC_76e439320699a131,"3100 Webber St, Palo Alto, CA 94306",37.415728,-122.130765,6.0,opencage,2026-01-18 07:58:16.822708+00:00,geocoded,True
92491,PRI_00000033,LOC_d1ef82ec6899b934,"700 ALBERT RAINS BLVD, GADSDEN, AL 35901",,,,,NaT,,False
39778,PRI_00000044,LOC_5a630c8dd454bdc5,"601 JAMES I HARRISON JR PKWY E, TUSCALOOSA, AL...",,,,,NaT,,False


Unnamed: 0,school_id,best_location_id,best_geo_query,best_latitude,best_longitude,best_geo_confidence,best_geo_provider,best_geocoded_at_utc,best_geo_status,has_latlng
105426,PRI_A2190096,LOC_ef542a0a61eac97c,"500 SARATOGA AVE, SAN JOSE, CA 95129",37.317627,-121.971867,high,opencage,2026-01-18 03:48:41.748505+00:00,geocoded,True


[   97.39s] 03.7 completed


In [598]:
# ============================
# 03.8 Attach best_* location fields to schools_master (canonical one-row table) — FINAL (warning-free)
#   Input : IN_SCHOOLS_V1 (Notebook04 schools_master) + best_location (from 03.7)
#   Output: NB05_DIR/schools_master_geo_v1.parquet (+ csv)
# ============================

log("03.8 started")

# ---- required globals ----
assert "IN_SCHOOLS_V1" in globals(), "IN_SCHOOLS_V1 not defined"
assert "NB05_DIR" in globals(), "NB05_DIR not defined"
assert "best_location" in globals(), "best_location not found (run 03.7)"

assert Path(IN_SCHOOLS_V1).exists(), f"Missing file: {IN_SCHOOLS_V1}"

# ----------------------------
# 03.8a Load schools_master (Notebook04 input)
# ----------------------------
schools_master = pd.read_parquet(IN_SCHOOLS_V1)
log(f"Loaded schools_master: {IN_SCHOOLS_V1} | shape={schools_master.shape}")

# ----------------------------
# 03.8b Normalize best_location confidence + merge
# ----------------------------
sm = schools_master.copy()
bl = best_location.copy()

# Ensure expected columns exist (defensive)
needed_bl_cols = [
    "school_id",
    "best_location_id",
    "best_geo_query",
    "best_latitude",
    "best_longitude",
    "best_geo_confidence",
    "best_geo_provider",
    "best_geocoded_at_utc",
    "best_geo_status",
    "has_latlng",
]
for c in needed_bl_cols:
    if c not in bl.columns:
        bl[c] = pd.NA

# --- normalize best_geo_confidence into BOTH label + numeric score ---
conf_raw = bl["best_geo_confidence"]
conf_num = pd.to_numeric(conf_raw, errors="coerce")

def conf_to_label(x):
    if pd.isna(x):
        return ""
    s = str(x).strip().lower()
    if s in {"high", "medium", "low"}:
        return s
    try:
        v = float(s)
    except Exception:
        return ""
    if v >= 8:
        return "high"
    if v >= 5:
        return "medium"
    return "low"

bl["best_geo_confidence_label"] = conf_raw.map(conf_to_label)

label_to_num = {"high": 3, "medium": 2, "low": 1, "": 0}
bl["best_geo_confidence_score"] = conf_num
bl.loc[bl["best_geo_confidence_score"].isna(), "best_geo_confidence_score"] = (
    bl["best_geo_confidence_label"].map(label_to_num).fillna(0)
)

# Normalize timestamps + key columns
bl["best_geocoded_at_utc"] = pd.to_datetime(bl["best_geocoded_at_utc"], utc=True, errors="coerce")
bl["best_latitude"] = pd.to_numeric(bl["best_latitude"], errors="coerce")
bl["best_longitude"] = pd.to_numeric(bl["best_longitude"], errors="coerce")

# Ensure boolean (BEFORE merge)
bl["has_latlng"] = bl["has_latlng"].fillna(False).astype(bool)

# Only bring over the best_* fields
keep_cols = [
    "school_id",
    "best_location_id",
    "best_geo_query",
    "best_latitude",
    "best_longitude",
    "best_geo_confidence_label",
    "best_geo_confidence_score",
    "best_geo_provider",
    "best_geocoded_at_utc",
    "best_geo_status",
    "has_latlng",
]

sm2 = sm.merge(bl[keep_cols], on="school_id", how="left")

# Ensure boolean (AFTER merge) — fixes pandas FutureWarning
sm2["has_latlng"] = sm2["has_latlng"].eq(True)

# Coverage stats (warning-free)
log(f"03.8 schools_master rows: {len(sm2):,}")
log(f"03.8 best_lat/lng coverage: {int(sm2['has_latlng'].sum()):,} / {len(sm2):,}")

# Harker sanity
try:
    display(
        sm2.loc[sm2["school_id"].eq("PRI_A2190096"),
                ["school_id","name","best_geo_query","best_latitude","best_longitude",
                 "best_geo_confidence_label","best_geo_confidence_score","best_geo_status","has_latlng"]]
    )
except Exception:
    print(
        sm2.loc[sm2["school_id"].eq("PRI_A2190096"),
                ["school_id","name","best_geo_query","best_latitude","best_longitude",
                 "best_geo_confidence_label","best_geo_confidence_score","best_geo_status","has_latlng"]]
    )

# Update in-memory
schools_master = sm2

# ----------------------------
# 03.8c Save enriched schools_master (Notebook05 output)
# ----------------------------
OUT_SM = NB05_DIR / "schools_master_geo_v1.parquet"
OUT_SM_CSV = NB05_DIR / "schools_master_geo_v1.csv"

OUT_SM.parent.mkdir(parents=True, exist_ok=True)
schools_master.to_parquet(OUT_SM, index=False)
schools_master.to_csv(OUT_SM_CSV, index=False)

log(f"03.8 saved: {OUT_SM}")
log(f"03.8 saved: {OUT_SM_CSV}")

log("03.8 completed")


[   97.40s] 03.8 started
[   97.43s] Loaded schools_master: /Users/jennifer-david/Documents/work/SpringBoard/projects/Capstone Projects/smart-school/data/processed/notebook04/schools_master_v1.parquet | shape=(126616, 18)
[   97.52s] 03.8 schools_master rows: 126,616
[   97.52s] 03.8 best_lat/lng coverage: 3,150 / 126,616


Unnamed: 0,school_id,name,best_geo_query,best_latitude,best_longitude,best_geo_confidence_label,best_geo_confidence_score,best_geo_status,has_latlng
119411,PRI_A2190096,HARKER,"500 SARATOGA AVE, SAN JOSE, CA 95129",37.317627,-121.971867,high,3.0,geocoded,True


[   98.08s] 03.8 saved: /Users/jennifer-david/Documents/work/SpringBoard/projects/Capstone Projects/smart-school/data/processed/notebook05/schools_master_geo_v1.parquet
[   98.08s] 03.8 saved: /Users/jennifer-david/Documents/work/SpringBoard/projects/Capstone Projects/smart-school/data/processed/notebook05/schools_master_geo_v1.csv
[   98.08s] 03.8 completed


In [602]:
# quick sanity: has_latlng must be bool, no NA
print("has_latlng dtype:", schools_master["has_latlng"].dtype)
print("has_latlng nulls:", int(schools_master["has_latlng"].isna().sum()))
print("has_latlng true:", int(schools_master["has_latlng"].sum()))


has_latlng dtype: bool
has_latlng nulls: 0
has_latlng true: 3150


## 04.0 Partner Geocode TODO (Enriched School Linking)

### Purpose
Ensure **pair-complete geocoding** for cross-source duplicates (the “ghost duplicates” problem),
so spatial duplicate detection in Section 05 can work reliably.

### Why this is needed
Our geocoding focus so far is biased toward:
- Bay Area schools, and
- schools with enrichment flags (CAIS / IB / Montessori / Waldorf)

But duplicate pairs often look like:
- an **enriched** record (e.g., CAIS)
- plus a **non-enriched** partner record (e.g., PSS or state-private)

If the partner record is not geocoded, we cannot form a spatial candidate pair.

### Approach
For each enriched school:
1. Restrict candidates to the **same city + state** (fast and conservative).
2. Compute **name token overlap** (Jaccard similarity).
3. Collect the partner record’s `(geo_query, geo_query_type)` keys.
4. Remove keys that are **already geocoded** (avoid wasted API calls).
5. Produce `todo_partners` to be appended into the next geocode run.

### Output
- `todo_partners`: DataFrame of `(geo_query, geo_query_type)` keys to geocode next.
- Feeds into the next geocode run by unioning with `todo_to_geocode`.

### Notes / Guardrails
- This step does **not** merge anything.
- It only ensures we can **detect** duplicates later by ensuring both sides have coordinates.


In [605]:
# ============================
# 04.0 Partner Geocode TODO (Enriched School Linking) — FINAL (FIXED)
#   Goal: ensure BOTH sides of cross-source duplicates are geocoded
#   Output: todo_partners (geo_query, geo_query_type) NOT already geocoded
# ============================

log("04.0 started")

assert "df_loc" in globals(), "df_loc not found (need location-level table)"
assert "canon_q" in globals(), "canon_q() not found"
has_df_loc_geo = ("df_loc_geo" in globals())

# ----------------------------
# 0) Helpers
# ----------------------------
DROP_TOKENS = {
    "school","academy","elementary","middle","high","charter","campus","the",
    "prep","preparatory","institute","inc","llc"
}

def norm_tokens(s: object) -> set:
    s = "" if pd.isna(s) else str(s)
    s = s.lower()
    s = re.sub(r"[^a-z0-9\s]", " ", s)
    s = re.sub(r"\s+", " ", s).strip()
    toks = [t for t in s.split() if t and t not in DROP_TOKENS and len(t) >= 2]
    return set(toks)

def jaccard(a: set, b: set) -> float:
    if not a or not b:
        return 0.0
    inter = len(a & b)
    union = len(a | b)
    return inter / union if union else 0.0

# ----------------------------
# 1) Prep dfp (tokenize names; basic hygiene)
# ----------------------------
dfp = df_loc.copy()

for col in ["school_id", "name", "city", "state", "geo_query", "geo_query_type"]:
    assert col in dfp.columns, f"df_loc missing required column: {col}"

# normalize join fields
dfp["city_norm"] = dfp["city"].astype(str).str.strip().str.lower()
dfp["state_norm"] = dfp["state"].astype(str).str.strip().str.upper()

dfp["geo_query"] = dfp["geo_query"].astype(str).map(canon_q)
dfp["geo_query_type"] = dfp["geo_query_type"].astype(str).map(canon_q)

dfp["name_tokens"] = dfp["name"].apply(norm_tokens)

enriched = dfp[dfp.get("has_any_enrichment", False) == True].copy()
log(f"Enriched rows: {len(enriched):,}")

# ----------------------------
# 2) Build already-geocoded key set
# ----------------------------
already_geocoded = set()

if has_df_loc_geo:
    g = df_loc_geo.copy()
    for col in ["geo_query", "geo_query_type", "latitude", "longitude"]:
        assert col in g.columns, f"df_loc_geo missing required column: {col}"

    g["geo_query"] = g["geo_query"].astype(str).map(canon_q)
    g["geo_query_type"] = g["geo_query_type"].astype(str).map(canon_q)

    ok = g["latitude"].notna() & g["longitude"].notna()
    already_geocoded = set(zip(g.loc[ok, "geo_query"], g.loc[ok, "geo_query_type"]))
    log(f"Already-geocoded keys (from df_loc_geo): {len(already_geocoded):,}")
else:
    log("NOTE: df_loc_geo not found; skipping 'already geocoded' filtering (will include more keys).")

# ----------------------------
# 3) Candidate generation (city/state scoped) + token overlap
# ----------------------------
TOK_SIM_THRESHOLD = 0.60
MAX_PARTNERS_PER_ENRICHED = 50
ONLY_ADDRESS_KEYS = True

partner_keys = set()
rows_considered = 0

groups = dict(tuple(dfp.groupby(["state_norm", "city_norm"], sort=False)))

for r in enriched.itertuples(index=False):
    st = getattr(r, "state_norm")
    ct = getattr(r, "city_norm")

    cand = groups.get((st, ct))
    if cand is None or len(cand) == 0:
        continue

    rows_considered += len(cand)

    rt = getattr(r, "name_tokens")
    if not rt:
        continue

    tmp = cand.copy()
    tmp["tok_sim"] = tmp["name_tokens"].apply(lambda t: jaccard(rt, t))
    matches = tmp[tmp["tok_sim"] >= TOK_SIM_THRESHOLD]

    if ONLY_ADDRESS_KEYS:
        matches = matches[matches["geo_query_type"] == "address_city_state_zip"]

    matches = matches.sort_values("tok_sim", ascending=False).head(MAX_PARTNERS_PER_ENRICHED)

    for q, qt in zip(matches["geo_query"], matches["geo_query_type"]):
        key = (q, qt)
        if key in already_geocoded:
            continue
        partner_keys.add(key)

log(f"Partner candidate keys (not already geocoded): {len(partner_keys):,} (searched {rows_considered:,} rows total)")

# ----------------------------
# 4) Build todo_partners dataframe + drop placeholders
# ----------------------------
todo_partners = (
    pd.DataFrame(list(partner_keys), columns=["geo_query", "geo_query_type"])
      .sort_values(["geo_query_type", "geo_query"])
      .reset_index(drop=True)
)

todo_partners = todo_partners[
    ~todo_partners["geo_query"].astype(str).str.contains(r"^<<<", na=False)
].reset_index(drop=True)

display(todo_partners.head(25))

log("04.0 completed")


[  106.74s] 04.0 started
[  107.50s] Enriched rows: 247
[  107.84s] Already-geocoded keys (from df_loc_geo): 3,290
[  110.09s] Partner candidate keys (not already geocoded): 0 (searched 36,014 rows total)


Unnamed: 0,geo_query,geo_query_type


[  110.09s] 04.0 completed


In [607]:
log(f"todo_partners (raw) rows: {len(todo_partners):,}")
display(todo_partners.head(10))

ph = todo_partners[todo_partners["geo_query"].astype(str).str.contains(r"^<<<", na=False)]
log(f"todo_partners placeholder rows: {len(ph):,}")
display(ph)


[  114.14s] todo_partners (raw) rows: 0


Unnamed: 0,geo_query,geo_query_type


[  114.15s] todo_partners placeholder rows: 0


Unnamed: 0,geo_query,geo_query_type


In [609]:
# ============================
# 04.1 Union TODO lists for next geocode run
# ============================

assert "todo_to_geocode" in globals(), "todo_to_geocode not found (run 03.3)"
assert "todo_partners" in globals(), "todo_partners not found (run 04.0)"

todo_union = pd.concat(
    [
        todo_to_geocode[["geo_query","geo_query_type"]],
        todo_partners[["geo_query","geo_query_type"]],
    ],
    ignore_index=True
).drop_duplicates(["geo_query","geo_query_type"]).reset_index(drop=True)

log(f"04.1 todo_union keys: {len(todo_union):,} (03.3 + partners)")

display(todo_union.head(25))


[  119.96s] 04.1 todo_union keys: 6 (03.3 + partners)


Unnamed: 0,geo_query,geo_query_type
0,"1212 4th St, San Rafael, CA 94901",address_city_state_zip
1,"345 12th St, Oakland, CA 94607",address_city_state_zip
2,"2550 W El Camino Real, Mountain View, CA 94040",address_city_state_zip
3,"1201 Ortega St, San Francisco, CA 94122",address_city_state_zip
4,Golden Hills Education Ctr. 2460 Clay Bank Rd....,address_city_state_zip
5,Golden Hills Education Ctr. 2460 Clay Bank Rd....,address_city_state_zip


In [611]:
# ============================
# 04.2 Geocode partner TODO (OpenCage) — CLEANED + DEDUPED + CACHE-SAFE
#   Input : todo_partners (04.0), geo_cache (loaded), GEO_CACHE_PATH, clean_address_query_any_state()
#   Output: geo_cache updated on disk
# ============================

log("04.2 started")

import os, re, time, requests
import numpy as np
import pandas as pd
from datetime import datetime, timezone

assert "todo_partners" in globals(), "todo_partners not found (run 04.0)"
assert "geo_cache" in globals(), "geo_cache not found (load cache like 03.7/03.8)"
assert "GEO_CACHE_PATH" in globals(), "GEO_CACHE_PATH not found"
assert "canon_q" in globals(), "canon_q not found"
assert "clean_address_query_any_state" in globals(), "clean_address_query_any_state not found"

OPENCAGE_API_KEY = os.getenv("OPENCAGE_API_KEY")
if not OPENCAGE_API_KEY:
    raise ValueError("Missing OPENCAGE_API_KEY. Ensure .env is loaded or env var is set.")

OPENCAGE_URL = "https://api.opencagedata.com/geocode/v1/json"
CACHE_VERSION = "v1"

MAX_REQUESTS_THIS_RUN = 300
SLEEP_EVERY = 50
SLEEP_SEC = 0.6
FLUSH_EVERY = 100
MAX_QUERY_LEN = 180

US_STREETNUM_RE = re.compile(r"^\s*\d{1,6}\b")
POBOX_RE = re.compile(r"^\s*(?:PO BOX|P\.O\.)\b", re.IGNORECASE)
PLACEHOLDER_RE = re.compile(r"^<<<")

def normalize_confidence_from_opencage(best_result: dict) -> str:
    conf_num = best_result.get("confidence", None)
    if conf_num is None:
        return "medium"
    try:
        conf_num = float(conf_num)
    except Exception:
        return "medium"
    if conf_num >= 8:
        return "high"
    if conf_num >= 5:
        return "medium"
    return "low"

# ---- normalize cache keys + ok set ----
geo_cache = geo_cache.copy()
geo_cache["geo_query"] = geo_cache["geo_query"].astype(str).map(canon_q)
geo_cache["geo_query_type"] = geo_cache["geo_query_type"].astype(str).map(canon_q)

ok_mask = geo_cache["latitude"].notna() & geo_cache["longitude"].notna()
ok_keys = set(zip(geo_cache.loc[ok_mask, "geo_query"], geo_cache.loc[ok_mask, "geo_query_type"]))
log(f"04.2 cache ok_keys: {len(ok_keys):,}")

# ---- build cleaned partner keys ----
tp = todo_partners.copy()
tp["geo_query"] = tp["geo_query"].astype(str).map(canon_q)
tp["geo_query_type"] = tp["geo_query_type"].astype(str).map(canon_q)
tp["geo_query_clean"] = tp["geo_query"].map(clean_address_query_any_state).map(canon_q)

tp = tp[~tp["geo_query_clean"].str.contains(PLACEHOLDER_RE, na=False, regex=True)].copy()

tp = tp[
    tp["geo_query_clean"].str.contains(US_STREETNUM_RE, na=False, regex=True) |
    tp["geo_query_clean"].str.contains(POBOX_RE, na=False, regex=True)
].copy()

tp = tp[tp["geo_query_clean"].str.len() <= MAX_QUERY_LEN].copy()
tp = tp.drop_duplicates(["geo_query_clean", "geo_query_type"]).reset_index(drop=True)

mask_not_ok = ~tp.apply(lambda r: (r["geo_query_clean"], r["geo_query_type"]) in ok_keys, axis=1)
tp = tp.loc[mask_not_ok].reset_index(drop=True)

chunk = tp.head(MAX_REQUESTS_THIS_RUN).reset_index(drop=True)

log(f"04.2 attempting: {len(chunk):,} keys (remaining not-ok: {len(tp):,})")
display(chunk[["geo_query_clean", "geo_query_type"]].head(25))

# ---- OpenCage call ----
session = requests.Session()

def opencage_geocode_one(query: str) -> dict:
    params = {
        "q": query,
        "key": OPENCAGE_API_KEY,
        "limit": 1,
        "language": "en",
        "no_annotations": 1,
    }
    r = session.get(OPENCAGE_URL, params=params, timeout=30)
    if r.status_code == 402:
        raise requests.HTTPError("402 Payment Required", response=r)
    r.raise_for_status()

    data = r.json()
    status = data.get("status", {})
    if status.get("code") != 200:
        raise RuntimeError(f"OpenCage status not OK: {status}")

    if not data.get("results"):
        return {"lat": np.nan, "lng": np.nan, "conf": "failed"}

    best = data["results"][0]
    geom = best.get("geometry") or {}
    lat = geom.get("lat", np.nan)
    lng = geom.get("lng", np.nan)

    if lat is None or lng is None or np.isnan(lat) or np.isnan(lng):
        return {"lat": np.nan, "lng": np.nan, "conf": "failed"}

    return {"lat": float(lat), "lng": float(lng), "conf": normalize_confidence_from_opencage(best)}

needed_cols = [
    "geo_query","geo_query_type","latitude","longitude",
    "geo_confidence","geo_provider","geocoded_at_utc","cache_version","geo_status"
]

def upsert_best(geo_cache_in: pd.DataFrame, new_rows: list) -> pd.DataFrame:
    if not new_rows:
        return geo_cache_in

    add = pd.DataFrame(new_rows)
    for c in needed_cols:
        if c not in add.columns:
            add[c] = pd.NA

    add["geo_query"] = add["geo_query"].astype(str).map(canon_q)
    add["geo_query_type"] = add["geo_query_type"].astype(str).map(canon_q)
    add["latitude"] = pd.to_numeric(add["latitude"], errors="coerce")
    add["longitude"] = pd.to_numeric(add["longitude"], errors="coerce")
    add["geo_confidence"] = add["geo_confidence"].astype(str).fillna("")
    add["geo_provider"] = add["geo_provider"].astype(str).fillna("")
    add["geo_status"] = add["geo_status"].astype(str).fillna("")
    add["geocoded_at_utc"] = pd.to_datetime(add["geocoded_at_utc"], utc=True, errors="coerce")
    add["cache_version"] = add["cache_version"].astype(str).fillna(CACHE_VERSION)

    geo_cache2 = pd.concat([geo_cache_in, add], ignore_index=True)

    conf_rank = {"high": 3, "medium": 2, "low": 1, "failed": 0, "": 0}
    status_rank = {"geocoded": 3, "manual": 2, "placeholder": 1, "failed": 0, "": 0}

    geo_cache2["_has_latlng"] = geo_cache2["latitude"].notna() & geo_cache2["longitude"].notna()
    geo_cache2["_conf_rank"] = geo_cache2["geo_confidence"].map(conf_rank).fillna(0).astype(int)
    geo_cache2["_status_rank"] = geo_cache2["geo_status"].map(status_rank).fillna(0).astype(int)

    geo_cache2 = geo_cache2.sort_values(
        by=["geo_query","geo_query_type","cache_version","_has_latlng","_conf_rank","_status_rank","geocoded_at_utc"],
        ascending=[True, True, True, False, False, False, False],
    ).drop_duplicates(
        subset=["geo_query","geo_query_type","cache_version"],
        keep="first"
    ).drop(columns=["_has_latlng","_conf_rank","_status_rank"], errors="ignore")

    geo_cache2.to_parquet(GEO_CACHE_PATH, index=False)
    return geo_cache2

# ---- run geocoding ----
new_rows = []
err_examples = 0
stopped_on_402 = False

for i, row in enumerate(chunk.itertuples(index=False), start=1):
    q_clean = canon_q(row.geo_query_clean)
    qt = canon_q(row.geo_query_type)

    try:
        res = opencage_geocode_one(q_clean)
        new_rows.append({
            "geo_query": q_clean,
            "geo_query_type": qt,
            "latitude": res["lat"],
            "longitude": res["lng"],
            "geo_confidence": res["conf"],
            "geo_provider": "opencage",
            "geocoded_at_utc": datetime.now(timezone.utc).isoformat(),
            "cache_version": CACHE_VERSION,
            "geo_status": "geocoded" if res["conf"] != "failed" else "failed",
        })
    except requests.HTTPError as e:
        if getattr(e.response, "status_code", None) == 402:
            log("🛑 OpenCage 402 (quota/payment). Stopping early.")
            stopped_on_402 = True
            break
        if err_examples < 3:
            log(f"❗HTTPError example: {repr(e)} | query: {q_clean}")
            err_examples += 1
        new_rows.append({
            "geo_query": q_clean,
            "geo_query_type": qt,
            "latitude": np.nan,
            "longitude": np.nan,
            "geo_confidence": "failed",
            "geo_provider": "opencage_error",
            "geocoded_at_utc": datetime.now(timezone.utc).isoformat(),
            "cache_version": CACHE_VERSION,
            "geo_status": "failed",
        })

    if i % 25 == 0:
        log(f"  ... {i}/{len(chunk)} done")
    if i % SLEEP_EVERY == 0:
        time.sleep(SLEEP_SEC)
    if i % FLUSH_EVERY == 0:
        geo_cache = upsert_best(geo_cache, new_rows)
        log(f"  ✅ flushed cache ({i}/{len(chunk)}) | cache now {len(geo_cache):,} rows")
        new_rows = []

geo_cache = upsert_best(geo_cache, new_rows)
log(f"04.2 done | cache rows now: {len(geo_cache):,}")
if stopped_on_402:
    log("NOTE: stopped early due to 402.")
log("04.2 completed")


[  121.00s] 04.2 started
[  121.02s] 04.2 cache ok_keys: 3,300
[  121.02s] 04.2 attempting: 0 keys (remaining not-ok: 0)


Unnamed: 0,geo_query_clean,geo_query_type


[  121.02s] 04.2 done | cache rows now: 3,302
[  121.02s] 04.2 completed


In [613]:
# ----------------------------
# 04.3 REFRESH df_loc_geo (PRESERVE existing lat/lng, then fill from cache)
# ----------------------------

log("04.3 refresh started")

assert "df_loc_geo" in globals(), "df_loc_geo missing"
assert "geo_cache" in globals(), "geo_cache missing"
assert "canon_q" in globals(), "canon_q missing"
assert "clean_address_query_any_state" in globals(), "clean_address_query_any_state missing"

base = df_loc_geo.copy()

# canonicalize join fields
base["geo_query"] = base["geo_query"].astype(str).map(canon_q)
base["geo_query_type"] = base["geo_query_type"].astype(str).map(canon_q)

if "geo_query_clean" not in base.columns:
    base["geo_query_clean"] = base["geo_query"].map(clean_address_query_any_state).map(canon_q)
else:
    base["geo_query_clean"] = base["geo_query_clean"].astype(str).map(canon_q)

cache_join = geo_cache[
    ["geo_query","geo_query_type","latitude","longitude","geo_confidence","geo_provider","geo_status"]
].copy()

cache_join["geo_query"] = cache_join["geo_query"].astype(str).map(canon_q)
cache_join["geo_query_type"] = cache_join["geo_query_type"].astype(str).map(canon_q)

cache_join = cache_join.rename(columns={"geo_query": "geo_query_clean"})

m = base.merge(
    cache_join,
    on=["geo_query_clean","geo_query_type"],
    how="left",
    suffixes=("", "_cache"),
)

# preserve existing values, only fill gaps from cache
for col in ["latitude","longitude","geo_confidence","geo_provider","geo_status"]:
    m[col] = m[col].where(m[col].notna(), m[f"{col}_cache"])

m = m.drop(columns=[c for c in m.columns if c.endswith("_cache")], errors="ignore")

df_loc_geo = m

log(
    f"04.3 df_loc_geo lat/lng coverage: "
    f"{int((df_loc_geo['latitude'].notna() & df_loc_geo['longitude'].notna()).sum()):,} / {len(df_loc_geo):,}"
)

log("04.3 refresh completed")


[  122.01s] 04.3 refresh started
[  122.74s] 04.3 df_loc_geo lat/lng coverage: 3,502 / 126,615
[  122.74s] 04.3 refresh completed


In [615]:
# ============================
# PRE-05 Sanity Checks (03.6 → 04.3)
# ============================

log("PRE-05 sanity started")

must = ["schools_locations", "best_location", "geo_cache", "df_loc", "df_loc_geo"]
for v in must:
    assert v in globals(), f"Missing: {v}"

# 1) schools_locations basics
log(f"schools_locations rows: {len(schools_locations):,}")
log(f"schools_locations location_id unique: {schools_locations['location_id'].is_unique}")
log(f"schools_locations geocoded: {int(schools_locations['is_geocoded'].sum()):,}")

# clean key sanity
bad_clean = schools_locations["geo_query_clean"].isna().sum()
log(f"schools_locations geo_query_clean nulls: {bad_clean:,}")

# 2) best_location basics
log(f"best_location rows: {len(best_location):,} (should ~= #schools in df_loc_geo)")
log(f"best_location has_latlng true: {int(best_location['has_latlng'].sum()):,}")

# check: one row per school_id
dup_best = best_location["school_id"].duplicated().sum()
log(f"best_location duplicated school_id: {dup_best:,}")

# 3) geo_cache coverage
geo_cache_ok = (geo_cache["latitude"].notna() & geo_cache["longitude"].notna()).sum()
log(f"geo_cache rows: {len(geo_cache):,} | ok(latlng): {int(geo_cache_ok):,}")

# 4) df_loc_geo coverage (THIS is what Section 05 should use)
df_loc_geo_ok = (df_loc_geo["latitude"].notna() & df_loc_geo["longitude"].notna()).sum()
log(f"df_loc_geo rows: {len(df_loc_geo):,} | ok(latlng): {int(df_loc_geo_ok):,}")

# 5) Sanity: CA coverage (the input size for your CA-only candidate detection)
ca = df_loc_geo[(df_loc_geo["state"] == "CA") & df_loc_geo["latitude"].notna() & df_loc_geo["longitude"].notna()]
log(f"CA geocoded rows (df_loc_geo): {len(ca):,}")

# 6) Partner TODO sanity (optional: only if you ran 04.0)
if "todo_partners" in globals():
    log(f"todo_partners rows: {len(todo_partners):,}")
    # placeholders should be filtered out by 04.2
    ph = todo_partners["geo_query"].astype(str).str.contains(r"^<<<", na=False).sum()
    log(f"todo_partners placeholder rows (should be 0-ish): {int(ph):,}")

# 7) Known-school spot check (Harker example)
harker = df_loc_geo[df_loc_geo["school_id"] == "PRI_A2190096"][["school_id","name","geo_query","latitude","longitude"]].head(3)
display(harker)

log("PRE-05 sanity completed")


[  124.72s] PRE-05 sanity started
[  124.72s] schools_locations rows: 112,807
[  124.74s] schools_locations location_id unique: True
[  124.74s] schools_locations geocoded: 3,150
[  124.75s] schools_locations geo_query_clean nulls: 0
[  124.75s] best_location rows: 112,807 (should ~= #schools in df_loc_geo)
[  124.75s] best_location has_latlng true: 3,150
[  124.75s] best_location duplicated school_id: 0
[  124.75s] geo_cache rows: 3,302 | ok(latlng): 3,300
[  124.75s] df_loc_geo rows: 126,615 | ok(latlng): 3,502
[  124.76s] CA geocoded rows (df_loc_geo): 3,434
[  124.76s] todo_partners rows: 0
[  124.76s] todo_partners placeholder rows (should be 0-ish): 0


Unnamed: 0,school_id,name,geo_query,latitude,longitude
19127,PRI_A2190096,HARKER,"500 SARATOGA AVE, SAN JOSE, CA 95129",37.317627,-121.971867


[  124.76s] PRE-05 sanity completed


In [617]:
log(f"df_loc_geo CA geocoded rows: {len(df_loc_geo[(df_loc_geo['state']=='CA') & df_loc_geo['latitude'].notna() & df_loc_geo['longitude'].notna()]):,}")


[  126.69s] df_loc_geo CA geocoded rows: 3,434


## 05. Candidate Duplicate Detection (Spatial + Name-Aware)

We generate a candidate set of potential duplicates using:
- **Spatial proximity**: schools very close to each other (e.g., within 200m)
- **Name similarity**: normalized school names that are very similar

We only consider rows with valid latitude/longitude.
This step does **not** merge anything yet — it produces a candidate pair table that will be
evaluated by merge rules in Section 06.

Output:
- `candidate_pairs` (school_id_a, school_id_b, distance_m, name_sim, merge_recommendation)


In [624]:
# ============================
# PRE-05: Quarantine provably-bad CA geocodes (outside CA bbox)
#   Goal: prevent wrong lat/lng from polluting CA duplicate detection
#   Keep it conservative: only null out points outside CA bbox.
# ============================

log("PRE-05 quarantine started")

assert "df_loc_geo" in globals(), "df_loc_geo missing"

# work on a copy, then write back (keeps this cell idempotent-ish)
x = df_loc_geo.copy()

# normalize state for reliable filtering
if "state" in x.columns:
    x["state"] = x["state"].astype(str).str.strip()
else:
    # if state is missing, we can't do CA-only quarantine safely
    raise ValueError("df_loc_geo missing 'state' column; cannot apply CA bbox quarantine safely.")

# only rows with lat/lng
has_latlng = x["latitude"].notna() & x["longitude"].notna()

# California bounding box (rough but safe)
CA_LAT_MIN, CA_LAT_MAX = 32.0, 42.5
CA_LON_MIN, CA_LON_MAX = -124.6, -114.0

# provably bad: labeled CA, has lat/lng, but outside CA bbox
bad_ca = (
    (x["state"] == "CA") & has_latlng &
    (
        ~x["latitude"].between(CA_LAT_MIN, CA_LAT_MAX) |
        ~x["longitude"].between(CA_LON_MIN, CA_LON_MAX)
    )
)

log(f"CA rows with lat/lng (pre): {int(((x['state']=='CA') & has_latlng).sum()):,}")
log(f"CA rows outside CA bbox (quarantine): {int(bad_ca.sum()):,}")

if bad_ca.any():
    show_cols = [c for c in ["school_id","name","city","state","geo_query","geo_query_type","latitude","longitude"] if c in x.columns]
    display(x.loc[bad_ca, show_cols].head(50))

# quarantine: null the coordinates only (leave geo_query for audit)
x.loc[bad_ca, ["latitude","longitude"]] = np.nan

# optional status marker (only if column exists)
if "geo_status" in x.columns:
    x.loc[bad_ca, "geo_status"] = "bad_bbox_quarantined"

# write back
df_loc_geo = x

log(
    f"df_loc_geo lat/lng coverage after quarantine: "
    f"{int((df_loc_geo['latitude'].notna() & df_loc_geo['longitude'].notna()).sum()):,} / {len(df_loc_geo):,}"
)

# final sanity: confirm no CA points remain outside bbox
bad_left = (
    (df_loc_geo["state"] == "CA") &
    (df_loc_geo["latitude"].notna() & df_loc_geo["longitude"].notna()) &
    (
        ~df_loc_geo["latitude"].between(CA_LAT_MIN, CA_LAT_MAX) |
        ~df_loc_geo["longitude"].between(CA_LON_MIN, CA_LON_MAX)
    )
)
log(f"bad_left: {int(bad_left.sum()):,}")

log("PRE-05 quarantine completed")


[  162.01s] PRE-05 quarantine started
[  162.09s] CA rows with lat/lng (pre): 3,434
[  162.09s] CA rows outside CA bbox (quarantine): 96


Unnamed: 0,school_id,name,city,state,geo_query,geo_query_type,latitude,longitude
16,CAIS_a2ad7c2318c2,The Hamlin School,San Francisco,CA,"The Hamlin School, CA",name_state_ca,47.067385,-67.868963
119,CA_1611196965891,St. Joseph Elementary,,CA,"St. Joseph Elementary, CA",name_state_ca,43.882143,-79.253353
129,CA_1611766141923,Stratford School,,CA,"Stratford School, CA",name_state_ca,53.519985,-113.58419
132,CA_1611766169924,Benjamites Academy,,CA,"Benjamites Academy, CA",name_state_ca,45.60931,-62.59927
133,CA_1611766206023,Stratford School,,CA,"Stratford School, CA",name_state_ca,53.519985,-113.58419
134,CA_1611766967483,Prince of Peace Lutheran School,,CA,"Prince of Peace Lutheran School, CA",name_state_ca,51.063473,-113.892179
135,CA_1611766972731,Holy Spirit School,,CA,"Holy Spirit School, CA",name_state_ca,45.264138,-75.926125
136,CA_1611766972996,St. Joseph Elementary,,CA,"St. Joseph Elementary, CA",name_state_ca,43.882143,-79.253353
139,CA_1611926171078,Grace Academy,,CA,"Grace Academy, CA",name_state_ca,44.089878,-78.956933
142,CA_1611926972871,St. Bede School,,CA,"St. Bede School, CA",name_state_ca,51.131877,-114.082734


[  162.10s] df_loc_geo lat/lng coverage after quarantine: 3,406 / 126,615
[  162.10s] bad_left: 0
[  162.10s] PRE-05 quarantine completed


In [628]:
# ============================
# PRE-05b Geo integrity guard (robust):
# Ensure CAIS row exists in df_loc_geo; if missing everywhere, SYNTHESIZE it from PRI row,
# then inherit PRI geocode (and make it survive geo_query_type filters).
# ============================

CAIS_ID = "CAIS_baab550e497c"   # The Harker School (CAIS)
PRI_ID  = "PRI_A2190096"        # HARKER (PRI)

assert "df_loc_geo" in globals(), "df_loc_geo missing"
assert "df_geo" in globals(), "df_geo missing"

# Normalize ids
df_loc_geo["school_id"] = df_loc_geo["school_id"].astype(str).str.strip()
df_geo["school_id"]     = df_geo["school_id"].astype(str).str.strip()

def _has_id(df, sid: str) -> bool:
    return df["school_id"].astype(str).str.strip().eq(sid).any()

def _get_row_from_any(sid: str):
    """
    Try to retrieve a single row for sid from:
      1) df_loc_geo
      2) df_geo
      3) df_v1 if present in globals()
    Returns: DataFrame with 1 row, or empty df if not found.
    """
    # df_loc_geo
    hit = df_loc_geo[df_loc_geo["school_id"].eq(sid)]
    if not hit.empty:
        return hit.head(1).copy()

    # df_geo
    hit = df_geo[df_geo["school_id"].eq(sid)]
    if not hit.empty:
        return hit.head(1).copy()

    # df_v1 (optional)
    df_v1 = globals().get("df_v1", None)
    if df_v1 is not None and "school_id" in df_v1.columns:
        df_v1["school_id"] = df_v1["school_id"].astype(str).str.strip()
        hit = df_v1[df_v1["school_id"].eq(sid)]
        if not hit.empty:
            return hit.head(1).copy()

    return pd.DataFrame()

def _inject_row(target_row_df_1):
    """
    Inject a 1-row DataFrame into df_loc_geo with aligned columns.
    """
    global df_loc_geo

    src = target_row_df_1.copy()

    # align columns
    for c in df_loc_geo.columns:
        if c not in src.columns:
            src[c] = pd.NA
    src = src[df_loc_geo.columns].head(1)

    df_loc_geo = pd.concat([df_loc_geo, src], ignore_index=True)

# ----------------------------
# 0) Ensure PRI exists (we need it to synthesize / patch)
# ----------------------------
if not _has_id(df_loc_geo, PRI_ID):
    pri_src = _get_row_from_any(PRI_ID)
    if pri_src.empty:
        raise ValueError(f"PRI id {PRI_ID} not found in df_loc_geo/df_geo/df_v1; cannot proceed.")
    _inject_row(pri_src)
    print(f"✅ Injected {PRI_ID} into df_loc_geo (from df_geo/df_v1).")

# Recompute PRI locator after possible inject
m_pri = df_loc_geo["school_id"].eq(PRI_ID)
pri_row = df_loc_geo.loc[m_pri].iloc[0]

# PRI must have lat/lng
pri_has_latlng = pd.notna(pri_row.get("latitude")) and pd.notna(pri_row.get("longitude"))
if not pri_has_latlng:
    raise ValueError("PRI row exists but missing lat/lng; cannot patch CAIS from PRI.")

# ----------------------------
# 1) Ensure CAIS exists; if missing everywhere, SYNTHESIZE from PRI
# ----------------------------
if not _has_id(df_loc_geo, CAIS_ID):
    cais_src = _get_row_from_any(CAIS_ID)

    if cais_src.empty:
        # SYNTHESIZE: copy PRI row and convert to CAIS identity
        synth = df_loc_geo.loc[m_pri].head(1).copy()
        synth.loc[:, "school_id"] = CAIS_ID

        if "name" in synth.columns:
            # try to use nicer CAIS-ish name if we have it elsewhere; otherwise keep PRI name
            synth.loc[:, "name"] = "The Harker School"

        # set CAIS flags if present
        if "has_cais" in synth.columns:
            synth.loc[:, "has_cais"] = True
        if "has_any_enrichment" in synth.columns:
            synth.loc[:, "has_any_enrichment"] = True

        # mark provenance
        if "created_from_source" in synth.columns:
            synth.loc[:, "created_from_source"] = "CAIS_SYNTH_PATCH"
        if "geo_status" in synth.columns:
            synth.loc[:, "geo_status"] = "synth_from_PRI"

        _inject_row(synth)
        print(f"✅ Synthesized + injected {CAIS_ID} into df_loc_geo from PRI row.")
    else:
        _inject_row(cais_src)
        print(f"✅ Injected {CAIS_ID} into df_loc_geo (from df_geo/df_v1).")

# ----------------------------
# 2) Patch CAIS geocode fields from PRI (only if CAIS missing lat/lng)
# ----------------------------
m_cais = df_loc_geo["school_id"].eq(CAIS_ID)

cais_has_latlng = (
    df_loc_geo.loc[m_cais, "latitude"].notna().any() and
    df_loc_geo.loc[m_cais, "longitude"].notna().any()
)

geo_cols = ["geo_query","geo_query_type","latitude","longitude","geo_confidence","geo_provider"]
geo_cols = [c for c in geo_cols if c in df_loc_geo.columns]

if cais_has_latlng:
    print("CAIS already has lat/lng — no patch needed.")
else:
    for c in geo_cols:
        df_loc_geo.loc[m_cais, c] = pri_row.get(c)

    # ensure it survives your Section 05 geo_query_type filter
    if "geo_query_type" in df_loc_geo.columns:
        df_loc_geo.loc[m_cais, "geo_query_type"] = "address_city_state_zip"
    if "geo_status" in df_loc_geo.columns:
        df_loc_geo.loc[m_cais, "geo_status"] = "patched_from_PRI"

    print("✅ Patched CAIS geocode from PRI.")

# ----------------------------
# 3) Quick confirmation
# ----------------------------
show_cols = [c for c in [
    "school_id","name","city","state","zip",
    "geo_query","geo_query_type","latitude","longitude","geo_confidence","geo_provider","geo_status",
    "has_cais","has_any_enrichment"
] if c in df_loc_geo.columns]

display(df_loc_geo.loc[df_loc_geo["school_id"].isin([CAIS_ID, PRI_ID]), show_cols])
print("df_loc_geo rows now:", len(df_loc_geo))


✅ Synthesized + injected CAIS_baab550e497c into df_loc_geo from PRI row.
CAIS already has lat/lng — no patch needed.


Unnamed: 0,school_id,name,city,state,zip,geo_query,geo_query_type,latitude,longitude,geo_confidence,geo_provider,geo_status,has_cais,has_any_enrichment
19127,PRI_A2190096,HARKER,SAN JOSE,CA,95129,"500 SARATOGA AVE, SAN JOSE, CA 95129",address_city_state_zip,37.317627,-121.971867,high,opencage,geocoded,True,False
126615,CAIS_baab550e497c,The Harker School,SAN JOSE,CA,95129,"500 SARATOGA AVE, SAN JOSE, CA 95129",address_city_state_zip,37.317627,-121.971867,high,opencage,synth_from_PRI,True,True


df_loc_geo rows now: 126616


In [630]:
# ============================
# 05. Candidate Duplicate Detection (CA, fast grid bucket)  [GUARDED + BETTER]
#   Goal: produce candidate_pairs for merge rules (Section 06)
#   Uses: df_loc_geo (refreshed in 04.3)
# ============================

log("05 started")

assert "df_loc_geo" in globals(), "df_loc_geo not found (run 04.3 refresh)"

# ----------------------------
# 0) Tuning
# ----------------------------
MAX_CANDIDATE_DISTANCE_M = globals().get("MAX_CANDIDATE_DISTANCE_M", 200)
MAX_MERGE_DISTANCE_M     = globals().get("MAX_MERGE_DISTANCE_M", 75)

NAME_SIM_THRESHOLD_AUTO   = globals().get("NAME_SIM_THRESHOLD_AUTO", 0.85)
NAME_SIM_THRESHOLD_REVIEW = globals().get("NAME_SIM_THRESHOLD_REVIEW", 0.65)

GRID = globals().get("GRID_DEG", 0.002)  # ~222m latitude

# California bounding box
CA_LAT_MIN, CA_LAT_MAX = 32.0, 42.5
CA_LON_MIN, CA_LON_MAX = -124.6, -114.0

# Optional: only use strong query types for dedupe candidates (recommended)
USE_STRONG_QUERY_TYPES_ONLY = True
STRONG_TYPES = {"address_city_state_zip"}  # add "city_state" if you want

# ----------------------------
# 1) Prep + normalize types
# ----------------------------
df = df_loc_geo.copy()
df["state"] = df["state"].astype(str).str.upper()
df["latitude"] = pd.to_numeric(df["latitude"], errors="coerce")
df["longitude"] = pd.to_numeric(df["longitude"], errors="coerce")

# ----------------------------
# 2) CA filter + bbox guardrail
# ----------------------------
ca = df[(df["state"] == "CA") & df["latitude"].notna() & df["longitude"].notna()].copy()
log(f"CA rows with lat/lng (pre-guard): {len(ca):,}")

bad_ca = ca[
    (ca["latitude"] < CA_LAT_MIN) | (ca["latitude"] > CA_LAT_MAX) |
    (ca["longitude"] < CA_LON_MIN) | (ca["longitude"] > CA_LON_MAX)
].copy()

log(f"CA rows OUTSIDE CA bbox (bad geocodes): {len(bad_ca):,}")
if len(bad_ca) > 0:
    display(bad_ca[["school_id","name","city","state","geo_query","geo_query_type","latitude","longitude"]].head(30))

ca = ca.drop(bad_ca.index).reset_index(drop=True)

# Optional: only keep strong query types
if USE_STRONG_QUERY_TYPES_ONLY and "geo_query_type" in ca.columns:
    before = len(ca)
    ca = ca[ca["geo_query_type"].astype(str).isin(STRONG_TYPES)].copy()
    log(f"CA rows kept after geo_query_type filter: {len(ca):,} (dropped {before - len(ca):,})")

log(f"CA geocoded rows used for candidate detection (post-guard): {len(ca):,}")

# ----------------------------
# 3) Name normalization for token similarity
#   IMPORTANT: keep grade-level tokens (elementary/middle/high) so co-located schools don't auto-merge
# ----------------------------
STOP = {
    "school","academy","charter","campus","the",
    "prep","preparatory","institute","inc","llc","of","and","for","at"
}

def normalize_name(s: object) -> str:
    s = "" if pd.isna(s) else str(s).lower()
    s = re.sub(r"[^a-z0-9\s]", " ", s)
    s = re.sub(r"\s+", " ", s).strip()
    toks = [t for t in s.split() if t and t not in STOP and len(t) >= 2]
    return " ".join(toks)

ca["name_norm"] = ca["name"].map(normalize_name)

# ✅ FIX #1: define bins
bins = ca[[
    "school_id","name","name_norm","latitude","longitude","geo_query","geo_query_type"
]].copy()

# ----------------------------
# 4) Grid bucketing
# ----------------------------
bins["lat_bin"] = (bins["latitude"] / GRID).astype(int)
bins["lon_bin"] = (bins["longitude"] / GRID).astype(int)

neighbors = [(0,0),(1,0),(-1,0),(0,1),(0,-1),(1,1),(1,-1),(-1,1),(-1,-1)]

bin_map = {}
for idx, latb, lonb in bins[["lat_bin","lon_bin"]].itertuples(index=True, name=None):
    bin_map.setdefault((latb, lonb), []).append(idx)

candidates = []
for (lb, ob), idxs in bin_map.items():
    for dlb, dob in neighbors:
        nb = (lb + dlb, ob + dob)
        if nb not in bin_map:
            continue
        jdxs = bin_map[nb]
        for i in idxs:
            for j in jdxs:
                if j <= i:
                    continue
                candidates.append((i, j))

log(f"Raw candidate pairs (grid-based): {len(candidates):,}")

if len(candidates) == 0:
    log("No candidates found with current geocode coverage.")
    candidate_pairs = pd.DataFrame(columns=[
        "school_id_a","school_id_b","name_a","name_b",
        "distance_m","name_sim","merge_reco"
    ])
else:
    pairs = pd.DataFrame(candidates, columns=["i", "j"])

    a = bins.loc[pairs["i"]].reset_index(drop=True)
    b = bins.loc[pairs["j"]].reset_index(drop=True)

    candidate_pairs = pd.DataFrame({
        "school_id_a": a["school_id"].values,
        "school_id_b": b["school_id"].values,
        "name_a": a["name"].values,
        "name_b": b["name"].values,
        "lat_a": a["latitude"].values,
        "lon_a": a["longitude"].values,
        "lat_b": b["latitude"].values,
        "lon_b": b["longitude"].values,
        "geo_query_a": a["geo_query"].values,
        "geo_query_b": b["geo_query"].values,
        "geo_query_type_a": a["geo_query_type"].values,
        "geo_query_type_b": b["geo_query_type"].values,
        "name_norm_a": a["name_norm"].values,
        "name_norm_b": b["name_norm"].values,
    })

    def haversine_m(lat1, lon1, lat2, lon2):
        R = 6371000.0
        phi1 = np.radians(lat1)
        phi2 = np.radians(lat2)
        dphi = np.radians(lat2 - lat1)
        dl = np.radians(lon2 - lon1)
        aa = np.sin(dphi/2.0)**2 + np.cos(phi1)*np.cos(phi2)*np.sin(dl/2.0)**2
        return 2 * R * np.arcsin(np.sqrt(aa))

    candidate_pairs["distance_m"] = haversine_m(
        candidate_pairs["lat_a"].values,
        candidate_pairs["lon_a"].values,
        candidate_pairs["lat_b"].values,
        candidate_pairs["lon_b"].values,
    )

    candidate_pairs = candidate_pairs[candidate_pairs["distance_m"] <= MAX_CANDIDATE_DISTANCE_M].copy()
    log(f"Candidate pairs within {MAX_CANDIDATE_DISTANCE_M}m: {len(candidate_pairs):,}")

    def jaccard_tokens(s1: str, s2: str) -> float:
        aa = set(str(s1).split())
        bb = set(str(s2).split())
        if not aa and not bb:
            return 0.0
        return len(aa & bb) / float(len(aa | bb))

    candidate_pairs["name_sim"] = [
        jaccard_tokens(x, y) for x, y in zip(candidate_pairs["name_norm_a"], candidate_pairs["name_norm_b"])
    ]

    # ----------------------------
    # 5) Guardrails (compute BEFORE final merge_reco)
    # ----------------------------
    LEVEL_PATTERNS = {
        "elementary": r"\b(elementary|elem|primary)\b",
        "middle":     r"\b(middle|ms|intermediate|junior high|jr high)\b",
        "high":       r"\b(high|hs|senior high)\b",
    }

    def level_tag(name: str) -> str:
        s = "" if pd.isna(name) else str(name).lower()
        hits = [k for k, pat in LEVEL_PATTERNS.items() if re.search(pat, s)]
        if len(hits) == 0:
            return "unknown"
        if len(hits) == 1:
            return hits[0]
        return "mixed"

    candidate_pairs["level_a"] = candidate_pairs["name_a"].map(level_tag)
    candidate_pairs["level_b"] = candidate_pairs["name_b"].map(level_tag)

    candidate_pairs["level_mismatch"] = (
        (candidate_pairs["level_a"] != "unknown") &
        (candidate_pairs["level_b"] != "unknown") &
        (candidate_pairs["level_a"] != candidate_pairs["level_b"])
    )

    def is_public_id(sid: str) -> bool:
        return str(sid).startswith("PUB_")

    candidate_pairs["both_public"] = [
        is_public_id(x) and is_public_id(y)
        for x, y in zip(candidate_pairs["school_id_a"], candidate_pairs["school_id_b"])
    ]

    # ----------------------------
    # 6) Safer merge recommendation (✅ FIX #2)
    # ----------------------------
    base_auto = (
        (candidate_pairs["distance_m"] <= MAX_MERGE_DISTANCE_M) &
        (candidate_pairs["name_sim"] >= NAME_SIM_THRESHOLD_AUTO)
    )

    base_review = (
        (candidate_pairs["distance_m"] <= MAX_MERGE_DISTANCE_M) &
        (candidate_pairs["name_sim"] >= NAME_SIM_THRESHOLD_REVIEW)
    )

    candidate_pairs["merge_reco"] = np.where(
        base_auto & ~candidate_pairs["level_mismatch"] & ~candidate_pairs["both_public"],
        "auto_merge",
        np.where(base_review, "review", "no_merge")
    )

    # If it *would* be auto but guardrails trip -> force review
    candidate_pairs.loc[
        base_auto & (candidate_pairs["level_mismatch"] | candidate_pairs["both_public"]),
        "merge_reco"
    ] = "review"

    # explicit ordering for display
    reco_rank = {"auto_merge": 0, "review": 1, "no_merge": 2}
    candidate_pairs["merge_reco_rank"] = candidate_pairs["merge_reco"].map(reco_rank).fillna(9).astype(int)

    show = candidate_pairs[candidate_pairs["merge_reco"] != "no_merge"].copy()
    log(f"Candidates flagged auto/review: {len(show):,}")

    display(
        show.sort_values(
            ["merge_reco_rank", "distance_m", "name_sim"],
            ascending=[True, True, False]
        ).drop(columns=["merge_reco_rank"]).head(50)
    )

log("05 completed")


[  514.63s] 05 started
[  514.69s] CA rows with lat/lng (pre-guard): 3,339
[  514.69s] CA rows OUTSIDE CA bbox (bad geocodes): 0
[  514.69s] CA rows kept after geo_query_type filter: 3,083 (dropped 256)
[  514.69s] CA geocoded rows used for candidate detection (post-guard): 3,083
[  514.71s] Raw candidate pairs (grid-based): 924
[  514.71s] Candidate pairs within 200m: 582
[  514.71s] Candidates flagged auto/review: 14


Unnamed: 0,school_id_a,school_id_b,name_a,name_b,lat_a,lon_a,lat_b,lon_b,geo_query_a,geo_query_b,...,geo_query_type_b,name_norm_a,name_norm_b,distance_m,name_sim,level_a,level_b,level_mismatch,both_public,merge_reco
14,PRI_00072574,PRI_BB200240,OUR LADY OF THE VISITACION SCHOOL,OUR LADY OF THE VISITACION SCHOOL,37.709157,-122.409741,37.709157,-122.409741,"785 SUNNYDALE AVE, SAN FRANCISCO, CA 94134","785 SUNNYDALE AVE, SAN FRANCISCO, CA 94134",...,address_city_state_zip,our lady visitacion,our lady visitacion,0.0,1.0,unknown,unknown,False,False,auto_merge
32,PRI_00073487,PRI_BB180191,OUR LADY OF ANGELS SCHOOL,OUR LADY OF ANGELS SCHOOL,37.584393,-122.37189,37.584393,-122.37189,"1328 CABRILLO AVE, BURLINGAME, CA 94010","1328 CABRILLO AVE, BURLINGAME, CA 94010",...,address_city_state_zip,our lady angels,our lady angels,0.0,1.0,unknown,unknown,False,False,auto_merge
67,PRI_00077845,PRI_A2192021,ALL SAINTS ACADEMY OF STOCKTON,ALL SAINTS ACADEMY OF STOCKTON,37.932205,-121.286783,37.932205,-121.286783,"144 W 5TH ST, STOCKTON, CA 95206","144 W 5TH ST, STOCKTON, CA 95206",...,address_city_state_zip,all saints stockton,all saints stockton,0.0,1.0,unknown,unknown,False,False,auto_merge
81,PRI_00082345,PRI_BB160344,WEST PORTAL LUTHERAN SCHOOL,WEST PORTAL LUTHERAN SCHOOL,37.735332,-122.473673,37.735332,-122.473673,"200 SLOAT BLVD, SAN FRANCISCO, CA 94132","200 SLOAT BLVD, SAN FRANCISCO, CA 94132",...,address_city_state_zip,west portal lutheran,west portal lutheran,0.0,1.0,unknown,unknown,False,False,auto_merge
108,PRI_00092477,PRI_A1900626,PACIFIC BAY CHRISTIAN SCHOOL,PACIFIC BAY CHRISTIAN SCHOOL,37.586567,-122.493172,37.586567,-122.493172,"1030 LINDA MAR BLVD, PACIFICA, CA 94044","1030 LINDA MAR BLVD, PACIFICA, CA 94044",...,address_city_state_zip,pacific bay christian,pacific bay christian,0.0,1.0,unknown,unknown,False,False,auto_merge
127,PRI_A0100963,PRI_K9300754,HACIENDA SCHOOL,HACIENDA SCHOOL,37.693566,-121.900436,37.693566,-121.900436,"4671 CHABOT DR, PLEASANTON, CA 94588","4671 CHABOT DR, PLEASANTON, CA 94588",...,address_city_state_zip,hacienda,hacienda,0.0,1.0,unknown,unknown,False,False,auto_merge
162,PRI_A0700289,PRI_A1500500,SIERRA VISTA KIRK BAUCHER SCHOOL,SIERRA VISTA KIRK BAUCHER SCHOOL,37.674293,-121.086134,37.674293,-121.086134,"2524 FINNEY RD, MODESTO, CA 95358","2524 FINNEY RD, MODESTO, CA 95358",...,address_city_state_zip,sierra vista kirk baucher,sierra vista kirk baucher,0.0,1.0,unknown,unknown,False,False,auto_merge
170,PRI_A0900219,PRI_BB160348,BAYHILL HIGH SCHOOL,BAYHILL HIGH SCHOOL,37.876041,-122.271603,37.876041,-122.271603,"1940 VIRGINIA ST, BERKELEY, CA 94709","1940 VIRGINIA ST, BERKELEY, CA 94709",...,address_city_state_zip,bayhill high,bayhill high,0.0,1.0,high,high,False,False,auto_merge
174,PRI_A0992014,PRI_BB180272,SUMMIT CHRISTIAN SCHOOL,SUMMIT CHRISTIAN SCHOOL,38.65678,-121.225664,38.65678,-121.225664,"5010 HAZEL AVE, FAIR OAKS, CA 95628","5010 HAZEL AVE, FAIR OAKS, CA 95628",...,address_city_state_zip,summit christian,summit christian,0.0,1.0,unknown,unknown,False,False,auto_merge
177,PRI_A1100082,PRI_A2100264,AS-SAFA INSTITUTE/AS-SAFA ACADEMY,AS-SAFA ACADEMY,37.289966,-121.909378,37.289966,-121.909378,"1631 PEREGRINO WAY, SAN JOSE, CA 95125","1631 PEREGRINO WAY, SAN JOSE, CA 95125",...,address_city_state_zip,as safa as safa,as safa,0.0,1.0,unknown,unknown,False,False,auto_merge


[  514.72s] 05 completed


In [468]:
# sanity: check for likely grade-level splits that should NOT auto-merge
grade_words = ["elementary", "middle", "high"]

sus = show[
    (show["merge_reco"] == "auto_merge") &
    (
        show["name_a"].str.contains("elementary|middle|high", case=False, na=False) |
        show["name_b"].str.contains("elementary|middle|high", case=False, na=False)
    )
].copy()

log(f"Auto-merges that include grade words (inspect): {len(sus):,}")
display(sus[["school_id_a","school_id_b","name_a","name_b","distance_m","name_sim","geo_query_a","geo_query_b"]].head(30))


[35833.54s] Auto-merges that include grade words (inspect): 1


Unnamed: 0,school_id_a,school_id_b,name_a,name_b,distance_m,name_sim,geo_query_a,geo_query_b
167,PRI_A0900219,PRI_BB160348,BAYHILL HIGH SCHOOL,BAYHILL HIGH SCHOOL,0.0,1.0,"1940 VIRGINIA ST, BERKELEY, CA 94709","1940 VIRGINIA ST, BERKELEY, CA 94709"


## 06. Entity Resolution & Schools Master v2 Export (Deduplicated + Geocoded)

### 06.0 Purpose & Outputs
- Goal: Convert `schools_master_v1` (multi-source, duplicate-prone) into `schools_master_v2` (geocoded + deduplicated)
- Outputs:
  - `data/processed/notebook05/merge_candidates_ca.csv`
  - `data/processed/notebook05/merge_decisions_ca.csv`
  - `data/processed/notebook05/school_id_merge_map_v1.csv` (old_id → canonical_id)
  - `data/processed/notebook05/schools_master_v2.parquet`
  - `data/processed/notebook05/schools_master_v2.csv` (optional)

---

### 06.1 Assemble Candidate Pairs (Input to Resolution)
- Input: `candidate_pairs` from Section 05 (CA-wide grid candidate generation)
- Normalize / enrich candidate rows with:
  - distance (`distance_m`)
  - name similarity (`name_sim`)
  - enrichment deltas (CAIS/IB/etc.)
  - public/private flags and grade-band signatures (guardrails)
- Persist: `merge_candidates_ca.csv`

---

### 06.2 Deterministic Merge Decision Rules
- Produce a decision per candidate pair:
  - `auto_merge` (safe)
  - `review` (manual)
  - `no_merge` (reject)
- Core rules (v1 defaults):
  - Auto-merge: very close + strong name similarity + not public grade-split
  - Review: close + medium similarity and/or enrichment delta
  - No-merge: everything else
- Persist: `merge_decisions_ca.csv`
- Persist audit log: `entity_merge_log.csv` (append-only or overwrite by run)

---

### 06.3 Build Merge Groups (Connected Components)
- Interpret `auto_merge` (and optionally approved `review`) as edges in a graph
- Compute connected components to form merge groups:
  - each group → one `canonical_school_id`
- Outputs:
  - `school_id_merge_map_v1.csv` (every merged school_id → canonical_school_id)
  - `merge_groups_summary.csv` (group size distribution + examples)

---

### 06.4 Canonical Record Construction (Field Resolution)
For each merge group:
- Choose a canonical “base row” deterministically (priority order):
  1. has_cais / has_any_enrichment
  2. has_latlng + higher geo_confidence
  3. richer address fields (has zip, full address)
  4. stable tie-breaker (hash of school_id)
- Merge fields:
  - **Union** boolean tags: `has_cais`, `has_ib`, `has_waldorf`, `has_ams_montessori`, `has_any_enrichment`
  - Preserve “best” name (prefer CAIS display name if present)
  - Preserve best location fields (prefer full address + zip if available)
  - Preserve provenance fields (optional): sources_present / source_ids
- Output: one canonical row per `canonical_school_id`

---

### 06.5 Apply Merge Map to Full Dataset
- Add `canonical_school_id` to every row in `schools_master_v1`
- Produce `schools_master_v2`:
  - Option A (recommended): one row per canonical entity (deduped)
  - Option B (traceability): keep raw rows + canonical id (long form)
- Validate invariants:
  - canonical ids are non-null
  - deduped v2 has fewer/equal rows than v1
  - all CAIS schools appear at least once in v2

---

### 06.6 Export `schools_master_v2` (Authoritative)
- Save to:
  - `data/processed/notebook05/schools_master_v2.parquet`
  - optional: `data/processed/notebook05/schools_master_v2.csv`
- Save metadata:
  - run timestamp
  - cache version / thresholds used
  - counts: v1 rows, v2 rows, merged groups, auto_merge edges, review edges

---

### 06.7 Spot Checks (High-Value Schools)
- Quick targeted checks (examples):
  - “The Harker School”
  - “Bayhill High School”
  - “Crystal Springs Uplands”
- Ensure:
  - only one canonical entity per school
  - enrichment flags are preserved after merge
  - address + lat/lng remain sensible

---



### 06.1 Assemble Candidate Pairs (Input to Resolution)

We take the raw spatial candidate pairs from Section 05 and enrich them with the extra signals needed for deterministic merge decisions:

- **Public/private flags** (merging public campus splits is dangerous)
- **Grade-band signatures** (avoid merging Elementary vs Middle vs High on same campus)
- **Enrichment delta** (CAIS/IB/etc. helps identify “ghost duplicates”)
- A normalized, consistent schema for downstream steps

Output:
- `merge_candidates_ca.csv` (authoritative input to 06.2)


In [642]:
# ============================
# 06.1 Assemble Candidate Pairs (enriched for merge decisions)
#   - df_geo = source of truth for metadata
#   - Patch: if candidate_pairs contains a school_id not in df_geo (e.g., synthesized CAIS row),
#            synthesize a minimal meta row from df_loc_geo and/or a paired PRI row
# ============================

log("06.1 started")

OUT_CANDIDATES = NB05_DIR / "merge_candidates_ca.csv"
OUT_CANDIDATES.parent.mkdir(parents=True, exist_ok=True)

# --- Preconditions ---
if "candidate_pairs" not in globals():
    raise ValueError("candidate_pairs not found. Run Section 05 first.")

if "df_geo" not in globals():
    raise ValueError("df_geo not found. Run Section 04/geo sections first so df_geo exists.")

print("candidate_pairs shape:", getattr(candidate_pairs, "shape", None))
print("df_geo shape:", getattr(df_geo, "shape", None))

if len(candidate_pairs) < 50:
    raise ValueError(
        f"candidate_pairs is too small ({len(candidate_pairs)} rows). "
        "This will almost certainly produce 0 auto_merges. Rebuild candidate_pairs in Section 05."
    )

if len(df_geo) < 10_000:
    raise ValueError(
        f"df_geo is too small ({len(df_geo)} rows). You likely have the wrong df_geo in memory."
    )

# ----------------------------
# 1) Pull metadata from df_geo (source-of-truth)
# ----------------------------
need_cols = [
    "school_id","name","address","city","state","zip","county",
    "is_public","is_private","is_charter",
    "serves_pk","serves_elementary","serves_middle","serves_high",
    "has_any_enrichment","has_cais","has_ib","has_ams_montessori","has_waldorf",
]
meta_cols = [c for c in need_cols if c in df_geo.columns]
if "school_id" not in meta_cols:
    raise ValueError("df_geo is missing school_id column.")

meta = df_geo[meta_cols].drop_duplicates("school_id").copy()
meta["school_id"] = meta["school_id"].astype(str).str.strip()

# ----------------------------
# 1b) META PATCH: ensure all candidate IDs exist in meta
#   If df_geo lacks a candidate id, synthesize minimal row from df_loc_geo (and paired PRI if needed)
# ----------------------------
cand_ids = pd.unique(
    pd.concat([
        candidate_pairs["school_id_a"].astype(str).str.strip(),
        candidate_pairs["school_id_b"].astype(str).str.strip()
    ], ignore_index=True)
)

missing_ids = sorted(set(cand_ids) - set(meta["school_id"].astype(str)))
if missing_ids:
    log(f"06.1 meta patch: {len(missing_ids)} candidate ids missing from df_geo/meta")
    # We only expect tiny count (like synthesized CAIS rows)
    display(pd.DataFrame({"missing_school_id": missing_ids}).head(50))

    # We need df_loc_geo to synthesize name/city/state/zip at minimum
    if "df_loc_geo" not in globals():
        raise ValueError(
            "df_loc_geo missing. Run Section 04.3 refresh (df_loc_geo) before 06.1, "
            "or inject the missing ids into df_geo."
        )

    # index for quick lookup
    loc = df_loc_geo.copy()
    loc["school_id"] = loc["school_id"].astype(str).str.strip()

    # Helper: build one meta row for a missing id
    def _synthesize_meta_row(miss_id: str) -> dict:
        miss_id = str(miss_id)

        # Try to get identity fields from df_loc_geo
        rloc = loc[loc["school_id"] == miss_id]
        base = {}

        if not rloc.empty:
            r = rloc.iloc[0]
            for c in ["school_id","name","address","city","state","zip","county",
                      "is_public","is_private","is_charter",
                      "serves_pk","serves_elementary","serves_middle","serves_high",
                      "has_any_enrichment","has_cais","has_ib","has_ams_montessori","has_waldorf"]:
                if c in meta_cols and c in rloc.columns:
                    base[c] = r.get(c)
        else:
            # If even df_loc_geo lacks it, we can only proceed if it's a known CAIS synth case.
            base["school_id"] = miss_id

        # Fill missing booleans safely
        for b in ["is_public","is_private","is_charter",
                  "serves_pk","serves_elementary","serves_middle","serves_high",
                  "has_any_enrichment","has_cais","has_ib","has_ams_montessori","has_waldorf"]:
            if b in meta_cols and b not in base:
                base[b] = False

        # Special-case: if this missing ID is a CAIS_ row and we can infer its paired PRI row from candidate_pairs
        # (the Harker situation) — copy basic fields from the paired row if still missing.
        if miss_id.startswith("CAIS_"):
            # find any edge where this ID appears, then take the other endpoint as potential donor
            edges = candidate_pairs[
                (candidate_pairs["school_id_a"].astype(str).str.strip() == miss_id) |
                (candidate_pairs["school_id_b"].astype(str).str.strip() == miss_id)
            ]
            donor_id = None
            if not edges.empty:
                e0 = edges.iloc[0]
                a_id = str(e0["school_id_a"]).strip()
                b_id = str(e0["school_id_b"]).strip()
                donor_id = b_id if a_id == miss_id else a_id

            if donor_id and donor_id in set(meta["school_id"].astype(str)):
                donor = meta[meta["school_id"].astype(str) == donor_id].iloc[0]
                for c in ["address","city","state","zip","county"]:
                    if c in meta_cols and (c not in base or pd.isna(base.get(c)) or str(base.get(c)).strip() == ""):
                        base[c] = donor.get(c)

                # Ensure CAIS flag on the synthesized CAIS row
                if "has_cais" in meta_cols:
                    base["has_cais"] = True

        # Final: ensure required keys exist
        base["school_id"] = miss_id
        return {c: base.get(c, pd.NA) for c in meta_cols}

    new_rows = [_synthesize_meta_row(mid) for mid in missing_ids]
    new_df = pd.DataFrame(new_rows)

    # 1) Ensure patch chunk has every meta column (so values land in right place)
    #    but do it BEFORE we decide what to drop.
    new_df = new_df.reindex(columns=meta.columns)

    # 2) Identify columns that are entirely NA in the patch chunk
    all_na_cols = [c for c in new_df.columns if new_df[c].isna().all()]

    # 3) Concatenate ONLY the non-all-NA columns from the patch chunk
    #    (this avoids the pandas dtype warning)
    cols_to_concat = [c for c in meta.columns if c not in all_na_cols]

    meta = pd.concat(
        [meta[cols_to_concat], new_df[cols_to_concat]],
        ignore_index=True
    )

    # 4) Restore full meta shape by reindexing back to meta.columns (adds missing cols as NA)
    #    This happens AFTER concat, so it won't trigger the warning.
    meta = meta.reindex(columns=meta.columns)


    meta["school_id"] = meta["school_id"].astype(str).str.strip()

    log(f"06.1 meta patch: meta rows now {len(meta):,}")

# ----------------------------
# 2) Validate candidate_pairs required columns
# ----------------------------
req_cp = ["school_id_a","school_id_b","distance_m","name_sim"]
missing_cp = [c for c in req_cp if c not in candidate_pairs.columns]
if missing_cp:
    raise ValueError(f"candidate_pairs missing required columns: {missing_cp}")

cand = candidate_pairs.copy()
cand["school_id_a"] = cand["school_id_a"].astype(str).str.strip()
cand["school_id_b"] = cand["school_id_b"].astype(str).str.strip()

# remove self-pairs
cand = cand[cand["school_id_a"] != cand["school_id_b"]].copy()

# drop any existing name columns to avoid duplicates
for c in ["name_a", "name_b"]:
    if c in cand.columns:
        cand = cand.drop(columns=[c])

# ----------------------------
# 3) Join meta for side A and B
# ----------------------------
a_meta = meta.add_prefix("a_")
b_meta = meta.add_prefix("b_")

cand = cand.merge(a_meta, left_on="school_id_a", right_on="a_school_id", how="left")
cand = cand.merge(b_meta, left_on="school_id_b", right_on="b_school_id", how="left")

missing_a = cand["a_school_id"].isna().sum()
missing_b = cand["b_school_id"].isna().sum()
log(f"06.1 join missing: a={missing_a:,} | b={missing_b:,} (out of {len(cand):,})")

# show any remaining missing (should be 0 now)
if missing_a > 0:
    display(cand[cand["a_school_id"].isna()][["school_id_a","school_id_b","distance_m","name_sim"]].head(30))
if missing_b > 0:
    display(cand[cand["b_school_id"].isna()][["school_id_a","school_id_b","distance_m","name_sim"]].head(30))

miss_pct = (missing_a + missing_b) / (2 * len(cand)) if len(cand) else 0.0
if miss_pct > 0.05:
    raise ValueError(
        f"Too many candidate rows missing metadata (~{miss_pct:.1%}). "
        "Likely df_geo school_id mismatch or candidate_pairs not aligned with df_geo."
    )

# ----------------------------
# 4) Stable columns
# ----------------------------
cand = cand.rename(columns={"a_name": "name_a", "b_name": "name_b"})

# ----------------------------
# 5) Guardrail features
# ----------------------------
def grade_sig(prefix: str, r: pd.Series) -> str:
    parts = []
    for g in ["pk","elementary","middle","high"]:
        col = f"{prefix}serves_{g}"
        if col in r.index and (r.get(col) == True):
            parts.append(g)
    return ",".join(parts) if parts else ""

cand["a_grade_sig"] = cand.apply(lambda r: grade_sig("a_", r), axis=1)
cand["b_grade_sig"] = cand.apply(lambda r: grade_sig("b_", r), axis=1)

cand["grade_split"] = (
    (cand["a_grade_sig"] != "") &
    (cand["b_grade_sig"] != "") &
    (cand["a_grade_sig"] != cand["b_grade_sig"])
)

cand["both_public"] = (cand.get("a_is_public") == True) & (cand.get("b_is_public") == True)

cand["enrichment_delta"] = (
    (cand.get("a_has_any_enrichment") == True) ^ (cand.get("b_has_any_enrichment") == True)
)

# ----------------------------
# 6) Output
# ----------------------------
keep = [
    "school_id_a","school_id_b",
    "name_a","name_b",
    "distance_m","name_sim",
    "grade_split","both_public","enrichment_delta",
    "a_grade_sig","b_grade_sig",
    "a_is_public","b_is_public","a_is_private","b_is_private","a_is_charter","b_is_charter",
    "a_has_any_enrichment","b_has_any_enrichment",
    "a_has_cais","b_has_cais","a_has_ib","b_has_ib",
    "a_address","a_city","a_state","a_zip",
    "b_address","b_city","b_state","b_zip",
]
keep = [c for c in keep if c in cand.columns]

merge_candidates_ca = cand[keep].copy()

dups = merge_candidates_ca.columns[merge_candidates_ca.columns.duplicated()].tolist()
if dups:
    raise ValueError(f"merge_candidates_ca has duplicate columns (unexpected): {dups}")

log(f"06.1 merge_candidates_ca rows: {len(merge_candidates_ca):,}")
if "name_sim" in merge_candidates_ca.columns:
    log(f"06.1 name_sim==0: {int((merge_candidates_ca['name_sim']==0).sum()):,} / {len(merge_candidates_ca):,}")

display(merge_candidates_ca.head(20))

merge_candidates_ca.to_csv(OUT_CANDIDATES, index=False)
log(f"Saved: {OUT_CANDIDATES}")

log("06.1 completed")


[ 1827.28s] 06.1 started
candidate_pairs shape: (582, 22)
df_geo shape: (126612, 68)
[ 1827.31s] 06.1 meta patch: 1 candidate ids missing from df_geo/meta


Unnamed: 0,missing_school_id
0,CAIS_baab550e497c


[ 1827.40s] 06.1 meta patch: meta rows now 126,613
[ 1827.44s] 06.1 join missing: a=0 | b=0 (out of 582)
[ 1827.45s] 06.1 merge_candidates_ca rows: 582
[ 1827.45s] 06.1 name_sim==0: 412 / 582


Unnamed: 0,school_id_a,school_id_b,name_a,name_b,distance_m,name_sim,grade_split,both_public,enrichment_delta,a_grade_sig,...,a_has_cais,b_has_cais,a_has_ib,b_has_ib,a_city,a_state,a_zip,b_city,b_state,b_zip
0,CAIS_0fb36d096d98,PRI_00073181,The Academy,ST RAPHAEL SCHOOL,185.580877,0.0,False,False,True,,...,True,False,False,False,Berkeley,CA,94705,SAN RAFAEL,CA,94901
1,CAIS_d6d2f01752a3,PUB_063509005934,Brandeis Marin,Venetia Valley K-8,27.874149,0.0,False,False,True,,...,True,False,False,False,San Rafael,CA,94903,San Rafael,CA,94903
2,PRI_00072541,PRI_A9504605,MISSION DOLORES ACADEMY,CHILDREN'S DAY SCHOOL,140.443246,0.0,False,False,True,,...,False,True,False,False,SAN FRANCISCO,CA,94114,SAN FRANCISCO,CA,94110
3,PRI_00072541,PUB_063441007842,MISSION DOLORES ACADEMY,Everett Middle,155.634855,0.0,False,False,False,,...,False,False,False,False,SAN FRANCISCO,CA,94114,San Francisco,CA,94114
4,PRI_00072574,PRI_BB200240,OUR LADY OF THE VISITACION SCHOOL,OUR LADY OF THE VISITACION SCHOOL,0.0,1.0,False,False,False,,...,False,False,False,False,SAN FRANCISCO,CA,94134,SAN FRANCISCO,CA,94134
5,PRI_00072665,PUB_063441005640,ST ANTHONY - IMMACULATE CONCEPTION SCHOOL,Flynn (Leonard R.) Elementary,184.902067,0.0,False,False,False,,...,False,False,False,False,SAN FRANCISCO,CA,94110,San Francisco,CA,94110
6,PRI_00072745,PUB_063441005672,ST FINN BARR CATHOLIC SCHOOL,Sunnyside Elementary,122.88903,0.0,False,False,False,,...,False,False,False,False,SAN FRANCISCO,CA,94112,San Francisco,CA,94112
7,PRI_00072869,PUB_069111109472,ST PETER'S SCHOOL,S.F. County Opportunity (Hilltop),104.744461,0.0,False,False,False,,...,False,False,False,False,SAN FRANCISCO,CA,94110,San Francisco,CA,94110
8,PRI_00072949,PRI_00093539,ST THOMAS MORE ELEMENTARY SCHOOL,THE BRANDEIS SCHOOL OF SAN FRANCISCO,194.063283,0.0,False,False,True,,...,False,True,False,False,SAN FRANCISCO,CA,94132,SAN FRANCISCO,CA,94132
9,PRI_00073115,PUB_063492007781,ST GREGORY SCHOOL,Beresford Elementary,185.605938,0.0,False,False,False,,...,False,False,False,False,SAN MATEO,CA,94403,San Mateo,CA,94403


[ 1827.45s] Saved: /Users/jennifer-david/Documents/work/SpringBoard/projects/Capstone Projects/smart-school/data/processed/notebook05/merge_candidates_ca.csv
[ 1827.45s] 06.1 completed


In [644]:
merge_candidates_ca.sort_values(["distance_m","name_sim"], ascending=[True, False]).head(20)

Unnamed: 0,school_id_a,school_id_b,name_a,name_b,distance_m,name_sim,grade_split,both_public,enrichment_delta,a_grade_sig,...,a_has_cais,b_has_cais,a_has_ib,b_has_ib,a_city,a_state,a_zip,b_city,b_state,b_zip
4,PRI_00072574,PRI_BB200240,OUR LADY OF THE VISITACION SCHOOL,OUR LADY OF THE VISITACION SCHOOL,0.0,1.0,False,False,False,,...,False,False,False,False,SAN FRANCISCO,CA,94134,SAN FRANCISCO,CA,94134
12,PRI_00073487,PRI_BB180191,OUR LADY OF ANGELS SCHOOL,OUR LADY OF ANGELS SCHOOL,0.0,1.0,False,False,False,,...,False,False,False,False,BURLINGAME,CA,94010,BURLINGAME,CA,94010
28,PRI_00077845,PRI_A2192021,ALL SAINTS ACADEMY OF STOCKTON,ALL SAINTS ACADEMY OF STOCKTON,0.0,1.0,False,False,False,,...,False,False,False,False,STOCKTON,CA,95206,STOCKTON,CA,95206
34,PRI_00082345,PRI_BB160344,WEST PORTAL LUTHERAN SCHOOL,WEST PORTAL LUTHERAN SCHOOL,0.0,1.0,False,False,False,,...,False,False,False,False,SAN FRANCISCO,CA,94132,SAN FRANCISCO,CA,94132
43,PRI_00092477,PRI_A1900626,PACIFIC BAY CHRISTIAN SCHOOL,PACIFIC BAY CHRISTIAN SCHOOL,0.0,1.0,False,False,False,,...,False,False,False,False,PACIFICA,CA,94044,PACIFICA,CA,94044
51,PRI_A0100963,PRI_K9300754,HACIENDA SCHOOL,HACIENDA SCHOOL,0.0,1.0,False,False,False,,...,False,False,False,False,PLEASANTON,CA,94588,PLEASANTON,CA,94588
68,PRI_A0700289,PRI_A1500500,SIERRA VISTA KIRK BAUCHER SCHOOL,SIERRA VISTA KIRK BAUCHER SCHOOL,0.0,1.0,False,False,False,,...,False,False,False,False,MODESTO,CA,95358,MODESTO,CA,95358
73,PRI_A0900219,PRI_BB160348,BAYHILL HIGH SCHOOL,BAYHILL HIGH SCHOOL,0.0,1.0,False,False,False,,...,True,False,False,False,BERKELEY,CA,94709,BERKELEY,CA,94709
77,PRI_A0992014,PRI_BB180272,SUMMIT CHRISTIAN SCHOOL,SUMMIT CHRISTIAN SCHOOL,0.0,1.0,False,False,False,,...,False,False,False,False,FAIR OAKS,CA,95628,FAIR OAKS,CA,95628
78,PRI_A1100082,PRI_A2100264,AS-SAFA INSTITUTE/AS-SAFA ACADEMY,AS-SAFA ACADEMY,0.0,1.0,False,False,False,,...,False,False,False,False,SAN JOSE,CA,95125,SAN JOSE,CA,95125


### 06.2 Deterministic Merge Decisions (Auto / Review / No-Merge)

We apply deterministic rules to each candidate pair to decide whether two rows represent the same real-world school.

Goals:
- **Auto-merge only when it is very safe**
- **Route ambiguous cases to review**
- **Avoid incorrect merges**, especially for public school campus splits (elementary vs middle vs high, continuation, virtual programs)

Decision outputs:
- `auto_merge`: safe to merge automatically
- `review`: likely duplicates but needs human verification
- `no_merge`: reject

We also record a `reason` string for auditability and debugging.


In [650]:
# ============================
# 06.2 Deterministic merge decisions (auto / review / no_merge)  [HARDENED v2.1]
#   Goals:
#   - Never auto_merge public/public
#   - Hard-block public/public grade splits
#   - (NEW) Never auto_merge ANY grade_split (force review)
#   - Allow tight CAIS↔private override (Harker-style)
#   - Guard against "campus/annex/middle/elementary/high" false auto-merges
#   - Fix regex warning (non-capturing group)
# ============================

log("06.2 started")

OUT_DECISIONS = NB05_DIR / "merge_decisions_ca.csv"
OUT_LOG       = NB05_DIR / "entity_merge_log.csv"
OUT_REVIEW    = NB05_DIR / "merge_review_ca.csv"
OUT_DECISIONS.parent.mkdir(parents=True, exist_ok=True)

# --- load inputs if needed ---
if "merge_candidates_ca" not in globals():
    p = NB05_DIR / "merge_candidates_ca.csv"
    if not p.exists():
        raise FileNotFoundError(f"merge_candidates_ca not in memory and not found on disk: {p}")
    merge_candidates_ca = pd.read_csv(p)

df = merge_candidates_ca.copy()

# ----------------------------
# thresholds
# ----------------------------
AUTO_MAX_DIST_M      = 50
AUTO_MIN_NAME_SIM    = 0.85

REVIEW_MAX_DIST_M    = 75
REVIEW_MIN_NAME_SIM  = 0.65
REVIEW_MIN_NAME_SIM_ENRICH_DELTA = 0.60

# CAIS override: tight
CAIS_OVERRIDE_MAX_DIST_M = 50
CAIS_OVERRIDE_MIN_GUARD_SIM = 0.45
CAIS_OVERRIDE_MIN_GUARD_OVERLAP_TOKENS = 1
CAIS_OVERRIDE_MIN_RAW_NAME_SIM = 0.10

# ----------------------------
# required columns (fail fast)
# ----------------------------
required_cols = [
    "school_id_a","school_id_b","name_a","name_b","distance_m","name_sim",
    "both_public","grade_split","enrichment_delta",
    "a_has_cais","b_has_cais",
    "a_is_private","b_is_private",
    "a_is_public","b_is_public",
]
missing = [c for c in required_cols if c not in df.columns]
if missing:
    raise ValueError(f"06.2 missing required columns in merge_candidates_ca: {missing}")

# ----------------------------
# numeric coercions
# ----------------------------
df["distance_m"] = pd.to_numeric(df["distance_m"], errors="coerce")
df["name_sim"]   = pd.to_numeric(df["name_sim"], errors="coerce")

# ----------------------------
# robust boolean parsing
# ----------------------------
def to_bool_series(s: pd.Series) -> pd.Series:
    s2 = s.copy()
    if s2.dtype == object:
        s2 = (
            s2.astype(str).str.strip().str.lower()
            .map({
                "true": True, "1": True, "yes": True, "y": True, "t": True,
                "false": False, "0": False, "no": False, "n": False, "f": False,
                "nan": False, "none": False, "": False
            })
            .fillna(False)
        )
    else:
        s2 = s2.fillna(False).astype(bool)
    return s2

for c in [
    "both_public","grade_split","enrichment_delta",
    "a_has_cais","b_has_cais",
    "a_is_private","b_is_private",
    "a_is_public","b_is_public",
]:
    df[c] = to_bool_series(df[c])

print("has_cais true counts:", int(df["a_has_cais"].sum()), int(df["b_has_cais"].sum()))

# ----------------------------
# HARD RULES
# ----------------------------

# 1) Never auto-merge public/public (even if name+dist identical)
public_public = df["both_public"]

# 2) Strong hard block: public/public + grade split => no_merge
hard_block = public_public & df["grade_split"]

# ----------------------------
# Guard: "campus/annex/elementary/middle/high" tokens
# If names differ and these tokens are present, do not auto_merge (force review).
# ----------------------------
# (NEW) use non-capturing group to avoid pandas warning
SPLIT_TOKENS = r"\b(?:campus|annex|lower|upper|middle|elementary|high|primary|secondary)\b"

name_a_l = df["name_a"].fillna("").astype(str).str.lower()
name_b_l = df["name_b"].fillna("").astype(str).str.lower()

has_split_token = (
    name_a_l.str.contains(SPLIT_TOKENS, regex=True) |
    name_b_l.str.contains(SPLIT_TOKENS, regex=True)
)
names_exact = name_a_l.str.replace(r"\s+", " ", regex=True).str.strip().eq(
    name_b_l.str.replace(r"\s+", " ", regex=True).str.strip()
)

# If token exists AND names are not exactly equal => block auto (but still allow review)
block_auto_due_to_tokens = has_split_token & (~names_exact)

# ----------------------------
# CAIS override guard similarity (drop generic tokens)
# ----------------------------
GENERIC_TOKENS = {
    "school","academy","elementary","middle","high","charter","campus",
    "the","of","and","for",
    "international",
    "montessori","waldorf",
    "prep","preparatory","college","collegiate",
}

def _norm_for_guard(s: str) -> str:
    s = "" if pd.isna(s) else str(s)
    s = s.lower()
    s = re.sub(r"[^a-z0-9\s]", " ", s)
    s = re.sub(r"\s+", " ", s).strip()
    toks = [t for t in s.split() if t and t not in GENERIC_TOKENS]
    return " ".join(toks)

def _jaccard_guard(a: str, b: str) -> float:
    sa = set(_norm_for_guard(a).split())
    sb = set(_norm_for_guard(b).split())
    if not sa and not sb:
        return 0.0
    return len(sa & sb) / float(len(sa | sb))

def _overlap_size(a: str, b: str) -> int:
    sa = set(_norm_for_guard(a).split())
    sb = set(_norm_for_guard(b).split())
    return len(sa & sb)

df["name_sim_guard"] = [_jaccard_guard(x, y) for x, y in zip(df["name_a"], df["name_b"])]
df["name_guard_overlap_n"] = [_overlap_size(x, y) for x, y in zip(df["name_a"], df["name_b"])]

cais_private_pair = (
    (df["a_has_cais"] & df["b_is_private"] & (~df["b_is_public"])) |
    (df["b_has_cais"] & df["a_is_private"] & (~df["a_is_public"]))
)

df["cais_address_override"] = (
    (df["distance_m"] <= CAIS_OVERRIDE_MAX_DIST_M) &
    cais_private_pair &
    (~public_public) &
    (df["name_sim_guard"] >= CAIS_OVERRIDE_MIN_GUARD_SIM) &
    (df["name_guard_overlap_n"] >= CAIS_OVERRIDE_MIN_GUARD_OVERLAP_TOKENS) &
    (df["name_sim"] >= CAIS_OVERRIDE_MIN_RAW_NAME_SIM) &
    (~hard_block) &
    # (NEW) never override campus/level splits
    (~block_auto_due_to_tokens)
)

df["used_cais_override"] = df["cais_address_override"].astype(bool)

# ----------------------------
# decision logic (vectorized)
# Order matters.
# ----------------------------

# Standard auto (private/mixed only), plus token guard
auto_std = (
    (df["distance_m"] <= AUTO_MAX_DIST_M) &
    (df["name_sim"] >= AUTO_MIN_NAME_SIM) &
    (~public_public) &
    (~block_auto_due_to_tokens) &
    (~hard_block)
)

# Review bucket (includes public/public candidates)
review_mask = (
    (~hard_block) &
    (df["distance_m"] <= REVIEW_MAX_DIST_M) &
    (
        (df["name_sim"] >= REVIEW_MIN_NAME_SIM) |
        (df["enrichment_delta"] & (df["name_sim"] >= REVIEW_MIN_NAME_SIM_ENRICH_DELTA))
    )
)

df["decision"] = np.select(
    [
        hard_block,
        df["cais_address_override"],   # rare but tight
        auto_std,
        review_mask,
    ],
    [
        "no_merge",
        "auto_merge",
        "auto_merge",
        "review",
    ],
    default="no_merge"
)

# ----------------------------
# SAFETY DOWNGRADES
# ----------------------------

# Safety A: if public/public and decision==auto_merge, downgrade to review (shouldn't happen anyway)
df.loc[(public_public) & (df["decision"] == "auto_merge"), "decision"] = "review"

# Safety B: if token guard blocks auto, but would have been auto, force review
would_have_auto = (
    (df["distance_m"] <= AUTO_MAX_DIST_M) &
    (df["name_sim"] >= AUTO_MIN_NAME_SIM) &
    (~public_public) &
    (block_auto_due_to_tokens) &
    (~hard_block)
)
df.loc[would_have_auto, "decision"] = "review"

# Safety C (NEW): if ANY grade_split, never auto_merge (force review), except hard_block already no_merge
grade_split_any = df["grade_split"] & (~hard_block)
df.loc[grade_split_any & (df["decision"] == "auto_merge"), "decision"] = "review"

# ----------------------------
# reasons (auditability)
# ----------------------------
df["reason"] = "default: insufficient evidence"

df.loc[hard_block, "reason"] = "no_merge: both_public + grade_split"

df.loc[df["cais_address_override"], "reason"] = (
    f"auto: CAIS↔private override dist<={CAIS_OVERRIDE_MAX_DIST_M} "
    f"& guard_sim>={CAIS_OVERRIDE_MIN_GUARD_SIM} & no split-token"
)

df.loc[auto_std & (df["decision"] == "auto_merge") & (~df["cais_address_override"]), "reason"] = (
    f"auto: dist<={AUTO_MAX_DIST_M} & name>={AUTO_MIN_NAME_SIM} & not both_public"
)

df.loc[(public_public) & (df["decision"] == "review"), "reason"] = (
    "review: both_public (never auto)"
)

df.loc[would_have_auto, "reason"] = (
    "review: split-token guard (campus/annex/level token present and names not exact)"
)

df.loc[grade_split_any & (df["decision"] == "review"), "reason"] = (
    "review: grade_split (never auto)"
)

df.loc[(df["decision"] == "review") & (~public_public) & (~would_have_auto) & (~grade_split_any), "reason"] = (
    f"review: dist<={REVIEW_MAX_DIST_M} & (name>={REVIEW_MIN_NAME_SIM} "
    f"or enrichment_delta+name>={REVIEW_MIN_NAME_SIM_ENRICH_DELTA})"
)

# ----------------------------
# quick diagnostics
# ----------------------------
print("Decision counts:")
display(df["decision"].value_counts().to_frame("count").reset_index().rename(columns={"index":"decision"}))

print("CAIS override auto_merges:", int(df["cais_address_override"].sum()))
if int(df["cais_address_override"].sum()) > 0:
    display(
        df[df["cais_address_override"]]
        .sort_values(["distance_m","name_sim_guard","name_sim"], ascending=[True, False, False])
        .head(30)
    )

# show top rows for inspection
display(
    df.sort_values(
        ["decision", "distance_m", "name_sim"],
        ascending=[True, True, False]
    ).head(50)
)

# ----------------------------
# persist
# ----------------------------
df.to_csv(OUT_DECISIONS, index=False)
print("Saved:", OUT_DECISIONS)

df.to_csv(OUT_LOG, index=False)
print("Saved:", OUT_LOG)

df[df["decision"] == "review"] \
  .sort_values(["distance_m","name_sim"], ascending=[True, False]) \
  .to_csv(OUT_REVIEW, index=False)
print("Saved:", OUT_REVIEW)

log("06.2 completed")


[ 2812.39s] 06.2 started
has_cais true counts: 19 6
Decision counts:


Unnamed: 0,decision,count
0,no_merge,556
1,auto_merge,13
2,review,13


CAIS override auto_merges: 2


Unnamed: 0,school_id_a,school_id_b,name_a,name_b,distance_m,name_sim,grade_split,both_public,enrichment_delta,a_grade_sig,...,a_zip,b_city,b_state,b_zip,name_sim_guard,name_guard_overlap_n,cais_address_override,used_cais_override,decision,reason
73,PRI_A0900219,PRI_BB160348,BAYHILL HIGH SCHOOL,BAYHILL HIGH SCHOOL,0.0,1.0,False,False,False,,...,94709,BERKELEY,CA,94709,1.0,1,True,True,auto_merge,auto: CAIS↔private override dist<=50 & guard_s...
117,PRI_A2190096,CAIS_baab550e497c,HARKER,The Harker School,0.0,1.0,False,False,True,,...,95129,SAN JOSE,CA,95129,1.0,1,True,True,auto_merge,auto: CAIS↔private override dist<=50 & guard_s...


Unnamed: 0,school_id_a,school_id_b,name_a,name_b,distance_m,name_sim,grade_split,both_public,enrichment_delta,a_grade_sig,...,a_zip,b_city,b_state,b_zip,name_sim_guard,name_guard_overlap_n,cais_address_override,used_cais_override,decision,reason
4,PRI_00072574,PRI_BB200240,OUR LADY OF THE VISITACION SCHOOL,OUR LADY OF THE VISITACION SCHOOL,0.0,1.0,False,False,False,,...,94134,SAN FRANCISCO,CA,94134,1.0,3,False,False,auto_merge,auto: dist<=50 & name>=0.85 & not both_public
12,PRI_00073487,PRI_BB180191,OUR LADY OF ANGELS SCHOOL,OUR LADY OF ANGELS SCHOOL,0.0,1.0,False,False,False,,...,94010,BURLINGAME,CA,94010,1.0,3,False,False,auto_merge,auto: dist<=50 & name>=0.85 & not both_public
28,PRI_00077845,PRI_A2192021,ALL SAINTS ACADEMY OF STOCKTON,ALL SAINTS ACADEMY OF STOCKTON,0.0,1.0,False,False,False,,...,95206,STOCKTON,CA,95206,1.0,3,False,False,auto_merge,auto: dist<=50 & name>=0.85 & not both_public
34,PRI_00082345,PRI_BB160344,WEST PORTAL LUTHERAN SCHOOL,WEST PORTAL LUTHERAN SCHOOL,0.0,1.0,False,False,False,,...,94132,SAN FRANCISCO,CA,94132,1.0,3,False,False,auto_merge,auto: dist<=50 & name>=0.85 & not both_public
43,PRI_00092477,PRI_A1900626,PACIFIC BAY CHRISTIAN SCHOOL,PACIFIC BAY CHRISTIAN SCHOOL,0.0,1.0,False,False,False,,...,94044,PACIFICA,CA,94044,1.0,3,False,False,auto_merge,auto: dist<=50 & name>=0.85 & not both_public
51,PRI_A0100963,PRI_K9300754,HACIENDA SCHOOL,HACIENDA SCHOOL,0.0,1.0,False,False,False,,...,94588,PLEASANTON,CA,94588,1.0,1,False,False,auto_merge,auto: dist<=50 & name>=0.85 & not both_public
68,PRI_A0700289,PRI_A1500500,SIERRA VISTA KIRK BAUCHER SCHOOL,SIERRA VISTA KIRK BAUCHER SCHOOL,0.0,1.0,False,False,False,,...,95358,MODESTO,CA,95358,1.0,4,False,False,auto_merge,auto: dist<=50 & name>=0.85 & not both_public
73,PRI_A0900219,PRI_BB160348,BAYHILL HIGH SCHOOL,BAYHILL HIGH SCHOOL,0.0,1.0,False,False,False,,...,94709,BERKELEY,CA,94709,1.0,1,True,True,auto_merge,auto: CAIS↔private override dist<=50 & guard_s...
77,PRI_A0992014,PRI_BB180272,SUMMIT CHRISTIAN SCHOOL,SUMMIT CHRISTIAN SCHOOL,0.0,1.0,False,False,False,,...,95628,FAIR OAKS,CA,95628,1.0,2,False,False,auto_merge,auto: dist<=50 & name>=0.85 & not both_public
78,PRI_A1100082,PRI_A2100264,AS-SAFA INSTITUTE/AS-SAFA ACADEMY,AS-SAFA ACADEMY,0.0,1.0,False,False,False,,...,95125,SAN JOSE,CA,95125,0.666667,2,False,False,auto_merge,auto: dist<=50 & name>=0.85 & not both_public


Saved: /Users/jennifer-david/Documents/work/SpringBoard/projects/Capstone Projects/smart-school/data/processed/notebook05/merge_decisions_ca.csv
Saved: /Users/jennifer-david/Documents/work/SpringBoard/projects/Capstone Projects/smart-school/data/processed/notebook05/entity_merge_log.csv
Saved: /Users/jennifer-david/Documents/work/SpringBoard/projects/Capstone Projects/smart-school/data/processed/notebook05/merge_review_ca.csv
[ 2812.43s] 06.2 completed


In [652]:
# 1) confirm no auto_merge has grade_split=True
display(df[df["decision"]=="auto_merge"][["name_a","name_b","distance_m","name_sim","grade_split","both_public"]])

# 2) confirm no auto_merge has split tokens + non-exact names
SPLIT_TOKENS = r"\b(?:campus|annex|lower|upper|middle|elementary|high|primary|secondary)\b"
auto = df[df["decision"]=="auto_merge"].copy()

na = auto["name_a"].fillna("").str.lower().str.replace(r"\s+"," ",regex=True).str.strip()
nb = auto["name_b"].fillna("").str.lower().str.replace(r"\s+"," ",regex=True).str.strip()

auto["has_split_token"] = na.str.contains(SPLIT_TOKENS, regex=True) | nb.str.contains(SPLIT_TOKENS, regex=True)
auto["names_exact"] = na.eq(nb)

display(auto[auto["has_split_token"] & (~auto["names_exact"])][["name_a","name_b","distance_m","name_sim"]])


Unnamed: 0,name_a,name_b,distance_m,name_sim,grade_split,both_public
4,OUR LADY OF THE VISITACION SCHOOL,OUR LADY OF THE VISITACION SCHOOL,0.0,1.0,False,False
12,OUR LADY OF ANGELS SCHOOL,OUR LADY OF ANGELS SCHOOL,0.0,1.0,False,False
28,ALL SAINTS ACADEMY OF STOCKTON,ALL SAINTS ACADEMY OF STOCKTON,0.0,1.0,False,False
34,WEST PORTAL LUTHERAN SCHOOL,WEST PORTAL LUTHERAN SCHOOL,0.0,1.0,False,False
37,HAMLIN SCHOOL,THE HAMLIN SCHOOL,19.501157,1.0,False,False
43,PACIFIC BAY CHRISTIAN SCHOOL,PACIFIC BAY CHRISTIAN SCHOOL,0.0,1.0,False,False
51,HACIENDA SCHOOL,HACIENDA SCHOOL,0.0,1.0,False,False
68,SIERRA VISTA KIRK BAUCHER SCHOOL,SIERRA VISTA KIRK BAUCHER SCHOOL,0.0,1.0,False,False
73,BAYHILL HIGH SCHOOL,BAYHILL HIGH SCHOOL,0.0,1.0,False,False
77,SUMMIT CHRISTIAN SCHOOL,SUMMIT CHRISTIAN SCHOOL,0.0,1.0,False,False


Unnamed: 0,name_a,name_b,distance_m,name_sim


In [481]:
# STEP 1: Check if Yew Chung is being merged with German (should be EMPTY)
mask = (
    df["decision"].isin(["auto_merge", "review"]) &
    (
        df["name_a"].str.contains("yew chung", case=False, na=False) |
        df["name_b"].str.contains("yew chung", case=False, na=False)
    )
)

cols = [
    "school_id_a","name_a","school_id_b","name_b",
    "distance_m","name_sim","name_sim_guard",
    "used_cais_override","decision","reason"
]

out = df.loc[mask, cols].sort_values(["decision","distance_m"], ascending=[True, True])
print("Rows found:", len(out))
display(out)


Rows found: 0


Unnamed: 0,school_id_a,name_a,school_id_b,name_b,distance_m,name_sim,name_sim_guard,used_cais_override,decision,reason


### 06.3 Build Merge Groups (Connected Components)

We convert `auto_merge` decisions into an undirected graph:

- Nodes: `school_id`
- Edges: `(school_id_a, school_id_b)` where `decision == auto_merge`

We then compute connected components to form merge groups.  
Each merge group is assigned a deterministic `canonical_school_id`, and we generate:

- `school_id_merge_map_v1.csv`: every merged `school_id` → `canonical_school_id`
- `merge_groups_summary.csv`: group sizes + quick diagnostics

This step is deterministic and idempotent.

In [660]:
# ============================
# 06.3 Build merge groups (connected components) + school_id -> canonical_school_id map 
#   Fixes:
#   - FULL identity map covers ALL ids (df_geo universe + any ids in auto edges)
#   - Canonical is an EXISTING school_id from within the group (no CAN_ minting)
#   - (FIX) eliminate pandas FutureWarning by opting into future no-silent-downcasting behavior
# ============================


pd.set_option("future.no_silent_downcasting", True)

OUT_MAP    = NB05_DIR / "school_id_merge_map_v2.csv"
OUT_GROUPS = NB05_DIR / "merge_groups_summary_v2.csv"
OUT_MAP.parent.mkdir(parents=True, exist_ok=True)

log("06.3 started")

# Load decisions if needed
if "df" not in globals() or "decision" not in df.columns:
    df = pd.read_csv(NB05_DIR / "merge_decisions_ca.csv")

# Preconditions
if "df_geo" not in globals():
    raise ValueError("df_geo not found. 06.3 needs df_geo (or switch all_ids source to schools_master).")
if "school_id" not in df_geo.columns:
    raise ValueError("df_geo missing required column: school_id")

# --- auto edges ---
auto = df[df["decision"] == "auto_merge"][["school_id_a", "school_id_b"]].copy()
auto["school_id_a"] = auto["school_id_a"].astype(str)
auto["school_id_b"] = auto["school_id_b"].astype(str)

log(f"06.3 auto_merge edges: {len(auto):,}")

# --- DSU / Union-Find ---
class DSU:
    def __init__(self):
        self.parent = {}
        self.rank = {}

    def find(self, x):
        x = str(x)
        if x not in self.parent:
            self.parent[x] = x
            self.rank[x] = 0
            return x
        if self.parent[x] != x:
            self.parent[x] = self.find(self.parent[x])
        return self.parent[x]

    def union(self, a, b):
        ra = self.find(a)
        rb = self.find(b)
        if ra == rb:
            return
        if self.rank[ra] < self.rank[rb]:
            self.parent[ra] = rb
        elif self.rank[ra] > self.rank[rb]:
            self.parent[rb] = ra
        else:
            self.parent[rb] = ra
            self.rank[ra] += 1

dsu = DSU()

# Build DSU from edges (if any)
for a, b in auto.itertuples(index=False):
    dsu.union(a, b)

# Edge nodes (may include ids not present in df_geo)
edge_nodes = set(auto["school_id_a"]).union(set(auto["school_id_b"]))

# --- all ids for full identity map (union df_geo universe + edge nodes) ---
all_ids = pd.Series(
    pd.concat([
        df_geo["school_id"].astype(str),
        pd.Series(sorted(edge_nodes), dtype=str),
    ]).unique(),
    name="old_school_id"
).to_frame()

# If there are no edges, produce identity map + empty groups summary
if len(auto) == 0:
    merge_map_all = all_ids.copy()
    merge_map_all["canonical_school_id"] = merge_map_all["old_school_id"]  # identity
    merge_map_all.to_csv(OUT_MAP, index=False)
    log(f"Saved: {OUT_MAP}")

    empty_groups = pd.DataFrame(columns=["root", "canonical_school_id", "group_size"])
    empty_groups.to_csv(OUT_GROUPS, index=False)
    log(f"Saved: {OUT_GROUPS}")

    log(f"06.3 total schools: {len(merge_map_all):,} | merged school_ids: 0 | groups: 0")
    log("06.3 completed (no auto_merge edges)")
    display(pd.DataFrame({"note": ["No auto_merge edges; produced identity mapping."]}))
else:
    # Nodes involved in merges
    nodes_in_edges = sorted(edge_nodes)
    comp = pd.DataFrame({"old_school_id": nodes_in_edges})
    comp["root"] = comp["old_school_id"].map(dsu.find)

    # Groups (only merged nodes)
    groups = comp.groupby("root")["old_school_id"].apply(list).reset_index()
    groups["group_size"] = groups["old_school_id"].apply(len)

    # ----------------------------
    # Winner selection (deterministic, uses existing ids)
    # Preference rule:
    #   1) has_cais True
    #   2) is_private True
    #   3) is_public False (prefer non-public)
    #   4) more complete address fields (address/city/state/zip non-null count)
    #   5) lexicographic school_id (tie-break)
    # ----------------------------
    geo_cols = [
        "school_id",
        "has_cais",
        "is_private",
        "is_public",
        "address",
        "city",
        "state",
        "zip",
        "name",
        "display_name",
    ]
    geo_present = [c for c in geo_cols if c in df_geo.columns]
    geo = df_geo[geo_present].copy()
    geo["school_id"] = geo["school_id"].astype(str)

    def _boolish(series: pd.Series) -> pd.Series:
        s = series.copy()
        if s.dtype == object:
            s = (
                s.astype(str).str.strip().str.lower()
                .map({
                    "true": True, "1": True, "yes": True, "y": True, "t": True,
                    "false": False, "0": False, "no": False, "n": False, "f": False,
                    "nan": False, "none": False, "": False
                })
                .fillna(False)
            )
        else:
            s = s.fillna(False).astype(bool)
        return s.astype(bool)

    for c in ["has_cais", "is_private", "is_public"]:
        if c not in geo.columns:
            geo[c] = False
        geo[c] = _boolish(geo[c])

    for c in ["address", "city", "state", "zip"]:
        if c not in geo.columns:
            geo[c] = np.nan

    geo["addr_completeness"] = (
        geo["address"].notna().astype(int) +
        geo["city"].notna().astype(int) +
        geo["state"].notna().astype(int) +
        geo["zip"].notna().astype(int)
    )

    def choose_winner(group_ids):
        ids = pd.Series(list(map(str, group_ids)), name="school_id")
        g = ids.to_frame().merge(geo, on="school_id", how="left")

        # With future.no_silent_downcasting=True, these won't warn
        g["has_cais"] = g["has_cais"].fillna(False).astype(bool)
        g["is_private"] = g["is_private"].fillna(False).astype(bool)
        g["is_public"] = g["is_public"].fillna(False).astype(bool)
        g["addr_completeness"] = g["addr_completeness"].fillna(0).astype(int)

        g = g.sort_values(
            by=["has_cais", "is_private", "is_public", "addr_completeness", "school_id"],
            ascending=[False, False, True, False, True],
            kind="mergesort",
        )
        return str(g.iloc[0]["school_id"])

    groups["canonical_school_id"] = groups["old_school_id"].apply(choose_winner)

    # Explode mapping for merged nodes
    merge_map_merged = (
        groups[["canonical_school_id", "old_school_id"]]
          .explode("old_school_id")
          .reset_index(drop=True)
    )

    # FULL identity map for ALL schools
    merge_map_all = all_ids.copy()
    merge_map_all["canonical_school_id"] = merge_map_all["old_school_id"]  # identity
    merge_map_all = merge_map_all.merge(
        merge_map_merged,
        on="old_school_id",
        how="left",
        suffixes=("", "_merged")
    )
    merge_map_all["canonical_school_id"] = merge_map_all["canonical_school_id_merged"].fillna(
        merge_map_all["canonical_school_id"]
    )
    merge_map_all = merge_map_all.drop(columns=["canonical_school_id_merged"])

    # Save outputs
    merge_map_all.to_csv(OUT_MAP, index=False)
    log(f"Saved: {OUT_MAP}")

    groups_summary = groups[["root", "canonical_school_id", "group_size"]].sort_values(
        ["group_size", "canonical_school_id"], ascending=[False, True]
    )
    groups_summary.to_csv(OUT_GROUPS, index=False)
    log(f"Saved: {OUT_GROUPS}")

    # Diagnostics
    log("06.3 Merge group size distribution (merged groups only):")
    display(groups_summary["group_size"].value_counts().to_frame("count").reset_index().rename(columns={"index":"group_size"}))

    log("06.3 Top merge groups (merged groups only):")
    display(groups.sort_values("group_size", ascending=False).head(20))

    merged_count = int((merge_map_all["canonical_school_id"] != merge_map_all["old_school_id"]).sum())
    log(f"06.3 total schools: {len(merge_map_all):,} | merged school_ids: {merged_count:,} | groups: {len(groups):,}")

    # Sanity: canonical must be in the group
    bad = groups[~groups.apply(lambda r: r["canonical_school_id"] in set(map(str, r["old_school_id"])), axis=1)]
    if len(bad) > 0:
        display(bad.head(20))
        raise ValueError(f"06.3 sanity failed: some canonical_school_id not present in its group (n={len(bad)})")

    if merge_map_all["canonical_school_id"].isna().any():
        raise ValueError("06.3 sanity failed: null canonical_school_id in full mapping")

    log("06.3 completed")


[ 3402.53s] 06.3 started
[ 3402.53s] 06.3 auto_merge edges: 13
[ 3402.85s] Saved: /Users/jennifer-david/Documents/work/SpringBoard/projects/Capstone Projects/smart-school/data/processed/notebook05/school_id_merge_map_v2.csv
[ 3402.85s] Saved: /Users/jennifer-david/Documents/work/SpringBoard/projects/Capstone Projects/smart-school/data/processed/notebook05/merge_groups_summary_v2.csv
[ 3402.85s] 06.3 Merge group size distribution (merged groups only):


Unnamed: 0,group_size,count
0,2,13


[ 3402.85s] 06.3 Top merge groups (merged groups only):


Unnamed: 0,root,old_school_id,group_size,canonical_school_id
0,PRI_00072574,"[PRI_00072574, PRI_BB200240]",2,PRI_00072574
1,PRI_00073487,"[PRI_00073487, PRI_BB180191]",2,PRI_00073487
2,PRI_00077845,"[PRI_00077845, PRI_A2192021]",2,PRI_00077845
3,PRI_00082345,"[PRI_00082345, PRI_BB160344]",2,PRI_00082345
4,PRI_00083553,"[PRI_00083553, PRI_BB160346]",2,PRI_00083553
5,PRI_00092477,"[PRI_00092477, PRI_A1900626]",2,PRI_00092477
6,PRI_A0100963,"[PRI_A0100963, PRI_K9300754]",2,PRI_A0100963
7,PRI_A0700289,"[PRI_A0700289, PRI_A1500500]",2,PRI_A0700289
8,PRI_A0900219,"[PRI_A0900219, PRI_BB160348]",2,PRI_A0900219
9,PRI_A0992014,"[PRI_A0992014, PRI_BB180272]",2,PRI_A0992014


[ 3402.86s] 06.3 total schools: 126,613 | merged school_ids: 13 | groups: 13
[ 3402.86s] 06.3 completed


### 06.4 Apply Canonical IDs and Collapse Merged Schools

Using the merge groups from 06.3, we assign every school a
`canonical_school_id`.

Rules:
- If a school appears in `school_id_merge_map_v1`, use its canonical ID
- Otherwise, the school is already canonical

This step does NOT delete records.
It produces:
- `schools_master_canonical_v1`: one row per canonical school
- `school_merge_lineage_v1`: full audit trail of merged IDs

All downstream notebooks must use `canonical_school_id`.


In [690]:
# # ============================
# # 06.4 Apply canonical IDs + collapse merged schools (FAST)  [HARDENED v2.1]
# # Inputs:
# #  - IN_SCHOOLS_V2 (parquet from Notebook 04 / 05: current canonical-ish master BEFORE applying merge collapse)
# #  - NB05_DIR / school_id_merge_map_v2.csv (from 06.3)
# # Outputs:
# #  - NB05_DIR / schools_master_v3.parquet
# #  - NB05_DIR / schools_master_v3.csv
# #  - NB05_DIR / school_merge_lineage_v2.csv
# # ============================

# from pathlib import Path
# import pandas as pd
# import numpy as np

# log("06.4 started")

# # --- Paths ---
# MERGE_MAP_PATH     = NB05_DIR / "school_id_merge_map_v2.csv"

# # NOTE: Rename your input variable to match reality.
# # If your Notebook 04 output is called IN_SCHOOLS_V1, keep it.
# assert "IN_SCHOOLS_V1" in globals(), "IN_SCHOOLS_V1 not defined (set it to your current schools_master parquet path)."
# IN_SCHOOLS_V2 = IN_SCHOOLS_V1  # alias for clarity

# OUT_SCHOOLS_V3      = NB05_DIR / "schools_master_v3.parquet"
# OUT_SCHOOLS_V3_CSV  = NB05_DIR / "schools_master_v3.csv"
# OUT_LINEAGE_V2      = NB05_DIR / "school_merge_lineage_v2.csv"

# # --- Preconditions ---
# assert Path(IN_SCHOOLS_V2).exists(), f"Input schools parquet not found: {IN_SCHOOLS_V2}"
# assert MERGE_MAP_PATH.exists(), f"merge map not found: {MERGE_MAP_PATH}"

# # ----------------------------
# # 1) Load v2 + merge map
# # ----------------------------
# log(f"Loading schools: {IN_SCHOOLS_V2}")
# df_in = pd.read_parquet(IN_SCHOOLS_V2)
# log(f"df_in: {df_in.shape}")

# log("=== 06.4 grade-signal check (pre-collapse) ===")
# grade_cols = [c for c in df_in.columns if "grade" in c.lower() or c.startswith("serves_")]
# log(f"Grade-related cols present: {grade_cols}")

# for c in ["serves_pk","serves_elementary","serves_middle","serves_high"]:
#     if c in df_in.columns:
#         log(f"{c} true_count = {int(df_in[c].fillna(False).astype(bool).sum()):,}")

# if "grade_band_sig" in df_in.columns:
#     vc = df_in["grade_band_sig"].astype("string").fillna("").value_counts().head(10)
#     log("Top grade_band_sig:")
#     print(vc)



# if "school_id" not in df_in.columns:
#     raise ValueError("Input schools df missing required column: school_id")

# log(f"Loading merge map: {MERGE_MAP_PATH}")
# merge_map = pd.read_csv(MERGE_MAP_PATH, dtype={"old_school_id": "string", "canonical_school_id": "string"})
# merge_map["old_school_id"] = merge_map["old_school_id"].astype(str)
# merge_map["canonical_school_id"] = merge_map["canonical_school_id"].astype(str)

# id_to_canon = dict(zip(merge_map["old_school_id"], merge_map["canonical_school_id"]))

# df = df_in.copy()
# df["school_id"] = df["school_id"].astype(str)
# df["canonical_school_id"] = df["school_id"].map(id_to_canon).fillna(df["school_id"]).astype(str)

# # Helpful: flag which rows are actually being remapped
# df["_is_remapped"] = (df["canonical_school_id"] != df["school_id"])

# # ----------------------------
# # 2) Vectorized scoring signals for representative row selection
# # ----------------------------
# sid = df["school_id"]

# # Source rank: CAIS (best) < PRI < PUB < other
# source_rank = np.select(
#     [sid.str.startswith("CAIS_"), sid.str.startswith("PRI_"), sid.str.startswith("PUB_")],
#     [0, 1, 2],
#     default=9
# ).astype(np.int16)

# # Enrichment boost
# if "has_any_enrichment" in df.columns:
#     enriched_boost = df["has_any_enrichment"].fillna(False).astype(bool).astype(np.int8)
# else:
#     enriched_boost = np.zeros(len(df), dtype=np.int8)

# # Geo confidence rank
# if "geo_confidence" in df.columns:
#     geo_conf = df["geo_confidence"].fillna("").astype(str).str.lower()
# else:
#     geo_conf = pd.Series([""] * len(df), index=df.index)

# geo_rank = geo_conf.map({"high": 3, "medium": 2, "low": 1, "failed": 0}).fillna(0).astype(np.int8)

# lat = pd.to_numeric(df["latitude"], errors="coerce") if "latitude" in df.columns else pd.Series([np.nan] * len(df), index=df.index)
# lon = pd.to_numeric(df["longitude"], errors="coerce") if "longitude" in df.columns else pd.Series([np.nan] * len(df), index=df.index)
# has_latlng = (lat.notna() & lon.notna()).astype(np.int8)

# # ----------------------------
# # 3) Address quality score (structured address wins)
# # ----------------------------
# addr = (df["address"].fillna("").astype(str) if "address" in df.columns else pd.Series([""] * len(df), index=df.index))
# addr_norm = addr.str.lower()

# has_street_num = addr.str.match(r"^\s*\d+").fillna(False).astype(np.int8)
# is_po_box = addr_norm.str.contains(r"\bpo box\b|p\.o\.|^\s*box\s", regex=True).fillna(False).astype(np.int8)

# zip_col = (df["zip"].fillna("").astype(str) if "zip" in df.columns else pd.Series([""] * len(df), index=df.index))
# zip5 = zip_col.str.extract(r"(\d{5})", expand=False).fillna("")
# has_zip5 = (zip5.str.len() == 5).astype(np.int8)

# has_newline = addr.str.contains(r"[\r\n]").fillna(False).astype(np.int8)
# has_view_on = addr_norm.str.contains("view on").fillna(False).astype(np.int8)

# addr_len = addr.str.len().clip(0, 200).astype(np.int16)

# addr_quality = (
#     50 * has_street_num +
#     30 * has_zip5 -
#     60 * is_po_box -
#     20 * has_newline -
#     40 * has_view_on +
#     1 * addr_len
# ).astype(np.int32)

# name_len = (df["name"].fillna("").astype(str).str.len().astype(np.int16) if "name" in df.columns else np.zeros(len(df), dtype=np.int16))

# # ----------------------------
# # 4) Representative row score (deterministic)
# # Priority: enrichment > latlng > geo_conf > address quality > source > name length
# # ----------------------------
# rep_score = (
#     1_000_000 * enriched_boost
#     + 200_000 * has_latlng
#     + 50_000 * geo_rank
#     + 5_000 * addr_quality
#     - 100 * source_rank
#     + 1 * name_len
# ).astype(np.int64)

# df["_rep_score"] = rep_score
# df["_sid_tiebreak"] = sid  # stable tie-break

# best_idx = (
#     df.sort_values(["canonical_school_id", "_rep_score", "_sid_tiebreak"],
#                    ascending=[True, False, True])
#       .groupby("canonical_school_id", sort=False)
#       .head(1)
#       .index
# )

# base = df.loc[best_idx].copy()

# # Keep original rep row id for auditability
# base["school_id_original"] = base["school_id"].astype(str)

# # IMPORTANT: force primary key to canonical
# base["school_id"] = base["canonical_school_id"].astype(str)

# # ----------------------------
# # 5) Boolean OR rollups (FAST)
# # ----------------------------
# BOOL_OR_COLS = [
#     "is_public", "is_private", "is_charter",
#     "serves_pk", "serves_elementary", "serves_middle", "serves_high",
#     "has_cais", "has_ib", "has_ams_montessori", "has_waldorf", "has_any_enrichment",
# ]
# BOOL_OR_COLS = [c for c in BOOL_OR_COLS if c in df.columns]

# if BOOL_OR_COLS:
#     bool_mat = df[["canonical_school_id"] + BOOL_OR_COLS].copy()
#     for c in BOOL_OR_COLS:
#         bool_mat[c] = bool_mat[c].fillna(False).astype(bool).astype(np.int8)

#     agg_bool = bool_mat.groupby("canonical_school_id", as_index=False)[BOOL_OR_COLS].max()
#     for c in BOOL_OR_COLS:
#         agg_bool[c] = agg_bool[c].astype(bool)

#     # drop possibly stale bools from base then merge in rolled-up bools
#     base = base.drop(columns=[c for c in BOOL_OR_COLS if c in base.columns], errors="ignore")
#     base = base.merge(agg_bool, on="canonical_school_id", how="left")

# # ----------------------------
# # 6) Best geocode per group
# # prefer: has_latlng > geo_rank > addr_quality > source_rank
# # ----------------------------
# df["_has_latlng"] = has_latlng
# df["_geo_rank"] = geo_rank
# df["_addr_quality"] = addr_quality
# df["_source_rank"] = source_rank

# geo_cols = [c for c in ["latitude", "longitude", "geo_confidence", "geo_provider", "geo_status", "geo_query", "geo_query_type"] if c in df.columns]
# if geo_cols:
#     best_geo = (
#         df.sort_values(
#             ["canonical_school_id", "_has_latlng", "_geo_rank", "_addr_quality", "_source_rank", "school_id"],
#             ascending=[True, False, False, False, True, True]
#         )
#         .groupby("canonical_school_id", sort=False)
#         .head(1)[["canonical_school_id"] + geo_cols]
#         .copy()
#     )
#     base = base.drop(columns=[c for c in geo_cols if c in base.columns], errors="ignore")
#     base = base.merge(best_geo, on="canonical_school_id", how="left")

# # ----------------------------
# # 7) Prefer CAIS display name if present (human-friendly)
# # ----------------------------
# if "name" in df.columns:
#     cais = df.loc[df["school_id"].str.startswith("CAIS_"), ["canonical_school_id", "name"]].dropna()
#     if len(cais) > 0:
#         cais["_name_len"] = cais["name"].astype(str).str.len()
#         cais_best = (
#             cais.sort_values(["canonical_school_id", "_name_len", "name"], ascending=[True, False, True])
#                 .drop_duplicates("canonical_school_id")[["canonical_school_id", "name"]]
#                 .rename(columns={"name": "_cais_name"})
#         )
#         base = base.merge(cais_best, on="canonical_school_id", how="left")
#         base["name"] = base["_cais_name"].fillna(base.get("name"))
#         base = base.drop(columns=["_cais_name"], errors="ignore")

# # ----------------------------
# # 8) Lineage export (store merged ids as pipe-delimited string + merged_count)
# # ----------------------------
# lineage = (
#     df.groupby("canonical_school_id")["school_id"]
#       .apply(lambda x: "|".join(sorted(set(x.astype(str)))))
#       .reset_index()
#       .rename(columns={"school_id": "merged_school_ids"})
# )
# lineage["merged_count"] = lineage["merged_school_ids"].apply(lambda s: 0 if not s else (s.count("|") + 1))
# lineage = lineage.rename(columns={"canonical_school_id": "canonical_school_id"})

# # ----------------------------
# # 9) Final cleanup + write
# # ----------------------------
# helper_cols = [c for c in base.columns if c.startswith("_")]
# base = base.drop(columns=helper_cols, errors="ignore")

# schools_master_v3 = base.reset_index(drop=True)

# OUT_SCHOOLS_V3.parent.mkdir(parents=True, exist_ok=True)
# schools_master_v3.to_parquet(OUT_SCHOOLS_V3, index=False)
# schools_master_v3.to_csv(OUT_SCHOOLS_V3_CSV, index=False)
# lineage.to_csv(OUT_LINEAGE_V2, index=False)

# log(f"✅ Saved: {OUT_SCHOOLS_V3}")
# log(f"✅ Saved: {OUT_SCHOOLS_V3_CSV}")
# log(f"✅ Saved: {OUT_LINEAGE_V2}")

# log(f"Canonical rows: {len(schools_master_v3):,}")
# log(f"Lineage rows:   {len(lineage):,}")
# log(f"Merged groups (count>1): {int((lineage['merged_count'] > 1).sum()):,}")

# # ----------------------------
# # 10) Correct sanity checks
# # ----------------------------
# # How many IDs were actually remapped in the map?
# merge_map["_is_remap"] = (merge_map["old_school_id"] != merge_map["canonical_school_id"])
# expected_remapped_ids = int(merge_map["_is_remap"].sum())

# # How many original rows in df got remapped?
# got_remapped_rows = int(df["_is_remapped"].sum())

# log(f"Map remapped ids (old!=canon): {expected_remapped_ids:,}")
# log(f"Input rows remapped (school_id!=canon): {got_remapped_rows:,}")

# # After collapse, every output school_id should equal canonical_school_id
# bad_pk = int((schools_master_v3["school_id"].astype(str) != schools_master_v3["canonical_school_id"].astype(str)).sum())
# if bad_pk != 0:
#     raise ValueError(f"06.4 sanity failed: output school_id != canonical_school_id for {bad_pk} rows")

# log("06.4 completed")


[130886.59s] 06.4 started
[130886.59s] Loading schools: /Users/jennifer-david/Documents/work/SpringBoard/projects/Capstone Projects/smart-school/data/processed/notebook04/schools_master_v1.parquet
[130886.64s] df_in: (126616, 18)
[130886.64s] === 06.4 grade-signal check (pre-collapse) ===
[130886.64s] Grade-related cols present: ['serves_pk', 'serves_elementary', 'serves_middle', 'serves_high']
[130886.64s] serves_pk true_count = 0
[130886.64s] serves_elementary true_count = 0
[130886.64s] serves_middle true_count = 0
[130886.64s] serves_high true_count = 0
[130886.64s] Loading merge map: /Users/jennifer-david/Documents/work/SpringBoard/projects/Capstone Projects/smart-school/data/processed/notebook05/school_id_merge_map_v2.csv
[130890.01s] ✅ Saved: /Users/jennifer-david/Documents/work/SpringBoard/projects/Capstone Projects/smart-school/data/processed/notebook05/schools_master_v3.parquet
[130890.01s] ✅ Saved: /Users/jennifer-david/Documents/work/SpringBoard/projects/Capstone Projects/s

In [692]:
# ============================
# 06.4 Apply canonical IDs + collapse merged schools (FAST)  [HARDENED v2.2 + grade rehydrate + guard]
# Inputs:
#  - IN_SCHOOLS_V2 (parquet from Notebook 04/05: canonical-ish master BEFORE merge collapse)
#  - NB05_DIR / school_id_merge_map_v2.csv (from 06.3)
# Outputs:
#  - NB05_DIR / schools_master_v3.parquet
#  - NB05_DIR / schools_master_v3.csv
#  - NB05_DIR / school_merge_lineage_v2.csv
# ============================

from pathlib import Path
import pandas as pd
import numpy as np
import re

log("06.4 started")

# --- Paths ---
MERGE_MAP_PATH = NB05_DIR / "school_id_merge_map_v2.csv"

assert "IN_SCHOOLS_V1" in globals(), "IN_SCHOOLS_V1 not defined (set it to your current schools_master parquet path)."
IN_SCHOOLS_V2 = IN_SCHOOLS_V1  # alias for clarity

OUT_SCHOOLS_V3     = NB05_DIR / "schools_master_v3.parquet"
OUT_SCHOOLS_V3_CSV = NB05_DIR / "schools_master_v3.csv"
OUT_LINEAGE_V2     = NB05_DIR / "school_merge_lineage_v2.csv"

# --- Preconditions ---
assert Path(IN_SCHOOLS_V2).exists(), f"Input schools parquet not found: {IN_SCHOOLS_V2}"
assert MERGE_MAP_PATH.exists(), f"merge map not found: {MERGE_MAP_PATH}"

# ----------------------------
# 1) Load v2 + merge map
# ----------------------------
log(f"Loading schools: {IN_SCHOOLS_V2}")
df_in = pd.read_parquet(IN_SCHOOLS_V2)
log(f"df_in: {df_in.shape}")

if "school_id" not in df_in.columns:
    raise ValueError("Input schools df missing required column: school_id")

log(f"Loading merge map: {MERGE_MAP_PATH}")
merge_map = pd.read_csv(
    MERGE_MAP_PATH,
    dtype={"old_school_id": "string", "canonical_school_id": "string"}
)
merge_map["old_school_id"] = merge_map["old_school_id"].astype(str)
merge_map["canonical_school_id"] = merge_map["canonical_school_id"].astype(str)
id_to_canon = dict(zip(merge_map["old_school_id"], merge_map["canonical_school_id"]))

# Working df
df = df_in.copy()
df["school_id"] = df["school_id"].astype(str)
df["canonical_school_id"] = df["school_id"].map(id_to_canon).fillna(df["school_id"]).astype(str)

# Helpful: flag which rows are actually being remapped
df["_is_remapped"] = (df["canonical_school_id"] != df["school_id"])

# ----------------------------
# 1.1 Grade-signal check (pre-collapse)
# ----------------------------
log("=== 06.4 grade-signal check (pre-collapse) ===")
grade_cols = [c for c in df.columns if "grade" in c.lower() or c.startswith("serves_")]
log(f"Grade-related cols present: {grade_cols}")

for c in ["serves_pk", "serves_elementary", "serves_middle", "serves_high"]:
    if c in df.columns:
        log(f"{c} true_count (pre) = {int(df[c].fillna(False).astype(bool).sum()):,}")

if "grade_band_sig" in df.columns:
    vc = df["grade_band_sig"].astype("string").fillna("").value_counts().head(10)
    log("Top grade_band_sig:")
    print(vc)

# ----------------------------
# 1.2 Rehydrate serves_* flags if missing/empty (schema-adaptive)
# ----------------------------
serves_cols = ["serves_pk", "serves_elementary", "serves_middle", "serves_high"]

def _grade_to_num(g):
    if pd.isna(g):
        return None
    g = str(g).strip().lower()
    if g in {"pk", "pre-k", "prek"}:
        return -1
    if g == "k":
        return 0
    try:
        return int(g)
    except:
        return None

def _parse_grade_band_sig(sig: str) -> set:
    # returns token set like {"pk","elementary","middle","high"}
    if sig is None:
        return set()
    s = str(sig).strip().lower()
    if not s:
        return set()
    # normalize separators
    s = re.sub(r"[\|/;]+", ",", s)
    parts = [p.strip() for p in s.split(",") if p.strip()]
    tokens = set()
    for p in parts:
        if p in {"pk", "pre-k", "prek"}:
            tokens.add("pk")
        elif p in {"k", "kg", "kindergarten"}:
            tokens.add("pk")  # treat K as pk_k bucket; adjust later if you want distinct K
        elif "elementary" in p:
            tokens.add("elementary")
        elif "middle" in p:
            tokens.add("middle")
        elif "high" in p:
            tokens.add("high")
    return tokens

need_rehydrate = True
if all(c in df.columns for c in serves_cols):
    any_true = df[serves_cols].fillna(False).astype(bool).any(axis=1).any()
    need_rehydrate = not bool(any_true)

if need_rehydrate:
    log("Rehydrating serves_* flags (missing or all-false).")

    # Ensure columns exist + boolean dtype
    for c in serves_cols:
        if c not in df.columns:
            df[c] = False
        df[c] = df[c].astype("boolean").fillna(False).astype(bool)

    # Case A: grade_band_sig
    if "grade_band_sig" in df.columns:
        tok = df["grade_band_sig"].apply(_parse_grade_band_sig)

        df["serves_pk"] = tok.apply(lambda t: "pk" in t)
        df["serves_elementary"] = tok.apply(lambda t: "elementary" in t)
        df["serves_middle"] = tok.apply(lambda t: "middle" in t)
        df["serves_high"] = tok.apply(lambda t: "high" in t)

        log("✅ serves_* derived from grade_band_sig")

    # Case B: lowest/highest grade (try multiple common names)
    else:
        lo_col = next((c for c in ["lowest_grade", "grade_low", "lowest_grade_offered"] if c in df.columns), None)
        hi_col = next((c for c in ["highest_grade", "grade_high", "highest_grade_offered"] if c in df.columns), None)

        if lo_col and hi_col:
            lo = df[lo_col].apply(_grade_to_num)
            hi = df[hi_col].apply(_grade_to_num)

            df["serves_pk"] = (lo <= -1) & hi.notna()
            df["serves_elementary"] = (lo <= 5) & (hi >= 0)
            df["serves_middle"] = (lo <= 8) & (hi >= 6)
            df["serves_high"] = (lo <= 12) & (hi >= 9)

            log(f"✅ serves_* derived from {lo_col}/{hi_col}")
        else:
            log("⚠️ No grade source columns found (grade_band_sig or low/high grade). serves_* remain empty.")

    for c in serves_cols:
        log(f"{c} true_count (post rehydrate) = {int(df[c].sum()):,}")
else:
    log("serves_* already populated; no rehydrate needed.")

# ----------------------------
# 2) Vectorized scoring signals for representative row selection
# ----------------------------
sid = df["school_id"]

source_rank = np.select(
    [sid.str.startswith("CAIS_"), sid.str.startswith("PRI_"), sid.str.startswith("PUB_")],
    [0, 1, 2],
    default=9
).astype(np.int16)

if "has_any_enrichment" in df.columns:
    enriched_boost = df["has_any_enrichment"].fillna(False).astype(bool).astype(np.int8)
else:
    enriched_boost = np.zeros(len(df), dtype=np.int8)

if "geo_confidence" in df.columns:
    geo_conf = df["geo_confidence"].fillna("").astype(str).str.lower()
else:
    geo_conf = pd.Series([""] * len(df), index=df.index)

geo_rank = geo_conf.map({"high": 3, "medium": 2, "low": 1, "failed": 0}).fillna(0).astype(np.int8)

lat = pd.to_numeric(df["latitude"], errors="coerce") if "latitude" in df.columns else pd.Series([np.nan] * len(df), index=df.index)
lon = pd.to_numeric(df["longitude"], errors="coerce") if "longitude" in df.columns else pd.Series([np.nan] * len(df), index=df.index)
has_latlng = (lat.notna() & lon.notna()).astype(np.int8)

# ----------------------------
# 3) Address quality score (structured address wins)
# ----------------------------
addr = (df["address"].fillna("").astype(str) if "address" in df.columns else pd.Series([""] * len(df), index=df.index))
addr_norm = addr.str.lower()

has_street_num = addr.str.match(r"^\s*\d+").fillna(False).astype(np.int8)
is_po_box = addr_norm.str.contains(r"\bpo box\b|p\.o\.|^\s*box\s", regex=True).fillna(False).astype(np.int8)

zip_col = (df["zip"].fillna("").astype(str) if "zip" in df.columns else pd.Series([""] * len(df), index=df.index))
zip5 = zip_col.str.extract(r"(\d{5})", expand=False).fillna("")
has_zip5 = (zip5.str.len() == 5).astype(np.int8)

has_newline = addr.str.contains(r"[\r\n]").fillna(False).astype(np.int8)
has_view_on = addr_norm.str.contains("view on").fillna(False).astype(np.int8)

addr_len = addr.str.len().clip(0, 200).astype(np.int16)

addr_quality = (
    50 * has_street_num +
    30 * has_zip5 -
    60 * is_po_box -
    20 * has_newline -
    40 * has_view_on +
    1 * addr_len
).astype(np.int32)

name_len = (df["name"].fillna("").astype(str).str.len().astype(np.int16)
            if "name" in df.columns else np.zeros(len(df), dtype=np.int16))

# ----------------------------
# 4) Representative row score (deterministic)
# ----------------------------
rep_score = (
    1_000_000 * enriched_boost
    + 200_000 * has_latlng
    + 50_000 * geo_rank
    + 5_000 * addr_quality
    - 100 * source_rank
    + 1 * name_len
).astype(np.int64)

df["_rep_score"] = rep_score
df["_sid_tiebreak"] = sid  # stable tie-break

best_idx = (
    df.sort_values(["canonical_school_id", "_rep_score", "_sid_tiebreak"],
                   ascending=[True, False, True])
      .groupby("canonical_school_id", sort=False)
      .head(1)
      .index
)

base = df.loc[best_idx].copy()
base["school_id_original"] = base["school_id"].astype(str)
base["school_id"] = base["canonical_school_id"].astype(str)  # PK = canonical

# ----------------------------
# 5) Boolean OR rollups (FAST)
# ----------------------------
BOOL_OR_COLS = [
    "is_public", "is_private", "is_charter",
    "serves_pk", "serves_elementary", "serves_middle", "serves_high",
    "has_cais", "has_ib", "has_ams_montessori", "has_waldorf", "has_any_enrichment",
]
BOOL_OR_COLS = [c for c in BOOL_OR_COLS if c in df.columns]

if BOOL_OR_COLS:
    bool_mat = df[["canonical_school_id"] + BOOL_OR_COLS].copy()
    for c in BOOL_OR_COLS:
        bool_mat[c] = bool_mat[c].fillna(False).astype(bool).astype(np.int8)

    agg_bool = bool_mat.groupby("canonical_school_id", as_index=False)[BOOL_OR_COLS].max()
    for c in BOOL_OR_COLS:
        agg_bool[c] = agg_bool[c].astype(bool)

    base = base.drop(columns=[c for c in BOOL_OR_COLS if c in base.columns], errors="ignore")
    base = base.merge(agg_bool, on="canonical_school_id", how="left")

# ----------------------------
# 6) Best geocode per group
# ----------------------------
df["_has_latlng"] = has_latlng
df["_geo_rank"] = geo_rank
df["_addr_quality"] = addr_quality
df["_source_rank"] = source_rank

geo_cols = [c for c in ["latitude", "longitude", "geo_confidence", "geo_provider", "geo_status", "geo_query", "geo_query_type"] if c in df.columns]
if geo_cols:
    best_geo = (
        df.sort_values(
            ["canonical_school_id", "_has_latlng", "_geo_rank", "_addr_quality", "_source_rank", "school_id"],
            ascending=[True, False, False, False, True, True]
        )
        .groupby("canonical_school_id", sort=False)
        .head(1)[["canonical_school_id"] + geo_cols]
        .copy()
    )
    base = base.drop(columns=[c for c in geo_cols if c in base.columns], errors="ignore")
    base = base.merge(best_geo, on="canonical_school_id", how="left")

# ----------------------------
# 7) Prefer CAIS display name if present (human-friendly)
# ----------------------------
if "name" in df.columns:
    cais = df.loc[df["school_id"].str.startswith("CAIS_"), ["canonical_school_id", "name"]].dropna()
    if len(cais) > 0:
        cais["_name_len"] = cais["name"].astype(str).str.len()
        cais_best = (
            cais.sort_values(["canonical_school_id", "_name_len", "name"], ascending=[True, False, True])
                .drop_duplicates("canonical_school_id")[["canonical_school_id", "name"]]
                .rename(columns={"name": "_cais_name"})
        )
        base = base.merge(cais_best, on="canonical_school_id", how="left")
        base["name"] = base["_cais_name"].fillna(base.get("name"))
        base = base.drop(columns=["_cais_name"], errors="ignore")

# ----------------------------
# 8) Lineage export
# ----------------------------
lineage = (
    df.groupby("canonical_school_id")["school_id"]
      .apply(lambda x: "|".join(sorted(set(x.astype(str)))))
      .reset_index()
      .rename(columns={"school_id": "merged_school_ids"})
)
lineage["merged_count"] = lineage["merged_school_ids"].apply(lambda s: 0 if not s else (s.count("|") + 1))

# ----------------------------
# 9) Final cleanup + write
# ----------------------------
helper_cols = [c for c in base.columns if c.startswith("_")]
schools_master_v3 = base.drop(columns=helper_cols, errors="ignore").reset_index(drop=True)

OUT_SCHOOLS_V3.parent.mkdir(parents=True, exist_ok=True)
schools_master_v3.to_parquet(OUT_SCHOOLS_V3, index=False)
schools_master_v3.to_csv(OUT_SCHOOLS_V3_CSV, index=False)
lineage.to_csv(OUT_LINEAGE_V2, index=False)

log(f"✅ Saved: {OUT_SCHOOLS_V3}")
log(f"✅ Saved: {OUT_SCHOOLS_V3_CSV}")
log(f"✅ Saved: {OUT_LINEAGE_V2}")

log(f"Canonical rows: {len(schools_master_v3):,}")
log(f"Lineage rows:   {len(lineage):,}")
log(f"Merged groups (count>1): {int((lineage['merged_count'] > 1).sum()):,}")

# ----------------------------
# 10) Sanity checks
# ----------------------------
merge_map["_is_remap"] = (merge_map["old_school_id"] != merge_map["canonical_school_id"])
expected_remapped_ids = int(merge_map["_is_remap"].sum())
got_remapped_rows = int(df["_is_remapped"].sum())

log(f"Map remapped ids (old!=canon): {expected_remapped_ids:,}")
log(f"Input rows remapped (school_id!=canon): {got_remapped_rows:,}")

bad_pk = int((schools_master_v3["school_id"].astype(str) != schools_master_v3["canonical_school_id"].astype(str)).sum())
if bad_pk != 0:
    raise ValueError(f"06.4 sanity failed: output school_id != canonical_school_id for {bad_pk} rows")

# NEW: grade sanity (prevents silently shipping all-false again)
if all(c in schools_master_v3.columns for c in serves_cols):
    any_grade_true = schools_master_v3[serves_cols].fillna(False).astype(bool).any(axis=1).any()
    if not any_grade_true:
        log("⚠️ WARNING: Output serves_* are still all-false. No grade source was found upstream.")
        # Optional: uncomment to make it fail-fast in NB05:
        # raise ValueError("Output serves_* are all-false. Fix upstream grade derivation before exporting schools_master_v3.")

log("06.4 completed")


[134015.92s] 06.4 started
[134015.92s] Loading schools: /Users/jennifer-david/Documents/work/SpringBoard/projects/Capstone Projects/smart-school/data/processed/notebook04/schools_master_v1.parquet
[134015.96s] df_in: (126616, 18)
[134015.96s] Loading merge map: /Users/jennifer-david/Documents/work/SpringBoard/projects/Capstone Projects/smart-school/data/processed/notebook05/school_id_merge_map_v2.csv
[134016.06s] === 06.4 grade-signal check (pre-collapse) ===
[134016.06s] Grade-related cols present: ['serves_pk', 'serves_elementary', 'serves_middle', 'serves_high']
[134016.06s] serves_pk true_count (pre) = 0
[134016.06s] serves_elementary true_count (pre) = 0
[134016.06s] serves_middle true_count (pre) = 0
[134016.06s] serves_high true_count (pre) = 0
[134016.06s] Rehydrating serves_* flags (missing or all-false).
[134016.06s] ⚠️ No grade source columns found (grade_band_sig or low/high grade). serves_* remain empty.
[134016.06s] serves_pk true_count (post rehydrate) = 0
[134016.06s] s

In [664]:

lineage = pd.read_csv(NB05_DIR / "school_merge_lineage_v1.csv")

target_ids = {"PRI_A1900425", "CAIS_3584e195542a"}

def parse_merged_ids(cell):
    if pd.isna(cell):
        return set()

    s = str(cell).strip()

    # Case 1: looks like a Python list string: "['A','B']"
    if s.startswith("[") and s.endswith("]"):
        try:
            vals = ast.literal_eval(s)   # safe eval
            return set(map(str, vals))
        except Exception:
            return set()

    # Case 2: delimiter format (choose what you used; keep both)
    if "|" in s:
        return set(x.strip() for x in s.split("|") if x.strip())
    if "," in s:
        return set(x.strip() for x in s.split(",") if x.strip())

    # Case 3: single id
    return {s}

lineage["_merged_set"] = lineage["merged_school_ids"].apply(parse_merged_ids)

hits = lineage[lineage["_merged_set"].apply(lambda s: len(s & target_ids) > 0)].copy()
display(hits[["canonical_school_id","merged_count","merged_school_ids"]].head(20))

# Strong check: are BOTH ids in the same canonical group?
both = lineage[lineage["_merged_set"].apply(lambda s: target_ids.issubset(s))]
print("Groups containing BOTH target IDs:", len(both))
display(both[["canonical_school_id","merged_count","merged_school_ids"]])


Unnamed: 0,canonical_school_id,merged_count,merged_school_ids
3,CAIS_3584e195542a,1,CAIS_3584e195542a
16066,PRI_A1900425,1,PRI_A1900425


Groups containing BOTH target IDs: 0


Unnamed: 0,canonical_school_id,merged_count,merged_school_ids


In [666]:
# STEP 2 verification: confirm German + CAIS German are in the same canonical group
lineage = pd.read_csv(NB05_DIR / "school_merge_lineage_v1.csv")

target_ids = {"PRI_A1900425", "CAIS_3584e195542a"}

hits = lineage[lineage["merged_school_ids"].apply(lambda xs: set(eval(xs)) if isinstance(xs, str) and xs.startswith("[") else set(xs)).apply(lambda s: len(s & target_ids) > 0)]
display(hits.head(10))


Unnamed: 0,canonical_school_id,merged_school_ids,merged_count


In [668]:
# STEP 3: Audit suspicious groups (low name similarity merges)
dec = pd.read_csv(NB05_DIR / "merge_decisions_ca.csv")
auto = dec[dec["decision"] == "auto_merge"].copy()

# Focus on risky auto_merges: low raw name_sim OR used CAIS override
risky = auto[(auto["name_sim"] < 0.55) | (auto.get("used_cais_override", False) == True)].copy()

print("Risky auto_merge edges:", len(risky))
display(
    risky.sort_values(["used_cais_override","name_sim","distance_m"], ascending=[False, True, True])[
        ["school_id_a","name_a","school_id_b","name_b","distance_m","name_sim",
         "name_sim_guard","used_cais_override","reason"]
    ].head(50)
)


Risky auto_merge edges: 2


Unnamed: 0,school_id_a,name_a,school_id_b,name_b,distance_m,name_sim,name_sim_guard,used_cais_override,reason
73,PRI_A0900219,BAYHILL HIGH SCHOOL,PRI_BB160348,BAYHILL HIGH SCHOOL,0.0,1.0,1.0,True,auto: CAIS↔private override dist<=50 & guard_s...
117,PRI_A2190096,HARKER,CAIS_baab550e497c,The Harker School,0.0,1.0,1.0,True,auto: CAIS↔private override dist<=50 & guard_s...


In [670]:
# sanity check

v2 = pd.read_parquet(OUT_SCHOOLS_V2)
lineage = pd.read_csv(OUT_LINEAGE)

print("Merged groups >1:")
display(lineage[lineage["merged_count"] > 1].sort_values("merged_count", ascending=False))


Merged groups >1:


Unnamed: 0,canonical_school_id,merged_school_ids,merged_count
28,CAN_24b2170365fd,PRI_00077845|PRI_A2192021,2
29,CAN_29234bc4c4ea,PRI_A0992014|PRI_BB180272,2
30,CAN_29b8b7cc620d,PRI_00083553|PRI_BB160346,2
31,CAN_2eab9ebad716,PRI_00073487|PRI_BB180191,2
32,CAN_338195041dc7,PRI_A0700289|PRI_A1500500,2
33,CAN_533ae5e2759b,PRI_00072574|PRI_BB200240,2
34,CAN_61abbebcf0e1,PRI_00092477|PRI_A1900626,2
35,CAN_6900e7b9e74d,PRI_A0900219|PRI_BB160348,2
36,CAN_8288b85246de,PRI_00082345|PRI_BB160344,2
37,CAN_8d3521c3ab13,PRI_A0100963|PRI_K9300754,2


## 06.4.1 Manual Audit — Canonical Merge Validation (CAIS ↔ PSS)

Before finalizing `schools_master_v2`, we manually audit a small number of
**high-risk / high-value canonical merge groups** to validate merge logic.

### Why this audit is necessary
Some schools appear in multiple authoritative sources:
- **CAIS** (independent schools; authoritative addresses)
- **NCES PSS** (private school universe; known duplicate IDs)

Automated merging uses:
- exact / normalized address match
- geospatial distance thresholds
- enrichment presence (CAIS > PSS > public)

To ensure this logic is *correct and safe*, we manually inspect a few
canonical groups where:
- `merged_count >= 3`
- at least one CAIS record is present
- multiple PSS records collapse into a single canonical entity

### Audit focus (this cell)
We inspect canonical group:

**`CAN_f45e1a803de3`**

This group contains:
- 1 CAIS record
- 2 PSS records
- identical or near-identical physical locations

Expected outcome:
- All rows represent the **same real-world school**
- CAIS record should be selected as canonical representative
- PSS rows should be absorbed into lineage only

If this audit passes, it strongly validates the CAIS↔PSS merge rule.


In [672]:
# ============================
# 06.4.1 Reusable audit helpers (canonical groups)  [KEEP as optional QA]
# ============================

import ast
import pandas as pd

# Load once (reuse everywhere) — use v3 outputs
v3 = pd.read_parquet(OUT_SCHOOLS_V3)
lineage = pd.read_csv(OUT_LINEAGE_V2)
decisions = pd.read_csv(NB05_DIR / "merge_decisions_ca.csv")

def _parse_merged_ids(x):
    if isinstance(x, list):
        return [str(i) for i in x]
    if pd.isna(x):
        return []
    s = str(x).strip()

    # list-like
    if s.startswith("[") and s.endswith("]"):
        try:
            out = ast.literal_eval(s)
            if isinstance(out, list):
                return [str(i) for i in out]
        except Exception:
            pass

    # pipe-delimited (your current format)
    if "|" in s:
        return [p.strip().strip("'").strip('"') for p in s.split("|") if p.strip()]

    # comma legacy
    s2 = s.strip("[]")
    return [p.strip().strip("'").strip('"') for p in s2.split(",") if p.strip()]

# Cache parsed ids once (fast lookups)
if "_ids" not in lineage.columns:
    lineage["_ids"] = lineage["merged_school_ids"].apply(_parse_merged_ids)

# Optional: build reverse index (school_id -> canonical_school_id)
# This makes find_groups_by_school_ids O(k) instead of scanning 126k rows.
_school_to_canon = {}
for canon, ids in zip(lineage["canonical_school_id"].astype(str), lineage["_ids"]):
    for sid in ids:
        _school_to_canon[str(sid)] = canon

def find_groups_by_school_ids(school_ids):
    wanted = [str(x) for x in school_ids]
    canons = sorted({ _school_to_canon.get(s) for s in wanted if _school_to_canon.get(s) })
    return lineage[lineage["canonical_school_id"].astype(str).isin(canons)].drop(columns=["_ids"], errors="ignore")

def audit_canon(canon_id, df_v1=None, show_v3_row=True):
    canon_id = str(canon_id)
    row = lineage[lineage["canonical_school_id"].astype(str) == canon_id]
    if row.empty:
        print("❌ Not found:", canon_id)
        return

    display(row.drop(columns=["_ids"], errors="ignore"))
    merged_ids = row.iloc[0]["_ids"] if "_ids" in row.columns else _parse_merged_ids(row.iloc[0]["merged_school_ids"])
    print("Merged school IDs:", merged_ids)

    if df_v1 is not None:
        sub = df_v1[df_v1["school_id"].astype(str).isin(list(map(str, merged_ids)))].copy()
        cols = [c for c in ["school_id","name","address","city","state","zip","has_cais","has_any_enrichment"] if c in sub.columns]
        display(sub[cols])

    if show_v3_row:
        display(v3.loc[v3["canonical_school_id"].astype(str) == canon_id])

def trace_group_edges(canon_id):
    canon_id = str(canon_id)
    row = lineage[lineage["canonical_school_id"].astype(str) == canon_id]
    if row.empty:
        print("❌ Not found:", canon_id)
        return

    merged_ids = set(row.iloc[0]["_ids"]) if "_ids" in row.columns else set(_parse_merged_ids(row.iloc[0]["merged_school_ids"]))

    edges = decisions[
        (decisions["decision"] == "auto_merge") &
        (decisions["school_id_a"].astype(str).isin(merged_ids)) &
        (decisions["school_id_b"].astype(str).isin(merged_ids))
    ].copy()

    print("\n=== auto_merge edges inside", canon_id, "===")
    if edges.empty:
        print("(no auto_merge edges found in decisions table)")
        return

    cols = [c for c in ["school_id_a","school_id_b","name_a","name_b","distance_m","name_sim","reason"] if c in edges.columns]
    display(edges.sort_values(["distance_m","name_sim"], ascending=[True, False])[cols])


In [680]:
# ============================
# 06.4.2 Example audit: Harker merge (keep 1 example)
# ============================
hit = find_groups_by_school_ids(["PRI_A2190096", "CAIS_baab550e497c"])
display(hit)

if not hit.empty:
    can_id = hit.iloc[0]["canonical_school_id"]
    audit_canon(can_id, df_v1=df_v1)
    trace_group_edges(can_id)


Unnamed: 0,canonical_school_id,merged_school_ids,merged_count
19124,PRI_A2190096,CAIS_baab550e497c|PRI_A2190096,2


Unnamed: 0,canonical_school_id,merged_school_ids,merged_count
19124,PRI_A2190096,CAIS_baab550e497c|PRI_A2190096,2


Merged school IDs: ['CAIS_baab550e497c', 'PRI_A2190096']


Unnamed: 0,school_id,name,city,state,zip,has_cais,has_any_enrichment
119411,PRI_A2190096,HARKER,SAN JOSE,CA,95129,True,False


Unnamed: 0,school_id,name,city,state,zip,county,has_montessori,canonical_school_id,school_id_original,is_public,is_private,is_charter,serves_pk,serves_elementary,serves_middle,serves_high,has_cais,has_ib,has_waldorf,has_any_enrichment
19124,PRI_A2190096,The Harker School,San Jose,CA,95129,Santa Clara,False,PRI_A2190096,CAIS_baab550e497c,False,True,False,False,False,False,False,True,False,False,True



=== auto_merge edges inside PRI_A2190096 ===


Unnamed: 0,school_id_a,school_id_b,name_a,name_b,distance_m,name_sim,reason
117,PRI_A2190096,CAIS_baab550e497c,HARKER,The Harker School,0.0,1.0,auto: CAIS↔private override dist<=50 & guard_s...


In [688]:
# ============================
# 06.5 QA Summary (Invariants)  [HARDENED v3]
#   - Self-sufficient: loads decisions if missing
#   - Handles lineage IDs not present in v1 input
# ============================

log("06.5 QA started")

import pandas as pd

# --- Preconditions (expected from your pipeline) ---
assert "df_v1" in globals(), "df_v1 not found (load schools_master_v1 before QA)"
assert "schools_master_v3" in globals(), "schools_master_v3 not found (run 06.4 before QA)"
assert "lineage" in globals(), "lineage not found (load OUT_LINEAGE_V2 before QA)"

# ----------------------------
# Load decisions safely
# ----------------------------
if "df_decisions" in globals():
    df_decisions = df_decisions.copy()
elif "decisions" in globals():
    df_decisions = decisions.copy()
else:
    p = NB05_DIR / "merge_decisions_ca.csv"
    assert p.exists(), f"merge decisions not found: {p}"
    df_decisions = pd.read_csv(p)

# ----------------------------
# 0) Normalize types
# ----------------------------
lineage = lineage.copy()
lineage["canonical_school_id"] = lineage["canonical_school_id"].astype(str)
lineage["merged_school_ids"] = lineage["merged_school_ids"].fillna("").astype(str)
lineage["merged_count"] = pd.to_numeric(lineage["merged_count"], errors="coerce").fillna(1).astype(int)

schools_master_v3 = schools_master_v3.copy()
schools_master_v3["school_id"] = schools_master_v3["school_id"].astype(str)
schools_master_v3["canonical_school_id"] = schools_master_v3["canonical_school_id"].astype(str)

df_v1 = df_v1.copy()
df_v1["school_id"] = df_v1["school_id"].astype(str)

df_decisions = df_decisions.copy()
if "decision" in df_decisions.columns:
    df_decisions["decision"] = df_decisions["decision"].astype(str)

# ----------------------------
# 1) Row count delta invariant (input-aware)
#   expected_drop = sum(max(0, present_in_v1_n - 1)) across merged groups
# ----------------------------
before = len(df_v1)
after  = len(schools_master_v3)
actual_drop = int(before - after)

v1_ids = set(df_v1["school_id"].astype(str))

merged_rows = lineage[lineage["merged_count"] > 1].copy()

def _present_n_in_v1(merged_school_ids: str) -> int:
    ids = [s.strip() for s in str(merged_school_ids).split("|") if s.strip()]
    return sum((sid in v1_ids) for sid in ids)

merged_rows["present_n_in_v1"] = merged_rows["merged_school_ids"].apply(_present_n_in_v1)
expected_drop = int((merged_rows["present_n_in_v1"] - 1).clip(lower=0).sum())

assert actual_drop == expected_drop, (
    f"Row count mismatch: before={before}, after={after}, "
    f"actual_drop={actual_drop}, expected_drop={expected_drop}"
)

log(f"✓ Row drop OK: {before:,} → {after:,} (drop={actual_drop:,}; expected={expected_drop:,})")

missing_groups = merged_rows[merged_rows["present_n_in_v1"] < merged_rows["merged_count"]].copy()
log(f"Info: merged groups with IDs missing from v1: {len(missing_groups):,}")
if len(missing_groups) > 0:
    display(missing_groups[["canonical_school_id","merged_school_ids","merged_count","present_n_in_v1"]].head(25))

# ----------------------------
# 2) No public/public auto merges (hard rule)
# ----------------------------
if "both_public" in df_decisions.columns:
    # robust bool
    both_public = df_decisions["both_public"]
    if both_public.dtype == object:
        both_public = both_public.astype(str).str.strip().str.lower().isin(["true","1","yes","y","t"])
    else:
        both_public = both_public.fillna(False).astype(bool)

    pp_auto = df_decisions[(df_decisions["decision"] == "auto_merge") & (both_public)]
    assert len(pp_auto) == 0, f"Found public/public auto merges: {len(pp_auto)}"
    log("✓ No public/public auto merges")
else:
    log("⚠ Skipped: no 'both_public' column in decisions")

# ----------------------------
# 3) Canonical PK integrity
# ----------------------------
assert schools_master_v3["school_id"].is_unique, "school_id is not unique in schools_master_v3"
assert (schools_master_v3["school_id"] == schools_master_v3["canonical_school_id"]).all(), \
    "Invariant failed: school_id != canonical_school_id in schools_master_v3"
log("✓ Canonical PK integrity OK (unique, school_id==canonical_school_id)")

# ----------------------------
# 4) Lineage coverage (1:1 with canonical set)
# ----------------------------
canon_set = set(schools_master_v3["canonical_school_id"])
lineage_set = set(lineage["canonical_school_id"])
assert canon_set == lineage_set, (
    f"Lineage mismatch: canon_only={len(canon_set - lineage_set)}, "
    f"lineage_only={len(lineage_set - canon_set)}"
)
log("✓ Lineage coverage OK (canonical IDs match exactly)")

# ----------------------------
# 5) CAIS override sanity (informational)
# ----------------------------
if "used_cais_override" in df_decisions.columns:
    used = df_decisions["used_cais_override"]
    if used.dtype == object:
        used = used.astype(str).str.strip().str.lower().isin(["true","1","yes","y","t"])
    else:
        used = used.fillna(False).astype(bool)

    cais_auto = df_decisions[(df_decisions["decision"] == "auto_merge") & (used)]
    log(f"✓ CAIS override auto_merges: {len(cais_auto):,}")
else:
    log("⚠ Skipped: no 'used_cais_override' column in decisions")

log("06.5 QA completed — ALL CHECKS PASSED")


[ 4754.71s] 06.5 QA started
[ 4754.76s] ✓ Row drop OK: 126,615 → 126,603 (drop=12; expected=12)
[ 4754.76s] Info: merged groups with IDs missing from v1: 1


Unnamed: 0,canonical_school_id,merged_school_ids,merged_count,present_n_in_v1
19124,PRI_A2190096,CAIS_baab550e497c|PRI_A2190096,2,1


[ 4754.76s] ✓ No public/public auto merges
[ 4754.77s] ✓ Canonical PK integrity OK (unique, school_id==canonical_school_id)
[ 4754.80s] ✓ Lineage coverage OK (canonical IDs match exactly)
[ 4754.80s] ✓ CAIS override auto_merges: 2
[ 4754.80s] 06.5 QA completed — ALL CHECKS PASSED


In [684]:
# Diagnostic: check merged IDs that don't exist in df_v1
v1_ids = set(df_v1["school_id"].astype(str))
missing = []

merged_rows = lineage[lineage["merged_count"] > 1].copy()
for _, r in merged_rows.iterrows():
    ids = str(r["merged_school_ids"]).split("|")
    present = [i for i in ids if i in v1_ids]
    if len(present) < len(ids):
        missing.append({
            "canonical_school_id": r["canonical_school_id"],
            "merged_school_ids": r["merged_school_ids"],
            "present_in_v1": "|".join(present),
            "missing_from_v1": "|".join([i for i in ids if i not in v1_ids]),
        })

missing_df = pd.DataFrame(missing)
log(f"Merged groups with IDs missing from v1: {len(missing_df)}")
display(missing_df.head(50))


[ 4468.06s] Merged groups with IDs missing from v1: 1


Unnamed: 0,canonical_school_id,merged_school_ids,present_in_v1,missing_from_v1
0,PRI_A2190096,CAIS_baab550e497c|PRI_A2190096,PRI_A2190096,CAIS_baab550e497c


In [338]:
# ============================
# 03.3.1a Upgrade CAIS name_state_ca -> donor address geo_query (Bay Area MVP, FIXED)
#   - CAIS often missing best_city; use county for matching
#   - fallback to name-only ONLY when donor name_norm is unique in Bay Area
# ============================

log("03.3.1a started")

import re
import pandas as pd

assert "df_loc" in globals(), "df_loc not found"

BAYAREA_ZIP2 = {"94", "95"}
BAYAREA_COUNTIES = {
    "alameda","contra costa","marin","napa","san francisco","san mateo",
    "santa clara","solano","sonoma"
}

def norm_name(s: object) -> str:
    s = "" if pd.isna(s) else str(s).lower().strip()
    s = s.replace("&", " and ")
    s = re.sub(r"[^a-z0-9\s]", " ", s)
    s = re.sub(r"\s+", " ", s).strip()
    stop = {"school","schools","academy","the","of","and","for","at","inc","llc","co","prep","preparatory"}
    toks = [t for t in s.split() if t not in stop]
    return " ".join(toks)

def norm_county(s: object) -> str:
    s = "" if pd.isna(s) else str(s).lower().strip()
    s = s.replace("county", "").strip()
    s = re.sub(r"[^a-z\s]", " ", s)
    s = re.sub(r"\s+", " ", s).strip()
    return s

# CAIS rows stuck on name_state_ca
cais_missing = df_loc[
    df_loc["school_id"].astype(str).str.startswith("CAIS_") &
    df_loc["geo_query_type"].astype(str).eq("name_state_ca")
].copy()

log(f"03.3.1a CAIS name_state_ca rows: {len(cais_missing):,}")

# donors: address_city_state_zip in CA + Bay Area zip2 guard
donors = df_loc[
    df_loc["geo_query_type"].astype(str).eq("address_city_state_zip") &
    df_loc["geo_query"].astype(str).str.contains(r",\s*CA\b", case=False, na=False)
].copy()

donors["zip2"] = donors["geo_query"].astype(str).str.extract(r"\b(\d{5})(?:-\d{4})?\b")[0].astype(str).str[:2]
donors = donors[donors["zip2"].isin(BAYAREA_ZIP2)].copy()

# normalized keys
cais_missing["name_norm"] = cais_missing["name"].map(norm_name)
cais_missing["county_norm"] = cais_missing.get("best_county", cais_missing.get("county", "")).map(norm_county)

donors["name_norm"] = donors["name"].map(norm_name)
donors["county_norm"] = donors.get("best_county", donors.get("county", "")).map(norm_county)

# keep only donors whose county is in Bay Area (extra safety)
donors = donors[donors["county_norm"].isin(BAYAREA_COUNTIES)].copy()

# donor scoring
donors["donor_score"] = (
    3*(donors.get("is_private", False)==True).astype(int) +
    2*(donors.get("has_any_enrichment", False)==True).astype(int) +
    1*(donors.get("is_public", False)==True).astype(int)
)
donors_sorted = donors.sort_values("donor_score", ascending=False)

# PASS A: match on (name_norm + county_norm)
m1 = cais_missing.merge(
    donors_sorted[["school_id","name","name_norm","county_norm","geo_query","geo_query_type","donor_score"]].rename(columns={
        "school_id":"donor_school_id",
        "name":"donor_name",
        "geo_query":"donor_geo_query",
        "geo_query_type":"donor_geo_query_type",
    }),
    on=["name_norm","county_norm"],
    how="left"
)

patch1 = m1[m1["donor_geo_query"].notna()].copy()

# PASS B: fallback match on name_norm ONLY when donor name_norm is unique (prevents collisions)
donor_name_counts = donors_sorted.groupby("name_norm")["geo_query"].nunique().to_dict()
unique_name_norm = {k for k,v in donor_name_counts.items() if v == 1}

cais_fallback = cais_missing[~cais_missing["school_id"].isin(patch1["school_id"])].copy()
cais_fallback = cais_fallback[cais_fallback["name_norm"].isin(unique_name_norm)].copy()

m2 = cais_fallback.merge(
    donors_sorted[["school_id","name","name_norm","geo_query","geo_query_type","donor_score"]].rename(columns={
        "school_id":"donor_school_id",
        "name":"donor_name",
        "geo_query":"donor_geo_query",
        "geo_query_type":"donor_geo_query_type",
    }),
    on="name_norm",
    how="left"
)
patch2 = m2[m2["donor_geo_query"].notna()].copy()

patch_map = pd.concat([
    patch1[["school_id","name","geo_query","geo_query_type","donor_school_id","donor_name","donor_geo_query","donor_geo_query_type"]],
    patch2[["school_id","name","geo_query","geo_query_type","donor_school_id","donor_name","donor_geo_query","donor_geo_query_type"]],
], ignore_index=True).drop_duplicates(["school_id"], keep="first")

log(f"03.3.1a CAIS donor address found: {len(patch_map):,} / {len(cais_missing):,}")
display(patch_map)

cais_patch_map = patch_map.copy()

log("03.3.1a completed")


[28388.80s] 03.3.1a started
[28388.82s] 03.3.1a CAIS name_state_ca rows: 16
[28388.97s] 03.3.1a CAIS donor address found: 0 / 16


Unnamed: 0,school_id,name,geo_query,geo_query_type,donor_school_id,donor_name,donor_geo_query,donor_geo_query_type


[28388.97s] 03.3.1a completed


In [340]:
# ============================
# 03.3.1a2 CAIS must-have manual address patch (MVP)
# ============================

cais_manual = {
    # fill these 8 with real addresses
    "CAIS_0fb36d096d98": "<<<address, city, CA zip>>>",
    "CAIS_1ff9891ebd76": "<<<address, city, CA zip>>>",
    "CAIS_3584e195542a": "<<<address, city, CA zip>>>",
    "CAIS_38bcb29a33d7": "<<<address, city, CA zip>>>",
    "CAIS_42ecd004ae1c": "<<<address, city, CA zip>>>",
    "CAIS_453b3e9b276c": "<<<address, city, CA zip>>>",
    "CAIS_4eaacca69eb8": "<<<address, city, CA zip>>>",
    "CAIS_8aea417dcbe4": "<<<address, city, CA zip>>>",
}

mask = df_loc["school_id"].isin(cais_manual.keys())
df_loc.loc[mask, "geo_query"] = df_loc.loc[mask, "school_id"].map(cais_manual)
df_loc.loc[mask, "geo_query_type"] = "address_city_state_zip"
log(f"03.3.1a2 CAIS manual patched: {int(mask.sum())}")


[28426.50s] 03.3.1a2 CAIS manual patched: 8


In [342]:
# ============================
# 03.3.1b Upgrade PO Box style queries -> name_city_state (CA-only, quality guarded)
#   - Avoid generating "..., CA" when city missing
# ============================

log("03.3.1b started")

import pandas as pd

po_types = {"name_city_state_po_box", "name_city_state_po"}

po_rows = df_loc[
    df_loc["geo_query_type"].astype(str).isin(po_types) &
    df_loc["geo_query"].astype(str).str.contains(r",\s*CA\b", case=False, na=False)
].copy()

log(f"03.3.1b PO-box style CA rows: {len(po_rows):,}")

def build_name_city_state(row) -> str:
    nm = "" if pd.isna(row.get("name")) else str(row.get("name")).strip()
    city = row.get("best_city", row.get("city", ""))
    city = "" if pd.isna(city) else str(city).strip()
    if city == "":
        # don't create junk; signal no patch
        return ""
    return f"{nm}, {city}, CA"

po_patch_map = po_rows[["school_id","name","geo_query","geo_query_type","best_city","county"]].drop_duplicates().copy()
po_patch_map["new_geo_query"] = po_patch_map.apply(build_name_city_state, axis=1)
po_patch_map["new_geo_query_type"] = "name_city_state"

# keep only valid patches
po_patch_map = po_patch_map[po_patch_map["new_geo_query"].ne("")].copy()

log(f"03.3.1b PO Box patch candidates (valid): {len(po_patch_map):,}")
display(po_patch_map.head(30))

log("03.3.1b completed")


[28460.48s] 03.3.1b started
[28460.53s] 03.3.1b PO-box style CA rows: 810
[28460.54s] 03.3.1b PO Box patch candidates (valid): 810


Unnamed: 0,school_id,name,geo_query,geo_query_type,best_city,county,new_geo_query,new_geo_query_type
2428,PRI_00075225,ST JOSEPH SCHOOL,"ST JOSEPH SCHOOL, FREMONT, CA",name_city_state_po_box,FREMONT,,"ST JOSEPH SCHOOL, FREMONT, CA",name_city_state
2541,PRI_00077867,ST MARY'S HIGH SCHOOL,"ST MARY'S HIGH SCHOOL, STOCKTON, CA",name_city_state_po_box,STOCKTON,,"ST MARY'S HIGH SCHOOL, STOCKTON, CA",name_city_state
2591,PRI_00080949,CATE SCHOOL,"CATE SCHOOL, CARPINTERIA, CA",name_city_state_po_box,CARPINTERIA,Santa Barbara,"CATE SCHOOL, CARPINTERIA, CA",name_city_state
2660,PRI_00083564,MIDLAND SCHOOL CORPORATION,"MIDLAND SCHOOL CORPORATION, LOS OLIVOS, CA",name_city_state_po_box,LOS OLIVOS,,"MIDLAND SCHOOL CORPORATION, LOS OLIVOS, CA",name_city_state
2692,PRI_00087208,I'SOT SCHOOL,"I'SOT SCHOOL, CANBY, CA",name_city_state_po_box,CANBY,,"I'SOT SCHOOL, CANBY, CA",name_city_state
2714,PRI_00088813,THE WALDORF SCHOOL OF MENDOCINO COUNTY,"THE WALDORF SCHOOL OF MENDOCINO COUNTY, CALPEL...",name_city_state_po_box,CALPELLA,Mendocino,"THE WALDORF SCHOOL OF MENDOCINO COUNTY, CALPEL...",name_city_state
2719,PRI_00089395,PAGE ACADEMY,"PAGE ACADEMY, COSTA MESA, CA",name_city_state_po_box,COSTA MESA,,"PAGE ACADEMY, COSTA MESA, CA",name_city_state
2780,PRI_00095285,FEATHER RIVER ADVENTIST SCHOOL,"FEATHER RIVER ADVENTIST SCHOOL, OROVILLE, CA",name_city_state_po_box,OROVILLE,,"FEATHER RIVER ADVENTIST SCHOOL, OROVILLE, CA",name_city_state
8258,PRI_01613388,LAKE TAHOE PREPATORY SCHOOL,"LAKE TAHOE PREPATORY SCHOOL, OLYMPIC VALLEY, CA",name_city_state_po_box,OLYMPIC VALLEY,,"LAKE TAHOE PREPATORY SCHOOL, OLYMPIC VALLEY, CA",name_city_state
8362,PRI_01897608,TRUTH TABERNACLE CHRISTIAN SCHOOL,"TRUTH TABERNACLE CHRISTIAN SCHOOL, FRESNO, CA",name_city_state_po_box,FRESNO,Fresno,"TRUTH TABERNACLE CHRISTIAN SCHOOL, FRESNO, CA",name_city_state


[28460.55s] 03.3.1b completed


In [344]:
# ============================
# 03.3.1c Apply patches -> df_loc_mvp (safe copy)
# ============================

log("03.3.1c started")

df_loc_mvp = df_loc.copy()

# CAIS donor patches
if "cais_patch_map" in globals() and len(cais_patch_map) > 0:
    cais_updates = cais_patch_map[["school_id","donor_geo_query","donor_geo_query_type"]].copy()
    cais_updates.rename(columns={
        "donor_geo_query":"new_geo_query",
        "donor_geo_query_type":"new_geo_query_type"
    }, inplace=True)

    df_loc_mvp = df_loc_mvp.merge(cais_updates, on="school_id", how="left")
    mask = df_loc_mvp["new_geo_query"].notna()
    df_loc_mvp.loc[mask, "geo_query"] = df_loc_mvp.loc[mask, "new_geo_query"]
    df_loc_mvp.loc[mask, "geo_query_type"] = df_loc_mvp.loc[mask, "new_geo_query_type"]
    df_loc_mvp = df_loc_mvp.drop(columns=["new_geo_query","new_geo_query_type"])
    log(f"03.3.1c Applied CAIS donor patches: {int(mask.sum()):,}")
else:
    log("03.3.1c No CAIS donor patches to apply.")

# PO box patches
if "po_patch_map" in globals() and len(po_patch_map) > 0:
    po_updates = po_patch_map[["school_id","new_geo_query","new_geo_query_type"]].copy()
    df_loc_mvp = df_loc_mvp.merge(po_updates, on="school_id", how="left")
    mask2 = df_loc_mvp["new_geo_query"].notna()
    df_loc_mvp.loc[mask2, "geo_query"] = df_loc_mvp.loc[mask2, "new_geo_query"]
    df_loc_mvp.loc[mask2, "geo_query_type"] = df_loc_mvp.loc[mask2, "new_geo_query_type"]
    df_loc_mvp = df_loc_mvp.drop(columns=["new_geo_query","new_geo_query_type"])
    log(f"03.3.1c Applied PO Box patches: {int(mask2.sum()):,}")
else:
    log("03.3.1c No PO box patches to apply.")

# sanity: must-have
must_ids = ["CAIS_628a857df100", "CAIS_d0d81d7b5167", "PRI_AA001475"]
display(df_loc_mvp[df_loc_mvp["school_id"].astype(str).isin(must_ids)][
    ["school_id","name","geo_query","geo_query_type","best_city","county"]
].drop_duplicates())

log("03.3.1c completed")


[28487.57s] 03.3.1c started
[28487.62s] 03.3.1c No CAIS donor patches to apply.
[28487.70s] 03.3.1c Applied PO Box patches: 810


Unnamed: 0,school_id,name,geo_query,geo_query_type,best_city,county
7,CAIS_628a857df100,Crystal Springs Uplands School,"Crystal Springs Uplands School, CA",name_state_ca,,San Mateo
19,CAIS_d0d81d7b5167,Lick-Wilmerding High School,"Lick-Wilmerding High School, CA",name_state_ca,,San Francisco
22428,PRI_AA001475,THE BRANSON SCHOOL,"THE BRANSON SCHOOL, ROSS, CA",name_city_state,ROSS,Marin


[28487.71s] 03.3.1c completed


In [346]:
# ============================
# 03.3.1d Build MVP TODO keys (STRICT Bay Area)
#   Output:
#     - todo_to_geocode_mvp_addr  (CA + ZIP2 in 94/95 ONLY)
#     - todo_to_geocode_mvp_name  (MUST_HAVE list ONLY)
#     - todo_to_geocode_mvp       (combined, addr first)
# ============================

log("03.3.1d started")

CACHE_VERSION = "v1"
BAYAREA_ZIP2 = {"94", "95"}

# ---- your MVP must-have list (edit freely) ----
MUST_HAVE_NAMES = {
    "The Branson School",
    "Crystal Springs Uplands School",
    "Lick-Wilmerding High School",
    "The Harker School",
    "Menlo School",
    "Castilleja School",
    "Head-Royce School",
    "Marin Academy",
    "San Domenico School",
    "Bellarmine College Preparatory",
    "St Ignatius College Preparatory",
    "The Nueva School",
    "The Athenian School",
    "The King's Academy",
    "Cathedral School for Boys",
    "Chinese American International School",
    "The Hamlin School",
    "The San Francisco School",
    "Park Day School",
    "Kehillah Jewish High School",
    "Brandeis Marin",
    "Bentley School",
    "Sonoma Academy",
    "Lycee Francais de San Francisco",
    "The International School of San Francisco",
    "Convent & Stuart Hall",
    "German International School of Silicon Valley",
}

def canon_q(s: object) -> str:
    s = "" if pd.isna(s) else str(s)
    s = s.replace("\u00a0", " ")
    s = re.sub(r"\s+", " ", s).strip()
    return s

def norm_name(s: object) -> str:
    s = "" if pd.isna(s) else str(s).strip().lower()
    s = s.replace("&", " and ")
    s = re.sub(r"[^a-z0-9\s]", " ", s)
    s = re.sub(r"\s+", " ", s).strip()
    return s

# ----------------------------
# cache success keys
# ----------------------------
geo_cache = pd.read_parquet(GEO_CACHE_PATH) if GEO_CACHE_PATH.exists() else pd.DataFrame()
for c in ["geo_query","geo_query_type","latitude","longitude","cache_version"]:
    if c not in geo_cache.columns:
        geo_cache[c] = pd.NA

geo_cache["cache_version"] = geo_cache["cache_version"].astype(str).fillna(CACHE_VERSION)
geo_cache = geo_cache[geo_cache["cache_version"] == CACHE_VERSION].copy()
geo_cache["geo_query"] = geo_cache["geo_query"].astype(str).map(canon_q)
geo_cache["geo_query_type"] = geo_cache["geo_query_type"].astype(str).map(canon_q)
geo_cache["latitude"] = pd.to_numeric(geo_cache["latitude"], errors="coerce")
geo_cache["longitude"] = pd.to_numeric(geo_cache["longitude"], errors="coerce")

ok_mask = geo_cache["latitude"].notna() & geo_cache["longitude"].notna()
ok_keys = set(zip(geo_cache.loc[ok_mask, "geo_query"], geo_cache.loc[ok_mask, "geo_query_type"]))

# ----------------------------
# start from df_loc_mvp (patched)
# ----------------------------
tmp = df_loc_mvp.copy()
tmp["geo_query"] = tmp["geo_query"].map(canon_q)
tmp["geo_query_type"] = tmp["geo_query_type"].map(canon_q)
tmp["name_norm"] = tmp["name"].map(norm_name)

zip2 = tmp["geo_query"].astype(str).str.extract(r"\b(\d{5})(?:-\d{4})?\b")[0].astype(str).str[:2]

# ----------------------------
# A) Address keys: STRICT Bay Area = CA + ZIP3 allowlist
# ----------------------------
BAYAREA_ZIP3 = {"940","941","942","943","944","945","946","947","948","949","950","951"}

zip5 = tmp["geo_query"].astype(str).str.extract(r"\b(\d{5})(?:-\d{4})?\b")[0]
zip3 = zip5.astype(str).str[:3]

addr_mask = (
    tmp["geo_query_type"].eq("address_city_state_zip")
    & tmp["geo_query"].str.contains(r",\s*CA\b", case=False, na=False)
    & zip3.isin(BAYAREA_ZIP3)
)

addr_keys = (
    tmp.loc[addr_mask, ["geo_query","geo_query_type"]]
      .drop_duplicates()
      .reset_index(drop=True)
)

addr_need = ~addr_keys.apply(
    lambda r: (r["geo_query"], r["geo_query_type"]) in ok_keys,
    axis=1
)

todo_to_geocode_mvp_addr = addr_keys.loc[addr_need].reset_index(drop=True)

# ----------------------------
# B) Name keys: ONLY MUST_HAVE list (avoid geocoding random "X, CA")
# ----------------------------
must_norm = {norm_name(x) for x in MUST_HAVE_NAMES}
name_types = {"name_city_state", "name_state_ca"}  # keep tight

name_mask = tmp["geo_query_type"].isin(name_types) & tmp["name_norm"].isin(must_norm)
name_keys = tmp.loc[name_mask, ["geo_query","geo_query_type"]].drop_duplicates().reset_index(drop=True)
name_need = ~name_keys.apply(lambda r: (r["geo_query"], r["geo_query_type"]) in ok_keys, axis=1)
todo_to_geocode_mvp_name = name_keys.loc[name_need].reset_index(drop=True)

# ----------------------------
# combined, addr first
# ----------------------------
todo_to_geocode_mvp = pd.concat([todo_to_geocode_mvp_addr, todo_to_geocode_mvp_name], ignore_index=True)
todo_to_geocode_mvp = todo_to_geocode_mvp.drop_duplicates(["geo_query","geo_query_type"]).reset_index(drop=True)

log(f"03.3.1d TODO addr keys (ZIP3 Bay Area allowlist): {len(todo_to_geocode_mvp_addr):,}")
log(f"03.3.1d TODO name keys (must-have only): {len(todo_to_geocode_mvp_name):,}")
log(f"03.3.1d TODO total: {len(todo_to_geocode_mvp):,}")

display(todo_to_geocode_mvp.head(30))

log("03.3.1d completed")


[28524.72s] 03.3.1d started
[28525.63s] 03.3.1d TODO addr keys (ZIP3 Bay Area allowlist): 195
[28525.63s] 03.3.1d TODO name keys (must-have only): 1
[28525.63s] 03.3.1d TODO total: 196


Unnamed: 0,geo_query,geo_query_type
0,"2350 Powell St, Emeryville, CA 94608",address_city_state_zip
1,"150 Oak St, San Francisco, CA 94102",address_city_state_zip
2,"180 N San Pedro Rd, San Rafael, CA 94903",address_city_state_zip
3,"3100 Webber St, Palo Alto, CA 94306",address_city_state_zip
4,"6130 Silberman Ave. 6130 Silberman Dr., San Jo...",address_city_state_zip
5,"625 South Seventh St. 625 South Seventh St., S...",address_city_state_zip
6,"890 East William 890 East William, San Jose, C...",address_city_state_zip
7,"850 North Second St. 850 North Second St., San...",address_city_state_zip
8,"275 North 24th St. 275 North 24th St., San Jos...",address_city_state_zip
9,"6515 Grapevine Way 6515 Grapevine Way, San Jos...",address_city_state_zip


[28525.63s] 03.3.1d completed
