# Notebook 06 — Scoring System v2 (Dense Metrics + Weight Budget)

---

## 00. Setup & Scope

**Relationship to earlier notebooks**
- Notebook 04: Golden Record (`schools_master_v1`)
- Notebook 05: Scoring v1 (binary-only features)

**What v2 adds**
- Dense metrics (0.0–1.0)
- Normalization + clipping
- Weight-budget safeguards to prevent domination
- Better tie-breaking realism

**What v2 does NOT do**
- No ML training
- No learned weights
- Still fully deterministic and explainable

---

## 01. Load Inputs

- Load `schools_master_v1.csv`
- Load raw source tables for metrics:
  - NCES CCD (public)
  - NCES PSS (private)
  - CRDC (where available)
- Validate join keys:
  - `school_id`
  - `ncessch`
  - `ppin`
- Define output paths:
  - `/data/processed/`
  - `/artifacts/`

---

## 02. Metric Design Rules (v2 Contracts)

**Dense metric contract**
- Each metric outputs a value in **[0.0, 1.0]**
- Higher is always “better”
- Missing values handled explicitly (not silent)

**Handling rules**
- Missing-value policy per metric
- Clipping rules (winsorization or hard caps)
- Outlier strategy (log / percentile)

**Availability**
- Public-only metrics
- Private-only metrics
- Metrics available to both

---

## 03. Build Dense Metrics

### 03.1 School Size (“Vibe”)
- Compute from total enrollment
- Apply log transform
- Normalize → `score_size`

### 03.2 Student–Teacher Ratio (“Attention”)
- Compute ratio
- Clip extreme values
- Invert and normalize → `score_attention`

### 03.3 Grade Coverage (“Logistics”)
- Parse grade span
- Derive binary flags:
  - `serves_elementary`
  - `serves_middle`
  - `serves_high`

### 03.4 Diversity Index (“Environment”)
- Compute entropy-based diversity score (where available)
- Normalize → `score_diversity`

---

## 04. Assemble Golden Record v2

- Join dense metrics back to `schools_master_v1`
- Create:
  - `schools_master_v2.csv` (or parquet)

**Coverage & sanity checks**
- % non-null per metric
- Distribution summaries
- Basic range validation

---

## 05. Feature Config v2 (Source of Truth + Weight Budget)

- Create `feature_config_master_v2.json`

**Feature categories**
- Tags (binary, higher weights)
- Dense metrics (continuous, lower weights)

**Weight budget rules**
- Dense metrics total contribution capped (e.g. 2–4 points)
- Dealbreaker tags retain dominance

**Validation checks**
- Per-feature max contribution = `weight × max_value`
- Print top contributors
- Assert no dense metric can dominate ranking alone

---

## 06. Build School Matrix v2

- Build `school_matrix_v2.npy` using v2 feature contract

**Save artifacts**
- `school_index_v2.csv`
- `school_vector_explain_v2.json`

**Audit**
- Binary feature prevalence
- Dense metric mean / std
- Missingness rates

---

## 07. Scoring & Ranking v2 (Deterministic)

- Weighted linear scoring using v2 weights
- Stable tie-breaking rules

**Comparison**
- Run v1 vs v2 on identical demo child profiles
- Show reduction in score ties
- Highlight rank shifts driven by dense metrics

---

## 08. Explainability v2 (Dense + Tags)

- Per-feature contribution table includes dense metrics
- Human-readable explanation template, e.g.:

> “Both schools match IB, but School A ranks higher due to smaller size and higher attention score.”

---

## 09. Validation: Tie-Reduction & Reasonableness Tests

**Tie-reduction tests**
- “Sea of ties” scenario:
  - National vs CA vs city-only filters
- Count unique scores
- Measure tie group sizes

**Sanity checks**
- Ensure IB schools are not outranked solely by size
- Weight-budget enforcement in action

**Optional**
- Basic metric distribution plots

---

## 10. Summary & Next Steps

**What v2 improves**
- Realistic tie-breaking
- More nuanced ranking
- Preserves explainability

**Remaining gaps**
- Lat/Lon distance
- Academic outcomes
- Reviews / sentiment

**ML readiness**
- Clean dense signals
- Normalized feature space
- Safe foundation for learned weights later


## 00. Setup & Scope

### Purpose of Notebook 06

Notebook 06 introduces **Scoring System v2**, which extends the binary-only scoring model from Notebook 05 with **dense (continuous) metrics** while preserving:

- Determinism
- Explainability
- Human-auditable weights

The goal of v2 is **better tie-breaking and realism**, not machine learning.

---

### Relationship to Earlier Notebooks

**Notebook 04 — Golden Record**
- Built the canonical school dataset: `schools_master_v1`
- Unified public (NCES CCD / CRDC) and private (PSS + curated lists)
- Established stable school identifiers (`school_id`, `ncessch`, `ppin`)
- Created pedagogy and program *tags* (IB, CAIS, Montessori, Waldorf, etc.)

**Notebook 05 — Scoring v1 (Binary Only)**
- Built school vectors using **binary features only**
- Used weighted linear scoring
- Produced explainable rankings
- Limitation: large groups of tied schools due to coarse features

Notebook 06 **does not replace** Notebook 05 — it **extends it**.

---

### What Scoring v2 Adds

Scoring v2 introduces **dense metrics** that capture real-world variation between schools:

- School size (vibe)
- Student–teacher ratio (attention)
- Grade coverage (logistics)
- Diversity environment (where available)

Key properties:
- All dense metrics are normalized to **[0.0, 1.0]**
- Higher values always mean “better”
- Metrics act as **tie-breakers**, not primary decision drivers

---

### Weight Budget Philosophy

A core design constraint in v2 is the **weight budget**.

Binary tags (e.g., IB, CAIS) represent **intentional parent preferences**  
Dense metrics represent **secondary refinements**

Rules enforced:
- Dense metrics receive **lower weights**
- Total dense-metric contribution is capped (e.g. 2–4 points)
- No dense metric can outweigh a major program tag

This ensures:
> A “perfect size” school can never outrank a school that meets a parent’s core pedagogical requirement.

---

### What Scoring v2 Explicitly Does NOT Do

To maintain interpretability and project scope, v2 does **not**:

- Train machine-learning models
- Learn weights from data
- Use outcome labels (test scores, rankings)
- Optimize for prediction accuracy

All weights remain **hand-set, reviewable, and adjustable**.

---

### Design Principles

Scoring v2 adheres to the following principles:

1. **Deterministic**  
   Same inputs → same outputs

2. **Explainable**  
   Every score decomposes into feature-level contributions

3. **Auditable**  
   Feature values, weights, and budgets are inspectable

4. **ML-Ready**  
   Normalized dense signals are structured for future learning

---

### Output of This Notebook

By the end of Notebook 06, we will produce:

- `schools_master_v2.csv` (or parquet)
- `feature_config_master_v2.json`
- `school_matrix_v2.npy`
- Ranking outputs using v2 scoring
- Validation artifacts demonstrating:
  - Tie reduction
  - Weight-budget enforcement


## 01. Load Inputs

In this section, we load all datasets required for Scoring v2 and perform **early validation**.

The goal here is not transformation, but **safety**:
- Ensure the Golden Record exists and is complete
- Ensure join keys are present and unique
- Fail fast if anything critical is missing

No scoring or metric computation happens in this section.


In [123]:
import pandas as pd
from pathlib import Path

# Paths
DATA_DIR = Path("../data")
RAW_DIR = DATA_DIR / "raw"
PROCESSED_DIR = DATA_DIR / "processed"

# Load Golden Record (v1)
schools = pd.read_csv(PROCESSED_DIR / "schools_master_v1.csv")

print("Loaded schools_master_v1")
print("Shape:", schools.shape)

# Critical validation
assert "school_id" in schools.columns, "Missing school_id"
assert schools["school_id"].is_unique, "Duplicate school_id detected"

schools.head()


Loaded schools_master_v1
Shape: (124619, 16)


Unnamed: 0,school_id,school_name,city,state,zip,address,join_key,is_public,is_private,has_ams_montessori,has_cais,has_ccd,has_crdc,has_ib,has_pss,has_waldorf
0,PUB_10000500870,albertville middle school,albertville,AL,35950,600 e alabama ave,albertville middle school|albertville|AL,True,False,False,False,True,False,False,False,False
1,PUB_10000500871,albertville high school,albertville,AL,35950,402 e mccord ave,albertville high school|albertville|AL,True,False,False,False,True,False,False,False,False
2,PUB_10000500879,albertville intermediate school,albertville,AL,35950,901 w mckinney ave,albertville intermediate school|albertville|AL,True,False,False,False,True,False,False,False,False
3,PUB_10000500889,albertville elementary school,albertville,AL,35950,145 west end drive,albertville elementary school|albertville|AL,True,False,False,False,True,False,False,False,False
4,PUB_10000501616,albertville kindergarten and prek,albertville,AL,35951,257 country club rd,albertville kindergarten and prek|albertville|AL,True,False,False,False,True,False,False,False,False


### 01.4 Derive authoritative join keys from `school_id`

Our Golden Record v1 currently does not carry raw source IDs (`ncessch`, `ppin`) as separate columns.

To compute dense metrics from CCD / PSS / CRDC in v2, we need stable join keys.
We will derive:
- `ncessch` for public schools from `school_id` (pattern: `PUB_<ncessch>`)
- `ppin` for private schools from `school_id` if present (pattern: `PRI_<ppin>`)

Then we can validate uniqueness + coverage before joining raw tables.


In [125]:
import numpy as np

# Derive keys from school_id (safe even if some rows don't match)
schools["ncessch"] = np.where(
    schools["school_id"].str.startswith("PUB_"),
    schools["school_id"].str.replace("PUB_", "", regex=False),
    np.nan
)

schools["ppin"] = np.where(
    schools["school_id"].str.startswith("PRI_"),
    schools["school_id"].str.replace("PRI_", "", regex=False),
    np.nan
)

# Basic validation (fail-fast)
assert schools["school_id"].is_unique, "Duplicate school_id detected"

# These are not fatal, but you want to see the coverage
pub_coverage = schools.loc[schools["is_public"] == True, "ncessch"].notna().mean()
pri_coverage = schools.loc[schools["is_private"] == True, "ppin"].notna().mean()

print(f"Derived ncessch coverage for public schools: {pub_coverage:.3%}")
print(f"Derived ppin coverage for private schools: {pri_coverage:.3%}")

schools[["school_id", "is_public", "is_private", "ncessch", "ppin"]].head(10)


Derived ncessch coverage for public schools: 100.000%
Derived ppin coverage for private schools: 100.000%


Unnamed: 0,school_id,is_public,is_private,ncessch,ppin
0,PUB_10000500870,True,False,10000500870,
1,PUB_10000500871,True,False,10000500871,
2,PUB_10000500879,True,False,10000500879,
3,PUB_10000500889,True,False,10000500889,
4,PUB_10000501616,True,False,10000501616,
5,PUB_10000502150,True,False,10000502150,
6,PUB_10000600193,True,False,10000600193,
7,PUB_10000600872,True,False,10000600872,
8,PUB_10000600876,True,False,10000600876,
9,PUB_10000600877,True,False,10000600877,


### 01.5 Load CCD component tables (normalized raw data)

The NCES CCD dataset is stored in normalized form across multiple tables
(e.g., directory, staff, characteristics).

For Scoring v2, we will:
- Load only the CCD tables required for dense metrics
- Treat these tables as authoritative raw sources
- Defer any merging until metric-specific sections

At this stage, we only load and validate availability.


In [127]:
# CCD component tables (normalized structure)
ccd_dir = pd.read_csv(RAW_DIR / "ccd" / "ccd_directory.csv", low_memory=False)
ccd_staff = pd.read_csv(RAW_DIR / "ccd" / "ccd_staff.csv")
ccd_chars = pd.read_csv(RAW_DIR / "ccd" / "ccd_school_characteristics.csv")

print("CCD Directory:", ccd_dir.shape)
print("CCD Staff:", ccd_staff.shape)
print("CCD Characteristics:", ccd_chars.shape)

# --- Standardize join key name ---
def standardize_ncessch(df: pd.DataFrame) -> pd.DataFrame:
    if "NCESSCH" in df.columns:
        df = df.rename(columns={"NCESSCH": "ncessch"})
    return df

ccd_dir = standardize_ncessch(ccd_dir)
ccd_staff = standardize_ncessch(ccd_staff)
ccd_chars = standardize_ncessch(ccd_chars)

# --- Validate join key presence (now safe) ---
for name, df in {
    "ccd_directory": ccd_dir,
    "ccd_staff": ccd_staff,
    "ccd_school_characteristics": ccd_chars,
}.items():
    assert "ncessch" in df.columns, f"{name} missing ncessch"

print("CCD join keys standardized and validated")


CCD Directory: (102274, 65)
CCD Staff: (100458, 15)
CCD Characteristics: (100458, 17)
CCD join keys standardized and validated


## 02. Metric Design Rules (v2 Contracts)

Before computing any dense metrics, we define a **strict contract** that every
metric in Scoring v2 must follow.

These rules exist to:
- Prevent accidental dominance by noisy metrics
- Ensure comparability across schools
- Preserve explainability and auditability
- Support future ML without breaking determinism

Once defined here, these rules are treated as **non-negotiable constraints**
for the rest of the notebook.

---

### 02.1 Dense Metric Contract (Formal Definition)

Every dense metric must satisfy **all** of the following:

1. **Output range**
   - Metric score must be in **[0.0, 1.0]**
   - 0.0 = worst observed / least favorable
   - 1.0 = best observed / most favorable

2. **Directionality**
   - Higher is always better
   - If a raw signal is “lower is better” (e.g., student–teacher ratio),
     it must be inverted during normalization

3. **Explicit missing-value handling**
   - Missing values are never silently filled
   - Each metric must document one of:
     - neutral default (e.g., 0.5)
     - pessimistic default (e.g., 0.0)
     - exclusion from scoring
   - The choice is intentional and justified

4. **Clipping and outlier control**
   - Extreme values are clipped (hard cap or percentile)
   - Prevents a small number of outliers from collapsing the distribution

5. **Source transparency**
   - Each metric explicitly declares:
     - raw source table(s)
     - join key(s)
     - population coverage (public / private / both)

6. **Preserve raw values for explainability**
   - In addition to the normalized score, each metric must retain its
     underlying raw value
   - Raw values are **never used for scoring**
   - Raw values exist solely for:
     - UI explanations
     - auditability
     - debugging
   - Example:
     - `score_attention = 0.82`
     - `raw_student_teacher_ratio = 12.3`

---

### 02.2 Metric Roles in Scoring v2

Dense metrics act as **tie-breakers**, not primary decision drivers.

- Binary tags encode *explicit parent intent*
  (e.g., IB, CAIS, Montessori)
- Dense metrics encode *secondary refinement*
  (e.g., vibe, attention, environment)

This separation is enforced later through **weight budgets**.

---

### 02.3 Public vs Private Availability

Not all metrics apply to all schools.

- Some metrics are **public-only** (e.g., CCD-based)
- Some are **private-only**
- Some apply to **both**

Missingness due to data availability is:
- expected
- measured
- never silently penalized

---

### 02.4 Why We Normalize Early

Normalization happens **before** weighting and scoring.

Benefits:
- Comparable feature scales
- Interpretable weights
- Cleaner explainability
- ML-ready feature space

Once normalized, dense metrics behave like **soft tags**
rather than raw measurements.


In [129]:
# Contract validation helper for dense metrics
def validate_dense_metric(score_series: pd.Series, raw_series: pd.Series):
    """
    Enforces the Scoring v2 dense metric contract.
    """
    assert score_series.min() >= 0.0, "Score below 0.0"
    assert score_series.max() <= 1.0, "Score above 1.0"
    assert score_series.notna().any(), "Score has no valid values"
    assert raw_series.notna().any(), "Raw metric has no valid values"

# NOTE:
# - Scoring logic will use only score_*
# - raw_* is preserved for explainability and UI
# - This function will be reused after each metric is computed


### 03.1 School Size (“Vibe”) — Private Enrollment + Public Teachers Proxy

Our current CCD tables do not include student enrollment counts.  
To still create a meaningful “size/vibe” tie-breaker:

- **Private schools:** use PSS `NUMSTUDS` as `raw_enrollment_private`
- **Public schools:** use CCD Staff `TEACHERS` as a size proxy (`raw_teachers_public`)

We then compute a unified `raw_size_value` and normalize it into:
- `score_size` in [0,1], where **higher = smaller school**

We also preserve raw values for explainability:
- `raw_enrollment_private`, `raw_teachers_public`, `raw_size_value`, `raw_size_source`


In [131]:
# ============================================================
#  CLEANUP: drop any previously created raw columns
# ============================================================
cols_to_drop = [
    "raw_enrollment_private",
    "raw_teachers_public",
    "raw_enrollment",
    "raw_teachers",
    "raw_size_value",
    "raw_size_source",
    "score_size",
]

schools = schools.drop(columns=[c for c in cols_to_drop if c in schools.columns])
# ============================================================


# ---------- helpers ----------
def minmax_01(x: pd.Series) -> pd.Series:
    lo, hi = x.min(), x.max()
    if pd.isna(lo) or pd.isna(hi) or hi == lo:
        return pd.Series(0.5, index=x.index)
    return (x - lo) / (hi - lo)

def clip_percentiles(x: pd.Series, p_low=0.01, p_high=0.99) -> pd.Series:
    lo = x.quantile(p_low)
    hi = x.quantile(p_high)
    return x.clip(lower=lo, upper=hi)


# ---------- 0) Load PSS (private enrollment) ----------
pss_path = RAW_DIR / "enrichment" / "pss2122.csv"
pss = pd.read_csv(pss_path, usecols=["PPIN", "NUMSTUDS"], low_memory=False)

pss = pss.rename(columns={"PPIN": "ppin"})
pss["ppin"] = pss["ppin"].astype(str).str.strip()
pss["NUMSTUDS"] = pd.to_numeric(pss["NUMSTUDS"], errors="coerce")

pss_enr = pss.rename(columns={"NUMSTUDS": "raw_enrollment_private"})


# ---------- 1) Standardize join keys ----------
schools["ncessch"] = schools["ncessch"].astype(str).str.strip()
schools["ppin"] = schools["ppin"].astype(str).str.strip()
ccd_staff["ncessch"] = ccd_staff["ncessch"].astype(str).str.strip()


# ---------- 2) Public teachers proxy ----------
ccd_teach = ccd_staff[["ncessch", "TEACHERS"]].copy()
ccd_teach["TEACHERS"] = pd.to_numeric(ccd_teach["TEACHERS"], errors="coerce")
ccd_teach = ccd_teach.rename(columns={"TEACHERS": "raw_teachers_public"})


# ---------- 3) Merge raw values ----------
schools = schools.merge(pss_enr, on="ppin", how="left")
schools = schools.merge(ccd_teach, on="ncessch", how="left")


# ---------- 4) Choose size basis ----------
schools["raw_enrollment"] = schools["raw_enrollment_private"]
schools["raw_teachers"] = schools["raw_teachers_public"]

schools["raw_size_value"] = np.where(
    schools["raw_enrollment"].notna(),
    schools["raw_enrollment"],
    schools["raw_teachers"]
)

schools["raw_size_source"] = np.where(
    schools["raw_enrollment"].notna(),
    "enrollment",
    np.where(schools["raw_teachers"].notna(), "teachers_proxy", np.nan)
)


# ---------- 5) Normalize -> score_size ----------
raw = schools["raw_size_value"]
mask = raw.notna() & (raw >= 0)

log_x = pd.Series(np.nan, index=schools.index)
log_x.loc[mask] = np.log1p(raw.loc[mask])
log_x.loc[mask] = clip_percentiles(log_x.loc[mask], 0.01, 0.99)

norm = pd.Series(np.nan, index=schools.index)
norm.loc[mask] = minmax_01(log_x.loc[mask])

schools["score_size"] = np.nan
schools.loc[mask, "score_size"] = 1.0 - norm.loc[mask]


# ---------- 6) Validate ----------
validate_dense_metric(
    schools.loc[mask, "score_size"],
    schools.loc[mask, "raw_size_value"]
)

print("score_size built")
print("Coverage:", f"{mask.mean():.3%}")
print("Size source counts:\n", schools["raw_size_source"].value_counts(dropna=False).head(10))
schools[["school_id","raw_size_source","raw_size_value","score_size"]].head()


score_size built
Coverage: 96.430%
Size source counts:
 raw_size_source
teachers_proxy    97825
enrollment        22345
nan                4449
Name: count, dtype: int64


Unnamed: 0,school_id,raw_size_source,raw_size_value,score_size
0,PUB_10000500870,teachers_proxy,43.0,0.399235
1,PUB_10000500871,teachers_proxy,91.0,0.282136
2,PUB_10000500879,teachers_proxy,42.0,0.402885
3,PUB_10000500889,teachers_proxy,48.5,0.380536
4,PUB_10000501616,teachers_proxy,30.0,0.454832


### 03.2 Student–Teacher Ratio (“Attention”)

**Intent**
- Lower student–teacher ratio generally means more individual attention.
- We define attention so that **higher = better**.

**Raw value preserved for UI**
- `raw_student_teacher_ratio` (students per teacher)

**Normalized score**
- `score_attention` in [0.0, 1.0]
- Higher score = lower ratio (more attention)

**Data availability (current state)**
- Private schools: PSS provides both `NUMSTUDS` and `NUMTEACH` → compute ratio directly
- Public schools: CCD tables loaded so far provide `TEACHERS` but not enrollment → ratio unavailable for now

**Method**
1. Compute ratio where possible
2. Clip outliers (1st–99th percentile)
3. Normalize to [0,1]
4. Invert so lower ratio gets higher score
5. Validate against v2 contract


In [133]:
# ============================================================
# 03.1 Dense Metric: School Size ("Vibe") — UNIT-ALIGNED + DUAL SCORE
# ============================================================

# ---------- Re-run safety: drop prior outputs from this metric ----------
cols_to_drop = [
    "raw_enrollment_private",
    "raw_teachers_public",
    "raw_enrollment",
    "raw_teachers",
    "raw_students_est_public",
    "raw_size_value",
    "raw_size_source",
    "score_size_small",
    "score_size_large",
]
schools = schools.drop(columns=[c for c in cols_to_drop if c in schools.columns])

# ---------- CONSTANTS ----------
# Heuristic: if we only have teacher count, estimate students.
# National avg is ~16:1; HS can be ~20:1. We use 20 to avoid classifying mid-sized schools as "tiny".
TEACHER_TO_STUDENT_MULTIPLIER = 20.0

# ---------- Helpers (assumed defined earlier in notebook; kept here for clarity) ----------
# clip_percentiles(x, p_low=0.01, p_high=0.99)
# minmax_01(x)
# validate_dense_metric(score_series, raw_series)

# ---------- 0) Load minimal PSS columns (private enrollment) ----------
pss_path = RAW_DIR / "enrichment" / "pss2122.csv"
pss = pd.read_csv(pss_path, usecols=["PPIN", "NUMSTUDS"], low_memory=False)

pss = pss.rename(columns={"PPIN": "ppin"})
pss["ppin"] = pss["ppin"].astype(str).str.strip()
pss["NUMSTUDS"] = pd.to_numeric(pss["NUMSTUDS"], errors="coerce")

pss_enr = pss.rename(columns={"NUMSTUDS": "raw_enrollment_private"})

# ---------- 1) Standardize join keys ----------
schools["ncessch"] = schools["ncessch"].astype(str).str.strip()
schools["ppin"] = schools["ppin"].astype(str).str.strip()
ccd_staff["ncessch"] = ccd_staff["ncessch"].astype(str).str.strip()

# ---------- 2) Build public teachers proxy (CCD Staff) ----------
assert "TEACHERS" in ccd_staff.columns, "ccd_staff missing TEACHERS"
ccd_teach = ccd_staff[["ncessch", "TEACHERS"]].copy()
ccd_teach["TEACHERS"] = pd.to_numeric(ccd_teach["TEACHERS"], errors="coerce")
ccd_teach = ccd_teach.rename(columns={"TEACHERS": "raw_teachers_public"})

# ---------- 3) Merge raw inputs ----------
schools = schools.merge(pss_enr, on="ppin", how="left")
schools = schools.merge(ccd_teach, on="ncessch", how="left")

# Preserve raw fields for UI
schools["raw_enrollment"] = schools["raw_enrollment_private"]        # students (private)
schools["raw_teachers"] = schools["raw_teachers_public"]             # teachers (public)

# Unit alignment: convert teacher proxy into estimated student scale
schools["raw_students_est_public"] = schools["raw_teachers"] * TEACHER_TO_STUDENT_MULTIPLIER

# Choose size basis (students-scale comparable across populations)
schools["raw_size_value"] = np.where(
    schools["raw_enrollment"].notna(),
    schools["raw_enrollment"],            # private: real students
    schools["raw_students_est_public"]    # public: estimated students
)

schools["raw_size_source"] = np.where(
    schools["raw_enrollment"].notna(),
    "enrollment",
    np.where(schools["raw_teachers"].notna(), f"teachers_est_x{int(TEACHER_TO_STUDENT_MULTIPLIER)}", np.nan)
)

# ---------- 4) Normalize -> dual scores ----------
raw = pd.to_numeric(schools["raw_size_value"], errors="coerce")
mask = raw.notna() & (raw >= 0)

# Transform: log compress + clip outliers
log_x = pd.Series(np.nan, index=schools.index, dtype=float)
log_x.loc[mask] = np.log1p(raw.loc[mask])
log_x.loc[mask] = clip_percentiles(log_x.loc[mask], 0.01, 0.99)

# Normalize to [0,1]
norm = pd.Series(np.nan, index=schools.index, dtype=float)
norm.loc[mask] = minmax_01(log_x.loc[mask])

# Initialize outputs (prevents rerun artifacts)
schools["score_size_large"] = np.nan   # 1.0 = big
schools["score_size_small"] = np.nan   # 1.0 = small / intimate

# Dual scores
schools.loc[mask, "score_size_large"] = norm.loc[mask]
schools.loc[mask, "score_size_small"] = 1.0 - norm.loc[mask]

# ---------- 5) Contract validation ----------
validate_dense_metric(schools.loc[mask, "score_size_small"], schools.loc[mask, "raw_size_value"])
validate_dense_metric(schools.loc[mask, "score_size_large"], schools.loc[mask, "raw_size_value"])

# ---------- 6) Audit prints ----------
print("03.1 School Size metrics built (unit-aligned + dual scores)")
print("Coverage:", f"{mask.mean():.3%}")
print("Source Distribution:\n", schools["raw_size_source"].value_counts(dropna=False).head(10))

# Quick sanity: public vs private medians on the same (students-like) scale
print("\nSanity (students-scale) medians:")
print("Median public raw_size_value:", schools.loc[schools["is_public"] == True, "raw_size_value"].median())
print("Median private raw_size_value:", schools.loc[schools["is_private"] == True, "raw_size_value"].median())

cols = [
    "school_id", "is_public", "is_private",
    "raw_size_source", "raw_enrollment", "raw_teachers", "raw_students_est_public",
    "raw_size_value", "score_size_small", "score_size_large"
]
schools.loc[mask, cols].head(8)


03.1 School Size metrics built (unit-aligned + dual scores)
Coverage: 96.430%
Source Distribution:
 raw_size_source
teachers_est_x20    97825
enrollment          22345
nan                  4449
Name: count, dtype: int64

Sanity (students-scale) medians:
Median public raw_size_value: 560.0
Median private raw_size_value: 82.0


Unnamed: 0,school_id,is_public,is_private,raw_size_source,raw_enrollment,raw_teachers,raw_students_est_public,raw_size_value,score_size_small,score_size_large
0,PUB_10000500870,True,False,teachers_est_x20,,43.0,860.0,860.0,0.138033,0.861967
1,PUB_10000500871,True,False,teachers_est_x20,,91.0,1820.0,1820.0,0.042495,0.957505
2,PUB_10000500879,True,False,teachers_est_x20,,42.0,840.0,840.0,0.14103,0.85897
3,PUB_10000500889,True,False,teachers_est_x20,,48.5,970.0,970.0,0.122697,0.877303
4,PUB_10000501616,True,False,teachers_est_x20,,30.0,600.0,600.0,0.183885,0.816115
5,PUB_10000502150,True,False,teachers_est_x20,,60.0,1200.0,1200.0,0.095583,0.904417
6,PUB_10000600193,True,False,teachers_est_x20,,21.03,420.6,420.6,0.229105,0.770895
7,PUB_10000600872,True,False,teachers_est_x20,,36.27,725.4,725.4,0.159715,0.840285


### 03.2 Student–Teacher Ratio (“Attention”) — v2 (Private-only, honest missingness)

**Intent**
- Lower student–teacher ratio usually means more individual attention.
- We normalize into a dense metric:
  - `score_attention` in [0,1]
  - **Higher = better attention** (lower ratio)

**Data availability**
- Private schools (PSS): we have both students and teachers:
  - `NUMSTUDS` (students)
  - `NUMTEACH` (teachers)
- Public schools (CCD): we currently have `TEACHERS`, but not enrollment.
  - Therefore, we do **not** compute a public ratio in v2.
  - Public `score_attention` remains **missing**, explicitly.

**Raw values preserved for UI**
- `raw_students_private`, `raw_teachers_private`
- `raw_student_teacher_ratio` (students per teacher)

**Normalization method**
1. Compute ratio = students / teachers (ignore invalid: teachers <= 0)
2. Clip extreme ratios (1st–99th percentile)
3. Normalize to [0,1]
4. Invert so **lower ratio → higher score**


In [135]:
# ---------- Re-run safety: drop prior outputs from this metric ----------
cols_to_drop = [
    "raw_students_private",
    "raw_teachers_private",
    "raw_student_teacher_ratio",
    "raw_attention_source",
    "score_attention",
]
schools = schools.drop(columns=[c for c in cols_to_drop if c in schools.columns])

# ---------- Load minimal PSS columns needed for attention ----------
pss_path = RAW_DIR / "enrichment" / "pss2122.csv"
pss_attn = pd.read_csv(pss_path, usecols=["PPIN", "NUMSTUDS", "NUMTEACH"], low_memory=False)

# Standardize join key
pss_attn = pss_attn.rename(columns={"PPIN": "ppin"})
pss_attn["ppin"] = pss_attn["ppin"].astype(str).str.strip()

# Numeric coercion
pss_attn["NUMSTUDS"] = pd.to_numeric(pss_attn["NUMSTUDS"], errors="coerce")
pss_attn["NUMTEACH"] = pd.to_numeric(pss_attn["NUMTEACH"], errors="coerce")

# Rename to raw_* fields
pss_attn = pss_attn.rename(columns={
    "NUMSTUDS": "raw_students_private",
    "NUMTEACH": "raw_teachers_private",
})

# ---------- Join into schools ----------
schools["ppin"] = schools["ppin"].astype(str).str.strip()
schools = schools.merge(pss_attn, on="ppin", how="left")

# ---------- Compute raw ratio (private only) ----------
# ratio = students per teacher; invalid if teachers <= 0
ratio = schools["raw_students_private"] / schools["raw_teachers_private"]
ratio = ratio.where(schools["raw_teachers_private"] > 0)

schools["raw_student_teacher_ratio"] = ratio
schools["raw_attention_source"] = np.where(
    schools["raw_student_teacher_ratio"].notna(),
    "pss_private_ratio",
    np.nan
)

# ---------- Normalize -> score_attention (higher = better attention) ----------
mask = schools["raw_student_teacher_ratio"].notna() & (schools["raw_student_teacher_ratio"] > 0)

# Clip outliers on ratio (prevents a few crazy values dominating)
ratio_clip = schools.loc[mask, "raw_student_teacher_ratio"].copy()
ratio_clip = clip_percentiles(ratio_clip, 0.01, 0.99)

# Normalize ratio to [0,1] (0=lowest ratio, 1=highest ratio), then invert
norm = minmax_01(ratio_clip)

schools["score_attention"] = np.nan
schools.loc[mask, "score_attention"] = 1.0 - norm  # lower ratio -> higher score

# ---------- Contract validation + audit ----------
validate_dense_metric(schools.loc[mask, "score_attention"], schools.loc[mask, "raw_student_teacher_ratio"])

print("03.2 Attention metric built (private-only)")
print("Coverage (all schools):", f"{mask.mean():.3%}")
print("Coverage among private schools:", f"{mask[schools['is_private'] == True].mean():.3%}")

# A few examples
cols = [
    "school_id", "is_private",
    "raw_students_private", "raw_teachers_private",
    "raw_student_teacher_ratio", "score_attention"
]
schools.loc[mask, cols].head(8)


03.2 Attention metric built (private-only)
Coverage (all schools): 17.931%
Coverage among private schools: 100.000%


Unnamed: 0,school_id,is_private,raw_students_private,raw_teachers_private,raw_student_teacher_ratio,score_attention
102274,PRI_00000033,True,107.0,12.6,8.492063,0.750677
102275,PRI_00000044,True,392.0,42.3,9.267139,0.725301
102276,PRI_00000055,True,177.0,15.6,11.346154,0.657236
102277,PRI_00000077,True,376.0,33.6,11.190476,0.662333
102278,PRI_00000088,True,336.0,24.0,14.0,0.570352
102279,PRI_00000124,True,92.0,9.0,10.222222,0.694033
102280,PRI_00000135,True,104.0,12.3,8.455285,0.751881
102281,PRI_00000146,True,179.0,10.5,17.047619,0.470575


### 03.3 Grade Coverage (“Logistics”)

**Intent**
A school’s grade span affects family logistics:
- Fewer transitions (e.g., K–8) can be easier than switching schools at 6th grade.
- We convert grade coverage into simple, explainable flags.

**Data sources**
- Public: CCD Directory grade-offered flags (e.g., `G_KG_OFFERED`, `G_1_OFFERED`, … `G_12_OFFERED`)
- Private: PSS low/high grade fields (`LOGR2022`, `HIGR2022`) if available

**Outputs (explainable + ML-ready)**
- `serves_elementary` (K–5)
- `serves_middle` (6–8)
- `serves_high` (9–12)
- `grade_span_min` and `grade_span_max` (normalized grade codes, where available)
- `raw_grade_source` in {"ccd_offered_flags", "pss_span_2022", NaN}

Notes:
- These are “dense-adjacent” structural features (not continuous scores).
- Missingness is expected; we preserve it honestly.


In [137]:
# ---------- Re-run safety: drop prior outputs from this section ----------
cols_to_drop = [
    "serves_elementary",
    "serves_middle",
    "serves_high",
    "grade_span_min",
    "grade_span_max",
    "raw_grade_source",
]
schools = schools.drop(columns=[c for c in cols_to_drop if c in schools.columns])

# ---------- Helpers ----------
# Normalize CCD "grade offered" flags: treat 1/"Yes"/"Y"/True as offered
def flag_true(s: pd.Series) -> pd.Series:
    if s.dtype == bool:
        return s
    x = s.astype(str).str.strip().str.upper()
    return x.isin(["1", "Y", "YES", "TRUE", "T"])

# Convert PSS grade codes to a comparable numeric scale
# Common codes: "KG", "PK", "UG", "AE", or numeric strings ("01".."12").
# We map:
#   PK -> -1, KG -> 0, 01..12 -> 1..12
# Anything else -> NaN (we ignore for v2)
def normalize_grade_code(x) -> float:
    if pd.isna(x):
        return np.nan
    s = str(x).strip().upper()
    if s in ["PK", "PREK"]:
        return -1.0
    if s in ["KG", "K"]:
        return 0.0
    # handle "01".."12" or "1".."12"
    try:
        return float(int(s))
    except Exception:
        return np.nan

# ---------- 1) Public: derive from CCD directory grade-offered flags ----------
# Ensure join key dtype
schools["ncessch"] = schools["ncessch"].astype(str).str.strip()
ccd_dir["ncessch"] = ccd_dir["ncessch"].astype(str).str.strip()

# Confirm expected columns exist (based on your earlier column list)
needed = ["G_KG_OFFERED", "G_1_OFFERED", "G_5_OFFERED", "G_6_OFFERED", "G_8_OFFERED", "G_9_OFFERED", "G_12_OFFERED"]
missing = [c for c in needed if c not in ccd_dir.columns]
assert not missing, f"ccd_directory missing expected grade-offered columns: {missing}"

ccd_g = ccd_dir[["ncessch"] + needed].copy()

# Convert to boolean "offered"
for c in needed:
    ccd_g[c] = flag_true(ccd_g[c])

# Build public grade flags
ccd_g["serves_elementary"] = ccd_g[["G_KG_OFFERED","G_1_OFFERED","G_5_OFFERED"]].any(axis=1)
ccd_g["serves_middle"]     = ccd_g[["G_6_OFFERED","G_8_OFFERED"]].any(axis=1)
ccd_g["serves_high"]       = ccd_g[["G_9_OFFERED","G_12_OFFERED"]].any(axis=1)

# Optional span bounds from GSLO/GSHI if you want them (present in your ccd_dir)
span_cols = []
if "GSLO" in ccd_dir.columns and "GSHI" in ccd_dir.columns:
    span_cols = ["GSLO", "GSHI"]
    ccd_span = ccd_dir[["ncessch", "GSLO", "GSHI"]].copy()
    ccd_span["grade_span_min"] = ccd_span["GSLO"].apply(normalize_grade_code)
    ccd_span["grade_span_max"] = ccd_span["GSHI"].apply(normalize_grade_code)
    ccd_span = ccd_span[["ncessch", "grade_span_min", "grade_span_max"]]
else:
    ccd_span = pd.DataFrame({"ncessch": ccd_g["ncessch"]})

# Merge public outputs
ccd_out = ccd_g[["ncessch", "serves_elementary", "serves_middle", "serves_high"]].merge(
    ccd_span, on="ncessch", how="left"
)
ccd_out["raw_grade_source"] = "ccd_offered_flags"

schools = schools.merge(ccd_out, on="ncessch", how="left")

# ---------- 2) Private: derive from PSS LOGR/HIGR if available (fill only where missing) ----------
pss_path = RAW_DIR / "enrichment" / "pss2122.csv"
pss_cols = ["PPIN", "LOGR2022", "HIGR2022"]
pss_present = all(c in pd.read_csv(pss_path, nrows=0).columns for c in pss_cols)

if pss_present:
    pss_span = pd.read_csv(pss_path, usecols=pss_cols, low_memory=False).rename(columns={"PPIN": "ppin"})
    pss_span["ppin"] = pss_span["ppin"].astype(str).str.strip()
    pss_span["grade_span_min_pss"] = pss_span["LOGR2022"].apply(normalize_grade_code)
    pss_span["grade_span_max_pss"] = pss_span["HIGR2022"].apply(normalize_grade_code)

    # Derive serves_* from span (if min/max exist)
    def span_overlaps(min_g, max_g, lo, hi):
        if pd.isna(min_g) or pd.isna(max_g):
            return False
        return not (max_g < lo or min_g > hi)

    pss_span["serves_elementary_pss"] = pss_span.apply(lambda r: span_overlaps(r["grade_span_min_pss"], r["grade_span_max_pss"], 0, 5), axis=1)
    pss_span["serves_middle_pss"]     = pss_span.apply(lambda r: span_overlaps(r["grade_span_min_pss"], r["grade_span_max_pss"], 6, 8), axis=1)
    pss_span["serves_high_pss"]       = pss_span.apply(lambda r: span_overlaps(r["grade_span_min_pss"], r["grade_span_max_pss"], 9, 12), axis=1)

    pss_span = pss_span[[
        "ppin",
        "grade_span_min_pss", "grade_span_max_pss",
        "serves_elementary_pss", "serves_middle_pss", "serves_high_pss"
    ]]

    schools["ppin"] = schools["ppin"].astype(str).str.strip()
    schools = schools.merge(pss_span, on="ppin", how="left")

    # Fill only where CCD is missing (private schools won’t have CCD anyway)
    for base, pss_col in [
        ("grade_span_min", "grade_span_min_pss"),
        ("grade_span_max", "grade_span_max_pss"),
        ("serves_elementary", "serves_elementary_pss"),
        ("serves_middle", "serves_middle_pss"),
        ("serves_high", "serves_high_pss"),
    ]:
        schools[base] = schools[base].where(schools[base].notna(), schools[pss_col])

    # Set source for those filled via PSS (only where CCD source is null)
    schools["raw_grade_source"] = schools["raw_grade_source"].where(
        schools["raw_grade_source"].notna(),
        np.where(schools["grade_span_min_pss"].notna() | schools["grade_span_max_pss"].notna(), "pss_span_2022", np.nan)
    )

    # Cleanup helper columns
    drop_helpers = [c for c in schools.columns if c.endswith("_pss")]
    schools = schools.drop(columns=drop_helpers)

else:
    print("PSS LOGR2022/HIGR2022 not found; private grade span fill skipped.")

# ---------- 3) Quick audit ----------
print("03.3 Grade coverage built")
print("Coverage (serves_elementary):", f"{schools['serves_elementary'].notna().mean():.3%}")
print("Coverage (serves_middle):", f"{schools['serves_middle'].notna().mean():.3%}")
print("Coverage (serves_high):", f"{schools['serves_high'].notna().mean():.3%}")
print("Source distribution:\n", schools["raw_grade_source"].value_counts(dropna=False).head(10))

schools[[
    "school_id", "is_public", "is_private",
    "grade_span_min", "grade_span_max",
    "serves_elementary", "serves_middle", "serves_high",
    "raw_grade_source"
]].head(10)


03.3 Grade coverage built
Coverage (serves_elementary): 100.000%
Coverage (serves_middle): 100.000%
Coverage (serves_high): 100.000%
Source distribution:
 raw_grade_source
ccd_offered_flags    102274
pss_span_2022         22345
Name: count, dtype: int64


Unnamed: 0,school_id,is_public,is_private,grade_span_min,grade_span_max,serves_elementary,serves_middle,serves_high,raw_grade_source
0,PUB_10000500870,True,False,7.0,8.0,False,True,False,ccd_offered_flags
1,PUB_10000500871,True,False,9.0,12.0,False,False,True,ccd_offered_flags
2,PUB_10000500879,True,False,5.0,6.0,True,True,False,ccd_offered_flags
3,PUB_10000500889,True,False,3.0,4.0,False,False,False,ccd_offered_flags
4,PUB_10000501616,True,False,-1.0,0.0,True,False,False,ccd_offered_flags
5,PUB_10000502150,True,False,1.0,2.0,True,False,False,ccd_offered_flags
6,PUB_10000600193,True,False,5.0,8.0,True,True,False,ccd_offered_flags
7,PUB_10000600872,True,False,6.0,12.0,False,True,True,ccd_offered_flags
8,PUB_10000600876,True,False,-1.0,-1.0,False,False,False,ccd_offered_flags
9,PUB_10000600877,True,False,3.0,5.0,True,False,False,ccd_offered_flags


### 03.4 Diversity Index (“Environment”) — CRDC Enrollment-Based

**Intent**
Measure the *variety* of student backgrounds at a school using race/ethnicity
composition. This reflects the learning environment, not academic quality.

**Data source**
- Public schools: CRDC `Enrollment.csv`
- Private schools: not available → left missing (honest)

**Raw value preserved**
- `raw_diversity_entropy` (Shannon entropy of race proportions)

**Score**
- `score_diversity` in [0,1]
- Higher = more evenly mixed racial composition

**Method**
1. Use total students per race (Male + Female + X)
2. Convert counts → proportions
3. Compute Shannon entropy
4. Normalize by max entropy = log(number of race groups)

Notes:
- Entropy rewards *true variety*, not just “non-white %”
- Missing values are preserved for ranking-time handling


In [139]:
# ---------- Re-run safety ----------
cols_to_drop = [
    "raw_diversity_entropy",
    "score_diversity",
    "raw_diversity_source",
]
schools = schools.drop(columns=[c for c in cols_to_drop if c in schools.columns])

# ---------- Load CRDC Enrollment ----------
crdc_path = RAW_DIR / "crdc" / "Enrollment.csv"
crdc = pd.read_csv(crdc_path, low_memory=False)
print("Loaded CRDC Enrollment:", crdc.shape)

# Standardize join key
# CRDC uses COMBOKEY (12-digit NCES school ID)
crdc["COMBOKEY"] = crdc["COMBOKEY"].astype(str).str.strip()
schools["ncessch"] = schools["ncessch"].astype(str).str.strip()

# ---------- Define race groups ----------
race_bases = ["HI", "WH", "BL", "AS", "AM", "HP", "TR"]

# Build total enrollment per race = M + F + X
race_totals = {}
for r in race_bases:
    cols = [f"SCH_ENR_{r}_M", f"SCH_ENR_{r}_F", f"SCH_ENR_{r}_X"]
    missing = [c for c in cols if c not in crdc.columns]
    if missing:
        raise AssertionError(f"CRDC Enrollment missing columns for {r}: {missing}")

    crdc[cols] = crdc[cols].apply(pd.to_numeric, errors="coerce").fillna(0)
    race_totals[r] = crdc[cols].sum(axis=1)

race_df = pd.DataFrame(race_totals)
race_df["COMBOKEY"] = crdc["COMBOKEY"]

# ---------- Shannon entropy ----------
def shannon_entropy(values: np.ndarray) -> float:
    total = values.sum()
    if total <= 0:
        return np.nan
    p = values / total
    p = p[p > 0]
    if len(p) <= 1:
        return 0.0
    return -np.sum(p * np.log(p))

race_df["raw_diversity_entropy"] = race_df[race_bases].apply(
    lambda r: shannon_entropy(r.values.astype(float)), axis=1
)

# Normalize entropy to [0,1]
K = len(race_bases)
max_entropy = np.log(K)

race_df["score_diversity"] = race_df["raw_diversity_entropy"] / max_entropy
race_df["score_diversity"] = race_df["score_diversity"].clip(0.0, 1.0)
race_df["raw_diversity_source"] = "crdc_enrollment_entropy"

# ---------- Merge into schools ----------
schools = schools.merge(
    race_df[["COMBOKEY", "raw_diversity_entropy", "score_diversity", "raw_diversity_source"]],
    left_on="ncessch",
    right_on="COMBOKEY",
    how="left"
).drop(columns=["COMBOKEY"])

# ---------- Contract validation ----------
mask = schools["score_diversity"].notna()
validate_dense_metric(
    schools.loc[mask, "score_diversity"],
    schools.loc[mask, "raw_diversity_entropy"]
)

# ---------- Audit ----------
print("03.4 Diversity metric built (CRDC Enrollment)")
print("Coverage (all schools):", f"{mask.mean():.3%}")
print("Coverage (public schools):", f"{mask[schools['is_public'] == True].mean():.3%}")
print("Source distribution:\n", schools["raw_diversity_source"].value_counts(dropna=False).head(10))

schools.loc[mask, [
    "school_id", "is_public",
    "raw_diversity_entropy", "score_diversity"
]].head(8)


Loaded CRDC Enrollment: (98010, 233)
03.4 Diversity metric built (CRDC Enrollment)
Coverage (all schools): 56.752%
Coverage (public schools): 69.151%
Source distribution:
 raw_diversity_source
crdc_enrollment_entropy    76687
NaN                        47932
Name: count, dtype: int64


Unnamed: 0,school_id,is_public,raw_diversity_entropy,score_diversity
19090,PUB_100000400012,True,1.188992,0.611021
19091,PUB_100000500013,True,0.135609,0.069689
19092,PUB_100000600017,True,0.035478,0.018232
19093,PUB_100000700070,True,0.963381,0.49508
19094,PUB_100001000079,True,-0.001959,0.0
19095,PUB_100001100091,True,0.835274,0.429246
19096,PUB_100001400106,True,0.202568,0.1041
19097,PUB_100001500112,True,1.196977,0.615125


## 04. Assemble Golden Record v2

**Goal**
Create a single, stable dataset (`schools_master_v2`) that:
- Extends `schools_master_v1` with v2 dense metrics
- Preserves raw values for explainability
- Makes missingness explicit and auditable
- Is safe to use for downstream vectorization and scoring

**Inputs**
- `schools_master_v1` (from Notebook 04)
- Dense metrics built in Section 03:
  - Size: `score_size_small`, `score_size_large`
  - Attention: `score_attention`
  - Logistics: `serves_elementary`, `serves_middle`, `serves_high`, `grade_span_min/max`
  - Diversity: `score_diversity`

**Outputs**
- `schools_master_v2.csv`
- Coverage report (% non-null per metric)
- Quick distribution checks (sanity, not deep EDA)

Notes:
- No NaN filling happens here (that is deferred to matrix build in Section 06)
- This dataset is the **single source of truth** for v2 scoring


In [155]:
# ---------- 1) Define v2 columns to keep ----------
v2_cols = [
    # identifiers
    "school_id", "school_name", "city", "state", "zip",
    "is_public", "is_private",

    # ---- Tags (Notebook 04 / v1 backbone flags) ----
    "has_ib",
    "has_cais",
    "has_ams_montessori",
    "has_waldorf",
    "has_ccd",
    "has_crdc",
    "has_pss",

    # ---- Size (03.1) ----
    "raw_size_value",
    "raw_size_source",
    "score_size_small",
    "score_size_large",

    # ---- Attention (03.2) ----
    "raw_student_teacher_ratio",
    "raw_attention_source",
    "score_attention",

    # ---- Logistics (03.3) ----
    "grade_span_min",
    "grade_span_max",
    "serves_elementary",
    "serves_middle",
    "serves_high",
    "raw_grade_source",

    # ---- Diversity (03.4) ----
    "raw_diversity_entropy",
    "raw_diversity_source",
    "score_diversity",
]

# Keep only columns that actually exist (defensive)
v2_cols = [c for c in v2_cols if c in schools.columns]

# De-dupe while preserving order
v2_cols = list(dict.fromkeys(v2_cols))

schools_v2 = schools[v2_cols].copy()

print("schools_master_v2 shape:", schools_v2.shape)

# ---------- 2) Coverage report ----------
coverage = (
    schools_v2
    .isna()
    .mean()
    .rename("missing_rate")
    .to_frame()
)
coverage["coverage_pct"] = (1.0 - coverage["missing_rate"]) * 100
coverage = coverage.sort_values("coverage_pct", ascending=False)

print("\n=== Coverage Report (% non-missing) ===")
display(coverage[["coverage_pct"]].round(2))

# ---------- 3) Quick distribution sanity checks ----------
print("\n=== Dense Metric Summaries ===")

dense_metrics = [
    "score_size_small",
    "score_size_large",
    "score_attention",
    "score_diversity",
]

for c in dense_metrics:
    if c in schools_v2.columns:
        s = schools_v2[c]
        print(
            f"{c}: "
            f"count={s.notna().sum()}, "
            f"mean={s.mean():.3f}, "
            f"min={s.min():.3f}, "
            f"max={s.max():.3f}"
        )

# ---------- 4) Save artifact ----------
out_path = PROCESSED_DIR / "schools_master_v2.csv"
schools_v2.to_csv(out_path, index=False)

print(f"\nSaved: {out_path}")
print("Contains has_ib?", "has_ib" in schools_v2.columns)

schools_v2.head(10)


schools_master_v2 shape: (124619, 29)

=== Coverage Report (% non-missing) ===


Unnamed: 0,coverage_pct
school_id,100.0
has_waldorf,100.0
raw_grade_source,100.0
serves_high,100.0
serves_middle,100.0
serves_elementary,100.0
raw_attention_source,100.0
school_name,100.0
has_crdc,100.0
has_ccd,100.0



=== Dense Metric Summaries ===
score_size_small: count=120170, mean=0.265, min=0.000, max=1.000
score_size_large: count=120170, mean=0.735, min=0.000, max=1.000
score_attention: count=22345, mean=0.706, min=0.000, max=1.000
score_diversity: count=70724, mean=0.307, min=0.000, max=0.909

Saved: ../data/processed/schools_master_v2.csv
Contains has_ib? True


Unnamed: 0,school_id,school_name,city,state,zip,is_public,is_private,has_ib,has_cais,has_ams_montessori,...,score_attention,grade_span_min,grade_span_max,serves_elementary,serves_middle,serves_high,raw_grade_source,raw_diversity_entropy,raw_diversity_source,score_diversity
0,PUB_10000500870,albertville middle school,albertville,AL,35950,True,False,False,False,False,...,,7.0,8.0,False,True,False,ccd_offered_flags,,,
1,PUB_10000500871,albertville high school,albertville,AL,35950,True,False,False,False,False,...,,9.0,12.0,False,False,True,ccd_offered_flags,,,
2,PUB_10000500879,albertville intermediate school,albertville,AL,35950,True,False,False,False,False,...,,5.0,6.0,True,True,False,ccd_offered_flags,,,
3,PUB_10000500889,albertville elementary school,albertville,AL,35950,True,False,False,False,False,...,,3.0,4.0,False,False,False,ccd_offered_flags,,,
4,PUB_10000501616,albertville kindergarten and prek,albertville,AL,35951,True,False,False,False,False,...,,-1.0,0.0,True,False,False,ccd_offered_flags,,,
5,PUB_10000502150,albertville primary school,albertville,AL,35950,True,False,False,False,False,...,,1.0,2.0,True,False,False,ccd_offered_flags,,,
6,PUB_10000600193,kate duncan smith dar middle,grant,AL,35747,True,False,False,False,False,...,,5.0,8.0,True,True,False,ccd_offered_flags,,,
7,PUB_10000600872,asbury high school,albertville,AL,35951,True,False,False,False,False,...,,6.0,12.0,False,True,True,ccd_offered_flags,,,
8,PUB_10000600876,claysville school,guntersville,AL,35976,True,False,False,False,False,...,,-1.0,-1.0,False,False,False,ccd_offered_flags,,,
9,PUB_10000600877,douglas elementary school,douglas,AL,35964,True,False,False,False,False,...,,3.0,5.0,True,False,False,ccd_offered_flags,,,


## 05. Feature Config v2 (Source of Truth + Weight Budget)

**Goal**
Create a single “source of truth” config (`feature_config_master_v2.json`) that defines:
- which features are used in scoring
- how they map to columns in `schools_master_v2`
- their weights
- **weight budget rules** so dense metrics act as tie-breakers (not dominators)

**Core principle**
- **Tags / dealbreakers** (IB, CAIS, etc.) get higher weights (e.g., 2–5)
- **Dense metrics** (size, attention, diversity) get lower weights (e.g., 0.25–1)
- We enforce a **dense-metric max contribution budget**, e.g. ≤ 3.0 points total

**Important v2 nuance**
- Some dense metrics have missingness (attention, diversity). We keep them as-is here.
- NaN handling happens later in Section 06 (matrix build): NaNs are filled to 0.0 for math safety.


In [157]:
# ---------- Output path (defensive) ----------
try:
    OUT_DIR = ARTIFACTS_DIR
except NameError:
    # Fall back to reports or processed if you don't have ARTIFACTS_DIR
    try:
        OUT_DIR = REPORTS_DIR
    except NameError:
        OUT_DIR = PROCESSED_DIR

OUT_DIR = Path(OUT_DIR)
OUT_DIR.mkdir(parents=True, exist_ok=True)

# ---------- Weight budget targets ----------
DENSE_MAX_POINTS_BUDGET = 3.0  # dense metrics should not dominate tags
TAG_DEFAULT_WEIGHT = 2.0
DENSE_DEFAULT_WEIGHT = 0.75

# ---------- Feature config (v2) ----------
# Notes:
# - Dense metrics are continuous [0,1] by contract.
# - Binary tags are 0/1.
# - Use score_size_small as the primary “vibe” dimension for now.
#   (We keep score_size_large in the dataset, but don't include both unless needed.)
feature_config_v2 = {
    "scoring_method": "weighted_linear",
    "version": "v2",
    "rules": {
        "dense_metric_range": [0.0, 1.0],
        "binary_range": [0, 1],
        "dense_max_points_budget": DENSE_MAX_POINTS_BUDGET,
        "nan_policy": "Allowed in v2 dataset; filled to 0.0 at matrix build time (Section 06)."
    },
    "features": [
        # --------------------------
        # TAGS / DEALBREAKERS (high weight)
        # --------------------------
        {"name": "tag_ib", "type": "binary", "source_col": "has_ib", "weight": 5.0},
        {"name": "tag_cais", "type": "binary", "source_col": "has_cais", "weight": 5.0},
        {"name": "tag_ams_montessori", "type": "binary", "source_col": "has_ams_montessori", "weight": 2.0},
        {"name": "tag_waldorf", "type": "binary", "source_col": "has_waldorf", "weight": 2.0},

        # --------------------------
        # STRUCTURE (medium weight)
        # --------------------------
        {"name": "serves_elementary", "type": "binary", "source_col": "serves_elementary", "weight": 1.0},
        {"name": "serves_middle", "type": "binary", "source_col": "serves_middle", "weight": 1.0},
        {"name": "serves_high", "type": "binary", "source_col": "serves_high", "weight": 1.0},

        # --------------------------
        # DENSE METRICS (low weight, tie-breakers)
        # --------------------------
        {"name": "score_size_small", "type": "dense", "source_col": "score_size_small", "weight": 0.75},
        {"name": "score_attention", "type": "dense", "source_col": "score_attention", "weight": 0.75},
        {"name": "score_diversity", "type": "dense", "source_col": "score_diversity", "weight": 0.75},
    ]
}

# ---------- Save JSON ----------
config_path = OUT_DIR / "feature_config_master_v2.json"
with open(config_path, "w") as f:
    json.dump(feature_config_v2, f, indent=2)

print(f"Saved: {config_path}")

# ---------- Validation: per-feature max contribution + dense budget ----------
def max_contribution(feat: dict) -> float:
    # by contract: dense max=1.0, binary max=1
    max_val = 1.0
    return float(feat["weight"]) * max_val

rows = []
dense_total_max = 0.0
tag_total_max = 0.0

for feat in feature_config_v2["features"]:
    mc = max_contribution(feat)
    rows.append({
        "name": feat["name"],
        "type": feat["type"],
        "source_col": feat["source_col"],
        "weight": feat["weight"],
        "max_contribution": mc
    })
    if feat["type"] == "dense":
        dense_total_max += mc
    else:
        tag_total_max += mc

# Show top contributors (who can dominate if mis-weighted)
df_budget = pd.DataFrame(rows).sort_values("max_contribution", ascending=False)
print("\n=== Top potential contributors (weight × max_value) ===")
display(df_budget.head(15))

print("\n=== Budget summary ===")
print(f"Dense total max points: {dense_total_max:.2f} (budget <= {DENSE_MAX_POINTS_BUDGET:.2f})")
print(f"Non-dense total max points (tags/structure): {tag_total_max:.2f}")

assert dense_total_max <= DENSE_MAX_POINTS_BUDGET + 1e-9, (
    f"Dense metrics can contribute up to {dense_total_max:.2f}, exceeding budget {DENSE_MAX_POINTS_BUDGET:.2f}. "
    "Lower dense weights or reduce dense feature count."
)

print("Weight budget check passed (dense metrics are tie-breakers).")


Saved: ../data/processed/feature_config_master_v2.json

=== Top potential contributors (weight × max_value) ===


Unnamed: 0,name,type,source_col,weight,max_contribution
0,tag_ib,binary,has_ib,5.0,5.0
1,tag_cais,binary,has_cais,5.0,5.0
2,tag_ams_montessori,binary,has_ams_montessori,2.0,2.0
3,tag_waldorf,binary,has_waldorf,2.0,2.0
4,serves_elementary,binary,serves_elementary,1.0,1.0
5,serves_middle,binary,serves_middle,1.0,1.0
6,serves_high,binary,serves_high,1.0,1.0
7,score_size_small,dense,score_size_small,0.75,0.75
8,score_attention,dense,score_attention,0.75,0.75
9,score_diversity,dense,score_diversity,0.75,0.75



=== Budget summary ===
Dense total max points: 2.25 (budget <= 3.00)
Non-dense total max points (tags/structure): 17.00
Weight budget check passed (dense metrics are tie-breakers).


## 06. Build School Matrix v2 (Neutrality-Safe)

**Goal**
Convert `schools_master_v2` into a numeric matrix suitable for fast scoring,
while preserving the meaning of “missing data” correctly.

**Key Design Principle: Neutrality**
- Dense metrics are normalized to [0.0, 1.0]
- Missing data means **“no opinion”**, not “worst”
- Therefore:
  - Binary features: missing → 0.0 (absence of tag)
  - Dense features: missing → **0.5 (neutral / average)**

This avoids systematic penalties for schools where data is unavailable
(e.g., public schools lacking student–teacher ratio data).

**Why this matters**
- Filling dense NaNs with 0.0 would incorrectly mark missing data as “worst”
- Neutral fill ensures schools are neither boosted nor penalized
- This preserves fairness and ranking stability

**Outputs**
- `school_matrix_v2.npy` — numeric feature matrix
- `school_index_v2.csv` — row index ↔ school_id mapping
- `school_vector_explain_v2.json` — feature metadata
- `school_matrix_audit_v2.csv` — transparency + QA artifact

Notes:
- Raw NaNs are preserved in `schools_master_v2`
- Neutral filling happens **only here**, at matrix build time


In [159]:
# ---------- Load inputs ----------
schools_v2 = pd.read_csv(PROCESSED_DIR / "schools_master_v2.csv")
with open(PROCESSED_DIR / "feature_config_master_v2.json", "r") as f:
    cfg = json.load(f)

features = cfg["features"]

print("Loaded schools_master_v2:", schools_v2.shape)
print("Loaded feature_config_v2:", len(features), "features")

# ---------- Output directory ----------
OUT_DIR = PROCESSED_DIR
OUT_DIR.mkdir(parents=True, exist_ok=True)

# ---------- Matrix build ----------
X_cols = []
meta = []
audit_rows = []

for feat in features:
    name = feat["name"]
    ftype = feat["type"]
    col = feat["source_col"]
    weight = float(feat["weight"])

    assert col in schools_v2.columns, f"Missing column: {col}"

    raw = pd.to_numeric(schools_v2[col], errors="coerce")
    missing_rate = raw.isna().mean()

    if ftype == "binary":
        # Binary: missing means absence of tag
        x = raw.fillna(0)
        x = (x != 0).astype(float)

        fill_value = 0.0

        audit_rows.append({
            "feature": name,
            "type": ftype,
            "source_col": col,
            "weight": weight,
            "missing_rate_raw": missing_rate,
            "fill_value": fill_value,
            "mean_after_fill": x.mean(),
            "std_after_fill": x.std(ddof=0),
        })

    elif ftype == "dense":
        # Dense: missing means "no opinion"
        fill_value = feat.get("missing_fill", 0.5)

        x = raw.fillna(fill_value)
        x = x.replace([np.inf, -np.inf], fill_value)

        # Optional safety: clamp to [0,1]
        x = x.clip(0.0, 1.0)

        audit_rows.append({
            "feature": name,
            "type": ftype,
            "source_col": col,
            "weight": weight,
            "missing_rate_raw": missing_rate,
            "fill_value": fill_value,
            "mean_after_fill": x.mean(),
            "std_after_fill": x.std(ddof=0),
        })

    else:
        raise ValueError(f"Unknown feature type: {ftype}")

    X_cols.append(x.to_numpy())
    meta.append({
        "feature": name,
        "type": ftype,
        "source_col": col,
        "weight": weight,
        "missing_fill": fill_value,
    })

# ---------- Stack into matrix ----------
X = np.column_stack(X_cols)
assert np.isfinite(X).all(), "Matrix contains NaN or inf"

print("school_matrix_v2 built:", X.shape)

# ---------- Save artifacts ----------
np.save(OUT_DIR / "school_matrix_v2.npy", X)

school_index = schools_v2[["school_id"]].copy()
school_index["row_index"] = np.arange(len(school_index))
school_index.to_csv(OUT_DIR / "school_index_v2.csv", index=False)

with open(OUT_DIR / "school_vector_explain_v2.json", "w") as f:
    json.dump({"features": meta}, f, indent=2)

audit_df = pd.DataFrame(audit_rows).sort_values(["type", "feature"])
audit_df.to_csv(OUT_DIR / "school_matrix_audit_v2.csv", index=False)

print("Saved matrix + index + explain + audit artifacts")
display(audit_df)


Loaded schools_master_v2: (124619, 29)
Loaded feature_config_v2: 10 features
school_matrix_v2 built: (124619, 10)
Saved matrix + index + explain + audit artifacts


  schools_v2 = pd.read_csv(PROCESSED_DIR / "schools_master_v2.csv")


Unnamed: 0,feature,type,source_col,weight,missing_rate_raw,fill_value,mean_after_fill,std_after_fill
4,serves_elementary,binary,serves_elementary,1.0,0.0,0.0,0.622353,0.484799
6,serves_high,binary,serves_high,1.0,0.0,0.0,0.368002,0.482262
5,serves_middle,binary,serves_middle,1.0,0.0,0.0,0.479903,0.499596
2,tag_ams_montessori,binary,has_ams_montessori,2.0,0.0,0.0,4e-05,0.006334
1,tag_cais,binary,has_cais,5.0,0.0,0.0,0.000586,0.024196
0,tag_ib,binary,has_ib,5.0,0.0,0.0,0.000265,0.016271
3,tag_waldorf,binary,has_waldorf,2.0,0.0,0.0,0.00012,0.010971
8,score_attention,dense,score_attention,0.75,0.820693,0.5,0.537024,0.108968
9,score_diversity,dense,score_diversity,0.75,0.432478,0.5,0.390713,0.204763
7,score_size_small,dense,score_size_small,0.75,0.035701,0.5,0.27295,0.182561


## 07. Scoring & Ranking v2 (Deterministic: Tiers → Tie-Breakers)

**Goal**
Produce a deterministic ranking that behaves like a real search engine:

1) **Tiers / Hard Constraints (Filter stage)**  
If a family *requires* something (e.g., serves middle school, has IB), we filter to only schools that satisfy it.
This guarantees “tier behavior” without abusing weights.

2) **Tie-Breakers (Score stage)**  
Within the filtered candidate set, compute a weighted linear score using v2 features.
Dense metrics (size / attention / diversity) act as tie-breakers because their total max contribution is budgeted.

**Why this design is deliberate**
- We do **not** encode sector bias (`is_public/is_private`) into scoring.
- We do **not** double-count size (`score_size_small` vs `score_size_large`).
- We treat missing dense data as “no opinion” using **neutral fill** (handled in Section 06).

**Outputs**
- Ranked list (top N)
- Score distribution + tie stats
- Per-feature contribution breakdown for the top result (explainability preview)


In [166]:
# ============================================================
# 07. Scoring & Ranking v2 (Deterministic)
# ============================================================

# ---------- Load artifacts from Section 06 ----------
X = np.load(PROCESSED_DIR / "school_matrix_v2.npy")
school_index = pd.read_csv(PROCESSED_DIR / "school_index_v2.csv")

with open(PROCESSED_DIR / "school_vector_explain_v2.json", "r") as f:
    explain = json.load(f)

features = explain["features"]
feat_names = [f["feature"] for f in features]
feat_weights = np.array([float(f["weight"]) for f in features])

feat_to_idx = {name: i for i, name in enumerate(feat_names)}

print("Loaded matrix:", X.shape)
print("Loaded school_index:", school_index.shape)
print("Feature count:", len(feat_names))


# ---------- Helper: Hard filters (tier guarantee) ----------
def apply_hard_filters(X_mat, requires, feat_to_idx):
    if not requires:
        return np.ones(X_mat.shape[0], dtype=bool)

    mask = np.ones(X_mat.shape[0], dtype=bool)
    for fname, required_val in requires.items():
        assert fname in feat_to_idx, f"Hard filter feature not in matrix: {fname}"
        j = feat_to_idx[fname]

        if required_val == 1:
            mask &= (X_mat[:, j] >= 0.5)
        elif required_val == 0:
            mask &= (X_mat[:, j] < 0.5)
        else:
            raise ValueError("Hard filters support binary values only")

    return mask


# ---------- Demo child profile ----------
child_requires = {
    "serves_middle": 1,
}

child_pref = {
    "tag_ib": 1.0,
    "serves_middle": 1.0,
    "score_size_small": 1.0,
    "score_attention": 1.0,
    "score_diversity": 0.6,
}

child_vec = np.zeros(len(feat_names))
for i, fname in enumerate(feat_names):
    child_vec[i] = float(child_pref.get(fname, 0.0))

w_child = feat_weights * child_vec

print("\n=== Child profile ===")
display(
    pd.DataFrame({
        "feature": feat_names,
        "weight": feat_weights,
        "child_pref": child_vec,
        "effective_weight": w_child,
    }).sort_values("effective_weight", ascending=False)
)


# ---------- Stage 1: Filter (tiers) ----------
mask = apply_hard_filters(X, child_requires, feat_to_idx)
print(f"\nFilter stage: kept {mask.sum():,} / {X.shape[0]:,} schools ({mask.mean():.2%})")

X_cand = X[mask]
idx_cand = school_index.loc[mask].copy().reset_index(drop=True)


# ---------- Stage 2: Score (tie-breakers) ----------
scores = X_cand @ w_child

rank_df = idx_cand.copy()
rank_df["score_v2"] = scores

rank_df = rank_df.sort_values(
    ["score_v2", "school_id"],
    ascending=[False, True],
    kind="mergesort"
).reset_index(drop=True)

print("\n=== Score distribution (filtered set) ===")
print(rank_df["score_v2"].describe())

print("\n=== Top 10 schools (v2) ===")
display(rank_df.head(10))


# ---------- Improved Tie Report ----------
rank_df["score_rounded"] = rank_df["score_v2"].round(6)

vc = rank_df["score_rounded"].value_counts()
num_unique_scores = int(vc.shape[0])
largest_tie = int(vc.max())
pct_in_ties = float((vc[vc > 1].sum() / vc.sum()) * 100)

print("\n=== Tie report (rounded to 6 decimals) ===")
print("Unique scores:", f"{num_unique_scores:,}")
print("Largest tie group size:", f"{largest_tie:,}")
print("% of schools in tie groups (size > 1):", f"{pct_in_ties:.2f}%")


# ---------- Explainability preview (top school) ----------
top = rank_df.iloc[0]
row_idx = int(top["row_index"])
x_top = X[row_idx]

explain_df = pd.DataFrame({
    "feature": feat_names,
    "school_value": x_top,
    "child_pref": child_vec,
    "weight": feat_weights,
    "effective_weight": w_child,
    "contribution": x_top * w_child,
}).sort_values("contribution", ascending=False)

print(f"\n=== Explainability preview: {top['school_id']} (score={top['score_v2']:.4f}) ===")
display(explain_df)


Loaded matrix: (124619, 10)
Loaded school_index: (124619, 2)
Feature count: 10

=== Child profile ===


Unnamed: 0,feature,weight,child_pref,effective_weight
0,tag_ib,5.0,1.0,5.0
5,serves_middle,1.0,1.0,1.0
7,score_size_small,0.75,1.0,0.75
8,score_attention,0.75,1.0,0.75
9,score_diversity,0.75,0.6,0.45
1,tag_cais,5.0,0.0,0.0
2,tag_ams_montessori,2.0,0.0,0.0
3,tag_waldorf,2.0,0.0,0.0
4,serves_elementary,1.0,0.0,0.0
6,serves_high,1.0,0.0,0.0



Filter stage: kept 59,805 / 124,619 schools (47.99%)

=== Score distribution (filtered set) ===
count    59805.000000
mean         1.808661
std          0.248875
min          1.225000
25%          1.638530
50%          1.752291
75%          1.953831
max          7.234454
Name: score_v2, dtype: float64

=== Top 10 schools (v2) ===


Unnamed: 0,school_id,row_index,score_v2
0,PRI_A2100388,118078,7.234454
1,PRI_00093379,103044,7.134809
2,PRI_A9700331,121637,7.111198
3,PRI_A1792009,116246,7.109
4,PRI_A9100571,119794,7.064006
5,PRI_BB060167,122981,7.048074
6,PRI_A0770343,112054,7.040558
7,PRI_A0900353,112238,7.039431
8,PRI_BB180318,123458,6.997802
9,PUB_60194211934,6937,6.953974



=== Tie report (rounded to 6 decimals) ===
Unique scores: 38,120
Largest tie group size: 1,039
% of schools in tie groups (size > 1): 47.19%

=== Explainability preview: PRI_A2100388 (score=7.2345) ===


Unnamed: 0,feature,school_value,child_pref,weight,effective_weight,contribution
0,tag_ib,1.0,1.0,5.0,5.0,5.0
5,serves_middle,1.0,1.0,1.0,1.0,1.0
8,score_attention,0.857057,1.0,0.75,0.75,0.642793
7,score_size_small,0.488881,1.0,0.75,0.75,0.366661
9,score_diversity,0.5,0.6,0.75,0.45,0.225
1,tag_cais,0.0,0.0,5.0,0.0,0.0
2,tag_ams_montessori,0.0,0.0,2.0,0.0,0.0
3,tag_waldorf,0.0,0.0,2.0,0.0,0.0
4,serves_elementary,1.0,0.0,1.0,0.0,0.0
6,serves_high,1.0,0.0,1.0,0.0,0.0


## 08. Explainability v2 (Dense + Tags + Raw Values)

**Goal**  
Generate a parent-friendly explanation for *why* a school ranked where it did.

This step intentionally uses **two different artifacts**:

### 1. Slim Index (from Section 06)
- `school_matrix_v2.npy`
- `school_index_v2.csv`
- Used only to compute scores and feature contributions

### 2. Master Dataset (Truth Table)
- `schools_master_v2.csv`
- Loaded here to retrieve **raw values** (e.g., student–teacher ratio, estimated enrollment)
- Required for UI and human explanation

**Why this separation matters**
- Ranking must be fast, numeric, and minimal
- Explainability must be rich, honest, and human-readable
- Raw values are *never* baked into the scoring matrix

**Design Rules**
1. Show **feature contributions** (points)
2. Show **raw context** for dense metrics:
   - Attention → student–teacher ratio
   - Size → estimated enrollment + source
   - Diversity → entropy score + source
3. Handle missing data honestly:
   - “Not available (neutral)” instead of implying good or bad
4. Keep explanations concise and parent-friendly


In [172]:
import json
import numpy as np
import pandas as pd

# ============================================================
# 08. Explainability v2 (Dense + Tags + Raw Values)
# ============================================================

# ------------------------------------------------------------
# ARTIFACT ROLES (IMPORTANT)
#
# 1) school_matrix_v2.npy        -> numeric features only (scoring)
# 2) school_index_v2.csv         -> slim routing index (school_id ↔ row_index)
# 3) schools_master_v2.csv       -> source of truth for raw values (UI / explain)
# ------------------------------------------------------------

# ---------- Load scoring artifacts ----------
X = np.load(PROCESSED_DIR / "school_matrix_v2.npy")

school_index = pd.read_csv(
    PROCESSED_DIR / "school_index_v2.csv"
)

with open(PROCESSED_DIR / "school_vector_explain_v2.json", "r") as f:
    explain = json.load(f)

features = explain["features"]
feat_names = [f["feature"] for f in features]
feat_weights = np.array([float(f["weight"]) for f in features], dtype=float)

feat_to_idx = {name: i for i, name in enumerate(feat_names)}

print("Loaded matrix:", X.shape)
print("Loaded slim index:", school_index.shape)
print("Feature count:", len(feat_names))


# ---------- Load master dataset for raw values ----------
schools_v2 = pd.read_csv(
    PROCESSED_DIR / "schools_master_v2.csv",
    low_memory=False
)

# Fast lookup by school_id
schools_v2 = schools_v2.set_index("school_id", drop=False)

print("Loaded schools_master_v2:", schools_v2.shape)


# ---------- Child preferences (same as Section 07) ----------
child_pref = {
    "tag_ib": 1.0,
    "serves_middle": 1.0,
    "score_size_small": 1.0,
    "score_attention": 1.0,
    "score_diversity": 0.6,
}

child_vec = np.array(
    [float(child_pref.get(f, 0.0)) for f in feat_names],
    dtype=float
)

w_child = feat_weights * child_vec


# ---------- Raw-value context map ----------
RAW_CONTEXT = {
    "score_size_small": ["raw_size_value", "raw_size_source"],
    "score_attention": ["raw_student_teacher_ratio", "raw_attention_source"],
    "score_diversity": ["raw_diversity_entropy", "raw_diversity_source"],
}


def fmt(x, nd=2):
    if pd.isna(x):
        return None
    try:
        return round(float(x), nd)
    except Exception:
        return x


def explain_school(school_id: str, score: float, top_k: int = 6):
    """
    Build a parent-friendly explanation for one ranked school.
    """

    # --- map school_id -> matrix row ---
    r = school_index[school_index["school_id"] == school_id]
    assert len(r) == 1, f"school_id not found or duplicated: {school_id}"
    row_idx = int(r.iloc[0]["row_index"])

    # --- numeric values from matrix ---
    x = X[row_idx]
    contrib = x * w_child

    # --- raw context from master table ---
    base = schools_v2.loc[school_id]

    rows = []
    for i, fname in enumerate(feat_names):
        if w_child[i] == 0:
            continue

        payload = {}
        for rf in RAW_CONTEXT.get(fname, []):
            if rf in base.index:
                payload[rf] = base[rf]

        rows.append({
            "feature": fname,
            "school_value": x[i],
            "effective_weight": w_child[i],
            "contribution": contrib[i],
            "raw_context": payload,
        })

    df = (
        pd.DataFrame(rows)
        .sort_values("contribution", ascending=False)
        .head(top_k)
    )

    # --- headline summary ---
    bullets = []
    for _, r in df.iterrows():
        f = r["feature"]
        pts = r["contribution"]

        if f.startswith("tag_") and r["school_value"] >= 0.5:
            bullets.append(f"{f.replace('tag_', '').upper()} (+{pts:.2f})")

        elif f.startswith("serves_") and r["school_value"] >= 0.5:
            bullets.append(f"Serves {f.replace('serves_', '').title()} (+{pts:.2f})")

        elif f == "score_attention":
            ratio = r["raw_context"].get("raw_student_teacher_ratio")
            bullets.append(
                f"High attention (+{pts:.2f}, ~{fmt(ratio)} students/teacher)"
                if pd.notna(ratio)
                else f"Attention (+{pts:.2f}, data not available → neutral)"
            )

        elif f == "score_size_small":
            size = r["raw_context"].get("raw_size_value")
            bullets.append(
                f"Small-school vibe (+{pts:.2f}, ~{fmt(size,0)} students)"
                if pd.notna(size)
                else f"Small-school vibe (+{pts:.2f}, data not available → neutral)"
            )

        elif f == "score_diversity":
            bullets.append(
                f"Diverse environment (+{pts:.2f})"
                if pd.notna(r["raw_context"].get("raw_diversity_entropy"))
                else f"Diversity (+{pts:.2f}, data not available → neutral)"
            )

    headline = (
        f"**{base['school_name']}** ({base['city']}, {base['state']}) "
        f"ranked highly with score **{score:.3f}** because: "
        + "; ".join(bullets[:3])
        + "."
    )

    return headline, df


# ---------- Run explainability for top N results ----------
top_n = 3
top_rows = rank_df.head(top_n)[["school_id", "score_v2"]]

for _, r in top_rows.iterrows():
    print("\n" + "=" * 90)
    headline, details = explain_school(r["school_id"], float(r["score_v2"]))
    print(headline)
    display(details)


Loaded matrix: (124619, 10)
Loaded slim index: (124619, 2)
Feature count: 10
Loaded schools_master_v2: (124619, 29)

**granada preparatory school** (northridge, CA) ranked highly with score **7.234** because: IB (+5.00); Serves Middle (+1.00); High attention (+0.64, ~5.24 students/teacher).


Unnamed: 0,feature,school_value,effective_weight,contribution,raw_context
0,tag_ib,1.0,5.0,5.0,{}
1,serves_middle,1.0,1.0,1.0,{}
3,score_attention,0.857057,0.75,0.642793,{'raw_student_teacher_ratio': 5.24271844660194...
2,score_size_small,0.488881,0.75,0.366661,"{'raw_size_value': 54.0, 'raw_size_source': 'e..."
4,score_diversity,0.5,0.45,0.225,"{'raw_diversity_entropy': nan, 'raw_diversity_..."



**valley preparatory school** (redlands, CA) ranked highly with score **7.135** because: IB (+5.00); Serves Middle (+1.00); High attention (+0.64, ~5.44 students/teacher).


Unnamed: 0,feature,school_value,effective_weight,contribution,raw_context
0,tag_ib,1.0,5.0,5.0,{}
1,serves_middle,1.0,1.0,1.0,{}
3,score_attention,0.850453,0.75,0.63784,{'raw_student_teacher_ratio': 5.44444444444444...
2,score_size_small,0.362626,0.75,0.271969,"{'raw_size_value': 147.0, 'raw_size_source': '..."
4,score_diversity,0.5,0.45,0.225,"{'raw_diversity_entropy': nan, 'raw_diversity_..."



**bowman school** (palo alto, CA) ranked highly with score **7.111** because: IB (+5.00); Serves Middle (+1.00); High attention (+0.67, ~4.24 students/teacher).


Unnamed: 0,feature,school_value,effective_weight,contribution,raw_context
0,tag_ib,1.0,5.0,5.0,{}
1,serves_middle,1.0,1.0,1.0,{}
3,score_attention,0.889855,0.75,0.667392,{'raw_student_teacher_ratio': 4.24092409240924...
4,score_diversity,0.5,0.45,0.225,"{'raw_diversity_entropy': nan, 'raw_diversity_..."
2,score_size_small,0.291743,0.75,0.218807,"{'raw_size_value': 257.0, 'raw_size_source': '..."


## 09. Validation v2 (Tie-Reduction + Reasonableness Tests)

**Goal**
Prove v2 behaves better than v1 in two ways:

1) **Tie-reduction**
Dense metrics should reduce “sea of ties” within the same tag tier / grade tier.

2) **Reasonableness**
Tier logic must remain intact:
- Tag tiers (e.g., IB) should not be outranked purely by dense metrics.
- Missing dense data should not “destroy” a school (neutral fill = 0.5 keeps schools viable).

**What we validate in this section**
A. Tie report for the current run (unique scores, largest tie group, % in ties)  
B. Tier sanity checks:
   - if `tag_ib` is required, all results must have IB
   - if `serves_middle` is required, all results must serve middle
C. “Budget check in action” demo:
   - show the maximum possible dense contribution vs IB contribution
D. Distribution sanity:
   - confirm dense features are within [0, 1]
   - confirm fill behavior: missing dense metrics → 0.5 in the matrix


In [175]:
# ============================================================
# 09. Validation v2 (Tie-Reduction + Reasonableness)
# ============================================================

# ---------- Load artifacts ----------
X = np.load(PROCESSED_DIR / "school_matrix_v2.npy")
school_index = pd.read_csv(PROCESSED_DIR / "school_index_v2.csv")

with open(PROCESSED_DIR / "school_vector_explain_v2.json", "r") as f:
    explain = json.load(f)

features = explain["features"]
feat_names = [f["feature"] for f in features]
feat_weights = np.array([float(f["weight"]) for f in features], dtype=float)
feat_to_idx = {name: i for i, name in enumerate(feat_names)}

schools_v2 = pd.read_csv(PROCESSED_DIR / "schools_master_v2.csv", low_memory=False)

print("Matrix:", X.shape)
print("Index:", school_index.shape)
print("schools_master_v2:", schools_v2.shape)
print("Features:", feat_names)

# ---------- Child profile used for validation (same as Section 07) ----------
child_requires = {"serves_middle": 1}
child_pref = {
    "tag_ib": 1.0,
    "serves_middle": 1.0,
    "score_size_small": 1.0,
    "score_attention": 1.0,
    "score_diversity": 0.6,
}

child_vec = np.array([float(child_pref.get(n, 0.0)) for n in feat_names], dtype=float)
w_child = feat_weights * child_vec


# ---------- Helpers ----------
def apply_hard_filters(X_mat, requires):
    if not requires:
        return np.ones(X_mat.shape[0], dtype=bool)
    mask = np.ones(X_mat.shape[0], dtype=bool)
    for fname, required_val in requires.items():
        j = feat_to_idx[fname]
        if required_val == 1:
            mask &= (X_mat[:, j] >= 0.5)
        elif required_val == 0:
            mask &= (X_mat[:, j] < 0.5)
        else:
            raise ValueError("Hard filters support 0/1 only.")
    return mask


def rank_candidates(X_mat, idx_df):
    scores = X_mat @ w_child
    out = idx_df.copy()
    out["score_v2"] = scores
    out = out.sort_values(["score_v2", "school_id"], ascending=[False, True], kind="mergesort").reset_index(drop=True)
    out["score_rounded"] = out["score_v2"].round(6)
    return out


# ---------- A) Run ranking and tie report ----------
mask = apply_hard_filters(X, child_requires)
X_cand = X[mask]
idx_cand = school_index.loc[mask].copy().reset_index(drop=True)

rank_df = rank_candidates(X_cand, idx_cand)

vc = rank_df["score_rounded"].value_counts()
num_unique_scores = int(vc.shape[0])
largest_tie = int(vc.max())
pct_in_ties = float((vc[vc > 1].sum() / vc.sum()) * 100)

print("\n=== A) Tie report (rounded to 6 decimals) ===")
print("Candidates:", f"{len(rank_df):,}")
print("Unique scores:", f"{num_unique_scores:,}")
print("Largest tie group size:", f"{largest_tie:,}")
print("% of schools in tie groups (size > 1):", f"{pct_in_ties:.2f}%")


# ---------- B) Tier sanity checks ----------
print("\n=== B) Tier sanity checks ===")

# serves_middle required
if "serves_middle" in child_requires and child_requires["serves_middle"] == 1:
    j = feat_to_idx["serves_middle"]
    assert (X_cand[:, j] >= 0.5).all(), "Tier violation: found non-middle school in serves_middle-required results."
    print("All candidates serve middle school")

# tag_ib required (if you enable it)
if child_requires.get("tag_ib", 0) == 1:
    j = feat_to_idx["tag_ib"]
    assert (X_cand[:, j] >= 0.5).all(), "Tier violation: found non-IB school in IB-required results."
    print("All candidates have IB")


# ---------- C) Budget check in action (floor guarantee) ----------
print("\n=== C) Budget check in action ===")

dense_feats = [f["feature"] for f in features if f["type"] == "dense"]
tag_feats = [f["feature"] for f in features if f["type"] == "binary"]

dense_max = 0.0
for f in features:
    if f["type"] == "dense":
        # dense features in [0,1], max contribution when child_pref=1 is weight*1
        dense_max += float(f["weight"]) * float(child_pref.get(f["feature"], 0.0))

ib_weight = float([f for f in features if f["feature"] == "tag_ib"][0]["weight"]) * float(child_pref.get("tag_ib", 0.0))

print(f"Max possible dense contribution (this child): {dense_max:.2f}")
print(f"IB contribution (if present): {ib_weight:.2f}")

if ib_weight > dense_max:
    print("Tier guarantee holds: IB outranks perfect-dense non-IB (within this child profile).")
else:
    print("Tier guarantee does NOT hold under this child profile (consider budget/weights).")


# ---------- D) Distribution + fill sanity checks ----------
print("\n=== D) Distribution + fill sanity checks ===")

# Check all dense columns are within [0,1] in the MATRIX after fill
for f in dense_feats:
    j = feat_to_idx[f]
    col = X[:, j]
    if np.isnan(col).any():
        raise AssertionError(f"Matrix contains NaN in dense feature: {f} (should be filled).")
    if (col < 0).any() or (col > 1).any():
        raise AssertionError(f"Dense feature out of [0,1] range in matrix: {f}")
print("All dense features are in [0,1] and contain no NaNs in the matrix")

# Check the neutral fill behavior exists (0.5 should appear for missing-heavy dense features)
for f in dense_feats:
    j = feat_to_idx[f]
    col = X[:, j]
    pct_05 = float((np.isclose(col, 0.5)).mean() * 100)
    print(f"{f}: % exactly 0.5 (neutral fill) ≈ {pct_05:.2f}%")

print("\n=== Sample top 10 (for manual sanity) ===")
display(rank_df.head(10))


Matrix: (124619, 10)
Index: (124619, 2)
schools_master_v2: (124619, 29)
Features: ['tag_ib', 'tag_cais', 'tag_ams_montessori', 'tag_waldorf', 'serves_elementary', 'serves_middle', 'serves_high', 'score_size_small', 'score_attention', 'score_diversity']

=== A) Tie report (rounded to 6 decimals) ===
Candidates: 59,805
Unique scores: 38,120
Largest tie group size: 1,039
% of schools in tie groups (size > 1): 47.19%

=== B) Tier sanity checks ===
All candidates serve middle school

=== C) Budget check in action ===
Max possible dense contribution (this child): 1.95
IB contribution (if present): 5.00
Tier guarantee holds: IB outranks perfect-dense non-IB (within this child profile).

=== D) Distribution + fill sanity checks ===
All dense features are in [0,1] and contain no NaNs in the matrix
score_size_small: % exactly 0.5 (neutral fill) ≈ 3.57%
score_attention: % exactly 0.5 (neutral fill) ≈ 82.07%
score_diversity: % exactly 0.5 (neutral fill) ≈ 43.25%

=== Sample top 10 (for manual sani

Unnamed: 0,school_id,row_index,score_v2,score_rounded
0,PRI_A2100388,118078,7.234454,7.234454
1,PRI_00093379,103044,7.134809,7.134809
2,PRI_A9700331,121637,7.111198,7.111198
3,PRI_A1792009,116246,7.109,7.109
4,PRI_A9100571,119794,7.064006,7.064006
5,PRI_BB060167,122981,7.048074,7.048074
6,PRI_A0770343,112054,7.040558,7.040558
7,PRI_A0900353,112238,7.039431,7.039431
8,PRI_BB180318,123458,6.997802,6.997802
9,PUB_60194211934,6937,6.953974,6.953974


# 10. Summary & Next Steps

This notebook completes **Scoring v2** of the Smart School Finder system.

At this point, we have built a **fully deterministic, explainable, and validated recommendation engine** that ranks schools using transparent rules instead of opaque machine learning.

---

## 10.1 What We Built in v2

### A. Deterministic Scoring Engine
We implemented a scoring system based on:

- **Binary “Tier” features** (e.g., IB, CAIS, Montessori, grade coverage)
- **Dense “Tie-breaker” metrics** (school size, attention, diversity)
- A **strict weight budget** where:
  
  > Tag weights > Sum of dense metric weights

This guarantees that **structural requirements dominate rankings**, while dense metrics only refine ordering *within* tiers.

---

### B. Honest Missing-Data Handling
For dense metrics:
- Missing values are **not penalized**
- Missing values are filled with **neutral (0.5)** at matrix-build time

This preserves:
- Mathematical stability (no NaNs)
- Product fairness (“no opinion” instead of punishment)

---

### C. Explainability by Construction
Every ranked result can be explained as:

- Which features contributed
- How much each feature contributed
- What raw values caused the score

Example:
> “Ranked highly because: IB (+5.0), Serves Middle (+1.0), High Attention (~5.2 students per teacher)”

This enables **parent-facing explanations without post-hoc inference**.

---

### D. Validation & Safety Guarantees
We explicitly validated:

- Tier guarantees (tags dominate dense metrics)
- Dense metrics are bounded in [0,1]
- Neutral fill behaves as expected
- Tie behavior is understood and measurable

The system is **predictable, auditable, and safe for real-world use**.

---

## 10.2 What v2 Solves (Compared to v1)

| Problem | v1 | v2 |
|------|----|----|
| Hidden weighting | Implicit | Explicit, budgeted |
| Dense metrics | Binary | Continuous, normalized |
| Missing data | Silent failure | Neutral, intentional |
| Explainability | Approximate | Exact + traceable |
| Parent trust | Low | High |

---

## 10.3 What This System Is *Not*

This engine intentionally does **not**:
- Learn from user behavior
- Use black-box ML models
- Optimize engagement metrics
- Personalize via embeddings

This is a **calculus engine, not a guessing engine**.

---

## 10.4 Next Step: Preference Segments (v0)

With a validated scoring engine in place, the next step is to introduce **Preference Segments**:

- Deterministic presets (e.g., “Academic-First”, “Progressive”, “Small & Nurturing”)
- Each segment maps to:
  - Hard requirements
  - Feature weights
  - Preference intensities

These segments:
- Do **not** change the engine
- Do **not** introduce ML
- Act as a **product interface layer**

---

## 10.5 Looking Ahead

The next notebook will introduce:

> **Preference Segments v0** — a deterministic bridge between:
> - the scoring engine (this notebook)
> - and the AI parent-reporting layer

Once segments exist, AI can safely generate explanations **without influencing rankings**.

---

# 11. Preference Segments v0 (Deterministic)

This section introduces **Preference Segments v0** — a deterministic abstraction layer
that maps *human intent* into *scoring engine inputs*.

⚠️ Important:
- Preference Segments do **not** change the scoring engine
- They do **not** introduce machine learning
- They act purely as **configuration presets**

Think of segments as **named bundles of preferences**, not models.

---

## 11.1 Why Preference Segments Exist

Parents do not think in terms of:
- feature weights
- dot products
- normalized scores

They think in terms of:
- “Academic-focused”
- “Small & nurturing”
- “Balanced”
- “Progressive / Montessori-leaning”

Preference Segments translate these *human mental models* into:
- hard requirements
- feature importance
- relative trade-offs

---

## 11.2 Segment Design Principles

Each Preference Segment must satisfy:

1. **Deterministic**
   - Same segment → same ranking every time

2. **Transparent**
   - Fully inspectable configuration
   - No learned parameters

3. **Composable**
   - Segments can be extended or combined later

4. **Engine-agnostic**
   - Segments configure the engine
   - They do not alter engine logic

---

## 11.3 Segment Schema (v0)

Each segment defines:

- `hard_requires`
  - Binary filters applied *before scoring*
- `feature_prefs`
  - Feature preference strengths in `[0.0 – 1.0]`
- `description`
  - Human-readable explanation (used later by AI)

---

## 11.4 Example Preference Segments (v0)

The following segments are **hand-crafted baselines** designed for product exploration.

They are intentionally simple.



In [180]:
# ============================================================
# 11. Preference Segments v0 (Deterministic Config)
# ============================================================

PREFERENCE_SEGMENTS_V0 = {
    "academic_first": {
        "description": "Strong academics with structured rigor and credentials.",
        "hard_requires": {
            "serves_middle": 1,
        },
        "feature_prefs": {
            "tag_ib": 1.0,
            "tag_cais": 0.8,
            "score_attention": 0.6,
            "score_size_small": 0.3,
            "score_diversity": 0.4,
        },
    },

    "small_nurturing": {
        "description": "Intimate environment with high individual attention.",
        "hard_requires": {
            "serves_elementary": 1,
        },
        "feature_prefs": {
            "score_size_small": 1.0,
            "score_attention": 1.0,
            "score_diversity": 0.5,
        },
    },

    "progressive_balanced": {
        "description": "Balanced academics with progressive philosophy.",
        "hard_requires": {},
        "feature_prefs": {
            "tag_ams_montessori": 1.0,
            "tag_waldorf": 0.7,
            "score_attention": 0.6,
            "score_size_small": 0.5,
            "score_diversity": 0.6,
        },
    },

    "balanced_general": {
        "description": "Well-rounded schools with no extreme trade-offs.",
        "hard_requires": {},
        "feature_prefs": {
            "score_size_small": 0.6,
            "score_attention": 0.6,
            "score_diversity": 0.6,
        },
    },
}

print(f"Loaded {len(PREFERENCE_SEGMENTS_V0)} preference segments:")
for k, v in PREFERENCE_SEGMENTS_V0.items():
    print(f"- {k}: {v['description']}")


Loaded 4 preference segments:
- academic_first: Strong academics with structured rigor and credentials.
- small_nurturing: Intimate environment with high individual attention.
- progressive_balanced: Balanced academics with progressive philosophy.
- balanced_general: Well-rounded schools with no extreme trade-offs.
