# Notebook 07 — Preference Segments (v0 → v1 Validation)

## Purpose
This notebook evaluates **Preference Segments v0** — deterministic, hand-crafted
abstractions of parent intent — and validates that they produce **meaningful,
safe, and explainable differences** in school rankings.

This notebook does **not** perform machine learning or clustering.
All segments are deterministic configurations layered on top of the v2 scoring engine.

---

## 00. Setup & Scope

- Relationship to Notebook 06 (v2 Scoring Engine)
- What this notebook does:
  - Validate Preference Segment behavior
  - Compare ranking outcomes across segments
- What this notebook does NOT do:
  - No ML
  - No learned clusters
  - No weight training

---

## 01. Load Inputs & Feature Space

- Load `school_matrix_v2.npy`
- Load `school_index_v2.csv`
- Load `schools_master_v2.csv`
- Load `feature_config_master_v2.json`
- Load `preference_segments_v0.json`

Validation:
- Matrix shape consistency
- Feature alignment across artifacts

---

## 02. Segment Configuration → Mathematical Representation

For each Preference Segment:

- Resolve segment configuration into:
  - Segment feature vector
  - Segment weight vector
- Confirm:
  - Only known v2 features are referenced
  - Binary tags dominate dense metrics
  - No hidden or implicit logic exists

Output:
- One resolved **segment vector** per segment

---

## 03. Ranking Execution by Segment

For each segment:

- Score all schools using v2 scoring engine
- Rank results
- Save:
  - Top-K schools
  - Score distributions
  - Candidate counts

Goal:
- Validate that segments produce stable, differentiable rankings

---

## 04. Cross-Segment Ranking Comparison

Compare segments on:

- Top-10 / Top-25 overlap
- Rank movement of the same school across segments
- Presence / absence of tier-defining tags (e.g., IB)

Goal:
- Prove segments produce **meaningfully different outcomes**

---

## 05. Tie Behavior by Segment

For each segment:

- Count unique scores
- Largest tie group size
- % of candidates in tie groups

Compare:
- Segment-driven tie reduction
- Impact of dense metrics under different preferences

---

## 06. Explainability Differences (Segment Lens)

For selected schools:

- Generate explanations under **multiple segments**
- Compare:
  - Feature contributions
  - Raw context values

Demonstrate:
- Same school → different justification depending on segment

---

## 07. Tier & Constraint Sanity Checks

Validate for every segment:

- Hard requirements are never violated
- Tag weights dominate dense metrics
- No dense metric can overturn a tier decision

Explicit checks:
- IB vs non-IB dominance
- Grade-span enforcement

---

## 08. Segment Design Review

Qualitative review:

- What each segment optimizes for
- Known limitations
- Overlaps between segments
- Gaps in coverage

This section informs **future segment refinement**, not ML.

---

## 09. Summary & Product Implications

- What Preference Segments v0 prove
- Why deterministic segmentation is valuable
- Readiness for:
  - Parent-facing UX
  - AI-generated explanations
  - Controlled personalization

---

## 10. Next Steps

- Segment v1 refinement from feedback
- Child Profile → Segment mapping
- Soft personalization (non-learning)
- Future ML clustering roadmap (Notebook 08+)

---

> Preference Segments are **intent abstractions**, not intelligence.  
> Intelligence remains in the math.


## 00. Notebook Setup

### Goal
Validate **Preference Segments v0** (deterministic segment definitions) against real school data + baseline ranking outputs from Notebook 06, then produce:
- **v0 validation report** (coverage, separation, sanity checks)
- **segment diagnostics** (feature distributions, top schools by segment)
- **v1 action list** (what to change in v1: weights, thresholds, new features, missing tags)

### What we already have
- Notebook 06 produced:
  - `schools_master_v2` (Golden Record)
  - ML feature matrix (dense + binary features)
  - Scoring/ranking outputs (or at least ability to score)
- Preference Segments v0:
  - A deterministic config describing a few segments and their weights/feature intents.

### Inputs (expected)
- Processed schools dataset: `schools_master_v2` (CSV or Parquet)
- Feature config (v2): `feature_config_v2.json` (or similar)
- Preference segments config (v0): `preference_segments_v0.json`
- Optional: cached scoring outputs from Notebook 06 (e.g. `ranked_schools_v2.parquet`)

### Outputs (we will create)
- `../reports/notebook07_segments_v0_validation.md`
- `../reports/notebook07_segments_v0_coverage.csv`
- `../reports/notebook07_segments_v0_top_schools.csv`
- `../reports/notebook07_segments_v0_feature_summary.csv`

### Sections in this notebook
00. Setup  
01. Load Inputs (Schools + Feature Config + Segment Config)  
02+. Validation, diagnostics, and v1 recommendations (later)


In [46]:
# ============================================
# Notebook 07 — Preference Segments (v0 → v1 Validation)
# Section 00: Setup (matches your current repo)
# ============================================

from __future__ import annotations

from pathlib import Path
import numpy as np
import pandas as pd

# ----------------------------
# Display settings
# ----------------------------
pd.set_option("display.max_columns", 200)
pd.set_option("display.width", 140)

# ----------------------------
# Project paths
# ----------------------------
NOTEBOOK_DIR = Path.cwd()
PROJECT_ROOT = NOTEBOOK_DIR.parent

DATA_DIR = PROJECT_ROOT / "data"
PROCESSED_DIR = DATA_DIR / "processed"
REPORTS_DIR = PROJECT_ROOT / "reports"
CONFIG_DIR = PROJECT_ROOT / "config"  # currently missing in your repo; we create it

REPORTS_DIR.mkdir(parents=True, exist_ok=True)
CONFIG_DIR.mkdir(parents=True, exist_ok=True)

# ----------------------------
# Notebook 06 artifacts (confirmed from your listing)
# ----------------------------
SCHOOLS_MASTER_V2_PATH = PROCESSED_DIR / "schools_master_v2.csv"
SCHOOL_MATRIX_V2_PATH  = PROCESSED_DIR / "school_matrix_v2.npy"
SCHOOL_INDEX_V2_PATH   = PROCESSED_DIR / "school_index_v2.csv"
FEATURE_CONFIG_V2_PATH = PROCESSED_DIR / "feature_config_master_v2.json"

# Optional (not required for Section 00)
SCHOOL_MATRIX_AUDIT_PATH   = PROCESSED_DIR / "school_matrix_audit_v2.csv"
SCHOOL_VECTOR_EXPLAIN_PATH = PROCESSED_DIR / "school_vector_explain_v2.json"
SCHOOL_ID_TO_INDEX_PATH    = PROCESSED_DIR / "school_id_to_index.json"

# ----------------------------
# Preference Segments v0 (will create later; do NOT require in Section 00)
# ----------------------------
SEGMENTS_V0_PATH = CONFIG_DIR / "preference_segments_v0.json"

# ----------------------------
# Notebook 07 outputs (later)
# ----------------------------
NB07_SEGMENT_RANKS_PATH = REPORTS_DIR / "notebook07_segment_top_schools.csv"
NB07_DIAGNOSTICS_PATH   = REPORTS_DIR / "notebook07_segment_diagnostics.json"

# ----------------------------
# Guard rails — required inputs ONLY
# ----------------------------
required_files = [
    SCHOOLS_MASTER_V2_PATH,
    SCHOOL_MATRIX_V2_PATH,
    SCHOOL_INDEX_V2_PATH,
    FEATURE_CONFIG_V2_PATH,
]

missing = [p for p in required_files if not p.exists()]
if missing:
    raise FileNotFoundError(
        "Missing required file(s):\n" + "\n".join(str(p) for p in missing)
    )

# ----------------------------
# Final confirmation
# ----------------------------
print("Section 00 complete — setup ready")
print("PROJECT_ROOT:", PROJECT_ROOT)
print("PROCESSED_DIR:", PROCESSED_DIR)
print("CONFIG_DIR created:", CONFIG_DIR.exists())
print("REPORTS_DIR:", REPORTS_DIR)

print("\nInputs:")
print(" - schools_master_v2:", SCHOOLS_MASTER_V2_PATH.name)
print(" - school_matrix_v2:", SCHOOL_MATRIX_V2_PATH.name)
print(" - school_index_v2:", SCHOOL_INDEX_V2_PATH.name)
print(" - feature_config_master_v2:", FEATURE_CONFIG_V2_PATH.name)

print("\nSegments v0 exists:", SEGMENTS_V0_PATH.exists(), "| path:", SEGMENTS_V0_PATH)


Section 00 complete — setup ready
PROJECT_ROOT: /Users/jennifer-david/Documents/work/SpringBoard/projects/Capstone Projects/smart-school
PROCESSED_DIR: /Users/jennifer-david/Documents/work/SpringBoard/projects/Capstone Projects/smart-school/data/processed
CONFIG_DIR created: True
REPORTS_DIR: /Users/jennifer-david/Documents/work/SpringBoard/projects/Capstone Projects/smart-school/reports

Inputs:
 - schools_master_v2: schools_master_v2.csv
 - school_matrix_v2: school_matrix_v2.npy
 - school_index_v2: school_index_v2.csv
 - feature_config_master_v2: feature_config_master_v2.json

Segments v0 exists: True | path: /Users/jennifer-david/Documents/work/SpringBoard/projects/Capstone Projects/smart-school/config/preference_segments_v0.json


## 01. Load School Feature Space (Notebook 06 Outputs)

In this section, we load the finalized **school feature space** produced in Notebook 06:

- `schools_master_v2.csv` — canonical school records
- `school_matrix_v2.npy` — dense feature matrix (schools × features)
- `school_index_v2.csv` — row alignment between matrix and schools
- `feature_config_master_v2.json` — feature definitions and semantics

We perform **strict alignment checks** to ensure:
- Matrix rows match school index rows
- Matrix columns match feature configuration
- No silent misalignment before scoring

This guarantees that downstream segment scoring is **numerically correct and interpretable**.


In [49]:
# ============================================
# Notebook 07 — Section 01: Load school feature artifacts
# ============================================

# ----------------------------
# 01.1 Load schools_master_v2
# ----------------------------
schools = pd.read_csv(SCHOOLS_MASTER_V2_PATH, low_memory=False)
print("schools_master_v2:", schools.shape)
display(schools.head(3))

# ----------------------------
# 01.2 Load matrix + index
# ----------------------------
X_school = np.load(SCHOOL_MATRIX_V2_PATH)   # shape: (n_schools, n_features)
school_index = pd.read_csv(SCHOOL_INDEX_V2_PATH)

print("\nX_school shape:", X_school.shape)
print("school_index shape:", school_index.shape)
display(school_index.head(5))

# ----------------------------
# 01.3 Load feature config (source-of-truth for feature names)
# ----------------------------
with open(FEATURE_CONFIG_V2_PATH, "r") as f:
    feature_config = json.load(f)

# expected: {"features":[{...}, ...], ...}
features = feature_config.get("features", [])
if not features:
    raise ValueError("feature_config_master_v2.json missing 'features' list or it's empty.")

feature_names = [feat["name"] for feat in features]
print("\nfeature_config features:", len(feature_names))
print("first 10 features:", feature_names[:10])

# ----------------------------
# 01.4 Alignment checks (fail fast)
# ----------------------------
# (A) Matrix rows must match school_index rows
assert X_school.shape[0] == len(school_index), (
    f"Row mismatch: matrix rows={X_school.shape[0]} vs school_index rows={len(school_index)}"
)

# (B) Matrix cols must match feature_config length
assert X_school.shape[1] == len(feature_names), (
    f"Col mismatch: matrix cols={X_school.shape[1]} vs feature_config features={len(feature_names)}"
)

# (C) school_id coverage check (best-effort; only if column exists)
if "school_id" in school_index.columns and "school_id" in schools.columns:
    missing_in_schools = set(school_index["school_id"]) - set(schools["school_id"])
    if missing_in_schools:
        print(f"\n school_index has {len(missing_in_schools)} school_id not found in schools_master_v2 (showing 10):")
        print(list(sorted(missing_in_schools))[:10])
    else:
        print("\n school_index.school_id coverage matches schools_master_v2.school_id")
else:
    print("\nℹ️ Skipping school_id coverage check (school_id column not found in index and/or schools).")

# ----------------------------
# 01.5 Build feature name → column index map (used throughout Notebook 07)
# ----------------------------
feat_to_col = {name: i for i, name in enumerate(feature_names)}
print("\n Section 01 complete — artifacts loaded and aligned")
print("feat_to_col sample:", list(feat_to_col.items())[:5])


schools_master_v2: (124619, 29)


Unnamed: 0,school_id,school_name,city,state,zip,is_public,is_private,has_ib,has_cais,has_ams_montessori,has_waldorf,has_ccd,has_crdc,raw_size_value,raw_size_source,score_size_small,score_size_large,raw_student_teacher_ratio,raw_attention_source,score_attention,grade_span_min,grade_span_max,serves_elementary,serves_middle,serves_high,raw_grade_source,raw_diversity_entropy,raw_diversity_source,score_diversity
0,PUB_10000500870,albertville middle school,albertville,AL,35950,True,False,False,False,False,False,True,False,860.0,teachers_est_x20,0.138033,0.861967,,,,7.0,8.0,False,True,False,ccd_offered_flags,,,
1,PUB_10000500871,albertville high school,albertville,AL,35950,True,False,False,False,False,False,True,False,1820.0,teachers_est_x20,0.042495,0.957505,,,,9.0,12.0,False,False,True,ccd_offered_flags,,,
2,PUB_10000500879,albertville intermediate school,albertville,AL,35950,True,False,False,False,False,False,True,False,840.0,teachers_est_x20,0.14103,0.85897,,,,5.0,6.0,True,True,False,ccd_offered_flags,,,



X_school shape: (124619, 10)
school_index shape: (124619, 2)


Unnamed: 0,school_id,row_index
0,PUB_10000500870,0
1,PUB_10000500871,1
2,PUB_10000500879,2
3,PUB_10000500889,3
4,PUB_10000501616,4



feature_config features: 10
first 10 features: ['tag_ib', 'tag_cais', 'tag_ams_montessori', 'tag_waldorf', 'serves_elementary', 'serves_middle', 'serves_high', 'score_size_small', 'score_attention', 'score_diversity']

 school_index.school_id coverage matches schools_master_v2.school_id

 Section 01 complete — artifacts loaded and aligned
feat_to_col sample: [('tag_ib', 0), ('tag_cais', 1), ('tag_ams_montessori', 2), ('tag_waldorf', 3), ('serves_elementary', 4)]


## 01.A Create Preference Segments v0 (Deterministic Config)

We externalize Preference Segments v0 into a versioned JSON config:
- Lives in `./config/preference_segments_v0.json`
- References only existing features from `feature_config_master_v2.json`
- Deterministic baseline for validation (v0 → v1)


In [52]:
# ============================================
# Create preference_segments_v0.json (Deterministic baseline)
# FIXED: timezone-aware UTC timestamp
# ============================================

import json
from datetime import datetime, timezone

# Ensure config directory exists
CONFIG_DIR.mkdir(parents=True, exist_ok=True)

# Only allow features that exist in the v2 feature space
allowed_features = set(feature_names)  # from Section 01

segments_v0 = {
    "meta": {
        "name": "Preference Segments v0",
        "version": "0.1.0",
        "created_utc": datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ"),
        "notes": "Deterministic baseline segments for Notebook 07 validation. Uses only v2 feature space.",
        "scoring_intent": "Weighted linear alignment; binary tags are strong signals, dense scores are tie-breakers."
    },
    "segments": {
        "academic_first": {
            "label": "Academic First",
            "description": "Strong academics with structured rigor and credentials.",
            "features": [
                {"name": "tag_ib", "value": 1.0, "weight": 5.0},
                {"name": "tag_cais", "value": 1.0, "weight": 5.0},
                {"name": "serves_middle", "value": 1.0, "weight": 1.0},
                {"name": "serves_high", "value": 1.0, "weight": 1.0},
                {"name": "score_attention", "value": 1.0, "weight": 0.8},
                {"name": "score_diversity", "value": 1.0, "weight": 0.6},
            ],
        },

        "small_nurturing": {
            "label": "Small & Nurturing",
            "description": "Intimate environment with high individual attention.",
            "features": [
                {"name": "score_size_small", "value": 1.0, "weight": 2.5},
                {"name": "score_attention", "value": 1.0, "weight": 2.5},
                {"name": "serves_elementary", "value": 1.0, "weight": 1.0},
                {"name": "score_diversity", "value": 1.0, "weight": 0.5},
            ],
        },

        "progressive_balanced": {
            "label": "Progressive Balanced",
            "description": "Balanced academics with progressive philosophy.",
            "features": [
                {"name": "tag_ams_montessori", "value": 1.0, "weight": 2.0},
                {"name": "tag_waldorf", "value": 1.0, "weight": 2.0},
                {"name": "score_attention", "value": 1.0, "weight": 1.0},
                {"name": "score_diversity", "value": 1.0, "weight": 1.0},
                {"name": "score_size_small", "value": 1.0, "weight": 0.8},
            ],
        },

        "balanced_general": {
            "label": "Balanced General",
            "description": "Well-rounded schools with no extreme trade-offs.",
            "features": [
                {"name": "serves_elementary", "value": 1.0, "weight": 1.0},
                {"name": "serves_middle", "value": 1.0, "weight": 1.0},
                {"name": "serves_high", "value": 1.0, "weight": 1.0},
                {"name": "score_attention", "value": 1.0, "weight": 1.0},
                {"name": "score_diversity", "value": 1.0, "weight": 1.0},
                {"name": "score_size_small", "value": 1.0, "weight": 0.8},
            ],
        },
    },
}

# ----------------------------
# Validate referenced features
# ----------------------------
invalid_refs = []
for seg_key, seg in segments_v0["segments"].items():
    for f in seg["features"]:
        if f["name"] not in allowed_features:
            invalid_refs.append((seg_key, f["name"]))

if invalid_refs:
    raise ValueError(
        "Segments reference features not in feature space:\n"
        + "\n".join(f"{seg}: {feat}" for seg, feat in invalid_refs)
    )

# ----------------------------
# Write JSON to config/
# ----------------------------
with open(SEGMENTS_V0_PATH, "w") as f:
    json.dump(segments_v0, f, indent=2)

print("preference_segments_v0.json written")
print("Path:", SEGMENTS_V0_PATH)
print("Segments:", list(segments_v0["segments"].keys()))
print("created_utc:", segments_v0["meta"]["created_utc"])


preference_segments_v0.json written
Path: /Users/jennifer-david/Documents/work/SpringBoard/projects/Capstone Projects/smart-school/config/preference_segments_v0.json
Segments: ['academic_first', 'small_nurturing', 'progressive_balanced', 'balanced_general']
created_utc: 2025-12-27T17:23:43Z


## 02. Segment Scoring (v0 Validation)

In this section, we:
- Load `preference_segments_v0.json`
- Convert each preference segment into a **segment vector** aligned with the school feature space
- Compute a **weighted linear alignment score** for every school
- Produce a top-K ranked list of schools per segment

This validates whether deterministic Preference Segments v0 produce sensible, differentiable results.


In [55]:
# ============================================
# Section 02 — Build segment vectors & score schools
# ============================================

# ----------------------------
# 02.1 Load preference segments v0
# ----------------------------
with open(SEGMENTS_V0_PATH, "r") as f:
    segments_cfg = json.load(f)

segments = segments_cfg["segments"]
segment_keys = list(segments.keys())

print("Loaded segments:", segment_keys)

# ----------------------------
# 02.2 Build segment vectors (aligned to feature space)
# ----------------------------
n_features = X_school.shape[1]

segment_vectors = {}
segment_weights = {}

for seg_key, seg in segments.items():
    vec = np.zeros(n_features, dtype=float)
    wts = np.zeros(n_features, dtype=float)

    for f in seg["features"]:
        col = feat_to_col[f["name"]]
        vec[col] = f.get("value", 1.0)
        wts[col] = f.get("weight", 1.0)

    segment_vectors[seg_key] = vec
    segment_weights[seg_key] = wts

print("Built segment vectors for:", list(segment_vectors.keys()))

# ----------------------------
# 02.3 Scoring function (weighted linear alignment)
# ----------------------------
def score_schools(X, seg_vec, seg_wts):
    """
    X: school matrix (n_schools x n_features)
    seg_vec: desired feature values
    seg_wts: feature weights
    """
    # element-wise: school_value * desired_value * weight
    weighted = X * seg_vec * seg_wts
    return weighted.sum(axis=1)

# ----------------------------
# 02.4 Score all schools per segment
# ----------------------------
segment_scores = {}

for seg_key in segment_keys:
    scores = score_schools(
        X_school,
        segment_vectors[seg_key],
        segment_weights[seg_key],
    )
    segment_scores[seg_key] = scores

print("Scoring complete.")

# ----------------------------
# 02.5 Build top-K tables per segment
# ----------------------------
TOP_K = 50

top_tables = []

for seg_key, scores in segment_scores.items():
    df = pd.DataFrame({
        "school_id": school_index["school_id"],
        "score": scores,
    })

    df = (
        df.merge(
            schools[["school_id", "school_name", "city", "state", "is_public"]],
            on="school_id",
            how="left",
        )
        .sort_values("score", ascending=False)
        .head(TOP_K)
        .assign(segment=seg_key, rank=lambda x: range(1, len(x) + 1))
    )

    top_tables.append(df)

top_schools_df = pd.concat(top_tables, ignore_index=True)

# ----------------------------
# 02.6 Save results
# ----------------------------
top_schools_df.to_csv(NB07_SEGMENT_RANKS_PATH, index=False)

print("Section 02 complete")
print("Saved top schools per segment to:", NB07_SEGMENT_RANKS_PATH)
display(top_schools_df.head(10))


Loaded segments: ['academic_first', 'small_nurturing', 'progressive_balanced', 'balanced_general']
Built segment vectors for: ['academic_first', 'small_nurturing', 'progressive_balanced', 'balanced_general']
Scoring complete.
Section 02 complete
Saved top schools per segment to: /Users/jennifer-david/Documents/work/SpringBoard/projects/Capstone Projects/smart-school/reports/notebook07_segment_top_schools.csv


Unnamed: 0,school_id,score,school_name,city,state,is_public,segment,rank
0,PRI_BB180318,12.974635,silicon valley international school,palo alto,CA,False,academic_first,1
1,PRI_A0770343,12.951242,escuela bilingue internacional,oakland,CA,False,academic_first,2
2,PRI_A0900353,12.915775,the healdsburg school,healdsburg,CA,False,academic_first,3
3,PRI_00081873,8.05341,south peninsula hebrew day school,sunnyvale,CA,False,academic_first,4
4,PRI_00078361,8.029544,trinity school,menlo park,CA,False,academic_first,5
5,PRI_A9101385,8.023651,stanbridge academy,san mateo,CA,False,academic_first,6
6,PRI_00093539,8.021807,the brandeis school of san francisco,san francisco,CA,False,academic_first,7
7,PRI_A9700331,8.011884,bowman school,palo alto,CA,False,academic_first,8
8,PRI_02157311,7.9945,prospect sierra school,el cerrito,CA,False,academic_first,9
9,PRI_A9700620,7.990209,the saklan school,moraga,CA,False,academic_first,10


## 03. Segment Math Validation (Diagnostics)

In this section, we sanity-check segment scoring by answering:
- Do segments have reasonable score distributions?
- How many schools get a non-zero score per segment?
- Are segments overly overlapping (same schools winning everywhere)?
- What features drive the top results (quick contribution breakdown)?

In [58]:
# ============================================
# Section 03 — Diagnostics
# ============================================

# ----------------------------
# 03.1 Score distributions + coverage
# ----------------------------
rows = []
for seg_key, scores in segment_scores.items():
    scores = np.asarray(scores).astype(float)

    rows.append({
        "segment": seg_key,
        "n_schools": int(scores.shape[0]),
        "nonzero_count": int((scores > 0).sum()),
        "nonzero_pct": float((scores > 0).mean() * 100.0),
        "min": float(scores.min()),
        "p50": float(np.percentile(scores, 50)),
        "p90": float(np.percentile(scores, 90)),
        "p99": float(np.percentile(scores, 99)),
        "max": float(scores.max()),
        "mean": float(scores.mean()),
    })

diag_dist = pd.DataFrame(rows).sort_values("segment")
print("=== Score distribution summary ===")
display(diag_dist)

# Save for reporting
diag_dist_path = REPORTS_DIR / "notebook07_segment_diagnostics_distribution.csv"
diag_dist.to_csv(diag_dist_path, index=False)
print("Saved:", diag_dist_path)

# ----------------------------
# 03.2 Overlap: how often segments agree on the same top schools?
# ----------------------------
TOP_K = 50

top_ids_by_seg = {}
for seg_key, scores in segment_scores.items():
    top_idx = np.argsort(scores)[::-1][:TOP_K]
    top_ids_by_seg[seg_key] = set(school_index.loc[top_idx, "school_id"].tolist())

overlap_rows = []
seg_keys = list(top_ids_by_seg.keys())
for i in range(len(seg_keys)):
    for j in range(i + 1, len(seg_keys)):
        a, b = seg_keys[i], seg_keys[j]
        inter = top_ids_by_seg[a].intersection(top_ids_by_seg[b])
        overlap_rows.append({
            "seg_a": a,
            "seg_b": b,
            "topk": TOP_K,
            "intersection_count": len(inter),
            "jaccard": (len(inter) / len(top_ids_by_seg[a].union(top_ids_by_seg[b]))) if (top_ids_by_seg[a] or top_ids_by_seg[b]) else 0.0
        })

diag_overlap = pd.DataFrame(overlap_rows).sort_values(["intersection_count", "jaccard"], ascending=False)
print("\n=== Top-K overlap between segments ===")
display(diag_overlap)

diag_overlap_path = REPORTS_DIR / "notebook07_segment_diagnostics_overlap.csv"
diag_overlap.to_csv(diag_overlap_path, index=False)
print("Saved:", diag_overlap_path)

# ----------------------------
# 03.3 Quick explanation: top school's feature contributions for each segment
#     (uses the same math: X * seg_vec * seg_wts)
# ----------------------------
feature_list = feature_names  # from Section 01

explain_rows = []
for seg_key, scores in segment_scores.items():
    best_row = int(np.argmax(scores))
    school_id = school_index.loc[best_row, "school_id"]

    # Pull school record
    school_row = schools.loc[schools["school_id"] == school_id].iloc[0]

    contrib = X_school[best_row, :] * segment_vectors[seg_key] * segment_weights[seg_key]
    contrib_pairs = list(zip(feature_list, contrib.tolist()))
    contrib_pairs.sort(key=lambda x: x[1], reverse=True)

    top_contrib = contrib_pairs[:6]  # top 6 drivers

    explain_rows.append({
        "segment": seg_key,
        "top_school_id": school_id,
        "top_school_name": school_row.get("school_name"),
        "top_school_city": school_row.get("city"),
        "top_school_state": school_row.get("state"),
        "score": float(scores[best_row]),
        "top_feature_drivers": "; ".join([f"{k}={v:.3f}" for k, v in top_contrib if v > 0]),
    })

diag_explain = pd.DataFrame(explain_rows).sort_values("segment")
print("\n=== Top school drivers per segment (quick check) ===")
display(diag_explain)

diag_explain_path = REPORTS_DIR / "notebook07_segment_diagnostics_top_drivers.csv"
diag_explain.to_csv(diag_explain_path, index=False)
print("Saved:", diag_explain_path)

print("\nSection 03 complete — diagnostics saved to /reports")


=== Score distribution summary ===


Unnamed: 0,segment,n_schools,nonzero_count,nonzero_pct,min,p50,p90,p99,max,mean
0,academic_first,124619,124619,100.0,0.3,1.574423,2.761185,3.005099,12.974635,1.516205
3,balanced_general,124619,124619,100.0,0.534558,2.240349,4.306014,4.8,5.229274,2.616354
2,progressive_balanced,124619,124619,100.0,0.5,1.147619,1.580755,2.001933,4.091309,1.146418
1,small_nurturing,124619,124619,100.0,0.25,2.855746,3.985404,5.276101,6.02898,2.842645


Saved: /Users/jennifer-david/Documents/work/SpringBoard/projects/Capstone Projects/smart-school/reports/notebook07_segment_diagnostics_distribution.csv

=== Top-K overlap between segments ===


Unnamed: 0,seg_a,seg_b,topk,intersection_count,jaccard
3,small_nurturing,progressive_balanced,50,23,0.298701
4,small_nurturing,balanced_general,50,2,0.020408
1,academic_first,progressive_balanced,50,1,0.010101
5,progressive_balanced,balanced_general,50,1,0.010101
0,academic_first,small_nurturing,50,0,0.0
2,academic_first,balanced_general,50,0,0.0


Saved: /Users/jennifer-david/Documents/work/SpringBoard/projects/Capstone Projects/smart-school/reports/notebook07_segment_diagnostics_overlap.csv

=== Top school drivers per segment (quick check) ===


Unnamed: 0,segment,top_school_id,top_school_name,top_school_city,top_school_state,score,top_feature_drivers
0,academic_first,PRI_BB180318,silicon valley international school,palo alto,CA,12.974635,tag_ib=5.000; tag_cais=5.000; serves_middle=1....
3,balanced_general,PRI_A0500573,paideia educational heritage,santa rosa,CA,5.229274,serves_elementary=1.000; serves_middle=1.000; ...
2,progressive_balanced,PRI_A0100729,angels montessori preschool,concord,CA,4.091309,tag_ams_montessori=2.000; score_attention=0.97...
1,small_nurturing,PRI_A0106319,alpine montessori,oak ridge,NJ,6.02898,score_attention=2.500; score_size_small=2.279;...


Saved: /Users/jennifer-david/Documents/work/SpringBoard/projects/Capstone Projects/smart-school/reports/notebook07_segment_diagnostics_top_drivers.csv

Section 03 complete — diagnostics saved to /reports


## 04. Cross-Segment Ranking Comparison

We compare segments on:
- Top-K overlap (counts + Jaccard similarity)
- Rank movement of the same schools across segments
- Biggest “winners” and “losers” when switching segments (vs a baseline)

Goal: prove segments produce meaningfully different outcomes.


In [63]:
# ============================================
# Section 04 — Cross-Segment Ranking Comparison
# ============================================

SEG_KEYS = list(segment_scores.keys())
BASELINE_SEG = "balanced_general"
TOP_K = 50

assert BASELINE_SEG in SEG_KEYS, f"Baseline segment '{BASELINE_SEG}' not found. Have: {SEG_KEYS}"

# ----------------------------
# Helpers
# ----------------------------
def get_topk_school_ids(seg_key: str, k: int = 50) -> list[str]:
    scores = np.asarray(segment_scores[seg_key])
    top_idx = np.argsort(scores)[::-1][:k]
    return school_index.loc[top_idx, "school_id"].tolist()

def build_rank_map(seg_key: str) -> dict[str, int]:
    """Returns school_id -> rank(1..N) for the full ranking (1 = best)."""
    scores = np.asarray(segment_scores[seg_key])
    order = np.argsort(scores)[::-1]
    ids = school_index.loc[order, "school_id"].tolist()
    return {sid: i + 1 for i, sid in enumerate(ids)}

# ----------------------------
# 04.1 Overlap matrix (Top-K)
# ----------------------------
topk_sets = {seg: set(get_topk_school_ids(seg, TOP_K)) for seg in SEG_KEYS}

overlap_count = pd.DataFrame(index=SEG_KEYS, columns=SEG_KEYS, dtype=int)
overlap_jacc = pd.DataFrame(index=SEG_KEYS, columns=SEG_KEYS, dtype=float)

for a in SEG_KEYS:
    for b in SEG_KEYS:
        inter = len(topk_sets[a].intersection(topk_sets[b]))
        union = len(topk_sets[a].union(topk_sets[b]))
        overlap_count.loc[a, b] = inter
        overlap_jacc.loc[a, b] = (inter / union) if union else 0.0

overlap_count_path = REPORTS_DIR / "notebook07_section04_topk_overlap_counts.csv"
overlap_jacc_path  = REPORTS_DIR / "notebook07_section04_topk_overlap_jaccard.csv"
overlap_count.to_csv(overlap_count_path)
overlap_jacc.to_csv(overlap_jacc_path)

print("Saved overlap matrices:")
print(" -", overlap_count_path)
print(" -", overlap_jacc_path)

display(overlap_count)
display(overlap_jacc)

# ----------------------------
# 04.2 Rank movement table (union of all Top-K schools)
# ----------------------------
union_ids = set().union(*[topk_sets[s] for s in SEG_KEYS])
union_ids = sorted(union_ids)

rank_maps = {seg: build_rank_map(seg) for seg in SEG_KEYS}

rank_rows = []
for sid in union_ids:
    row = {"school_id": sid}
    for seg in SEG_KEYS:
        row[f"rank__{seg}"] = rank_maps[seg].get(sid, None)
    rank_rows.append(row)

rank_movement = pd.DataFrame(rank_rows)

# Attach basic metadata
meta_cols = [c for c in ["school_id", "school_name", "city", "state", "is_public"] if c in schools.columns]
rank_movement = rank_movement.merge(schools[meta_cols], on="school_id", how="left")

rank_movement_path = REPORTS_DIR / "notebook07_section04_rank_movement_union_topk.csv"
rank_movement.to_csv(rank_movement_path, index=False)
print("Saved rank movement table:", rank_movement_path)

display(rank_movement.head(15))

# ----------------------------
# 04.3 Biggest winners/losers vs baseline (rank delta)
# ----------------------------
# Define delta = baseline_rank - segment_rank
# Positive delta => segment ranks it higher (better) than baseline
baseline_rank = rank_movement[f"rank__{BASELINE_SEG}"]

delta_tables = []

for seg in SEG_KEYS:
    if seg == BASELINE_SEG:
        continue

    seg_rank = rank_movement[f"rank__{seg}"]
    delta = baseline_rank - seg_rank

    tmp = rank_movement.copy()
    tmp["baseline_segment"] = BASELINE_SEG
    tmp["compare_segment"] = seg
    tmp["delta_rank"] = delta

    # Keep only rows where both ranks exist
    tmp = tmp[baseline_rank.notna() & seg_rank.notna()].copy()

    # Winners: ranked much higher in seg vs baseline
    winners = tmp.sort_values("delta_rank", ascending=False).head(20)
    winners["direction"] = "winner_vs_baseline"

    # Losers: ranked much lower in seg vs baseline
    losers = tmp.sort_values("delta_rank", ascending=True).head(20)
    losers["direction"] = "loser_vs_baseline"

    delta_tables.append(winners)
    delta_tables.append(losers)

rank_delta_df = pd.concat(delta_tables, ignore_index=True)

rank_delta_path = REPORTS_DIR / "notebook07_section04_rank_delta_vs_baseline.csv"
rank_delta_df.to_csv(rank_delta_path, index=False)
print("Saved rank delta table:", rank_delta_path)

display(rank_delta_df.head(25))

print("\nSection 04 complete")


Saved overlap matrices:
 - /Users/jennifer-david/Documents/work/SpringBoard/projects/Capstone Projects/smart-school/reports/notebook07_section04_topk_overlap_counts.csv
 - /Users/jennifer-david/Documents/work/SpringBoard/projects/Capstone Projects/smart-school/reports/notebook07_section04_topk_overlap_jaccard.csv


Unnamed: 0,academic_first,small_nurturing,progressive_balanced,balanced_general
academic_first,50.0,0.0,1.0,0.0
small_nurturing,0.0,50.0,23.0,2.0
progressive_balanced,1.0,23.0,50.0,1.0
balanced_general,0.0,2.0,1.0,50.0


Unnamed: 0,academic_first,small_nurturing,progressive_balanced,balanced_general
academic_first,1.0,0.0,0.010101,0.0
small_nurturing,0.0,1.0,0.298701,0.020408
progressive_balanced,0.010101,0.298701,1.0,0.010101
balanced_general,0.0,0.020408,0.010101,1.0


Saved rank movement table: /Users/jennifer-david/Documents/work/SpringBoard/projects/Capstone Projects/smart-school/reports/notebook07_section04_rank_movement_union_topk.csv


Unnamed: 0,school_id,rank__academic_first,rank__small_nurturing,rank__progressive_balanced,rank__balanced_general,school_name,city,state,is_public
0,PRI_00073771,13,17995,18729,8524,sacred heart schools atherton,atherton,CA,False
1,PRI_00078361,5,5436,7609,2009,trinity school,menlo park,CA,False
2,PRI_00079514,24,13531,14932,6556,live oak school,san francisco,CA,False
3,PRI_00080858,23,13422,14810,6489,town school for boys,san francisco,CA,False
4,PRI_00081421,46,16617,18277,8268,seven hills school,walnut creek,CA,False
5,PRI_00081556,42,13845,15700,7052,episcopal day school of st matthew,san mateo,CA,False
6,PRI_00081873,4,5889,7872,2141,south peninsula hebrew day school,sunnyvale,CA,False
7,PRI_00082141,2326,13454,16,6462,sacramento waldorf school,fair oaks,CA,False
8,PRI_00083054,34,13470,15068,6651,hillbrook school,los gatos,CA,False
9,PRI_00083429,50,21138,28056,11000,head royce school,oakland,CA,False


Saved rank delta table: /Users/jennifer-david/Documents/work/SpringBoard/projects/Capstone Projects/smart-school/reports/notebook07_section04_rank_delta_vs_baseline.csv


Unnamed: 0,school_id,rank__academic_first,rank__small_nurturing,rank__progressive_balanced,rank__balanced_general,school_name,city,state,is_public,baseline_segment,compare_segment,delta_rank,direction
0,PRI_00096733,11,27680,11420,20207,charles armstrong school,belmont,CA,False,balanced_general,academic_first,20196,winner_vs_baseline
1,PRI_00083429,50,21138,28056,11000,head royce school,oakland,CA,False,balanced_general,academic_first,10950,winner_vs_baseline
2,PRI_00073771,13,17995,18729,8524,sacred heart schools atherton,atherton,CA,False,balanced_general,academic_first,8511,winner_vs_baseline
3,PRI_00083495,49,16941,18698,8507,katherine delmar burke school,san francisco,CA,False,balanced_general,academic_first,8458,winner_vs_baseline
4,PRI_00081421,46,16617,18277,8268,seven hills school,walnut creek,CA,False,balanced_general,academic_first,8222,winner_vs_baseline
5,PRI_00083881,47,15255,17707,7926,black pine circle school,berkeley,CA,False,balanced_general,academic_first,7879,winner_vs_baseline
6,PRI_A9100781,41,14617,17212,7631,ecole bilingue de berkeley,berkeley,CA,False,balanced_general,academic_first,7590,winner_vs_baseline
7,PRI_02013539,37,14206,16608,7278,almaden country day school,san jose,CA,False,balanced_general,academic_first,7241,winner_vs_baseline
8,PRI_BB180318,1,14507,15754,7088,silicon valley international school,palo alto,CA,False,balanced_general,academic_first,7087,winner_vs_baseline
9,PRI_00081556,42,13845,15700,7052,episcopal day school of st matthew,san mateo,CA,False,balanced_general,academic_first,7010,winner_vs_baseline



Section 04 complete


## 05. Tie Behavior by Segment

We quantify how “sharp” or “blurry” each segment’s ranking is by measuring **ties**:
- **Exact ties**: identical final scores
- **Rounded ties** (UX lens): ties after rounding scores (default: 3 decimals)

We report:
- Unique score counts
- Largest tie group size
- % of schools that belong to tied groups
- Tie behavior in important bands (Top 50, Top 200, Top 1%)


In [66]:
# ============================================
# Section 05 — Tie Behavior by Segment
# ============================================

SEG_KEYS = list(segment_scores.keys())
N = X_school.shape[0]

# Bands to examine (counts)
BANDS = {
    "top_50": 50,
    "top_200": 200,
    "top_1pct": max(1, int(np.ceil(N * 0.01))),
}

ROUND_DECIMALS = 3  # UX lens

def tie_stats(values: np.ndarray) -> dict:
    """
    Compute tie stats for a 1D array of scores.
    Ties are groups where count > 1 for a score value.
    """
    values = np.asarray(values)
    uniq, counts = np.unique(values, return_counts=True)
    tie_counts = counts[counts > 1]

    n = int(values.shape[0])
    n_unique = int(uniq.shape[0])
    n_tie_groups = int((counts > 1).sum())
    largest_tie = int(tie_counts.max()) if tie_counts.size else 1
    pct_in_ties = float((tie_counts.sum() / n) * 100.0) if tie_counts.size else 0.0

    return {
        "n": n,
        "n_unique": n_unique,
        "n_tie_groups": n_tie_groups,
        "largest_tie": largest_tie,
        "pct_in_ties": pct_in_ties,
    }

def top_band_scores(seg_key: str, k: int) -> np.ndarray:
    scores = np.asarray(segment_scores[seg_key]).astype(float)
    idx = np.argsort(scores)[::-1][:k]
    return scores[idx]

# ----------------------------
# 05.1 Whole-population tie summary (exact + rounded)
# ----------------------------
summary_rows = []

for seg in SEG_KEYS:
    scores = np.asarray(segment_scores[seg]).astype(float)

    exact = tie_stats(scores)
    rounded = tie_stats(np.round(scores, ROUND_DECIMALS))

    summary_rows.append({
        "segment": seg,
        "n_schools": exact["n"],

        "n_unique_exact": exact["n_unique"],
        "n_tie_groups_exact": exact["n_tie_groups"],
        "largest_tie_exact": exact["largest_tie"],
        "pct_in_ties_exact": exact["pct_in_ties"],

        f"n_unique_r{ROUND_DECIMALS}": rounded["n_unique"],
        f"n_tie_groups_r{ROUND_DECIMALS}": rounded["n_tie_groups"],
        f"largest_tie_r{ROUND_DECIMALS}": rounded["largest_tie"],
        f"pct_in_ties_r{ROUND_DECIMALS}": rounded["pct_in_ties"],
    })

tie_summary = pd.DataFrame(summary_rows).sort_values("segment")
display(tie_summary)

tie_summary_path = REPORTS_DIR / "notebook07_section05_tie_summary.csv"
tie_summary.to_csv(tie_summary_path, index=False)
print("Saved:", tie_summary_path)

# ----------------------------
# 05.2 Tie behavior in top bands (exact + rounded)
# ----------------------------
band_rows = []

for seg in SEG_KEYS:
    for band_name, k in BANDS.items():
        band_scores = top_band_scores(seg, k)

        exact = tie_stats(band_scores)
        rounded = tie_stats(np.round(band_scores, ROUND_DECIMALS))

        band_rows.append({
            "segment": seg,
            "band": band_name,
            "k": k,

            "n_unique_exact": exact["n_unique"],
            "largest_tie_exact": exact["largest_tie"],
            "pct_in_ties_exact": exact["pct_in_ties"],

            f"n_unique_r{ROUND_DECIMALS}": rounded["n_unique"],
            f"largest_tie_r{ROUND_DECIMALS}": rounded["largest_tie"],
            f"pct_in_ties_r{ROUND_DECIMALS}": rounded["pct_in_ties"],
        })

tie_topk = pd.DataFrame(band_rows).sort_values(["band", "segment"])
display(tie_topk)

tie_topk_path = REPORTS_DIR / "notebook07_section05_tie_topk.csv"
tie_topk.to_csv(tie_topk_path, index=False)
print("Saved:", tie_topk_path)

# ----------------------------
# 05.3 (Optional) Tie group size distribution (rounded scores)
#     Helps see whether ties are mostly size=2/3 or huge clumps.
# ----------------------------
dist_rows = []

for seg in SEG_KEYS:
    scores = np.round(np.asarray(segment_scores[seg]).astype(float), ROUND_DECIMALS)
    _, counts = np.unique(scores, return_counts=True)
    tie_counts = counts[counts > 1]

    if tie_counts.size == 0:
        dist_rows.append({"segment": seg, "tie_group_size": 1, "num_groups": 0})
        continue

    sizes, num_groups = np.unique(tie_counts, return_counts=True)
    for s, ng in zip(sizes.tolist(), num_groups.tolist()):
        dist_rows.append({
            "segment": seg,
            "tie_group_size": int(s),
            "num_groups": int(ng),
        })

tie_dist = pd.DataFrame(dist_rows).sort_values(["segment", "tie_group_size"])
tie_dist_path = REPORTS_DIR / "notebook07_section05_tie_group_size_dist_r3.csv"
tie_dist.to_csv(tie_dist_path, index=False)
print("Saved:", tie_dist_path)

print("\nSection 05 complete — tie metrics saved to /reports")
print("Bands:", BANDS, "| Rounded decimals:", ROUND_DECIMALS)


Unnamed: 0,segment,n_schools,n_unique_exact,n_tie_groups_exact,largest_tie_exact,pct_in_ties_exact,n_unique_r3,n_tie_groups_r3,largest_tie_r3,pct_in_ties_r3
0,academic_first,124619,67244,2652,15862,48.168417,2210,2107,15947,99.917348
3,balanced_general,124619,86730,8994,1796,37.621069,4192,3891,1797,99.758464
2,progressive_balanced,124619,79130,8613,3797,43.413926,1648,1585,3842,99.949446
1,small_nurturing,124619,84440,9026,3074,39.484348,4149,3842,3165,99.753649


Saved: /Users/jennifer-david/Documents/work/SpringBoard/projects/Capstone Projects/smart-school/reports/notebook07_section05_tie_summary.csv


Unnamed: 0,segment,band,k,n_unique_exact,largest_tie_exact,pct_in_ties_exact,n_unique_r3,largest_tie_r3,pct_in_ties_r3
2,academic_first,top_1pct,1247,727,51,53.889334,166,54,94.707298
11,balanced_general,top_1pct,1247,585,308,65.517241,225,321,94.386528
8,progressive_balanced,top_1pct,1247,305,77,87.890938,165,79,96.551724
5,small_nurturing,top_1pct,1247,313,93,87.089014,238,93,92.862871
1,academic_first,top_200,200,152,15,31.0,99,17,67.5
10,balanced_general,top_200,200,142,11,44.5,97,11,74.0
7,progressive_balanced,top_200,200,28,60,89.0,28,60,89.0
4,small_nurturing,top_200,200,15,49,97.5,14,49,97.5
0,academic_first,top_50,50,48,2,8.0,39,3,42.0
9,balanced_general,top_50,50,35,4,50.0,33,4,56.0


Saved: /Users/jennifer-david/Documents/work/SpringBoard/projects/Capstone Projects/smart-school/reports/notebook07_section05_tie_topk.csv
Saved: /Users/jennifer-david/Documents/work/SpringBoard/projects/Capstone Projects/smart-school/reports/notebook07_section05_tie_group_size_dist_r3.csv

Section 05 complete — tie metrics saved to /reports
Bands: {'top_50': 50, 'top_200': 200, 'top_1pct': 1247} | Rounded decimals: 3


## 06. Explainability Differences (Segment Lens)

In this section, we generate **side-by-side explanations** for the *same school* under multiple Preference Segments.

For a small set of spotlight schools, we compute per segment:
- the segment score
- the top contributing features (feature contribution = `school_value × segment_value × segment_weight`)
- a short human-readable explanation

Goal: demonstrate that **segment choice changes not only ranking, but also the “why”** — making the system interpretable and trustworthy for parents.

In [69]:
# ============================================
# Section 06 — Explainability Differences (Segment Lens)
# ============================================

SEG_KEYS = list(segment_scores.keys())
TOP_DRIVERS_N = 6

# ----------------------------
# 06.1 Pick spotlight schools
#     - Top 1 school from each segment
#     - + optional movers from Section 04 rank_delta_df (if it exists)
# ----------------------------
spotlight_ids = []

# top-1 per segment
for seg in SEG_KEYS:
    scores = np.asarray(segment_scores[seg]).astype(float)
    best_row = int(np.argmax(scores))
    spotlight_ids.append(school_index.loc[best_row, "school_id"])

# optional: add a few big movers (winner/loser) if rank_delta_df exists
if "rank_delta_df" in globals() and isinstance(rank_delta_df, pd.DataFrame) and len(rank_delta_df) > 0:
    # pick 2 winners + 2 losers for one compare segment if available
    # (keep simple: use the first compare segment present)
    first_compare = rank_delta_df["compare_segment"].iloc[0]
    winners = (
        rank_delta_df[(rank_delta_df["compare_segment"] == first_compare) & (rank_delta_df["direction"] == "winner_vs_baseline")]
        .head(2)["school_id"].tolist()
    )
    losers = (
        rank_delta_df[(rank_delta_df["compare_segment"] == first_compare) & (rank_delta_df["direction"] == "loser_vs_baseline")]
        .head(2)["school_id"].tolist()
    )
    spotlight_ids.extend(winners + losers)

# de-dup while preserving order
seen = set()
spotlight_ids = [x for x in spotlight_ids if not (x in seen or seen.add(x))]

print("Spotlight schools:", len(spotlight_ids))
print(spotlight_ids)

# ----------------------------
# 06.2 Helper: build school_id -> row index mapping
# ----------------------------
sid_to_row = dict(zip(school_index["school_id"].tolist(), school_index["row_index"].tolist()))

# ----------------------------
# 06.3 Helper: feature contribution + short explanation generator
# ----------------------------
def build_short_explanation(seg_key: str, contrib_pairs: list[tuple[str, float]]) -> str:
    """
    Simple, deterministic explanation (no LLM):
    Use top positive drivers only.
    """
    drivers = [f"{name}" for name, val in contrib_pairs if val > 0][:3]
    if not drivers:
        return "Low alignment on segment drivers."
    return "Strong alignment on: " + ", ".join(drivers) + "."

# ----------------------------
# 06.4 Compute explanations for each spotlight school under each segment
# ----------------------------
rows = []

meta_cols = [c for c in ["school_id", "school_name", "city", "state", "is_public"] if c in schools.columns]

for school_id in spotlight_ids:
    if school_id not in sid_to_row:
        continue

    r = int(sid_to_row[school_id])
    x = X_school[r, :]  # school feature row

    # meta
    meta = schools.loc[schools["school_id"] == school_id, meta_cols]
    if len(meta) == 0:
        school_name = None
        city = None
        state = None
        is_public = None
    else:
        m = meta.iloc[0]
        school_name = m.get("school_name")
        city = m.get("city")
        state = m.get("state")
        is_public = m.get("is_public")

    for seg_key in SEG_KEYS:
        # score (same math as Section 02)
        contrib = x * segment_vectors[seg_key] * segment_weights[seg_key]
        score = float(contrib.sum())

        contrib_pairs = list(zip(feature_names, contrib.tolist()))
        contrib_pairs.sort(key=lambda t: t[1], reverse=True)

        top_drivers = [(n, v) for n, v in contrib_pairs if v > 0][:TOP_DRIVERS_N]
        top_drivers_str = "; ".join([f"{n}={v:.3f}" for n, v in top_drivers]) if top_drivers else ""

        explanation = build_short_explanation(seg_key, contrib_pairs)

        rows.append({
            "school_id": school_id,
            "school_name": school_name,
            "city": city,
            "state": state,
            "is_public": is_public,
            "segment": seg_key,
            "score": score,
            "top_drivers": top_drivers_str,
            "explanation": explanation,
        })

explain_df = pd.DataFrame(rows).sort_values(["school_name", "segment"])
display(explain_df)

# ----------------------------
# 06.5 Save artifact
# ----------------------------
out_path = REPORTS_DIR / "notebook07_section06_explanations_spotlight.csv"
explain_df.to_csv(out_path, index=False)
print("Saved:", out_path)

print("\nSection 06 complete")


Spotlight schools: 8
['PRI_BB180318', 'PRI_A0106319', 'PRI_A0100729', 'PRI_A0500573', 'PRI_00096733', 'PRI_00083429', 'PRI_A1501746', 'PRI_BB161443']


Unnamed: 0,school_id,school_name,city,state,is_public,segment,score,top_drivers,explanation
4,PRI_A0106319,alpine montessori,oak ridge,NJ,False,academic_first,1.1,score_attention=0.800; score_diversity=0.300,"Strong alignment on: score_attention, score_di..."
7,PRI_A0106319,alpine montessori,oak ridge,NJ,False,balanced_general,3.229274,serves_elementary=1.000; score_attention=1.000...,"Strong alignment on: serves_elementary, score_..."
6,PRI_A0106319,alpine montessori,oak ridge,NJ,False,progressive_balanced,2.229274,score_attention=1.000; score_size_small=0.729;...,"Strong alignment on: score_attention, score_si..."
5,PRI_A0106319,alpine montessori,oak ridge,NJ,False,small_nurturing,6.02898,score_attention=2.500; score_size_small=2.279;...,"Strong alignment on: score_attention, score_si..."
8,PRI_A0100729,angels montessori preschool,concord,CA,False,academic_first,1.079307,score_attention=0.779; score_diversity=0.300,"Strong alignment on: score_attention, score_di..."
11,PRI_A0100729,angels montessori preschool,concord,CA,False,balanced_general,3.091309,serves_elementary=1.000; score_attention=0.974...,"Strong alignment on: serves_elementary, score_..."
10,PRI_A0100729,angels montessori preschool,concord,CA,False,progressive_balanced,4.091309,tag_ams_montessori=2.000; score_attention=0.97...,"Strong alignment on: tag_ams_montessori, score..."
9,PRI_A0100729,angels montessori preschool,concord,CA,False,small_nurturing,5.614007,score_attention=2.435; score_size_small=1.929;...,"Strong alignment on: score_attention, score_si..."
28,PRI_BB161443,bais sarah inc,lakewood,NJ,False,academic_first,1.1,score_attention=0.800; score_diversity=0.300,"Strong alignment on: score_attention, score_di..."
31,PRI_BB161443,bais sarah inc,lakewood,NJ,False,balanced_general,3.229274,serves_elementary=1.000; score_attention=1.000...,"Strong alignment on: serves_elementary, score_..."


Saved: /Users/jennifer-david/Documents/work/SpringBoard/projects/Capstone Projects/smart-school/reports/notebook07_section06_explanations_spotlight.csv

Section 06 complete


## 07. Tier & Constraint Sanity Checks

This section validates that Preference Segments remain **safe and consistent**:

- Tier-defining tags (e.g., IB / CAIS) should dominate dense metrics.
- Dense metrics (size, attention, diversity) should act as tie-breakers — not overturn tier decisions.
- Grade-span features should behave consistently with segment intent.

We implement deterministic checks and report any violations.

In [72]:
# ============================================
# Section 07 — Tier & Constraint Sanity Checks
# ============================================

SEG_KEYS = list(segment_scores.keys())
TOP_K = 500  # use a larger K to make sanity checks more meaningful

# helper: top-k row indices for a segment
def topk_indices(seg_key: str, k: int = 500) -> np.ndarray:
    scores = np.asarray(segment_scores[seg_key]).astype(float)
    return np.argsort(scores)[::-1][:k]

# helper: compute tag rates in a top-k set
def tag_rate_in_topk(seg_key: str, feature_name: str, k: int = 500) -> float:
    idx = topk_indices(seg_key, k)
    col = feat_to_col[feature_name]
    vals = X_school[idx, col]
    return float(np.mean(vals > 0.5))  # binary tags are 0/1

# helper: contribution of one feature for all schools under a segment
def feature_contrib(scores_row: np.ndarray, seg_key: str, feature_name: str) -> float:
    col = feat_to_col[feature_name]
    return float(scores_row[col] * segment_vectors[seg_key][col] * segment_weights[seg_key][col])

# ----------------------------
# 07.1 Tier dominance checks (Top-K composition)
# ----------------------------
tier_checks = []

# For academic_first, we expect IB and/or CAIS to show up strongly at the top.
# Note: CAIS + IB datasets are partial; so we treat this as a "signal check", not a strict pass/fail.
expected_tag_signals = {
    "academic_first": ["tag_ib", "tag_cais"],
    "progressive_balanced": ["tag_ams_montessori", "tag_waldorf"],
}

for seg_key, tags in expected_tag_signals.items():
    if seg_key not in SEG_KEYS:
        continue
    for t in tags:
        rate = tag_rate_in_topk(seg_key, t, TOP_K)
        tier_checks.append({
            "segment": seg_key,
            "top_k": TOP_K,
            "tag": t,
            "pct_topk_with_tag": rate * 100.0,
        })

tier_checks_df = pd.DataFrame(tier_checks).sort_values(["segment", "tag"])
display(tier_checks_df)

tier_checks_path = REPORTS_DIR / "notebook07_section07_tier_tag_presence_topk.csv"
tier_checks_df.to_csv(tier_checks_path, index=False)
print("Saved:", tier_checks_path)

# ----------------------------
# 07.2 Dense metrics should not overturn tier tags
#     Create "matched pairs" among top candidates:
#     - Compare IB vs non-IB (and CAIS vs non-CAIS) within Academic First
#     - Compare Montessori/Waldorf vs non- for Progressive Balanced
#     If a non-tag school beats a tag school purely by dense metrics, flag it.
# ----------------------------
def find_top_pairs_for_check(seg_key: str, primary_tag: str, k_pool: int = 5000) -> pd.DataFrame:
    """
    Pull a pool of top candidates, split by tag presence, and compare representative pairs.
    We'll compare top N tagged vs top N non-tagged by score and inspect feature contributions.
    """
    scores = np.asarray(segment_scores[seg_key]).astype(float)
    order = np.argsort(scores)[::-1][:k_pool]
    pool_ids = school_index.loc[order, "school_id"].tolist()
    pool_scores = scores[order]

    col = feat_to_col[primary_tag]
    tag_vals = X_school[order, col] > 0.5

    tagged = [(sid, sc, ridx) for sid, sc, ridx, tv in zip(pool_ids, pool_scores, order, tag_vals) if tv]
    nontag = [(sid, sc, ridx) for sid, sc, ridx, tv in zip(pool_ids, pool_scores, order, tag_vals) if not tv]

    # if no coverage, return empty
    if len(tagged) == 0 or len(nontag) == 0:
        return pd.DataFrame()

    # Compare top N from each
    N = min(200, len(tagged), len(nontag))
    tagged = tagged[:N]
    nontag = nontag[:N]

    rows = []
    for i in range(N):
        sid_t, sc_t, ridx_t = tagged[i]
        sid_n, sc_n, ridx_n = nontag[i]

        # Pull dense contributions for both schools (size/attention/diversity)
        dense_feats = ["score_size_small", "score_attention", "score_diversity"]
        contrib_t = {f: X_school[ridx_t, feat_to_col[f]] * segment_vectors[seg_key][feat_to_col[f]] * segment_weights[seg_key][feat_to_col[f]] for f in dense_feats}
        contrib_n = {f: X_school[ridx_n, feat_to_col[f]] * segment_vectors[seg_key][feat_to_col[f]] * segment_weights[seg_key][feat_to_col[f]] for f in dense_feats}

        rows.append({
            "segment": seg_key,
            "primary_tag": primary_tag,
            "tagged_school_id": sid_t,
            "tagged_score": float(sc_t),
            "nontag_school_id": sid_n,
            "nontag_score": float(sc_n),
            "tagged_dense_sum": float(sum(contrib_t.values())),
            "nontag_dense_sum": float(sum(contrib_n.values())),
            "nontag_beats_tagged": bool(sc_n > sc_t),
        })

    return pd.DataFrame(rows)

guardrail_rows = []

# Academic First: IB and CAIS should dominate
if "academic_first" in SEG_KEYS:
    for primary_tag in ["tag_ib", "tag_cais"]:
        df_pairs = find_top_pairs_for_check("academic_first", primary_tag=primary_tag, k_pool=8000)
        if len(df_pairs) > 0:
            guardrail_rows.append(df_pairs)

# Progressive: Montessori/Waldorf should dominate
if "progressive_balanced" in SEG_KEYS:
    for primary_tag in ["tag_ams_montessori", "tag_waldorf"]:
        df_pairs = find_top_pairs_for_check("progressive_balanced", primary_tag=primary_tag, k_pool=8000)
        if len(df_pairs) > 0:
            guardrail_rows.append(df_pairs)

if guardrail_rows:
    guardrail_df = pd.concat(guardrail_rows, ignore_index=True)

    # Summarize potential "violations"
    violations = (
        guardrail_df.groupby(["segment", "primary_tag"])["nontag_beats_tagged"]
        .mean()
        .reset_index()
        .rename(columns={"nontag_beats_tagged": "pct_pairs_where_nontag_beats_tagged"})
    )
    violations["pct_pairs_where_nontag_beats_tagged"] *= 100.0

    display(violations)

    guardrail_pairs_path = REPORTS_DIR / "notebook07_section07_guardrail_pairs.csv"
    guardrail_df.to_csv(guardrail_pairs_path, index=False)
    print("Saved:", guardrail_pairs_path)

    violations_path = REPORTS_DIR / "notebook07_section07_guardrail_violation_rates.csv"
    violations.to_csv(violations_path, index=False)
    print("Saved:", violations_path)
else:
    print("Guardrail pair checks skipped (insufficient tag coverage in top pool).")

# ----------------------------
# 07.3 Grade-span sanity check (top-k composition)
#     Ensure segments are not accidentally picking schools that don't serve typical spans.
# ----------------------------
grade_checks = []
grade_feats = ["serves_elementary", "serves_middle", "serves_high"]

for seg_key in SEG_KEYS:
    idx = topk_indices(seg_key, TOP_K)
    for gf in grade_feats:
        col = feat_to_col[gf]
        rate = float(np.mean(X_school[idx, col] > 0.5))
        grade_checks.append({
            "segment": seg_key,
            "top_k": TOP_K,
            "grade_flag": gf,
            "pct_topk_true": rate * 100.0,
        })

grade_checks_df = pd.DataFrame(grade_checks).sort_values(["segment", "grade_flag"])
display(grade_checks_df)

grade_checks_path = REPORTS_DIR / "notebook07_section07_grade_span_presence_topk.csv"
grade_checks_df.to_csv(grade_checks_path, index=False)
print("Saved:", grade_checks_path)

print("\nSection 07 complete — tier + constraint sanity checks saved to /reports")


Unnamed: 0,segment,top_k,tag,pct_topk_with_tag
1,academic_first,500,tag_cais,14.6
0,academic_first,500,tag_ib,6.6
2,progressive_balanced,500,tag_ams_montessori,1.0
3,progressive_balanced,500,tag_waldorf,3.0


Saved: /Users/jennifer-david/Documents/work/SpringBoard/projects/Capstone Projects/smart-school/reports/notebook07_section07_tier_tag_presence_topk.csv


Unnamed: 0,segment,primary_tag,pct_pairs_where_nontag_beats_tagged
0,academic_first,tag_cais,0.0
1,academic_first,tag_ib,90.909091
2,progressive_balanced,tag_ams_montessori,0.0
3,progressive_balanced,tag_waldorf,33.333333


Saved: /Users/jennifer-david/Documents/work/SpringBoard/projects/Capstone Projects/smart-school/reports/notebook07_section07_guardrail_pairs.csv
Saved: /Users/jennifer-david/Documents/work/SpringBoard/projects/Capstone Projects/smart-school/reports/notebook07_section07_guardrail_violation_rates.csv


Unnamed: 0,segment,top_k,grade_flag,pct_topk_true
0,academic_first,500,serves_elementary,69.4
2,academic_first,500,serves_high,95.2
1,academic_first,500,serves_middle,95.4
9,balanced_general,500,serves_elementary,100.0
11,balanced_general,500,serves_high,100.0
10,balanced_general,500,serves_middle,100.0
6,progressive_balanced,500,serves_elementary,84.0
8,progressive_balanced,500,serves_high,13.8
7,progressive_balanced,500,serves_middle,13.4
3,small_nurturing,500,serves_elementary,100.0


Saved: /Users/jennifer-david/Documents/work/SpringBoard/projects/Capstone Projects/smart-school/reports/notebook07_section07_grade_span_presence_topk.csv

Section 07 complete — tier + constraint sanity checks saved to /reports


## 08. Segment Design Review

This section provides a **qualitative review** of Preference Segments v0, informed by
the quantitative results from Sections 03–07. The goal is not to optimize or retrain,
but to understand **what worked, what didn’t, and why**, so that future refinements
(v1+) are deliberate and principled.

---

### 08.1 What Each Segment Optimizes For

**Academic First**
- Optimizes for: credentialed rigor and academic signaling
- Primary drivers: IB, CAIS, grade span (middle/high)
- Observed behavior:
  - Produces the sharpest rankings
  - Strong separation between top schools and the rest
  - Clearly differentiates credentialed schools from non-credentialed ones

**Small & Nurturing**
- Optimizes for: individual attention and small size
- Primary drivers: attention score, small-size score
- Observed behavior:
  - Many schools score similarly
  - Rankings are intentionally broad and inclusive
  - Reflects a parent mindset that values environment over prestige

**Progressive Balanced**
- Optimizes for: progressive pedagogy without extremes
- Primary drivers: Montessori/Waldorf tags, attention, diversity
- Observed behavior:
  - Strong performance for Montessori schools
  - Waldorf signal is weaker and more easily overridden
  - Segment overlaps partially with Small & Nurturing (expected)

**Balanced General**
- Optimizes for: broad, well-rounded coverage
- Primary drivers: grade span + moderate dense metrics
- Observed behavior:
  - Serves as a neutral baseline
  - Produces the least extreme rankings
  - Useful reference point for cross-segment comparison

---

### 08.2 Overlaps and Gaps Between Segments

**Expected overlaps**
- Small & Nurturing ↔ Progressive Balanced  
  (shared emphasis on attention and environment)

**Intentional non-overlaps**
- Academic First ↔ Small & Nurturing  
- Academic First ↔ Balanced General  

These patterns confirm that segments encode **distinct parent intents**, not superficial labels.

**Coverage gaps**
- Some segments (especially Small & Nurturing) lack sufficient feature resolution,
  leading to large tie groups.
- Certain pedagogical tags (e.g., Waldorf, IB) have limited or noisy coverage,
  reducing their dominance.

---

### 08.3 Strengths of Preference Segments v0

- Fully deterministic and explainable
- Easy to audit and debug
- Produces meaningfully different rankings
- Safe by construction (no learned bias, no hidden logic)
- Supports clear “why this school” explanations

Preference Segments v0 successfully demonstrate that **parent intent can be separated
from school popularity**, and that rankings can change transparently based on values.

---

### 08.4 Known Limitations (by Design)

The following limitations are **intentional** in v0:

- No hard requirements (all signals are soft)
- Limited feature space (10 features)
- No geographic or budget constraints
- No negative preferences
- No learning or personalization

These constraints keep v0 simple and interpretable, and they clarify where future
refinement is needed.

---

### 08.5 Implications for Segment v1

Insights from this notebook suggest the following **v1 directions**:

- Strengthen IB dominance for Academic First
- Improve Waldorf signal robustness
- Add tie-breakers for dense-metric-heavy segments
- Introduce optional hard constraints (grade span, pedagogy floors)
- Expand feature space to increase resolution

These are **configuration and data improvements**, not algorithmic changes.

---

### 08.6 Design Takeaway

Preference Segments v0 succeed as **intent abstractions**:
they do not attempt to predict outcomes, but instead provide
clear, safe, and explainable lenses through which parents can
explore schools.

The system’s intelligence remains in the math, while segments
act as transparent controls over what the math emphasizes.

This establishes a solid foundation for both product use and
future iteration.


## 09. Summary & Product Implications

This notebook validated **Preference Segments v0** as a practical, safe, and
explainable mechanism for incorporating parent intent into school ranking.

Across Sections 02–07, we demonstrated that deterministic segments layered on
top of the v2 scoring engine can meaningfully change outcomes without introducing
machine learning or opaque logic.

---

### 09.1 What Preference Segments v0 Prove

Preference Segments v0 successfully show that:

- Parent intent can be encoded **deterministically** as configuration.
- The same dataset and scoring engine can produce **very different rankings**
  depending on intent.
- These differences are:
  - explainable (feature-level contributions),
  - measurable (rank movement, overlap),
  - and safe (tier signals dominate dense metrics).

This confirms that personalization does **not** require learned models in order
to be valuable.

---

### 09.2 Why Deterministic Segmentation Is Valuable

Deterministic Preference Segments provide several advantages over ML-first approaches:

- **Transparency**  
  Every ranking decision can be traced to explicit features and weights.

- **Auditability**  
  Segment behavior can be validated, stress-tested, and reviewed independently.

- **Stability**  
  Results do not drift unexpectedly over time.

- **Product trust**  
  Parents can understand *why* a school appears, not just *that* it appears.

This makes the system especially well-suited for high-stakes, value-driven domains
like education.

---

### 09.3 Product Readiness

Based on this notebook, the system is ready for:

- **Parent-facing exploration**
  - Segment selection as a first interaction
  - Ranked school lists that change visibly by intent

- **Explainable recommendations**
  - “Why this school” explanations grounded in data
  - Segment-specific reasoning rather than generic justifications

- **Guided discovery**
  - Helping parents see trade-offs between rigor, environment, and philosophy
  - Encouraging exploration beyond reputation-based rankings

---

### 09.4 What This System Is — and Is Not

**This system is:**
- A value-aligned ranking engine
- An intent-aware decision support tool
- A foundation for explainable personalization

**This system is not:**
- A prediction of child outcomes
- A popularity or prestige ranking
- A black-box recommender

Keeping this distinction clear is critical for both ethical use and user trust.

---

### 09.5 Key Takeaway

Preference Segments v0 demonstrate that **intent-aware ranking is achievable
without machine learning**, and that doing so yields clearer, safer, and more
interpretable results.

By separating:
- *what parents care about* (segments),
- from *how schools are evaluated* (math),

the system creates a flexible and principled foundation for future growth.

This sets the stage for controlled refinement, richer features, and optional
learning — without sacrificing explainability or trust.


## 10. Next Steps

Preference Segments v0 establish a validated, deterministic foundation for
intent-aware school ranking. The next steps focus on **refinement, resolution,
and controlled personalization**, while preserving explainability and safety.

---

### 10.1 Segment v1 Refinement (Configuration-Level)

Based on findings from Sections 05–07, Segment v1 should prioritize:

- **Strengthening tier dominance**
  - Increase IB weight or introduce IB as a soft floor for Academic First
  - Improve Waldorf signal robustness through data enrichment

- **Reducing tie density**
  - Add additional tie-breaker features (e.g., enrollment stability, class size bands)
  - Increase resolution of dense metrics
  - Consider segment-specific secondary weights

- **Optional hard constraints**
  - Grade-span requirements (e.g., must serve elementary)
  - Pedagogy floors (e.g., Montessori required for Progressive segments)

All refinements should remain deterministic and auditable.

---

### 10.2 Child Profile → Segment Mapping

With segment behavior validated, the system can safely introduce child profiles:

- Map child attributes to:
  - hard requirements (e.g., grade span)
  - soft preferences (e.g., attention vs diversity)
- Resolve each child profile into:
  - a segment-aligned preference vector
  - optional segment blending (e.g., 70% Academic First, 30% Small & Nurturing)

This preserves the same math while enabling personalization.

---

### 10.3 Soft Personalization (Non-Learning)

Before introducing ML, the system can support:

- User-adjustable weights within a segment
- What-if comparisons between segments
- Preference sliders that map to existing features

These interactions increase user agency without adding model complexity.

---

### 10.4 Feature & Data Expansion

Improving ranking resolution will primarily come from **better features**, not
new algorithms:

- Pedagogy depth (e.g., IB program type, Montessori accreditation level)
- Class size and student-teacher ratios (where available)
- Program differentiation (arts, STEM, language immersion)
- Transportation, schedule, and logistics features

Each new feature increases explainability and reduces tie behavior.

---

### 10.5 Long-Term Learning Roadmap (Optional)

Only after deterministic performance plateaus should learning be introduced:

- Discover latent preference clusters from usage patterns
- Learn segment blending weights (not raw rankings)
- Use ML to suggest refinements, not replace core logic

Any learning should remain **assistive**, not authoritative.

---

### 10.6 Final Note

The core insight from this notebook is that **intelligence does not require opacity**.
By anchoring personalization in transparent math and explicit intent, the system
balances flexibility, trust, and control.

Preference Segments are the interface.
The scoring engine is the intelligence.
