# 03 — Child Profile → Child Vector & Matching

This notebook a parent/child questionnaire into a 4-number child vector that matches the same feature order as the school vectors:
Child vector = [academic_rigor, gifted_support, logistics_need, progressive_preference]

## Index

1. **Goals & Context**
   - Purpose of this notebook
   - Inputs and outputs
   - Reference to ADR-002 (Logistics as Hard Constraint)
   - Vector semantics and scope

2. **Load School Artifacts**
   - Load school feature table
   - Load school vectors
   - Load school ID → index mapping
   - Verify shapes and integrity

3. **Define Child Profile Schema (v1)**
   - Minimal, parent-friendly input fields
   - Numeric normalization (0–1 scale)
   - Defaults and validation rules
   - Ensure $V_{child}$ is $1 \times N$ shape (1D array) for compatibility with $V_{school}$ matrix (NumPy contract).

4. **Build Child Vector**
   - Map child profile → numerical vector
   - Enforce feature order contract
   - Generate human-readable explanation

5. **Apply Feasibility Filter (Logistics)**
   - Enforce ADR-002 default behavior
   - Output: feasible_school_vectors (subset of original matrix) and feasible_school_index_map (subset of original map)
   - Exclude infeasible schools
   - Optional override logic (future)

6. **Compute Similarity & Rank Schools**
   - Cosine similarity
   - Rank feasible schools
   - Extract top matches

7. **Explain Match Results**
   - Per-dimension contribution
   - Why a school ranked high or low
   - Simple, parent-readable explanation

8. **Save Outputs**
   - Child profile JSON
   - Child vector (.npy)
   - Explanation JSON
   - Match results CSV

---

**Note:**  
This notebook focuses on **deterministic, interpretable matching**.  
No machine learning models are trained in this step.


## 1. Goals & Context

### Purpose of This Notebook

Notebook 03 converts a **child / family profile** into a numerical **child vector**
that can be matched against the **school vectors** generated in Notebook 02.

This notebook answers the question:

> “Given a child’s needs and preferences, which schools are the best fit — and which schools are not feasible at all?”

---

### Inputs

This notebook consumes artifacts produced by **Notebook 02**:

- `school_features_sample.csv`  
- `school_vectors_sample.npy`  
- `school_id_to_index.json`  

These represent the **supply side** of the system.

---

### Outputs

This notebook will produce:

- `child_profile_sample.json`  
- `child_vector_sample.npy`  
- `child_vector_explain.json`  
- `match_results_sample.csv`  

These represent the **demand side** and the initial matching output.

---

### Architectural Constraint: Logistics as Feasibility

**Default behavior (v1):**

1. Logistics is treated as a **hard feasibility constraint**
2. Schools that do not meet the child’s minimum logistics requirement are **excluded**
3. Vector similarity is computed **only on feasible schools**

Formally include school if: **school.logistics_score ≥ child.logistics_threshold**


Vector similarity is then applied to rank the remaining schools.

> This design reflects real-world parenting constraints where certain schools,
> regardless of academic or pedagogical fit, are simply not feasible.

---

### Vector Semantics (Locked)

Both child and school vectors share the **same fixed order**:

[Academic Rigor, Gifted Support, Logistics, Progressive Style]

This order is treated as a **contract** across all notebooks and services.

---

### What This Notebook Is (and Is Not)

**This notebook is:**
- Vector construction
- Constraint filtering
- Similarity-based ranking
- Explainable matching

**This notebook is NOT:**
- Training a machine learning model
- Learning weights from data
- Optimizing parameters

This is a **deterministic, interpretable matching system**, designed first for correctness and trust.

---

### Why This Matters

Separating:
- **Feasibility (hard constraints)** from
- **Preference (soft ranking)**

allows the system to:
- Avoid impossible recommendations
- Preserve nuance among viable options
- Support future personalization without breaking trust

This foundation enables later extensions such as:
- User-controlled constraint toggles
- Learned weighting
- Feedback-driven refinement



## 2. Load School Artifacts

In this section we load the **school-side artifacts** generated by Notebook 02.

These artifacts represent the fixed **supply matrix** that the child vector
will be matched against.

Artifacts loaded:

- `school_features_sample.csv`  
  Human-readable feature table (IDs, names, feature scores)

- `school_vectors_sample.npy`  
  NumPy matrix of shape `(num_schools, 4)`

- `school_id_to_index.json`  
  Mapping from `school_internal_id` → row index in the vector matrix

Before proceeding, we validate:
- File existence
- Shapes and dimensions
- Alignment between features, vectors, and index mapping


In [95]:
import os
import json
import numpy as np
import pandas as pd

# ---------------------------------------------------------------------
# Paths to artifacts generated by Notebook 02
# ---------------------------------------------------------------------
processed_dir_notebook01 = "../data/processed/notebook01"
processed_dir_notebook02 = "../data/processed/notebook02"

features_path = os.path.join(processed_dir_notebook02, "school_features_sample.csv")
vectors_path = os.path.join(processed_dir_notebook02, "school_vectors_sample.npy")
mapping_path = os.path.join(processed_dir_notebook02, "school_id_to_index.json")

# ---------------------------------------------------------------------
# Existence checks (fail fast if artifacts are missing)
# ---------------------------------------------------------------------
for path in [features_path, vectors_path, mapping_path]:
    if not os.path.exists(path):
        raise FileNotFoundError(f"Missing required artifact: {path}")

print("All school artifacts found.")

# ---------------------------------------------------------------------
# Load artifacts
# ---------------------------------------------------------------------
school_features_df = pd.read_csv(features_path)
school_vectors = np.load(vectors_path)

with open(mapping_path, "r") as f:
    school_id_to_index = json.load(f)

print("\nArtifacts loaded successfully.")
print("school_features_df shape:", school_features_df.shape)
print("school_vectors shape:", school_vectors.shape)
print("school_id_to_index entries:", len(school_id_to_index))

# ---------------------------------------------------------------------
# Trust-but-verify integrity checks
# ---------------------------------------------------------------------

# 1) Vector row count must match number of schools
num_schools = school_features_df.shape[0]
num_features = school_vectors.shape[1]

assert school_vectors.shape[0] == num_schools, (
    "Mismatch: number of school vectors does not match number of schools"
)

# 2) Explicit feature contract enforcement
# Locked vector order: [Rigor, Gifted, Logistics, Progressive]
assert num_features == 4, (
    f"Expected 4 features per vector, got {num_features}"
)

# 3) School ID ↔ vector index mapping must align exactly
expected_ids = set(school_features_df["school_internal_id"])
mapped_ids = set(school_id_to_index.keys())

assert expected_ids == mapped_ids, (
    "Mismatch between school IDs in feature table and index mapping"
)

print("\nIntegrity checks passed.")

# ---------------------------------------------------------------------
# Visual alignment check (human sanity check)
# ---------------------------------------------------------------------
# NOTE:
# `display()` is intentionally used here because this is a Jupyter notebook.
# If this logic is later moved into a standalone Python script or backend
# service, replace `display()` with `print()` or structured logging.

first_id = school_features_df.iloc[0]["school_internal_id"]
first_idx = school_id_to_index[first_id]

print("\nAlignment check for first school:")
print("School ID:", first_id)
print("Mapped vector index:", first_idx)

print("\nFeature row:")
display(school_features_df.iloc[[0]])

print("Vector row:", school_vectors[first_idx])

print("\nSchool artifacts validated and ready for matching.")

All school artifacts found.

Artifacts loaded successfully.
school_features_df shape: (3, 6)
school_vectors shape: (3, 4)
school_id_to_index entries: 3

Integrity checks passed.

Alignment check for first school:
School ID: SCH0001
Mapped vector index: 0

Feature row:


Unnamed: 0,school_internal_id,school_display_name,academic_rigor,gifted_support,logistics,progressive_style
0,SCH0001,Sunnyvale Elementary School,0.0,0.0,0.5,0.0


Vector row: [0.  0.  0.5 0. ]

School artifacts validated and ready for matching.


## 3. Define Child Profile Schema (v1)

This section defines the **child / family input contract** that will be
converted into a numerical child vector.

The schema is intentionally minimal and normalized to a **0–1 scale** so that
parents can answer questions quickly and consistently.

### Child Profile Fields (v1)

Each field maps directly to one dimension of the school vector:

- `pref_academic_rigor`  
  How much academic challenge the child seeks (0 = low, 1 = very high)

- `need_gifted_support`  
  How much gifted / support accommodation the child needs (0–1)

- `need_logistics_support`  
  Minimum logistics support required by the family (0–1)

- `pref_progressive_style`  
  Learning style preference  
  (0 = thrives in structure, 1 = thrives in inquiry-based environments)

---

### Child Vector Naming & Shape Convention

The child vector is defined as:

\[
V_{child} = [Rigor_{req}, Support_{req}, Logistics_{req}, Style_{pref}]
\]

**Important contract:**

- `V_child` MUST be a **1 × N vector** (1D NumPy array)
- `V_school` is an **M × N matrix**
- Both must share the **same feature order**

In NumPy terms:

```python
V_child.shape == (N,)
V_school.shape == (M, N)
```
This ensures compatibility with vector similarity operations (e.g., cosine
similarity) without requiring reshaping or broadcasting hacks.

Violating this convention (e.g., using (1, N) or (N, 1)) can lead to subtle
errors or misleading similarity scores.

In [97]:
# ---------------------------
# Child Profile Schema (v1)
# ---------------------------
CHILD_PROFILE_SCHEMA_V1 = {
    "version": "v1",
    "vector_order_contract": [
        "pref_academic_rigor",
        "need_gifted_support",
        "need_logistics_support",
        "pref_progressive_style",
    ],
    "fields": {
        "pref_academic_rigor": {
            "type": "float",
            "range": [0.0, 1.0],
            "meaning": "0=low demand for rigor; 1=very high demand for rigor",
        },
        "need_gifted_support": {
            "type": "float",
            "range": [0.0, 1.0],
            "meaning": "0=no gifted/support needs; 1=high gifted/support needs (e.g., 2e)",
        },
        "need_logistics_support": {
            "type": "float",
            "range": [0.0, 1.0],
            "meaning": "0=no logistics needs; 1=strong logistics needs (aftercare/transport/beforecare)",
        },
        "pref_progressive_style": {
            "type": "float",
            "range": [0.0, 1.0],
            "meaning": "0=structure/traditional; 1=inquiry/progressive",
        },

        # ADR-002: Default feasibility filter (hard constraint)
        "treat_logistics_as_hard_requirement": {
            "type": "bool",
            "default": True,
            "meaning": "If True, apply logistics threshold as a hard filter before ranking",
        },
        "logistics_threshold": {
            "type": "float",
            "range": [0.0, 1.0],
            "default": 0.5,
            "meaning": "Minimum required logistics_score for schools (used only if hard requirement is True)",
        },
    },
}

# Vector dimension contract (must match school vectors)
CHILD_VECTOR_DIM = len(CHILD_PROFILE_SCHEMA_V1["vector_order_contract"])

print("Child Profile Schema v1 (contract):")
print(json.dumps(CHILD_PROFILE_SCHEMA_V1, indent=2))
print("\nChild vector dimension (N):", CHILD_VECTOR_DIM)

# ---------------------------
# Validation helper
# ---------------------------
def validate_child_profile(profile: dict, schema: dict) -> bool:
    """
    Validate a child profile against schema constraints.
    Fails fast if required fields are missing or out of range.

    Note:
    - This does NOT build the NumPy vector yet.
    - Vector shape enforcement happens in Section 4.
    """
    fields = schema["fields"]

    # Validate the 4 vector dimensions
    for key in schema["vector_order_contract"]:
        if key not in profile:
            raise ValueError(f"Missing required field: {key}")

        value = profile[key]
        field_def = fields[key]

        if field_def["type"] != "float":
            raise ValueError(f"Vector field {key} must be type float in schema")

        lo, hi = field_def["range"]
        if not isinstance(value, (int, float)):
            raise TypeError(f"{key} must be numeric, got {type(value)}")

        if not (lo <= float(value) <= hi):
            raise ValueError(f"{key}={value} out of range {lo}–{hi}")

    # Validate ADR-002 controls
    treat_hard = profile.get(
        "treat_logistics_as_hard_requirement",
        schema["fields"]["treat_logistics_as_hard_requirement"]["default"]
    )
    if not isinstance(treat_hard, bool):
        raise TypeError("treat_logistics_as_hard_requirement must be boolean")

    if treat_hard:
        threshold = profile.get(
            "logistics_threshold",
            schema["fields"]["logistics_threshold"]["default"]
        )
        lo, hi = schema["fields"]["logistics_threshold"]["range"]
        if not isinstance(threshold, (int, float)):
            raise TypeError("logistics_threshold must be numeric")
        if not (lo <= float(threshold) <= hi):
            raise ValueError(f"logistics_threshold={threshold} out of range {lo}–{hi}")

    return True

# ---------------------------
# Sample Child Profile (v1)
# ---------------------------
# Example: family needs aftercare, child is advanced and needs support, prefers some inquiry
child_profile_v1 = {
    "version": "v1",
    "pref_academic_rigor": 0.8,
    "need_gifted_support": 0.7,
    "need_logistics_support": 0.9,
    "pref_progressive_style": 0.6,
    "treat_logistics_as_hard_requirement": True,
    "logistics_threshold": 0.5,
}

print("\nSample child profile v1:")
print(json.dumps(child_profile_v1, indent=2))

# Validate sample profile
validate_child_profile(child_profile_v1, CHILD_PROFILE_SCHEMA_V1)
print("\nChild profile v1 validated successfully.")

Child Profile Schema v1 (contract):
{
  "version": "v1",
  "vector_order_contract": [
    "pref_academic_rigor",
    "need_gifted_support",
    "need_logistics_support",
    "pref_progressive_style"
  ],
  "fields": {
    "pref_academic_rigor": {
      "type": "float",
      "range": [
        0.0,
        1.0
      ],
      "meaning": "0=low demand for rigor; 1=very high demand for rigor"
    },
    "need_gifted_support": {
      "type": "float",
      "range": [
        0.0,
        1.0
      ],
      "meaning": "0=no gifted/support needs; 1=high gifted/support needs (e.g., 2e)"
    },
    "need_logistics_support": {
      "type": "float",
      "range": [
        0.0,
        1.0
      ],
      "meaning": "0=no logistics needs; 1=strong logistics needs (aftercare/transport/beforecare)"
    },
    "pref_progressive_style": {
      "type": "float",
      "range": [
        0.0,
        1.0
      ],
      "meaning": "0=structure/traditional; 1=inquiry/progressive"
    },
    "treat_log

## 4. Build Child Vector

This section converts a validated child profile into a numerical vector `V_child`
using the same fixed feature order as the school vectors:

[Academic Rigor, Gifted Support, Logistics, Progressive Style]


### Vector Contract (NumPy)

- `V_child` must be a **1D array** of shape `(N,)`
- `V_school` is a matrix of shape `(M, N)`

In our v1 design:
- `N = 4`
- `V_child.shape == (4,)`
- `V_school.shape == (num_schools, 4)`

This contract prevents silent broadcasting and ensures compatibility with
cosine similarity and downstream ranking.

We also generate a small explanation dictionary so the vector is interpretable.


In [99]:
def build_child_vector(profile: dict, schema: dict) -> tuple[np.ndarray, dict]:
    """
    Convert a validated child profile into a child vector (1D NumPy array)
    following the vector_order_contract.

    Returns:
      - child_vector: np.ndarray of shape (N,)
      - child_vector_explain: dict (human-readable)
    """
    order = schema["vector_order_contract"]

    # Build vector in strict order
    values = [float(profile[k]) for k in order]
    child_vector = np.array(values, dtype=float)

    # Enforce shape contract: (N,)
    expected_dim = len(order)
    assert child_vector.shape == (expected_dim,), (
        f"Child vector shape mismatch. Expected ({expected_dim},) got {child_vector.shape}"
    )

    # Basic explanation (simple and readable)
    child_vector_explain = {
        "version": profile.get("version", "v1"),
        "vector_order_contract": order,
        "vector_values": {k: float(profile[k]) for k in order},
        "notes": {
            "pref_academic_rigor": "Higher = wants more academic challenge",
            "need_gifted_support": "Higher = needs more gifted/2e support",
            "need_logistics_support": "Higher = needs more family logistics support",
            "pref_progressive_style": "Higher = prefers inquiry/progressive environments",
        },
        # ADR-002 controls carried forward for Section 5 filtering
        "adr_002": {
            "treat_logistics_as_hard_requirement": profile.get(
                "treat_logistics_as_hard_requirement",
                schema["fields"]["treat_logistics_as_hard_requirement"]["default"],
            ),
            "logistics_threshold": float(profile.get(
                "logistics_threshold",
                schema["fields"]["logistics_threshold"]["default"],
            )),
        }
    }

    return child_vector, child_vector_explain


# Build child vector from the sample profile created in Section 3
child_vector, child_vector_explain = build_child_vector(child_profile_v1, CHILD_PROFILE_SCHEMA_V1)

print("Child vector built successfully.")
print("child_vector shape:", child_vector.shape)
print("child_vector:", child_vector)

print("\nChild vector explanation:")
print(json.dumps(child_vector_explain, indent=2))


Child vector built successfully.
child_vector shape: (4,)
child_vector: [0.8 0.7 0.9 0.6]

Child vector explanation:
{
  "version": "v1",
  "vector_order_contract": [
    "pref_academic_rigor",
    "need_gifted_support",
    "need_logistics_support",
    "pref_progressive_style"
  ],
  "vector_values": {
    "pref_academic_rigor": 0.8,
    "need_gifted_support": 0.7,
    "need_logistics_support": 0.9,
    "pref_progressive_style": 0.6
  },
  "notes": {
    "pref_academic_rigor": "Higher = wants more academic challenge",
    "need_gifted_support": "Higher = needs more gifted/2e support",
    "need_logistics_support": "Higher = needs more family logistics support",
    "pref_progressive_style": "Higher = prefers inquiry/progressive environments"
  },
  "adr_002": {
    "treat_logistics_as_hard_requirement": true,
    "logistics_threshold": 0.5
  }
}


## 5. Apply Feasibility Filter (ADR-002: Logistics Hard Constraint)

Before ranking schools with vector similarity, we apply a feasibility filter
based on logistics.

**ADR-002 Default Behavior:**
- If `treat_logistics_as_hard_requirement` is True:
  - Exclude schools where `logistics_score < logistics_threshold`
- Otherwise:
  - Do not filter; logistics stays as a soft preference in the vector

This prevents "impossible matches" (e.g., a family needs aftercare but a school
offers none) from appearing in results.

### Note on Index Safety (The “Index Trap”)

This logic assumes that **row order** in `school_features_df` matches
row order in `school_vectors`.

If `school_features_df` is ever sorted or filtered **without resetting the index**,
`df.index` may no longer align with vector row numbers, leading to incorrect matches.

To avoid this class of bugs:
- Filtering should rely on **positional indices**, not DataFrame index values
- Vector slicing should always be driven by row order, not labels


In [101]:
# ---------------------------
# 5. Feasibility Filter (ADR-002)
# ---------------------------

def apply_logistics_feasibility_filter(
    school_features_df: pd.DataFrame,
    school_vectors: np.ndarray,
    child_vector_explain: dict,
):
    """
    Apply ADR-002 logistics feasibility filtering.

    Returns:
        filtered_features_df : pd.DataFrame
        filtered_vectors     : np.ndarray (subset of school_vectors)
        filtered_indices     : np.ndarray (positional indices into original matrix)
    """

    treat_hard = child_vector_explain["adr_002"]["treat_logistics_as_hard_requirement"]
    threshold = float(child_vector_explain["adr_002"]["logistics_threshold"])

    num_schools = school_vectors.shape[0]

    # Defensive check: feature table and vector matrix must align
    assert school_features_df.shape[0] == num_schools, (
        "school_features_df and school_vectors row counts do not match"
    )

    # ---------------------------
    # Case 1: No hard filtering
    # ---------------------------
    if not treat_hard:
        filtered_indices = np.arange(num_schools)
        return (
            school_features_df.copy(),
            school_vectors.copy(),
            filtered_indices,
        )

    # ---------------------------
    # Case 2: Apply logistics hard constraint
    # ---------------------------

    if "logistics" not in school_features_df.columns:
        raise KeyError("school_features_df must contain 'logistics' column")

    # Boolean mask aligned to row order (NOT DataFrame index)
    mask = (school_features_df["logistics"].to_numpy() >= threshold)

    # Positional indices into original arrays (INDEX-SAFE)
    filtered_indices = np.flatnonzero(mask)

    # Slice vectors by position
    filtered_vectors = school_vectors[filtered_indices]

    # Slice features and reset index for clean downstream use
    filtered_features_df = school_features_df.iloc[filtered_indices].reset_index(drop=True)

    return filtered_features_df, filtered_vectors, filtered_indices


# ---------------------------
# Execute filter using current child profile
# ---------------------------

filtered_school_features_df, filtered_school_vectors, filtered_school_indices = (
    apply_logistics_feasibility_filter(
        school_features_df=school_features_df,
        school_vectors=school_vectors,
        child_vector_explain=child_vector_explain,
    )
)

print("ADR-002 feasibility filter applied.")
print(f"Original schools: {school_features_df.shape[0]}")
print(f"Feasible schools: {filtered_school_features_df.shape[0]}")

print("\nFeasible schools:")
display(
    filtered_school_features_df[
        ["school_internal_id", "school_display_name", "logistics"]
    ]
)

print("\nFiltered positional indices (into original school_vectors):")
print(filtered_school_indices)


ADR-002 feasibility filter applied.
Original schools: 3
Feasible schools: 2

Feasible schools:


Unnamed: 0,school_internal_id,school_display_name,logistics
0,SCH0001,Sunnyvale Elementary School,0.5
1,SCH0002,Bay Area Montessori Academy,0.7



Filtered positional indices (into original school_vectors):
[0 1]


## 6. Compute Similarity & Rank Schools

Now that we have:

- `child_vector` (shape `(4,)`)
- `filtered_school_vectors` (shape `(M, 4)`)
- `filtered_school_features_df` (metadata for the same M schools)

we compute similarity scores and rank schools from best fit to worst fit.

### Similarity Method

We use **cosine similarity**, which measures how aligned two vectors are.

- 1.0 = perfect match (same direction)
- 0.0 = unrelated

Because both child and school vectors are normalized to 0–1 and share the same
feature order, cosine similarity is a clean baseline ranking method.


In [103]:
import numpy as np
import pandas as pd

# ---------------------------
# 6. Cosine Similarity + Ranking
# ---------------------------

def cosine_similarity_1_to_many(v: np.ndarray, M: np.ndarray) -> np.ndarray:
    """
    Compute cosine similarity between a single vector v (shape (N,))
    and a matrix M (shape (K, N)), returning shape (K,).
    """
    v = np.asarray(v, dtype=float)
    M = np.asarray(M, dtype=float)

    assert v.ndim == 1, f"Expected v to be 1D (N,), got {v.shape}"
    assert M.ndim == 2, f"Expected M to be 2D (K,N), got {M.shape}"
    assert M.shape[1] == v.shape[0], f"Dim mismatch: M is {M.shape}, v is {v.shape}"

    v_norm = np.linalg.norm(v)
    M_norms = np.linalg.norm(M, axis=1)

    # Avoid divide-by-zero
    if v_norm == 0:
        return np.zeros(M.shape[0], dtype=float)

    denom = (M_norms * v_norm)
    denom = np.where(denom == 0, 1e-12, denom)

    sims = (M @ v) / denom
    return sims


# 1) Compute similarity scores
similarities = cosine_similarity_1_to_many(child_vector, filtered_school_vectors)

# 2) Rank schools by similarity (highest first)
ranked_idx = np.argsort(similarities)[::-1]

ranked_results_df = filtered_school_features_df.copy()
ranked_results_df["similarity_score"] = similarities
ranked_results_df = ranked_results_df.iloc[ranked_idx].reset_index(drop=True)

# 3) Display top matches
print("Top matches (ranked):")
display(
    ranked_results_df[
        ["school_internal_id", "school_display_name", "similarity_score",
         "academic_rigor", "gifted_support", "logistics", "progressive_style"]
    ]
)

# Optional: print the best match in a more narrative style
top = ranked_results_df.iloc[0]
print("\nBest match:")
print(f"- {top['school_display_name']} ({top['school_internal_id']})")
print(f"- Similarity: {top['similarity_score']:.3f}")
print(f"- School vector: [{top['academic_rigor']:.2f}, {top['gifted_support']:.2f}, {top['logistics']:.2f}, {top['progressive_style']:.2f}]")
print(f"- Child  vector: [{child_vector[0]:.2f}, {child_vector[1]:.2f}, {child_vector[2]:.2f}, {child_vector[3]:.2f}]")


Top matches (ranked):


Unnamed: 0,school_internal_id,school_display_name,similarity_score,academic_rigor,gifted_support,logistics,progressive_style
0,SCH0002,Bay Area Montessori Academy,0.664428,0.0,0.0,0.7,1.0
1,SCH0001,Sunnyvale Elementary School,0.593442,0.0,0.0,0.5,0.0



Best match:
- Bay Area Montessori Academy (SCH0002)
- Similarity: 0.664
- School vector: [0.00, 0.00, 0.70, 1.00]
- Child  vector: [0.80, 0.70, 0.90, 0.60]


## 7. Explain Match Results (Parent-Friendly)

A matching system must be **explainable** to earn parent trust.

In this section we create a simple explanation for each matched school by
comparing the child vector to the school’s feature scores on each dimension:

- Academic Rigor
- Gifted / Support
- Logistics
- Progressive Style

We also label each dimension as:
- **Strong match**
- **OK match**
- **Weak match**

This explanation is intentionally simple (v1) and can be upgraded later with:
- learned weights
- natural language summaries
- user feedback loops


In [105]:
# ---------------------------
# 7. Explain Match Results (Parent-Friendly) — with Directional Insight
# ---------------------------

DIMENSIONS = [
    ("academic_rigor", "Academic Rigor"),
    ("gifted_support", "Gifted / Support"),
    ("logistics", "Logistics"),
    ("progressive_style", "Progressive Style"),
]

# Child-side keys aligned to vector contract (same order as school vectors)
CHILD_KEYS = CHILD_PROFILE_SCHEMA_V1["vector_order_contract"]
CHILD_LABELS = [
    "Rigor need",
    "Support need",
    "Logistics need",
    "Style preference",
]

def match_label_and_direction(child_val: float, school_val: float, tol_strong: float = 0.20, tol_ok: float = 0.45):
    """
    Returns:
      label:     Strong / OK / Weak match
      direction: Aligned / School higher / School lower
      diff:      signed difference (school - child)
      gap:       absolute difference
    """
    child_val = float(child_val)
    school_val = float(school_val)
    diff = school_val - child_val
    gap = abs(diff)

    if gap <= tol_strong:
        return "Strong match", "Aligned", diff, gap
    elif gap <= tol_ok:
        direction = "School higher" if diff > 0 else "School lower"
        return "OK match", direction, diff, gap
    else:
        direction = "School higher" if diff > 0 else "School lower"
        return "Weak match", direction, diff, gap


def explain_school_match(child_vector: np.ndarray, school_row: pd.Series) -> dict:
    """
    Create a parent-friendly explanation for one school.
    Includes directional insight (school higher/lower than child's need).
    """
    explanation = {
        "school_internal_id": school_row["school_internal_id"],
        "school_display_name": school_row["school_display_name"],
        "similarity_score": float(school_row["similarity_score"]),
        "dimensions": [],
        "summary": "",
    }

    for i, (col, dim_name) in enumerate(DIMENSIONS):
        child_val = float(child_vector[i])
        school_val = float(school_row[col])

        label, direction, diff, gap = match_label_and_direction(child_val, school_val)

        explanation["dimensions"].append({
            "dimension": dim_name,
            "child": {
                "label": CHILD_LABELS[i],
                "value": round(child_val, 2),
            },
            "school": {
                "label": f"{dim_name} score",
                "value": round(school_val, 2),
            },
            "match_label": label,
            "direction": direction,           # NEW: directional insight
            "diff": round(diff, 2),           # signed (school - child)
            "gap": round(gap, 2),             # absolute
        })

    # Summary: highlight best 2 (smallest gaps) and worst 1 (largest gap)
    gaps = [(d["gap"], d["dimension"], d["match_label"], d["direction"]) for d in explanation["dimensions"]]
    gaps_sorted = sorted(gaps, key=lambda x: x[0])

    best_two = gaps_sorted[:2]
    worst_one = gaps_sorted[-1]

    explanation["summary"] = (
        f"Best aligned on {best_two[0][1]} and {best_two[1][1]}. "
        f"Weakest alignment on {worst_one[1]} ({worst_one[3]})."
    )

    return explanation


# ---------------------------
# Execute: build explanations for ranked results
# ---------------------------

match_explanations = [explain_school_match(child_vector, row) for _, row in ranked_results_df.iterrows()]

print("Top match explanations:")
for exp in match_explanations[:2]:
    print("\n" + "-" * 70)
    print(f"{exp['school_display_name']} ({exp['school_internal_id']})")
    print(f"Similarity: {exp['similarity_score']:.3f}")
    print(exp["summary"])
    print("Details:")
    for d in exp["dimensions"]:
        print(
            f"- {d['dimension']}: Child {d['child']['value']} vs School {d['school']['value']} "
            f"→ {d['match_label']} ({d['direction']}, diff {d['diff']}, gap {d['gap']})"
        )

# ---------------------------
# Create a compact summary table for saving later
# ---------------------------

explain_rows = []
for exp in match_explanations:
    explain_rows.append({
        "school_internal_id": exp["school_internal_id"],
        "school_display_name": exp["school_display_name"],
        "similarity_score": exp["similarity_score"],
        "summary": exp["summary"],
    })

match_explanations_df = pd.DataFrame(explain_rows)

print("\nExplanation summary table:")
display(match_explanations_df)

# Optional: full JSON preview for the #1 result
print("\nFull JSON explanation (top match):")
print(json.dumps(match_explanations[0], indent=2))


Top match explanations:

----------------------------------------------------------------------
Bay Area Montessori Academy (SCH0002)
Similarity: 0.664
Best aligned on Logistics and Progressive Style. Weakest alignment on Academic Rigor (School lower).
Details:
- Academic Rigor: Child 0.8 vs School 0.0 → Weak match (School lower, diff -0.8, gap 0.8)
- Gifted / Support: Child 0.7 vs School 0.0 → Weak match (School lower, diff -0.7, gap 0.7)
- Logistics: Child 0.9 vs School 0.7 → OK match (School lower, diff -0.2, gap 0.2)
- Progressive Style: Child 0.6 vs School 1.0 → OK match (School higher, diff 0.4, gap 0.4)

----------------------------------------------------------------------
Sunnyvale Elementary School (SCH0001)
Similarity: 0.593
Best aligned on Logistics and Progressive Style. Weakest alignment on Academic Rigor (School lower).
Details:
- Academic Rigor: Child 0.8 vs School 0.0 → Weak match (School lower, diff -0.8, gap 0.8)
- Gifted / Support: Child 0.7 vs School 0.0 → Weak mat

Unnamed: 0,school_internal_id,school_display_name,similarity_score,summary
0,SCH0002,Bay Area Montessori Academy,0.664428,Best aligned on Logistics and Progressive Styl...
1,SCH0001,Sunnyvale Elementary School,0.593442,Best aligned on Logistics and Progressive Styl...



Full JSON explanation (top match):
{
  "school_internal_id": "SCH0002",
  "school_display_name": "Bay Area Montessori Academy",
  "similarity_score": 0.6644282038345252,
  "dimensions": [
    {
      "dimension": "Academic Rigor",
      "child": {
        "label": "Rigor need",
        "value": 0.8
      },
      "school": {
        "label": "Academic Rigor score",
        "value": 0.0
      },
      "match_label": "Weak match",
      "direction": "School lower",
      "diff": -0.8,
      "gap": 0.8
    },
    {
      "dimension": "Gifted / Support",
      "child": {
        "label": "Support need",
        "value": 0.7
      },
      "school": {
        "label": "Gifted / Support score",
        "value": 0.0
      },
      "match_label": "Weak match",
      "direction": "School lower",
      "diff": -0.7,
      "gap": 0.7
    },
    {
      "dimension": "Logistics",
      "child": {
        "label": "Logistics need",
        "value": 0.9
      },
      "school": {
        "label": "L

## 8. Save Outputs

This section saves the key artifacts produced by Notebook 03 so the matching
results are reproducible and auditable.

Outputs saved to `../data/processed/`:

- `child_profile_sample.json`
- `child_vector_sample.npy`
- `child_vector_explain.json`
- `match_results_sample.csv`
- `match_explanations_sample.json`
- `match_explanations_summary.csv`


In [107]:
processed_dir = "../data/processed"
os.makedirs(processed_dir, exist_ok=True)

# Paths
child_profile_path = os.path.join(processed_dir, "child_profile_sample.json")
child_vector_path = os.path.join(processed_dir, "child_vector_sample.npy")
child_explain_path = os.path.join(processed_dir, "child_vector_explain.json")

match_results_path = os.path.join(processed_dir, "match_results_sample.csv")
match_explanations_path = os.path.join(processed_dir, "match_explanations_sample.json")
match_explanations_summary_path = os.path.join(processed_dir, "match_explanations_summary.csv")

# ---- Save child artifacts ----
with open(child_profile_path, "w") as f:
    json.dump(child_profile_v1, f, indent=2)

np.save(child_vector_path, child_vector)

with open(child_explain_path, "w") as f:
    json.dump(child_vector_explain, f, indent=2)

# ---- Save match artifacts ----
ranked_results_df.to_csv(match_results_path, index=False)

with open(match_explanations_path, "w") as f:
    json.dump(match_explanations, f, indent=2)

match_explanations_df.to_csv(match_explanations_summary_path, index=False)

print("Saved outputs:")
print("-", child_profile_path)
print("-", child_vector_path)
print("-", child_explain_path)
print("-", match_results_path)
print("-", match_explanations_path)
print("-", match_explanations_summary_path)

Saved outputs:
- ../data/processed/child_profile_sample.json
- ../data/processed/child_vector_sample.npy
- ../data/processed/child_vector_explain.json
- ../data/processed/match_results_sample.csv
- ../data/processed/match_explanations_sample.json
- ../data/processed/match_explanations_summary.csv
