# 02 – Feature Layer & School Vectors

This notebook takes the schools master data and:
1. Builds a small, interpretable **feature layer** (scores per school).
2. Converts these features into **numeric vectors** suitable for matching.

---

## Index

1. [Notebook Goals & Context](#1-notebook-goals--context)
2. [Load Sample Schools Master Data](#2-load-sample-schools-master-data)
3. [Design the Initial Feature Layer](#3-design-the-initial-feature-layer)
4. [Implement Feature Computation Functions](#4-implement-feature-computation-functions)
5. [Build School Feature Table](#5-build-school-feature-table)
6. [Convert Features to School Vectors](#6-convert-features-to-school-vectors)
7. [Sanity Checks on Vectors](#7-sanity-checks-on-vectors)
8. [Save Feature Layer & Vectors for Later Use](#8-save-feature-layer--vectors-for-later-use)


## 1. Notebook Goals & Context

This notebook takes the **Schools Master Data** created in Notebook 01 and
begins transforming it into a form that a machine learning system can use for
matching.

We will build two layers:

### 1. A simple Feature Layer (0–1 scores)
We start by creating a small number of **interpretable features** that describe
each school numerically, such as:

- academic_rigor
- gifted_support
- logistics (transportation, before/after care)
- progressive_style (Montessori / inquiry-based influence)

These features are human-readable and help us understand how a school “behaves”
in terms of structure, academic pace, and learning style.

### 2. School vectors (numeric representation)
Once we have the feature scores, we convert them into **vectors** (lists of
numbers). These vectors allow us to compare schools to children using simple,
transparent ML techniques such as **cosine similarity**.

In simple terms:

> A vector is how a computer “sees” a school — as a small list of numbers.

Later, when we represent a child in the same way, we will match the child to
schools by measuring how close their vectors are.

### What this notebook produces
- Loads the sample data from Notebook 01  
- Defines a small, clear set of feature functions  
- Builds a Feature Layer DataFrame  
- Converts features into school vectors  
- Performs basic sanity checks  
- Saves:
  - `school_features_sample.csv`
  - `school_vectors_sample.npy`

### Why we keep this notebook simple
The goal is understanding, not over-optimization.  
We begin with only a few features and a simple vectorization pipeline so that the
matching logic stays:

- transparent  
- explainable to parents  
- easy to extend  
- easy to maintain  

---



## 2. Load Sample Schools Master Data

In this section we:

1. Load the sample schools master file  
   (`../data/processed/schools_master_sample.csv`).
2. Normalize data types for key columns:
   - Convert TRUE/FALSE-like columns to real booleans.
   - Leave numeric and string fields as-is for now.

This is important because the CSV stores some flags as text (e.g. "TRUE")
or as empty cells. We want these to become clean booleans (True/False) so
our feature functions behave correctly.



In [62]:
import pandas as pd
import numpy as np

# Path to the sample master file
sample_path = "../data/processed/schools_master_sample.csv"

# 1) Load the CSV
schools_master_df = pd.read_csv(sample_path)

print("Loaded schools_master_sample.csv")
print("Shape:", schools_master_df.shape)

# 2) Normalize boolean-like columns
def normalize_bool_column(series: pd.Series) -> pd.Series:
    """
    Convert a column with values like 'TRUE', 'FALSE', '', NaN
    into a proper boolean Series.

    - 'TRUE' (any case, with/without spaces) -> True
    - everything else (including blank, NaN, 'FALSE') -> False
    """
    return (
        series
        .astype(str)
        .str.strip()
        .str.upper()
        .eq("TRUE")
    )

# List of columns we *intend* to be booleans in the schema
BOOL_COLUMNS = [
    "is_elementary",
    "is_middle",
    "is_high",
    "has_preschool",
    "is_montessori",
    "is_ams_member",
    "is_waldorf",
    "is_progressive",
    "is_gifted_school",
    "is_2e_focused",
    "is_ib_school",
    "is_stem_focus",
    "is_arts_focus",
    "has_ap_program",
    "has_ib_program",
    "has_counselor",
    "has_special_ed_program",
    "has_504_support",
    "has_esl_ell_program",
    "has_gifted_program",
    "offers_ot_pt_speech",
    "has_counseling_services",
    "has_after_school_program",
    "has_transportation",
    "has_before_school_program",
    "has_financial_aid",
    "fuzzy_match_warning",
    "is_active_school",
]

# Only keep the columns that actually exist in this CSV
present_bool_cols = [col for col in BOOL_COLUMNS if col in schools_master_df.columns]

print("\nNormalizing boolean columns:")
print(present_bool_cols)

for col in present_bool_cols:
    schools_master_df[col] = normalize_bool_column(schools_master_df[col])

# quick check of dtypes and a few rows
print("\nDtypes for boolean-like columns:")
print(schools_master_df[present_bool_cols].dtypes)

print("\nPreview of boolean columns for first 3 schools:")
display(schools_master_df[["school_internal_id"] + present_bool_cols].head(3))

print(schools_master_df.columns.tolist())

Loaded schools_master_sample.csv
Shape: (3, 80)

Normalizing boolean columns:

Dtypes for boolean-like columns:
is_elementary                bool
is_middle                    bool
is_high                      bool
has_preschool                bool
is_montessori                bool
is_ams_member                bool
is_waldorf                   bool
is_progressive               bool
is_gifted_school             bool
is_2e_focused                bool
is_ib_school                 bool
is_stem_focus                bool
is_arts_focus                bool
has_ap_program               bool
has_ib_program               bool
has_counselor                bool
has_special_ed_program       bool
has_504_support              bool
has_esl_ell_program          bool
has_gifted_program           bool
offers_ot_pt_speech          bool
has_counseling_services      bool
has_after_school_program     bool
has_transportation           bool
has_before_school_program    bool
has_financial_aid            bool
is_a

Unnamed: 0,school_internal_id,is_elementary,is_middle,is_high,has_preschool,is_montessori,is_ams_member,is_waldorf,is_progressive,is_gifted_school,...,has_esl_ell_program,has_gifted_program,offers_ot_pt_speech,has_counseling_services,has_after_school_program,has_transportation,has_before_school_program,has_financial_aid,fuzzy_match_warning,is_active_school
0,SCH0001,True,False,False,False,False,False,False,False,False,...,False,False,False,False,True,False,False,False,False,False
1,SCH0002,False,False,False,True,True,True,False,True,False,...,False,False,False,False,True,False,True,True,False,False
2,SCH0003,False,False,True,False,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False




## 3. Design the Initial Feature Layer 

This notebook computes a small set of **four improved, explainable, 0–1 scores**
that describe each school in a way that is fair, interpretable, and useful for
vector-based matching.

These scores considers several logic issues:

- No more “high school bias” in academic rigor
- No unfair advantage for AP schools over IB schools
- Gifted vs 2e needs are distinguished more clearly
- After-care is weighted more realistically for working parents
- Progressive specialist schools (Montessori/Waldorf) are treated correctly

The goal is simplicity + explainability + mathematical usefulness.

---

## The Updated Initial Feature Set 

### 1. **academic_rigor (0–1)**  
**Purpose:** Measures the school’s academic challenge level fairly across K–12.

**Includes:**  
- Removed high-school bias (AP-only problem)  
- Balanced IB and AP equally  
- Gifted K–8 schools can now score high  
- Graduation rate is optional and adds only a small bonus  

**Inputs Used:**  
- `is_ib_school` / `has_ib_program`  
- `has_ap_program`  
- `ap_course_count`  
- `is_gifted_school` / `has_gifted_program` (for K–8 rigor)  
- `graduation_rate` (optional)

**Meaning:**  
- 0.0 → Low rigor  
- 1.0 → Very high rigor across any grade span  

---

### 2. **gifted_support (0–1)**  
**Purpose:** Measures how well the school supports *gifted* learners, while
treating 2e specialization as a smaller, more specific signal.

**Includes:**  
- Gifted-support remains dominant  
- 2e-focus is a niche positive, no longer overpowering  
- Percent-gifted (if available) provides smooth tuning  

**Inputs Used:**  
- `is_gifted_school`  
- `has_gifted_program`  
- `is_2e_focused`  
- `pct_gifted_identified` (optional)

**Meaning:**  
- 0.0 → No gifted support  
- 1.0 → Highly specialized gifted/2e school  

---

### 3. **logistics (0–1)**  
**Purpose:** Measures parent convenience and practical support services.

**Includes:**  
- After-school care is weighted highest (reflecting real parent pain points)  
- Transport and before-care still contribute meaningfully  

**Inputs Used:**  
- `has_transportation`  
- `has_before_school_program`  
- `has_after_school_program`

**Weights:**  
- After-care = 0.5  
- Transport = 0.3  
- Before-care = 0.2  

**Meaning:**  
- 0.0 → No family support services  
- 1.0 → Full logistical support  

---

### 4. **progressive_style (0–1)**  
**Purpose:** Measures how strongly the school leans toward progressive,
inquiry-based, child-led learning.

**Includes:**  
- Pure Montessori or Waldorf schools now correctly receive **strong** scores  
- Progressive signal is no longer diluted  
- Arts-focus is a mild bonus, not a primary driver  

**Inputs Used:**  
- `is_montessori`  
- `is_waldorf`  
- `is_progressive`  
- `is_arts_focus`

**Meaning:**  
- 0.0 → Traditional/structured  
- 0.8 → Strong progressive school (Montessori/Waldorf/Project-based)  
- 1.0 → Strong progressive + arts integration  

---

## Output of This Section

This defines the exact four features we will compute in Section 4:

- `academic_rigor`  
- `gifted_support`  
- `logistics`  
- `progressive_style`

These four numbers will then form the **school vector**:
`[
    academic_rigor,
    gifted_support,
    logistics,
    progressive_style
]`

This is the foundation of our matching engine.


## 4. Implement Feature Computation Functions

In this section, we implement the improved formulas for the four v1 feature
scores:

1. `academic_rigor_score`
2. `gifted_support_score`
3. `logistics_score`
4. `progressive_style_score`

Each function returns a value between 0 and 1 using simple, explainable rules.

- AP vs IB bias
- K–8 rigor scoring
- Gifted vs 2e weighting
- After-care > transport > before-care
- Specialists (Montessori/Waldorf) no longer diluted

After implementing the functions, we will run them on the sample dataset to
verify that values fall in a reasonable range.



In [64]:
import numpy as np
import pandas as pd

# ---------------------------------------
# Helper utilities
# ---------------------------------------

def _bool(row, col):
    """Safe boolean extraction; NaN → False."""
    val = row.get(col, np.nan)
    if pd.isna(val):
        return False
    return bool(val)

def _float(row, col, default=np.nan):
    """Safe float extraction; NaN → default."""
    val = row.get(col, default)
    if pd.isna(val):
        return default
    try:
        return float(val)
    except (TypeError, ValueError):
        return default

# ---------------------------------------
# Corrected High School Detection Logic
# ---------------------------------------

def _is_high_school(row):
    """
    Determine if a school serves high school grades (9–12).
    Handles strings like:
      - "12"
      - "9-12"
      - "PK-12"
      - "K-12"
      - "KG-12"
      - "07-12"
      - "Ungraded" → False

    A school is treated as high school if the highest grade
    string contains any of: 9, 10, 11, 12.
    """

    hg = row.get("highest_grade", None)
    if hg is None or pd.isna(hg):
        return False

    # Convert to string for flexible parsing
    hg_str = str(hg).upper()

    # High school grade markers
    hs_markers = ["9", "10", "11", "12"]

    # If any HS marker appears anywhere in the grade string → treat as HS
    if any(marker in hg_str for marker in hs_markers):
        return True

    # Fallback: try numeric conversion (for clean numeric data)
    try:
        return float(hg) >= 9
    except:
        return False

# ---------------------------------------
# 1. Academic Rigor Score (v1)
# ---------------------------------------

def compute_academic_rigor_score(row):
    """
    Academic rigor (0–1), final v1:

    - AP, IB, and gifted K–8 all count as rigor signals.
    - Top AP or IB high schools can reach 1.0.
    - Gifted K–8 can reach ~0.8.
    - Graduation rate is optional and adds up to +0.2.
    """

    score = 0.0

    has_ib = _bool(row, "is_ib_school") or _bool(row, "has_ib_program")
    has_ap = _bool(row, "has_ap_program")
    is_gifted_env = _bool(row, "is_gifted_school") or _bool(row, "has_gifted_program")
    is_high = _is_high_school(row)

    # 1) Base rigor
    if has_ib or has_ap or is_gifted_env:
        score += 0.5

    # 2) Depth
    ap_count = _float(row, "ap_course_count", default=0.0)

    if has_ap:
        # AP depth now contributes up to +0.3
        score += min(ap_count / 10.0, 1.0) * 0.3

    if has_ib:
        # IB DP = +0.3 (strong), MYP = +0.1
        if _bool(row, "has_ib_dp"):
            score += 0.3
        elif _bool(row, "has_ib_myp"):
            score += 0.1

    # 3) K–8 gifted rigor bump
    if (not is_high) and is_gifted_env:
        score += 0.3

    # 4) Graduation rate bonus (HS only)
    grad = _float(row, "graduation_rate", default=np.nan)
    if is_high and not pd.isna(grad):
        if grad > 1.0:
            grad = grad / 100.0
        grad = np.clip(grad, 0.0, 1.0)
        score += grad * 0.2

    return float(np.clip(score, 0.0, 1.0))

# ---------------------------------------
# 2. Gifted Support Score (final v1)
# ---------------------------------------

def compute_gifted_support_score(row):
    """
    Gifted support (0–1):

    - Gifted-only schools score highest.
    - Schools with gifted programs score solidly.
    - 2e-focus is a smaller niche boost.
    - pct_gifted_identified fine-tunes.
    """

    score = 0.0

    if _bool(row, "is_gifted_school"):
        score += 0.7
    elif _bool(row, "has_gifted_program"):
        score += 0.4

    if _bool(row, "is_2e_focused"):
        score += 0.2

    pct = _float(row, "pct_gifted_identified", default=np.nan)
    if not pd.isna(pct):
        score += min(pct / 0.10, 1.0) * 0.2

    return float(np.clip(score, 0.0, 1.0))

# ---------------------------------------
# 3. Logistics Score (final v1)
# ---------------------------------------

def compute_logistics_score(row):
    """
    Logistics / family support (0–1):

    - After-school program: 0.5
    - Transportation:       0.3
    - Before-school program:0.2
    """

    score = 0.0

    if _bool(row, "has_after_school_program"):
        score += 0.5
    if _bool(row, "has_transportation"):
        score += 0.3
    if _bool(row, "has_before_school_program"):
        score += 0.2

    return float(np.clip(score, 0.0, 1.0))

# ---------------------------------------
# 4. Progressive Style Score (final v1)
# ---------------------------------------

def compute_progressive_style_score(row):
    """
    Progressive / inquiry-based style (0–1):

    - Montessori/Waldorf/Progressive schools → strong base (0.8)
    - Arts-focus → +0.2
    """

    score = 0.0

    if _bool(row, "is_montessori") or _bool(row, "is_waldorf") or _bool(row, "is_progressive"):
        score += 0.8

    if _bool(row, "is_arts_focus"):
        score += 0.2

    return float(np.clip(score, 0.0, 1.0))

# ---------------------------------------
# Smoke Test
# ---------------------------------------

print("Running v1 updated feature functions on sample data with corrected HS detection:\n")

for idx, row in schools_master_df.head(3).iterrows():
    print(f"School: {row.get('school_display_name', 'N/A')}")
    print("  academic_rigor     =", compute_academic_rigor_score(row))
    print("  gifted_support     =", compute_gifted_support_score(row))
    print("  logistics          =", compute_logistics_score(row))
    print("  progressive_style  =", compute_progressive_style_score(row))
    print()


Running v1 updated feature functions on sample data with corrected HS detection:

School: Sunnyvale Elementary School
  academic_rigor     = 0.0
  gifted_support     = 0.0
  logistics          = 0.5
  progressive_style  = 0.0

School: Bay Area Montessori Academy
  academic_rigor     = 0.0
  gifted_support     = 0.0
  logistics          = 0.7
  progressive_style  = 1.0

School: Mountain View International High School
  academic_rigor     = 0.99
  gifted_support     = 0.0
  logistics          = 0.3
  progressive_style  = 0.0



## 5. Build School Feature Table

Now that we have functions to compute:

- `academic_rigor`
- `gifted_support`
- `logistics`
- `progressive_style`

we will apply them to every school and create a compact **feature table**.

This table will be the main input for:

- Vector creation (next section)
- Similarity-based matching logic
- Debugging and explainability (easy to print and inspect)

For now, we keep only:

- `school_internal_id` (stable key)
- `school_display_name` (human-readable)
- the four feature columns


In [66]:
# Columns we want in the feature table
FEATURE_COLUMNS = [
    "academic_rigor",
    "gifted_support",
    "logistics",
    "progressive_style",
]

# 1) Compute feature scores for each school and add as new columns
schools_master_with_features = schools_master_df.copy()

schools_master_with_features["academic_rigor"] = schools_master_with_features.apply(
    compute_academic_rigor_score, axis=1
)
schools_master_with_features["gifted_support"] = schools_master_with_features.apply(
    compute_gifted_support_score, axis=1
)
schools_master_with_features["logistics"] = schools_master_with_features.apply(
    compute_logistics_score, axis=1
)
schools_master_with_features["progressive_style"] = schools_master_with_features.apply(
    compute_progressive_style_score, axis=1
)

# 2) Build the compact feature table
school_features_df = schools_master_with_features[
    ["school_internal_id", "school_display_name"] + FEATURE_COLUMNS
].copy()

print("School feature table created.")
print(f"Rows: {school_features_df.shape[0]}, Columns: {school_features_df.shape[1]}")
school_features_df.head()


School feature table created.
Rows: 3, Columns: 6


Unnamed: 0,school_internal_id,school_display_name,academic_rigor,gifted_support,logistics,progressive_style
0,SCH0001,Sunnyvale Elementary School,0.0,0.0,0.5,0.0
1,SCH0002,Bay Area Montessori Academy,0.0,0.0,0.7,1.0
2,SCH0003,Mountain View International High School,0.99,0.0,0.3,0.0


## 6. Convert Features to School Vectors

Now that we have a compact feature table in `school_features_df`, we will:

1. Choose a fixed order for the feature columns:
   - `academic_rigor`
   - `gifted_support`
   - `logistics`
   - `progressive_style`

2. Convert these columns into a NumPy array of vectors, one vector per school.

Each school vector has the form:
`[ academic_rigor,
gifted_support,
logistics,
progressive_style ]`

These vectors are the actual objects that a matching algorithm will use with
similarity functions (e.g., cosine similarity) when comparing schools to a
child's profile vector.

In [68]:

# Ensure we are using the same feature column order as before
FEATURE_COLUMNS = [
    "academic_rigor",
    "gifted_support",
    "logistics",
    "progressive_style",
]

# 1) Extract the feature matrix (NumPy array)
school_vectors = school_features_df[FEATURE_COLUMNS].to_numpy(dtype=float)

# 2) Build an index mapping for convenience: school_internal_id -> row index
school_id_to_index = {
    row["school_internal_id"]: idx
    for idx, row in school_features_df.reset_index(drop=True).iterrows()
}

print("School vectors created.")
print("Shape of school_vectors:", school_vectors.shape)
print("\nExample (first school):")
print("ID:", school_features_df.iloc[0]["school_internal_id"])
print("Name:", school_features_df.iloc[0]["school_display_name"])
print("Vector:", school_vectors[0])


School vectors created.
Shape of school_vectors: (3, 4)

Example (first school):
ID: SCH0001
Name: Sunnyvale Elementary School
Vector: [0.  0.  0.5 0. ]


## 7. Sanity Checks on Vectors

Before using `school_vectors` in any matching logic, we run a few basic checks:

1. Confirm the shape is `(N_schools, N_features)`.
2. Ensure there are no missing (`NaN`) or infinite values.
3. Verify that all feature values are between 0 and 1.
4. Inspect a small preview of schools and their vectors.

These checks help catch mistakes early (e.g., wrong column order, bad scaling,
or unexpected missing values).


In [70]:

print("=== Sanity Checks on school_vectors ===\n")

# 1. Shape check
print("Shape of school_vectors:", school_vectors.shape)
print("Expected: (num_schools, 4)\n")

# 2. Value range check
min_val = np.nanmin(school_vectors)
max_val = np.nanmax(school_vectors)
print(f"Min value in vectors: {min_val:.3f}")
print(f"Max value in vectors: {max_val:.3f}")
print("All values should be between 0 and 1.\n")

# 3. NaN / missing value check
nan_mask = np.isnan(school_vectors)
num_nan = nan_mask.sum()
print(f"Number of NaN entries in school_vectors: {num_nan}")
if num_nan > 0:
    print("Warning: some vectors contain NaNs. Investigate source rows.")
else:
    print("Good: no NaNs in school_vectors.\n")

# 4. Per-feature summary from the DataFrame
print("\n=== Per-feature summary ===\n")
feature_summary = school_features_df[
    ["academic_rigor",
     "gifted_support",
     "logistics",
     "progressive_style"]
].describe()

display(feature_summary)

# 5. Quick look at all feature rows
print("\n=== First few school vectors with IDs & names ===\n")
for idx, row in school_features_df.head(10).iterrows():
    vec = school_vectors[idx]
    print(f"ID: {row['school_internal_id']}")
    print(f"Name: {row['school_display_name']}")
    print(f"Vector: {vec}")
    print("-" * 60)

=== Sanity Checks on school_vectors ===

Shape of school_vectors: (3, 4)
Expected: (num_schools, 4)

Min value in vectors: 0.000
Max value in vectors: 1.000
All values should be between 0 and 1.

Number of NaN entries in school_vectors: 0
Good: no NaNs in school_vectors.


=== Per-feature summary ===



Unnamed: 0,academic_rigor,gifted_support,logistics,progressive_style
count,3.0,3.0,3.0,3.0
mean,0.33,0.0,0.5,0.333333
std,0.571577,0.0,0.2,0.57735
min,0.0,0.0,0.3,0.0
25%,0.0,0.0,0.4,0.0
50%,0.0,0.0,0.5,0.0
75%,0.495,0.0,0.6,0.5
max,0.99,0.0,0.7,1.0



=== First few school vectors with IDs & names ===

ID: SCH0001
Name: Sunnyvale Elementary School
Vector: [0.  0.  0.5 0. ]
------------------------------------------------------------
ID: SCH0002
Name: Bay Area Montessori Academy
Vector: [0.  0.  0.7 1. ]
------------------------------------------------------------
ID: SCH0003
Name: Mountain View International High School
Vector: [0.99 0.   0.3  0.  ]
------------------------------------------------------------


## 8. Save Feature Layer & Vectors

Now that we have:

- `school_features_df` — the compact feature table  
- `school_vectors` — the NumPy matrix used for similarity search  
- `school_id_to_index` — a lookup map from school ID → vector row  

we will save them to the `data/processed/` folder for downstream use.

These files will be consumed later by:

- The child-profile vector notebook  
- The similarity scoring / matching notebook  
- Any prototype API or matching engine  


In [72]:
# Make sure the processed data directory exists
processed_dir = "../data/processed"
os.makedirs(processed_dir, exist_ok=True)

# 1. Save the feature table (CSV)
features_path = os.path.join(processed_dir, "school_features_sample.csv")
school_features_df.to_csv(features_path, index=False)
print(f"Saved school feature table → {features_path}")

# 2. Save the vector matrix (.npy binary file)
vectors_path = os.path.join(processed_dir, "school_vectors_sample.npy")
np.save(vectors_path, school_vectors)
print(f"Saved school vectors → {vectors_path}")

# 3. Save the ID → index mapping (JSON)
mapping_path = os.path.join(processed_dir, "school_id_to_index.json")
with open(mapping_path, "w") as f:
    json.dump(school_id_to_index, f, indent=2)
print(f"Saved ID→index map → {mapping_path}")

print("\nAll artifacts saved successfully.")


Saved school feature table → ../data/processed/school_features_sample.csv
Saved school vectors → ../data/processed/school_vectors_sample.npy
Saved ID→index map → ../data/processed/school_id_to_index.json

All artifacts saved successfully.
