# 01 – School Master Schema & Sample Data

This notebook defines the **schools_master_schema** (Golden Record for schools)
and creates a **small sample dataset** (3–5 schools) that matches the schema.

We will use this as a clean, controlled sandbox before dealing with large real datasets.

---

## Index

1. [Notebook Goals & Context](#1-notebook-goals--context)
2. [Final Schools Master Schema (Reference)](#2-final-schools-master-schema-reference)
3. [Create Empty DataFrame with Schema Columns](#3-create-empty-dataframe-with-schema-columns)
4. [Add a Few Sample Schools Manually](#4-add-a-few-sample-schools-manually)
5. [Sanity Checks on the Sample Data](#5-sanity-checks-on-the-sample-data)
6. [Save Sample Data for Later Notebooks](#6-save-sample-data-for-later-notebooks)


---

## 1. Notebook Goals & Context

This notebook is the starting point for the Smart School System.  
Its purpose is to define and validate the **Schools Master Schema**, which serves as
the unified "Golden Record" for all school data used in the project.

### Why this notebook is important
- Establishes a **single, authoritative schema** before ingesting any real datasets.
- Ensures all future data (NCES, CRDC, CA Private Directory, Montessori, IB, etc.)
  can be merged cleanly into the same structure.
- Creates a **small sample dataset** (3–5 schools) to test schema alignment,
  column types, and basic transformations.
- Prepares the foundation for later notebooks:
  - Feature engineering (derived scores)
  - Vectorization (ML-ready numeric features)
  - School–child matching engine

### What this notebook will produce
- A finalized **schools_master_schema** (human-readable reference).
- An empty Pandas DataFrame using all schema columns.
- A few manually constructed sample school records.
- Sanity checks to confirm the schema works in practice.
- A saved sample dataset for use in the next notebook.

---

## 2. Final Schools Master Schema (Reference)

This section defines the **Golden Record schema** for schools.  
It contains only *raw factual fields* (roles: BACKBONE, RAW_FEATURE, META).  
Derived ML scores live in a separate feature layer and are **not** included here.

---

### How to read this schema
- **BACKBONE** → unique identifiers and stable join keys  
- **RAW_FEATURE** → factual data used for later ML feature engineering  
- **META** → ingestion metadata and data quality markers  

Each section below is collapsible for readability.

---

# Schools Master Schema (Golden Record)

---

<details>
<summary><strong>1. Identifiers (Backbone)</strong></summary>

| Column name          | Type   | Description                                           | Role      |
|----------------------|--------|-------------------------------------------------------|-----------|
| school_internal_id   | string | Your stable unique ID (primary key).                 | BACKBONE  |
| nces_id              | string | Federal NCES school ID.                              | BACKBONE  |
| lea_id               | string | NCES district/LEA ID.                                | BACKBONE  |
| ppin                 | string | Private School PIN from PSS.                         | BACKBONE  |
| cds_code             | string | CA DOE CDS code (county–district–school).           | BACKBONE  |
| ib_id                | string | IB program identifier (if available).               | BACKBONE  |
| source_school_name   | string | Canonical school name selected from sources.         | RAW_FEATURE |
| alternate_names      | string | Delimited list of alternate names.                  | RAW_FEATURE |

</details>

---

<details>
<summary><strong>2. Location & Geography</strong></summary>

| Column name        | Type   | Description                               | Role        |
|--------------------|--------|-------------------------------------------|-------------|
| address_line_1     | string | Main street address.                      | RAW_FEATURE |
| address_line_2     | string | Suite/building/etc.                       | RAW_FEATURE |
| city               | string | City.                                     | RAW_FEATURE |
| county             | string | County name.                              | RAW_FEATURE |
| state              | string | Two-letter state code.                    | RAW_FEATURE |
| zip_code           | string | Postal code.                              | RAW_FEATURE |
| latitude           | float  | Latitude coordinate.                      | RAW_FEATURE |
| longitude          | float  | Longitude coordinate.                     | RAW_FEATURE |
| urbanicity         | string | Urban/Suburban/Town/Rural.                | RAW_FEATURE |
| nces_locale_code   | string | NCES locale code.                         | RAW_FEATURE |
| commute_zone       | string | UX region (e.g., “South Bay”).            | RAW_FEATURE |

</details>

---

<details>
<summary><strong>3. Basic School Profile</strong></summary>

| Column name          | Type   | Description                           | Role        |
|----------------------|--------|---------------------------------------|-------------|
| school_display_name  | string | Cleaned, user-facing school name.     | RAW_FEATURE |
| school_website       | string | School website URL.                   | RAW_FEATURE |
| school_type          | string | Public / Private / Charter / etc.     | RAW_FEATURE |
| governance_model     | string | District, Independent, Diocesan, etc. | RAW_FEATURE |
| religious_affiliation| string | None / Catholic / Jewish / etc.       | RAW_FEATURE |
| coed_status          | string | Co-ed / Boys / Girls.                 | RAW_FEATURE |
| boarding_status      | string | Day / Boarding / Both.                | RAW_FEATURE |
| year_founded         | int    | Year founded.                         | RAW_FEATURE |

</details>

---

<details>
<summary><strong>4. Grades, Ages, and Structure</strong></summary>

| Column name            | Type   | Description                             | Role        |
|------------------------|--------|-----------------------------------------|-------------|
| lowest_grade           | string | Lowest grade served (PK, K, 1...).      | RAW_FEATURE |
| highest_grade          | string | Highest grade served (5, 8, 12...).     | RAW_FEATURE |
| is_elementary          | bool   | Serves PK–5.                            | RAW_FEATURE |
| is_middle              | bool   | Serves grades 6–8.                      | RAW_FEATURE |
| is_high                | bool   | Serves grades 9–12.                     | RAW_FEATURE |
| has_preschool          | bool   | Has preschool program.                  | RAW_FEATURE |
| age_range_description  | string | Free-text age range from source.        | RAW_FEATURE |
| min_age_months         | int    | Youngest age served (in months).        | RAW_FEATURE |
| max_age_months         | int    | Oldest age served (in months).          | RAW_FEATURE |

</details>

---

<details>
<summary><strong>5. Enrollment & Demographics</strong></summary>

Compact version lists only key columns; full expansion is optional later.

| Column name         | Type  | Description                         | Role        |
|---------------------|-------|-------------------------------------|-------------|
| total_enrollment    | int   | Total student count.                | RAW_FEATURE |
| pct_ell             | float | % English Learners.                 | RAW_FEATURE |
| pct_swd             | float | % Students with disabilities.       | RAW_FEATURE |
| pct_econ_disadvantaged | float | % economically disadvantaged.   | RAW_FEATURE |
| pct_gifted_identified | float | % identified gifted (if avail).  | RAW_FEATURE |

*(Additional race/ethnicity columns exist but omitted here for readability; still part of schema.)*

</details>

---

<details>
<summary><strong>6. Programs & Pedagogy Tags</strong></summary>

| Column name          | Type  | Description                        | Role        |
|----------------------|-------|------------------------------------|-------------|
| is_montessori        | bool  | Montessori pedagogy school.        | RAW_FEATURE |
| is_ams_member        | bool  | Listed in AMS dataset.             | RAW_FEATURE |
| is_waldorf           | bool  | Waldorf/Steiner school.            | RAW_FEATURE |
| is_progressive       | bool  | Progressive/project-based.         | RAW_FEATURE |
| is_gifted_school     | bool  | Explicit gifted school.            | RAW_FEATURE |
| is_2e_focused        | bool  | Explicit twice-exceptional focus.  | RAW_FEATURE |
| is_ib_school         | bool  | Has at least one IB program.       | RAW_FEATURE |
| is_stem_focus        | bool  | STEM/STEAM emphasis.               | RAW_FEATURE |
| is_arts_focus        | bool  | Arts emphasis.                      | RAW_FEATURE |

</details>

---

<details>
<summary><strong>7. Academics & Rigor</strong></summary>

| Column name         | Type  | Description                         | Role        |
|---------------------|-------|-------------------------------------|-------------|
| has_ap_program      | bool  | School offers AP courses.           | RAW_FEATURE |
| ap_course_count     | int   | Number of AP courses.               | RAW_FEATURE |
| has_ib_program      | bool  | Any IB offering.                    | RAW_FEATURE |
| graduation_rate     | float | Graduation rate (if HS).            | RAW_FEATURE |

</details>

---

<details>
<summary><strong>8. Staffing & Resources</strong></summary>

| Column name            | Type  | Description                         | Role        |
|------------------------|-------|-------------------------------------|-------------|
| student_teacher_ratio  | float | Students per teacher.               | RAW_FEATURE |
| has_counselor          | bool  | At least one counselor.             | RAW_FEATURE |
| counselor_student_ratio| float | Students per counselor.             | RAW_FEATURE |

</details>

---

<details>
<summary><strong>9. Student Support & Services</strong></summary>

| Column name               | Type | Description                              | Role        |
|---------------------------|------|------------------------------------------|-------------|
| has_special_ed_program    | bool | Provides special education.              | RAW_FEATURE |
| has_504_support           | bool | Explicit 504 accommodations.             | RAW_FEATURE |
| has_esl_ell_program       | bool | ESL/ELL support.                         | RAW_FEATURE |
| has_gifted_program        | bool | Gifted enrichment.                       | RAW_FEATURE |
| offers_ot_pt_speech       | bool | On-site OT/PT/Speech services.           | RAW_FEATURE |
| has_after_school_program  | bool | After-school offerings.                  | RAW_FEATURE |
| has_transportation        | bool | School provides transportation.          | RAW_FEATURE |
| has_before_school_program | bool | Before-school care.                      | RAW_FEATURE |

</details>

---

<details>
<summary><strong>10. Equity, Discipline & Safety</strong></summary>

| Column name           | Type  | Description                         | Role        |
|-----------------------|-------|-------------------------------------|-------------|
| suspensions_total     | int   | Number of suspensions.              | RAW_FEATURE |
| expulsions_total      | int   | Number of expulsions.               | RAW_FEATURE |
| bullying_incidents_total | int | Bullying incidents.                 | RAW_FEATURE |
| seclusion_restraint_incidents_total | int | Restraint/seclusion events. | RAW_FEATURE |

</details>

---

<details>
<summary><strong>11. Tuition & Finance</strong></summary>

| Column name      | Type  | Description                     | Role        |
|------------------|-------|---------------------------------|-------------|
| tuition_min      | float | Minimum annual tuition (USD).   | RAW_FEATURE |
| tuition_max      | float | Maximum annual tuition.         | RAW_FEATURE |
| has_financial_aid| bool  | Offers financial aid.           | RAW_FEATURE |

</details>

---

<details>
<summary><strong>12. Meta & Data Quality</strong></summary>

| Column name              | Type  | Description                          | Role   |
|--------------------------|-------|--------------------------------------|--------|
| data_sources             | string| Data source list.                    | META   |
| record_confidence_score  | float | 0–1 confidence in merged record.     | META   |
| fuzzy_match_warning      | bool  | Low-confidence fuzzy match.          | META   |
| last_updated_date        | date  | Last update timestamp.               | META   |
| first_seen_date          | date  | When school was first ingested.      | META   |
| is_active_school         | bool  | School currently open.               | META   |
| closure_year             | int   | Year closed (if applicable).         | META   |

</details>

---

## 3. Create Empty DataFrame with Schema Columns

In this section, we translate the **Schools Master Schema** into a concrete
Pandas DataFrame structure.

For now, we are not loading any real data.  
We only:

1. Define the full list of schema column names.
2. Create an empty `DataFrame` with these columns.
3. Confirm that the structure matches our Golden Record design.

This gives us a stable “container” that all future ETL steps will populate.


In [25]:
import pandas as pd

# 1. Define the schema columns based on Section 2 (Golden Record)
SCHOOL_SCHEMA_COLUMNS = [
    # 1. Identifiers (Backbone)
    "school_internal_id",
    "nces_id",
    "lea_id",
    "ppin",
    "cds_code",
    "ib_id",
    "source_school_name",
    "alternate_names",

    # 2. Location & Geography
    "address_line_1",
    "address_line_2",
    "city",
    "county",
    "state",
    "zip_code",
    "latitude",
    "longitude",
    "urbanicity",
    "nces_locale_code",
    "commute_zone",

    # 3. Basic School Profile
    "school_display_name",
    "school_website",
    "school_type",
    "governance_model",
    "religious_affiliation",
    "coed_status",
    "boarding_status",
    "year_founded",

    # 4. Grades, Ages, and Structure
    "lowest_grade",
    "highest_grade",
    "is_elementary",
    "is_middle",
    "is_high",
    "has_preschool",
    "age_range_description",
    "min_age_months",
    "max_age_months",

    # 5. Enrollment & Demographics (core subset for now)
    "total_enrollment",
    "pct_ell",
    "pct_swd",
    "pct_econ_disadvantaged",
    "pct_gifted_identified",

    # 6. Programs & Pedagogy Tags
    "is_montessori",
    "is_ams_member",
    "is_waldorf",
    "is_progressive",
    "is_gifted_school",
    "is_2e_focused",
    "is_ib_school",
    "is_stem_focus",
    "is_arts_focus",

    # 7. Academics & Rigor
    "has_ap_program",
    "ap_course_count",
    "has_ib_program",
    "graduation_rate",

    # 8. Staffing & Resources
    "student_teacher_ratio",
    "has_counselor",
    "counselor_student_ratio",

    # 9. Student Support & Services
    "has_special_ed_program",
    "has_504_support",
    "has_esl_ell_program",
    "has_gifted_program",
    "offers_ot_pt_speech",
    "has_counseling_services",
    "has_after_school_program",
    "has_transportation",
    "has_before_school_program",

    # 10. Equity, Discipline & Safety
    "suspensions_total",
    "expulsions_total",
    "bullying_incidents_total",
    "seclusion_restraint_incidents_total",

    # 11. Tuition & Finance
    "tuition_min",
    "tuition_max",
    "has_financial_aid",

    # 12. Meta & Data Quality
    "data_sources",
    "record_confidence_score",
    "fuzzy_match_warning",
    "last_updated_date",
    "first_seen_date",
    "is_active_school",
    "closure_year",
]

# 2. Create an empty DataFrame with this schema
schools_master_df = pd.DataFrame(columns=SCHOOL_SCHEMA_COLUMNS)

print(f"Number of columns: {len(schools_master_df.columns)}")
schools_master_df.head()


Number of columns: 80


Unnamed: 0,school_internal_id,nces_id,lea_id,ppin,cds_code,ib_id,source_school_name,alternate_names,address_line_1,address_line_2,...,tuition_min,tuition_max,has_financial_aid,data_sources,record_confidence_score,fuzzy_match_warning,last_updated_date,first_seen_date,is_active_school,closure_year


## 4. Add a Few Sample Schools Manually

To validate that our schema works in practice, we will manually create 
3–5 school records and append them to the empty `schools_master_df`.

These sample rows serve several purposes:

- Confirm that the schema is usable and not missing critical fields.
- Allow early testing of cleaning functions, vector builders, and feature engineering.
- Provide a lightweight dataset for rapid prototyping before working with full NCES/CRDC files.

We intentionally fill **only a subset of fields** for each sample school.
Unfilled fields remain `NaN` and will be populated later during ingestion.


In [27]:
import numpy as np

sample_schools = [
    # ---------------------------------------------------------
    # 1. PUBLIC ELEMENTARY SCHOOL (Structured, Conventional)
    # ---------------------------------------------------------
    {
        "school_internal_id": "SCH0001",
        "nces_id": "063441012345",
        "school_display_name": "Sunnyvale Elementary School",
        "source_school_name": "Sunnyvale Elementary School",
        "city": "Sunnyvale",
        "county": "Santa Clara",
        "state": "CA",
        "zip_code": "94087",
        "latitude": 37.3688,
        "longitude": -122.0363,
        
        "school_type": "Public",
        "governance_model": "District",
        "lowest_grade": "K",
        "highest_grade": "5",
        "is_elementary": True,
        "has_preschool": False,

        "total_enrollment": 520,
        "pct_ell": 0.12,
        "pct_swd": 0.08,
        "pct_econ_disadvantaged": 0.18,

        "student_teacher_ratio": 22.0,
        "has_counselor": True,

        "has_special_ed_program": True,
        "has_504_support": True,
        "has_after_school_program": True,
        "has_transportation": False,
    },

    # ---------------------------------------------------------
    # 2. PRIVATE MONTESSORI SCHOOL (Low structure, high creativity)
    # ---------------------------------------------------------
    {
        "school_internal_id": "SCH0002",
        "ppin": "PSS998877",
        "school_display_name": "Bay Area Montessori Academy",
        "source_school_name": "Bay Area Montessori Academy",
        "city": "Cupertino",
        "county": "Santa Clara",
        "state": "CA",
        "zip_code": "95014",
        "latitude": 37.3220,
        "longitude": -122.0322,

        "school_type": "Private",
        "governance_model": "Independent",
        "religious_affiliation": "None",

        "lowest_grade": "PK",
        "highest_grade": "5",
        "has_preschool": True,
        "age_range_description": "18 months – 11 years",
        "min_age_months": 18,
        "max_age_months": 132,

        "is_montessori": True,
        "is_ams_member": True,
        "is_progressive": True,
        "is_arts_focus": True,

        "tuition_min": 18000,
        "tuition_max": 28000,
        "has_financial_aid": True,

        "has_after_school_program": True,
        "has_before_school_program": True,
        "has_transportation": False,
    },

    # ---------------------------------------------------------
    # 3. IB HIGH SCHOOL (Structured + rigorous academic profile)
    # ---------------------------------------------------------
    {
        "school_internal_id": "SCH0003",
        "nces_id": "063441099999",
        "ib_id": "IB12345",
        "school_display_name": "Mountain View International High School",
        "source_school_name": "Mountain View International High School",
        "city": "Mountain View",
        "county": "Santa Clara",
        "state": "CA",
        "zip_code": "94040",
        "latitude": 37.3861,
        "longitude": -122.0839,

        "school_type": "Public",
        "governance_model": "District",
        "coed_status": "Co-ed",

        "lowest_grade": "9",
        "highest_grade": "12",
        "is_high": True,

        "is_ib_school": True,
        "has_ib_dp": True,
        "has_ib_myp": True,
        "has_ib_program": True,

        "has_ap_program": True,
        "ap_course_count": 12,
        "graduation_rate": 0.95,

        "student_teacher_ratio": 18.0,
        "has_counselor": True,
        "counselor_student_ratio": 350,

        "has_transportation": True,
    }
]

# Convert to DataFrame and append
sample_df = pd.DataFrame(sample_schools, columns=SCHOOL_SCHEMA_COLUMNS)

# Append to the master schema frame
schools_master_df = sample_df.copy()

print(f"Sample rows added: {len(sample_df)}")
schools_master_df.head()


Sample rows added: 3


Unnamed: 0,school_internal_id,nces_id,lea_id,ppin,cds_code,ib_id,source_school_name,alternate_names,address_line_1,address_line_2,...,tuition_min,tuition_max,has_financial_aid,data_sources,record_confidence_score,fuzzy_match_warning,last_updated_date,first_seen_date,is_active_school,closure_year
0,SCH0001,63441012345.0,,,,,Sunnyvale Elementary School,,,,...,,,,,,,,,,
1,SCH0002,,,PSS998877,,,Bay Area Montessori Academy,,,,...,18000.0,28000.0,True,,,,,,,
2,SCH0003,63441099999.0,,,,IB12345,Mountain View International High School,,,,...,,,,,,,,,,


## 5. Sanity Checks on the Sample Data

Now that we have a few sample schools in `schools_master_df`, we will run some
basic sanity checks to make sure:

- The DataFrame has the expected number of rows and columns.
- Column names match the schema definition.
- Key identifier fields look reasonable and (eventually) unique.
- Categorical values (e.g., `school_type`) are in the expected range.
- There are no obvious type issues (e.g., strings where numbers are expected).

This is still a very small manual dataset, so we are not worried about
statistical properties yet. The goal is simply to confirm that the **schema
works in practice** and feels comfortable to use.


In [29]:
# Quick peek at the first few rows
print("1. Basic shape & preview")
print(f"Rows: {schools_master_df.shape[0]}, Columns: {schools_master_df.shape[1]}")
display(schools_master_df.head())

# 1. Check column names against expected schema
print("\n2. Columns match schema?")
missing_in_df = set(SCHOOL_SCHEMA_COLUMNS) - set(schools_master_df.columns)
extra_in_df = set(schools_master_df.columns) - set(SCHOOL_SCHEMA_COLUMNS)

print(f"Columns missing in DataFrame: {missing_in_df}")
print(f"Extra columns in DataFrame:   {extra_in_df}")

# 2. Check dtypes overview
print("\n3. Data types:")
display(schools_master_df.dtypes.to_frame("dtype"))

# 3. Check uniqueness of primary ID
print("\n4. school_internal_id uniqueness:")
print(schools_master_df["school_internal_id"])
print("Unique count:", schools_master_df["school_internal_id"].nunique())

# 4. Basic value counts for a few key categoricals
print("\n5. school_type value counts:")
print(schools_master_df["school_type"].value_counts(dropna=False))

print("\n6. city value counts:")
print(schools_master_df["city"].value_counts(dropna=False))

# 5. Missingness summary for a small subset of important columns
important_cols = [
    "school_internal_id",
    "school_display_name",
    "school_type",
    "city",
    "state",
    "lowest_grade",
    "highest_grade",
    "total_enrollment",
]

print("\n7. Missing values (important columns):")
missing_summary = schools_master_df[important_cols].isna().mean().sort_values(ascending=False)
display(missing_summary.to_frame("pct_missing"))


1. Basic shape & preview
Rows: 3, Columns: 80


Unnamed: 0,school_internal_id,nces_id,lea_id,ppin,cds_code,ib_id,source_school_name,alternate_names,address_line_1,address_line_2,...,tuition_min,tuition_max,has_financial_aid,data_sources,record_confidence_score,fuzzy_match_warning,last_updated_date,first_seen_date,is_active_school,closure_year
0,SCH0001,63441012345.0,,,,,Sunnyvale Elementary School,,,,...,,,,,,,,,,
1,SCH0002,,,PSS998877,,,Bay Area Montessori Academy,,,,...,18000.0,28000.0,True,,,,,,,
2,SCH0003,63441099999.0,,,,IB12345,Mountain View International High School,,,,...,,,,,,,,,,



2. Columns match schema?
Columns missing in DataFrame: set()
Extra columns in DataFrame:   set()

3. Data types:


Unnamed: 0,dtype
school_internal_id,object
nces_id,object
lea_id,float64
ppin,object
cds_code,float64
...,...
fuzzy_match_warning,float64
last_updated_date,float64
first_seen_date,float64
is_active_school,float64



4. school_internal_id uniqueness:
0    SCH0001
1    SCH0002
2    SCH0003
Name: school_internal_id, dtype: object
Unique count: 3

5. school_type value counts:
school_type
Public     2
Private    1
Name: count, dtype: int64

6. city value counts:
city
Sunnyvale        1
Cupertino        1
Mountain View    1
Name: count, dtype: int64

7. Missing values (important columns):


Unnamed: 0,pct_missing
total_enrollment,0.666667
school_internal_id,0.0
school_display_name,0.0
school_type,0.0
city,0.0
state,0.0
lowest_grade,0.0
highest_grade,0.0


## 6. Save Sample Data for Later Notebooks

We save the small sample version of `schools_master_df` so that the next
notebook (02_feature_layer_and_vectors.ipynb) can load it without re-running
the schema setup steps.

This file is not a real dataset — it is only a lightweight example used to:
- Validate schema behavior
- Prototype feature engineering functions
- Prototype vectorization and matching logic

The sample file will be saved under:

``../data/processed/schools_master_sample.csv``


In [31]:
import os

# Create processed data directory if missing
processed_dir = "../data/processed/notebook01"
os.makedirs(processed_dir, exist_ok=True)

# File path for sample dataset
sample_path = os.path.join(processed_dir, "schools_master_sample.csv")

# Save the DataFrame
schools_master_df.to_csv(sample_path, index=False)

print(f"Sample dataset saved to: {sample_path}")


Sample dataset saved to: ../data/processed/notebook01/schools_master_sample.csv
