# SAT Results – Data Cleaning and PostgreSQL Integration

This notebook performs the following tasks:

1. **Explore** the raw SAT results dataset.
2. **Clean and normalize** the data:
   - Handle invalid scores and out-of-range values.
   - Convert percentage fields from strings (e.g., `"85%"`) to numeric.
   - Remove duplicated schools and rows with completely missing scores.
3. **Export** a cleaned version of the dataset as `cleaned_sat_results.csv`.
4. **Append** the cleaned data into a PostgreSQL table using a parameterized insert.

The cleaning logic is implemented in Python (pandas), and the final data
is loaded into a PostgreSQL database hosted on Neon using SQLAlchemy and psycopg2.

In [15]:
import numpy as np
import pandas as pd

import psycopg2
from psycopg2.extras import execute_batch
from sqlalchemy import create_engine

# File paths

RAW_CSV_PATH = "sat-results.csv"            # raw input file
CLEAN_CSV_PATH = "cleaned_sat_results.csv"  # cleaned output file

# Database connection using SQLAlchemy (Neon onboarding database)
#
# SQLAlchemy connection string format:
# postgresql+psycopg2://user:password@host:port/dbname

DATABASE_URL = (
    "postgresql+psycopg2://neondb_owner:a9Am7Yy5r9_T7h4OF2GN"
    "@ep-falling-glitter-a5m0j5gk-pooler.us-east-2.aws.neon.tech:5432/neondb"
    "?sslmode=require"
)

# Create a global SQLAlchemy engine
engine = create_engine(DATABASE_URL)


def get_pg_connection():
    """
    Return a raw psycopg2 connection from the SQLAlchemy engine.

    This allows the rest of the notebook to keep using psycopg2-style
    cursors (for CREATE TABLE, INSERT, etc.) while the underlying
    connection is managed via SQLAlchemy and the provided DATABASE_URL.
    """
    return engine.raw_connection()


# Target table name used later in the notebook
TARGET_TABLE = "jorge_figueroa_sat_scores_text"

In [16]:
conn = None
cur = None

try:
    conn = get_pg_connection()
    cur = conn.cursor()
    cur.execute("SELECT current_database(), version();")
    current_db, version = cur.fetchone()
    print("✅ Connected successfully")
    print("Current database:", current_db)
    print("PostgreSQL version:", version)
except Exception as e:
    print("❌ Connection failed:", e)
finally:
    if cur is not None:
        cur.close()
    if conn is not None:
        conn.close()

✅ Connected successfully
Current database: neondb
PostgreSQL version: PostgreSQL 17.7 (178558d) on aarch64-unknown-linux-gnu, compiled by gcc (Debian 12.2.0-14+deb12u1) 12.2.0, 64-bit


## 1. Load and inspect the raw dataset

We start by loading the `sat-results.csv` file and exploring:

- Shape (rows, columns)
- Column names
- Basic data types and null counts

In [17]:
# Load the raw SAT results dataset
df_raw = pd.read_csv(RAW_CSV_PATH)

print("Raw shape (rows, columns):", df_raw.shape)
print("\nColumns:")
print(df_raw.columns.tolist())

df_raw.head()

Raw shape (rows, columns): (493, 11)

Columns:
['DBN', 'SCHOOL NAME', 'Num of SAT Test Takers', 'SAT Critical Reading Avg. Score', 'SAT Math Avg. Score', 'SAT Writing Avg. Score', 'SAT Critical Readng Avg. Score', 'internal_school_id', 'contact_extension', 'pct_students_tested', 'academic_tier_rating']


Unnamed: 0,DBN,SCHOOL NAME,Num of SAT Test Takers,SAT Critical Reading Avg. Score,SAT Math Avg. Score,SAT Writing Avg. Score,SAT Critical Readng Avg. Score,internal_school_id,contact_extension,pct_students_tested,academic_tier_rating
0,01M292,HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES,29,355,404,363,355,218160,x345,78%,2.0
1,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,91,383,423,366,383,268547,x234,,3.0
2,01M450,EAST SIDE COMMUNITY SCHOOL,70,377,402,370,377,236446,x123,,3.0
3,01M458,FORSYTH SATELLITE ACADEMY,7,414,401,359,414,427826,x123,92%,4.0
4,01M509,MARTA VALLE HIGH SCHOOL,44,390,433,384,390,672714,x123,92%,2.0


In [6]:
# Basic info: data types and non-null counts
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 493 entries, 0 to 492
Data columns (total 11 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   DBN                              493 non-null    object 
 1   SCHOOL NAME                      493 non-null    object 
 2   Num of SAT Test Takers           493 non-null    object 
 3   SAT Critical Reading Avg. Score  493 non-null    object 
 4   SAT Math Avg. Score              493 non-null    object 
 5   SAT Writing Avg. Score           493 non-null    object 
 6   SAT Critical Readng Avg. Score   493 non-null    object 
 7   internal_school_id               493 non-null    int64  
 8   contact_extension                388 non-null    object 
 9   pct_students_tested              376 non-null    object 
 10  academic_tier_rating             402 non-null    float64
dtypes: float64(1), int64(1), object(9)
memory usage: 42.5+ KB


## 2. Define cleaning helpers

We define small helper functions to keep the cleaning logic clear and reusable:

- `clean_sat_score(value)`
- `parse_percentage(value)`

These functions encapsulate the messy, real-world rules and keep the rest of
the code easier to read.

In [18]:
def clean_sat_score(value):
    """
    Convert a raw value into a valid SAT score (200–800).

    Rules:
    - If the value is missing or cannot be parsed as a number -> return NaN.
    - If the numeric value is outside the [200, 800] range -> return NaN.
    - Otherwise, return the integer score.
    """
    if pd.isna(value):
        return np.nan

    try:
        num = float(str(value).strip())
    except ValueError:
        # e.g., 's' or other non-numeric values
        return np.nan

    # Enforce valid SAT range
    if 200 <= num <= 800:
        return int(num)
    else:
        return np.nan


def parse_percentage(value):
    """
    Convert raw percentage values to a numeric percentage.

    Handles different formats:
    - '78%'  -> 78.0
    - '92 %' -> 92.0
    - '0.85' -> 85.0 (assuming it is a proportion)
    - numeric input is returned as float.
    """
    if pd.isna(value):
        return np.nan

    if isinstance(value, (int, float)):
        return float(value)

    text = str(value).strip()
    if text == "":
        return np.nan

    # Remove trailing '%' if present
    if text.endswith("%"):
        text = text[:-1].strip()

    try:
        num = float(text)
        # If value looks like a proportion, convert to percentage
        if num <= 1:
            return num * 100.0
        return num
    except ValueError:
        return np.nan

## 3. Clean the SAT dataset

Cleaning steps:

1. Merge duplicate reading score columns:
   - `SAT Critical Reading Avg. Score`
   - `SAT Critical Readng Avg. Score` (typo)

2. Normalize SAT scores using `clean_sat_score` for:
   - Reading, Math, Writing.

3. Normalize other numeric fields:
   - `Num of SAT Test Takers`
   - `pct_students_tested`
   - `academic_tier_rating`
   - `internal_school_id`

4. Remove duplicated schools by `DBN`.

5. Drop rows where all three section scores (reading, math, writing) are missing.

In [19]:
# Start from a copy of the raw dataframe
df = df_raw.copy()

# 1. Merge the two reading score columns (main + typo)

col_reading_main = "SAT Critical Reading Avg. Score"
col_reading_typo = "SAT Critical Readng Avg. Score"

if col_reading_main in df.columns and col_reading_typo in df.columns:
    df["reading_score_raw"] = df[col_reading_main].fillna(df[col_reading_typo])
elif col_reading_main in df.columns:
    df["reading_score_raw"] = df[col_reading_main]
elif col_reading_typo in df.columns:
    df["reading_score_raw"] = df[col_reading_typo]
else:
    df["reading_score_raw"] = np.nan  # fallback

# 2. Apply SAT score cleaning to Reading, Math and Writing

col_math = "SAT Math Avg. Score"
col_writing = "SAT Writing Avg. Score"

df["reading_score"] = df["reading_score_raw"].apply(clean_sat_score)
df["math_score"] = df[col_math].apply(clean_sat_score) if col_math in df.columns else np.nan
df["writing_score"] = df[col_writing].apply(clean_sat_score) if col_writing in df.columns else np.nan

# 3. Normalize number of test takers

num_takers_col = "Num of SAT Test Takers"
if num_takers_col in df.columns:
    df["num_test_takers"] = pd.to_numeric(df[num_takers_col], errors="coerce")
else:
    df["num_test_takers"] = np.nan

# 4. Normalize percentage of students tested

pct_col = "pct_students_tested"
if pct_col in df.columns:
    df["pct_students_tested_clean"] = df[pct_col].apply(parse_percentage)
else:
    df["pct_students_tested_clean"] = np.nan

# 5. Normalize academic tier rating

tier_col = "academic_tier_rating"
if tier_col in df.columns:
    df["academic_tier_rating_clean"] = pd.to_numeric(df[tier_col], errors="coerce")
else:
    df["academic_tier_rating_clean"] = np.nan

# 6. Normalize internal_school_id (optional but useful)

if "internal_school_id" in df.columns:
    df["internal_school_id_clean"] = pd.to_numeric(df["internal_school_id"], errors="coerce")
else:
    df["internal_school_id_clean"] = np.nan

# 7. Remove duplicated schools by DBN

if "DBN" in df.columns:
    before_dup = len(df)
    df = df.drop_duplicates(subset=["DBN"], keep="first")
    after_dup = len(df)
    print(f"Removed {before_dup - after_dup} duplicate rows based on DBN.")
else:
    print("WARNING: DBN column not found; duplicate removal skipped.")

# 8. Drop rows where all three SAT scores are missing

before_na = len(df)
df = df.dropna(subset=["reading_score", "math_score", "writing_score"], how="all")
after_na = len(df)
print(f"Removed {before_na - after_na} rows with all scores missing.")

print("Cleaned intermediate shape:", df.shape)

Removed 15 duplicate rows based on DBN.
Removed 57 rows with all scores missing.
Cleaned intermediate shape: (421, 19)


## 4. Build the final cleaned dataset and export to CSV

We now select the final set of columns we want to keep:

- `dbn`
- `school_name`
- `internal_school_id`
- `num_test_takers`
- `reading_score`
- `math_score`
- `writing_score`
- `pct_students_tested`
- `academic_tier_rating`

This final dataset is exported as `cleaned_sat_results.csv`.

In [20]:
# Build the final cleaned dataframe with standardized column names
final_cols = {}

if "DBN" in df.columns:
    final_cols["dbn"] = df["DBN"]

if "SCHOOL NAME" in df.columns:
    final_cols["school_name"] = df["SCHOOL NAME"]

# Use the cleaned internal_school_id if available
final_cols["internal_school_id"] = df["internal_school_id_clean"]

final_cols["num_test_takers"] = df["num_test_takers"]
final_cols["reading_score"]   = df["reading_score"]
final_cols["math_score"]      = df["math_score"]
final_cols["writing_score"]   = df["writing_score"]
final_cols["pct_students_tested"]   = df["pct_students_tested_clean"]
final_cols["academic_tier_rating"]  = df["academic_tier_rating_clean"]

df_clean = pd.DataFrame(final_cols)

print("Final cleaned shape:", df_clean.shape)
df_clean.head()

Final cleaned shape: (421, 9)


Unnamed: 0,dbn,school_name,internal_school_id,num_test_takers,reading_score,math_score,writing_score,pct_students_tested,academic_tier_rating
0,01M292,HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES,218160,29.0,355.0,404.0,363.0,78.0,2.0
1,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,268547,91.0,383.0,423.0,366.0,,3.0
2,01M450,EAST SIDE COMMUNITY SCHOOL,236446,70.0,377.0,402.0,370.0,,3.0
3,01M458,FORSYTH SATELLITE ACADEMY,427826,7.0,414.0,401.0,359.0,92.0,4.0
4,01M509,MARTA VALLE HIGH SCHOOL,672714,44.0,390.0,433.0,384.0,92.0,2.0


In [21]:
# Export the cleaned dataset to CSV
df_clean.to_csv(CLEAN_CSV_PATH, index=False)
print(f"Cleaned CSV exported to: {CLEAN_CSV_PATH}")

Cleaned CSV exported to: cleaned_sat_results.csv


## 5. PostgreSQL integration (Neon)

In this section we:

1. Create the target table in the Neon PostgreSQL database.
2. Prepare the cleaned data for insertion.
3. Insert the data using a parameterized batch insert.

In [22]:
TARGET_SCHEMA = "nyc_schools"
TARGET_TABLE = "jorge_figueroa_sat_results"

# Write the cleaned dataframe to PostgreSQL using SQLAlchemy
df_clean.to_sql(
    name=TARGET_TABLE,
    con=engine,
    schema=TARGET_SCHEMA,
    if_exists="replace",   # or "append" for incremental loads
    index=False
)

print(f"✅ Loaded {len(df_clean)} rows into {TARGET_SCHEMA}.{TARGET_TABLE}")

✅ Loaded 421 rows into nyc_schools.jorge_figueroa_sat_results


In [23]:
# Optional: quick verification by reading back a few rows from the database
query = f"""
SELECT *
FROM {TARGET_SCHEMA}.{TARGET_TABLE}
LIMIT 5;
"""

check_df = pd.read_sql_query(query, con=engine)
check_df

Unnamed: 0,dbn,school_name,internal_school_id,num_test_takers,reading_score,math_score,writing_score,pct_students_tested,academic_tier_rating
0,01M292,HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES,218160,29.0,355.0,404.0,363.0,78.0,2.0
1,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,268547,91.0,383.0,423.0,366.0,,3.0
2,01M450,EAST SIDE COMMUNITY SCHOOL,236446,70.0,377.0,402.0,370.0,,3.0
3,01M458,FORSYTH SATELLITE ACADEMY,427826,7.0,414.0,401.0,359.0,92.0,4.0
4,01M509,MARTA VALLE HIGH SCHOOL,672714,44.0,390.0,433.0,384.0,92.0,2.0


## 6. Summary

- Loaded the raw SAT results dataset from `sat-results.csv`.
- Dropped irrelevant or duplicate columns:
  - `SAT Critical Readng Avg. Score` (duplicate with typo),
  - `internal_school_id`,
  - `contact_extension`.
- Normalized column names to `snake_case`.
- Converted:
  - `pct_students_tested` from strings like `"78%"` to numeric proportions (e.g. `0.78`),
  - SAT section scores and other numeric fields to numeric types.
- Enforced valid SAT score ranges (200–800) and removed out-of-range values.
- Removed duplicate rows and rows with no SAT scores.
- Exported the cleaned dataset to `cleaned_sat_results.csv`.
- Loaded the cleaned data into the Neon PostgreSQL database using SQLAlchemy
  into the table `nyc_schools.jorge_figueroa_sat_results`.

This completes the Day 4 task: data cleaning, schema design, and database
integration for the NYC SAT results dataset.