# CS 677 Final Project – NYC Census Data with SVM (Fixed Version)

This notebook is the **corrected full version** with:

- Proper handling of missing values  
- One-hot encoding for **Borough** and **County**  
- Outlier inspection  
- Train/test split with stratification  
- Standardization (scaling)  
- PCA visualization  
- Multiple ML models: SVM (linear & RBF), Logistic Regression, SGD (gradient descent)  
- Hyperparameter tuning with GridSearchCV  
- Evaluation metrics (Accuracy, Precision, Recall, F1)  
- Confusion matrix, ROC curves  
- Learning curve and error vs training size  
- Final model comparison table  


## Problem Statement

The goal of this project is to predict whether a New York City census tract is **high income** or **low income** using demographic, socioeconomic, and commuting-related features from the NYC Census Tracts dataset. 

This is a **binary classification** task. We train and compare several models covered in CS 677 – a linear Support Vector Machine (SVM), a non‑linear SVM with RBF kernel, Logistic Regression, and an SGD‑based linear classifier – and evaluate them using accuracy, precision, recall, F1‑score, confusion matrices, ROC curves, and learning curves.

## 1. Imports

In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV, learning_curve
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, roc_curve, auc
)
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression, SGDClassifier


## 2. Load Dataset & Initial Overview

In [None]:

# Load NYC census tracts dataset
# Make sure `nyc_census_tracts.csv` is in the same folder as this notebook.
df = pd.read_csv("nyc_census_tracts.csv")

# Drop rows where Income is missing, because the target HighIncome depends on this column
df = df.dropna(subset=["Income"])

df.head()


In [None]:

df.info()


In [None]:

df.describe().T


### 2.1 Missing Values – Imputation

In [None]:

# Count missing values per column
missing_counts = df.isnull().sum().sort_values(ascending=False)
missing_counts.head(15)


We handle missing values using two different strategies:

- **Numeric columns** are filled with their **median**. The median is robust to skewed distributions and outliers, which is common in socioeconomic variables such as income, poverty, or unemployment.
- **Categorical columns** (`Borough`, `County`) are filled with their **mode** (most frequent value). This preserves existing categories without creating artificial new labels.

After imputation, the dataset has no missing values and is ready for feature engineering and modeling.

In [None]:

numeric_cols = df.select_dtypes(include=[np.number]).columns
categorical_cols = ['Borough', 'County']

# Impute numeric columns with median
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].median())

# Impute categorical columns with mode (most frequent value)
for col in categorical_cols:
    df[col] = df[col].fillna(df[col].mode()[0])

df.isnull().sum().sort_values(ascending=False).head()


### 2.2 Outlier Inspection – Income

In [None]:

plt.figure(figsize=(6, 3))
sns.boxplot(x=df['Income'])
plt.title("Boxplot of Income")
plt.show()

Q1 = df['Income'].quantile(0.25)
Q3 = df['Income'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = df[(df['Income'] < lower_bound) | (df['Income'] > upper_bound)]
len(outliers), lower_bound, upper_bound


We inspect potential outliers using boxplots.

- For **Income**, there are several very high‑income tracts that appear as extreme values.
- We also examine additional economic variables such as `Poverty`, `ChildPoverty`, `IncomePerCap`, and `Unemployment`.

These extreme values correspond to real neighborhoods (for example, very wealthy areas or very disadvantaged tracts), so they carry important information about NYC.  
Therefore, instead of removing them, we **keep these outliers** in the dataset.

In [None]:

# Explore outliers in a few additional economic variables
outlier_cols = ['Poverty', 'ChildPoverty', 'IncomePerCap', 'Unemployment']
for col in outlier_cols:
    plt.figure(figsize=(6, 3))
    sns.boxplot(x=df[col])
    plt.title(f"Boxplot of {col}")
    plt.show()


### 2.3 Correlation Heatmap

In [None]:

plt.figure(figsize=(10, 8))
sns.heatmap(df.select_dtypes(include=[np.number]).corr(), cmap="coolwarm")
plt.title("Correlation Heatmap (Numeric Features)")
plt.show()


## 3. Feature Engineering – HighIncome Target & Encoding

In [None]:

# Create binary HighIncome target based on median Income
median_income = df['Income'].median()
df['HighIncome'] = (df['Income'] > median_income).astype(int)
df['HighIncome'].value_counts()


In [None]:

# One-hot encode Borough & County
df = pd.get_dummies(df, columns=['Borough', 'County'], drop_first=True)

# Drop ID-like and leakage columns
drop_cols = ['CensusTract', 'Income', 'IncomeErr', 'IncomePerCap', 'IncomePerCapErr']
existing_drop = [c for c in drop_cols if c in df.columns]
df = df.drop(columns=existing_drop)

# Define X and y
X = df.drop(columns=['HighIncome'])
y = df['HighIncome']

X.shape, y.shape


## 4. Train/Test Split & Scaling

We split the data into **training (80%)** and **test (20%)** sets using stratified sampling so that the proportion of high‑income vs low‑income tracts is preserved.

Before training SVMs and Logistic Regression, we apply **StandardScaler** to all features.  
These models are sensitive to the scale of the input variables: if one feature has much larger numeric values than others, it can dominate the decision boundary.  
Standardization (zero mean, unit variance) ensures that:

- All features contribute on a similar scale.
- Gradient‑based optimization converges more reliably.
- The SVM margin is not biased by units of measurement.

In [None]:

# Split into train and test sets (stratify to keep HighIncome ratio similar)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Standardize features for SVM / Logistic Regression
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

X_train_scaled.shape, X_test_scaled.shape


### 4.1 PCA Visualization

To better understand whether the two classes are separable, we project the standardized features into **two principal components (PC1 and PC2)** using PCA and color the points by income class.  
Although there is some overlap, high‑income and low‑income tracts form partially distinct clusters, confirming that the features contain signal for classification.

In [None]:

# Reduce scaled features to 2D with PCA for visualization
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)

plt.figure(figsize=(6, 4))
scatter = plt.scatter(X_train_pca[:, 0], X_train_pca[:, 1],
                      c=y_train, cmap='coolwarm', alpha=0.7)
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("PCA Projection of Training Data")

# Add legend for income classes
import matplotlib.patches as mpatches
low_patch = mpatches.Patch(color=plt.cm.coolwarm(0.1), label='Low Income')
high_patch = mpatches.Patch(color=plt.cm.coolwarm(0.9), label='High Income')
plt.legend(handles=[low_patch, high_patch], title="Income Class")

plt.show()


## 5. Model Training – SVM, Logistic Regression, SGD

We train four classification models that correspond to topics covered in CS 677:

1. **SVM (Linear kernel)** – learns a linear decision boundary that maximizes the margin between high‑income and low‑income tracts.
2. **SVM (RBF kernel)** – uses the radial basis function kernel to create a **non‑linear** decision boundary in the original feature space. This allows the model to capture more complex relationships using the kernel trick.
3. **Logistic Regression** – a probabilistic linear classifier that serves as a strong baseline and is easy to interpret.
4. **SGDClassifier (hinge loss)** – an approximate linear SVM trained with **stochastic gradient descent**. It is efficient on large datasets and demonstrates the optimization techniques studied in class.

All models are trained on the standardized features (`X_train_scaled`) and evaluated on the standardized test set (`X_test_scaled`).

In [None]:

# Train linear SVM
svm_linear = SVC(kernel='linear', probability=True, random_state=42)
svm_linear.fit(X_train_scaled, y_train)

# Train non‑linear SVM with RBF kernel
svm_rbf = SVC(kernel='rbf', probability=True, random_state=42)
svm_rbf.fit(X_train_scaled, y_train)

# Train Logistic Regression baseline
log_reg = LogisticRegression(max_iter=1000, random_state=42)
log_reg.fit(X_train_scaled, y_train)

# Train linear SVM using SGD (hinge loss)
sgd_clf = SGDClassifier(loss='hinge', max_iter=1000, random_state=42)
sgd_clf.fit(X_train_scaled, y_train)


### 5.1 Evaluation Helper & Test Results

In [None]:

# Helper function to compute classification metrics for a given model
def evaluate_model(model, X_test, y_test):
    y_pred = model.predict(X_test)
    return {
        "Accuracy": accuracy_score(y_test, y_pred),
        "Precision": precision_score(y_test, y_pred),
        "Recall": recall_score(y_test, y_pred),
        "F1 Score": f1_score(y_test, y_pred)
    }

# Evaluate all models on the scaled test set
results = {
    "SVM Linear": evaluate_model(svm_linear, X_test_scaled, y_test),
    "SVM RBF": evaluate_model(svm_rbf, X_test_scaled, y_test),
    "Logistic Regression": evaluate_model(log_reg, X_test_scaled, y_test),
    "SGDClassifier": evaluate_model(sgd_clf, X_test_scaled, y_test)
}

pd.DataFrame(results).T


### 5.2 Confusion Matrices – All Models

In [None]:

# Visualize confusion matrices for all four models on the same figure
model_dict = {
    "SVM Linear": svm_linear,
    "SVM RBF": svm_rbf,
    "Logistic Regression": log_reg,
    "SGDClassifier": sgd_clf
}

plt.figure(figsize=(12, 10))
for i, (name, model) in enumerate(model_dict.items(), start=1):
    plt.subplot(2, 2, i)
    cm = confusion_matrix(y_test, model.predict(X_test_scaled))
    sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
    plt.title(name)
    plt.xlabel("Predicted")
    plt.ylabel("Actual")

plt.tight_layout()
plt.show()


### 5.3 ROC Curves – All Models

In [None]:

# Plot ROC curves for all models
plt.figure(figsize=(6, 4))

for name, model in model_dict.items():
    y_prob = model.predict_proba(X_test_scaled)[:, 1]
    fpr, tpr, _ = roc_curve(y_test, y_prob)
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, label=f"{name} (AUC={roc_auc:.3f})")

plt.plot([0, 1], [0, 1], linestyle="--", color="gray")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curves – All Models")
plt.legend()
plt.show()


## 6. Hyperparameter Tuning – SVM RBF

To improve performance of the non‑linear SVM, we tune the **C** (regularization strength) and **gamma** (RBF kernel width) hyperparameters using `GridSearchCV` with 5‑fold cross‑validation.  
This searches over a small grid of candidate values and selects the combination that maximizes cross‑validated accuracy.

In [None]:

param_grid = {
    "C": [0.1, 1, 10],
    "gamma": ["scale", "auto", 0.01, 0.001],
    "kernel": ["rbf"]
}

grid_search = GridSearchCV(
    SVC(probability=True, random_state=42),
    param_grid,
    cv=5,
    scoring="accuracy",
    n_jobs=-1
)

grid_search.fit(X_train_scaled, y_train)
grid_search.best_params_, grid_search.best_score_


In [None]:

best_svm = grid_search.best_estimator_
best_results = evaluate_model(best_svm, X_test_scaled, y_test)
best_results


## 7. Learning Curve – SVM RBF

We use a learning curve to study how the RBF SVM behaves as we increase the amount of training data.  
The plot shows training and validation accuracy for different training set sizes.

In [None]:

train_sizes, train_scores, test_scores = learning_curve(
    svm_rbf,
    X_train_scaled,
    y_train,
    cv=5,
    train_sizes=np.linspace(0.1, 1.0, 5),
    n_jobs=-1
)

train_mean = train_scores.mean(axis=1)
test_mean = test_scores.mean(axis=1)

plt.figure(figsize=(6, 4))
plt.plot(train_sizes, train_mean, marker="o", label="Training Accuracy")
plt.plot(train_sizes, test_mean, marker="o", label="Validation Accuracy")
plt.xlabel("Training Set Size")
plt.ylabel("Accuracy")
plt.title("Learning Curve – SVM RBF")
plt.legend()
plt.show()


### Interpretation of the Learning Curve

- **Training accuracy** starts very high for small training sizes and decreases slightly as we add more data.  
- **Validation accuracy** improves as the training set grows and then stabilizes around a similar range as training accuracy.

This pattern suggests that the RBF SVM benefits from more data and ends up with **low variance and low bias**:  
the model is not severely overfitting (training and validation curves are close) and additional data provides diminishing returns after a certain point.

## 8. Final Model Comparison

In [None]:

# Combine baseline model metrics with tuned SVM results in a single table
all_results = pd.DataFrame(results).T
all_results.loc["Best SVM (Tuned)"] = best_results
all_results


### Discussion & Conclusion

From the final comparison table we observe that:

- All four models achieve **strong performance** with accuracies around the high‑80% range.
- The **tuned SVM with RBF kernel** achieves the best overall balance of metrics (highest or near‑highest accuracy, precision, recall, and F1‑score).
- The **linear models** (Linear SVM, Logistic Regression, SGDClassifier) perform only slightly worse, which suggests that much of the class separation is close to linear in the feature space.
- **SGDClassifier** is somewhat behind the other models but is still competitive and demonstrates how stochastic gradient descent can approximate a linear SVM efficiently.

Overall, we conclude that:

1. Demographic and socioeconomic features from the NYC census tracts contain enough signal to reliably distinguish **high‑income vs low‑income** neighborhoods.  
2. Non‑linear SVM with RBF kernel is a strong choice for this problem, especially after hyperparameter tuning and proper scaling.  
3. For deployment in a real system, a simpler linear classifier (e.g., Logistic Regression or Linear SVM) could be preferred if interpretability and training speed are more important than the last few percentage points of accuracy.