# Project Milestone Two: Modeling and Feature Engineering

### Due: Midnight on April 13 (with 2-hour grace period) and worth 25 points

### Overview

This milestone builds on your work from Milestone 1. You will:

1. Evaluate baseline models using default settings.
2. Engineer new features and re-evaluate models.
3. Use feature selection techniques to find promising subsets.
4. Select the top 3 models and fine-tune them for optimal performance.

You must do all work in this notebook and upload to your team leader's account in Gradescope. There is no
Individual Assessment for this Milestone. 


In [7]:
# ===================================
# Useful Imports: Add more as needed
# ===================================

# Standard Libraries
import os
import time
import math
import io
import zipfile
import requests
from urllib.parse import urlparse
from itertools import chain, combinations

# Data Science Libraries
import numpy as np
import pandas as pd
import seaborn as sns

# Visualization
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import matplotlib.ticker as mticker  # Optional: Format y-axis labels as dollars
import seaborn as sns

# Scikit-learn (Machine Learning)
from sklearn.model_selection import (
    train_test_split, 
    cross_val_score, 
    GridSearchCV, 
    RandomizedSearchCV, 
    RepeatedKFold
)
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_error
from sklearn.feature_selection import SequentialFeatureSelector, f_regression, SelectKBest
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor, GradientBoostingRegressor

# Progress Tracking

from tqdm import tqdm

# =============================
# Global Variables
# =============================
random_state = 42

# =============================
# Utility Functions
# =============================

# Format y-axis labels as dollars with commas (optional)
def dollar_format(x, pos):
    return f'${x:,.0f}'

# Convert seconds to HH:MM:SS format
def format_hms(seconds):
    return time.strftime("%H:%M:%S", time.gmtime(seconds))



## Prelude: Load your Preprocessed Dataset from Milestone 1

In Milestone 1, you handled missing values, encoded categorical features, and explored your data. Before you begin this milestone, you’ll need to load that cleaned dataset and prepare it for modeling.

Here’s what to do:

1. Return to your Milestone 1 notebook and rerun your code through Part 3, where your dataset was fully cleaned (assume it’s called `df_cleaned`).

2. **Save** the cleaned dataset to a file by running:

>   df_cleaned.to_csv("zillow_cleaned.csv", index=False)

3. Switch to this notebook and **load** the saved data:

>   df = pd.read_csv("zillow_cleaned.csv")

4. Create a **train/test split** using `train_test_split`.  
   
6. **Standardize** the features (but not the target!) using **only the training data.** This ensures consistency across models without introducing data leakage from the test set:

>   scaler = StandardScaler()   
>   X_train_scaled = scaler.fit_transform(X_train)    
  
**Notes:** 

- You will not use the testing set during this milestone — it’s reserved for final evaluation later.
- You will have to redo the scaling step when you introduce new features (which have to be scaled as well).


In [8]:
# Add as many code cells as you need
df_clean = pd.read_csv('zillow_cleaned.csv')

In [9]:
df_clean.head()


Unnamed: 0,airconditioningtypeid,bathroomcnt,bedroomcnt,buildingqualitytypeid,calculatedbathnbr,calculatedfinishedsquarefeet,finishedsquarefeet12,fips,fullbathcnt,garagecarcnt,...,regionidcity,regionidcounty,regionidneighborhood,regionidzip,roomcnt,unitcnt,yearbuilt,numberofstories,censustractandblock,taxvaluedollarcnt
0,1.0,3.5,4.0,6.0,3.5,3100.0,3100.0,6059.0,3.0,2.0,...,53571.0,1286.0,118849.0,96978.0,0.0,1.0,1998.0,1.0,60590630000000.0,1023282.0
1,1.0,1.0,2.0,6.0,1.0,1465.0,1465.0,6111.0,1.0,1.0,...,13091.0,2061.0,118849.0,97099.0,5.0,1.0,1967.0,1.0,61110010000000.0,464000.0
2,1.0,2.0,3.0,6.0,2.0,1243.0,1243.0,6059.0,2.0,2.0,...,21412.0,1286.0,118849.0,97078.0,6.0,1.0,1962.0,1.0,60590220000000.0,564778.0
3,1.0,3.0,4.0,8.0,3.0,2376.0,2376.0,6037.0,3.0,2.0,...,396551.0,3101.0,118849.0,96330.0,0.0,1.0,1970.0,1.0,60373000000000.0,145143.0
4,1.0,3.0,3.0,8.0,3.0,1312.0,1312.0,6037.0,3.0,2.0,...,12447.0,3101.0,268548.0,96451.0,0.0,1.0,1964.0,1.0,60371240000000.0,119407.0


In [10]:
#split the data into train and test sets
X = df_clean.drop(columns=['taxvaluedollarcnt'])
y = df_clean['taxvaluedollarcnt']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=random_state)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

### Part 1: Baseline Modeling [3 pts]

Apply the following regression models to the scaled training dataset using **default parameters**:

- Linear Regression
- Ridge Regression
- Lasso Regression
- Decision Tree Regression
- Bagging
- Random Forest
- Gradient Boosting Trees

For each model:
- Use **repeated cross-validation** (e.g., 5 folds, 5 repeats).
- Report the **mean and standard deviation of CV RMSE Score** across all folds in a table. 


In [11]:
#Linear regression using repeated cross-validation 5 folds and 5 repeats
lr = LinearRegression()
cv = RepeatedKFold(n_splits=5, n_repeats=5, random_state=random_state)  
scores_lr = cross_val_score(lr, X_train_scaled, y_train, cv=cv, scoring='neg_mean_squared_error')

rmse_scores_lr = np.sqrt(-scores_lr)
rmse_mean_lr = rmse_scores_lr.mean()
rmse_std_lr = rmse_scores_lr.std()
rmse_table = pd.DataFrame({'RMSE Mean': [rmse_mean_lr], 'RMSE Std': [rmse_std_lr]})
rmse_table.index = ['Linear Regression']
print(rmse_table)

                       RMSE Mean      RMSE Std
Linear Regression  296756.896751  44392.551961


In [12]:
#Ridge regression using repeated cross-validation 5 folds and 5 repeats
ridge = Ridge(random_state=random_state)
cv = RepeatedKFold(n_splits=5, n_repeats=5, random_state=random_state)
scores_rr = cross_val_score(ridge, X_train_scaled, y_train, cv=cv, scoring='neg_mean_squared_error')

rmse_scores_rr = np.sqrt(-scores_rr )
rmse_mean_rr  = rmse_scores_rr .mean()
rmse_std_rr  = rmse_scores_rr .std()
rmse_table.loc['Ridge'] = [rmse_mean_rr , rmse_std_rr ]
print(rmse_table)

                       RMSE Mean      RMSE Std
Linear Regression  296756.896751  44392.551961
Ridge              289219.047275  19222.851917


In [13]:
import warnings
from sklearn.exceptions import ConvergenceWarning

#Lasso regression using repeated cross-validation 5 folds and 5 repeats
# Suppress ConvergenceWarnings
with warnings.catch_warnings():
    warnings.filterwarnings("ignore", category=ConvergenceWarning)
    lasso = Lasso(random_state=random_state)
    cv = RepeatedKFold(n_splits=5, n_repeats=5, random_state=random_state)
    scores_l = cross_val_score(lasso, X_train_scaled, y_train, cv=cv, scoring='neg_mean_squared_error')

rmse_scores_l = np.sqrt(-scores_l)
rmse_mean_l = rmse_scores_l.mean()
rmse_std_l = rmse_scores_l.std()
rmse_table.loc['Lasso'] = [rmse_mean_l, rmse_std_l]
print(rmse_table)

                       RMSE Mean      RMSE Std
Linear Regression  296756.896751  44392.551961
Ridge              289219.047275  19222.851917
Lasso              293213.692049  33334.707386


In [14]:
#ElasticNet regression using repeated cross-validation 5 folds and 5 repeats
elastic_net = ElasticNet(random_state=random_state)
cv = RepeatedKFold(n_splits=5, n_repeats=5, random_state=random_state)
scores_en = cross_val_score(elastic_net, X_train_scaled, y_train, cv=cv, scoring='neg_mean_squared_error')

rmse_scores_en = np.sqrt(-scores_en)
rmse_mean_en = rmse_scores_en.mean()
rmse_std_en = rmse_scores_en.std()
rmse_table.loc['ElasticNet'] = [rmse_mean_en, rmse_std_en]
print(rmse_table)

                       RMSE Mean      RMSE Std
Linear Regression  296756.896751  44392.551961
Ridge              289219.047275  19222.851917
Lasso              293213.692049  33334.707386
ElasticNet         294075.484151   4496.946761


In [15]:
#Decision Tree Regressor using repeated cross-validation 5 folds and 5 repeats
dt = DecisionTreeRegressor(random_state=random_state)
cv = RepeatedKFold(n_splits=5, n_repeats=5, random_state=random_state)
scores_dt = cross_val_score(dt, X_train_scaled, y_train, cv=cv, scoring='neg_mean_squared_error')

rmse_scores_dt = np.sqrt(-scores_dt)
rmse_mean_dt = rmse_scores_dt.mean()
rmse_std_dt = rmse_scores_dt.std()
rmse_table.loc['Decision Tree'] = [rmse_mean_dt, rmse_std_dt]
print(rmse_table)


                       RMSE Mean      RMSE Std
Linear Regression  296756.896751  44392.551961
Ridge              289219.047275  19222.851917
Lasso              293213.692049  33334.707386
ElasticNet         294075.484151   4496.946761
Decision Tree      344708.754694   5576.475372


In [16]:
#Bagging Regressor using repeated cross-validation 5 folds and 5 repeats
bagging = BaggingRegressor(random_state=random_state)
cv = RepeatedKFold(n_splits=5, n_repeats=5, random_state=random_state)
scores_bagging = cross_val_score(bagging, X_train_scaled, y_train, cv=cv, scoring='neg_mean_squared_error')

rmse_scores_bagging = np.sqrt(-scores_bagging)
rmse_mean_bagging = rmse_scores_bagging.mean()
rmse_std_bagging = rmse_scores_bagging.std()
rmse_table.loc['Bagging'] = [rmse_mean_bagging, rmse_std_bagging]
print(rmse_table)

                       RMSE Mean      RMSE Std
Linear Regression  296756.896751  44392.551961
Ridge              289219.047275  19222.851917
Lasso              293213.692049  33334.707386
ElasticNet         294075.484151   4496.946761
Decision Tree      344708.754694   5576.475372
Bagging            259849.854440   4266.661459


In [17]:
#Random Forest Regressor using repeated cross-validation 5 folds and 5 repeats
rf = RandomForestRegressor(random_state=random_state, n_jobs=-1)
cv = RepeatedKFold(n_splits=5, n_repeats=5, random_state=random_state)
scores_rf = cross_val_score(rf, X_train_scaled, y_train, cv=cv, scoring='neg_mean_squared_error')

rmse_scores_rf = np.sqrt(-scores_rf)
rmse_mean_rf = rmse_scores_rf.mean()
rmse_std_rf = rmse_scores_rf.std()
rmse_table.loc['Random Forest'] = [rmse_mean_rf, rmse_std_rf]
print(rmse_table)

                       RMSE Mean      RMSE Std
Linear Regression  296756.896751  44392.551961
Ridge              289219.047275  19222.851917
Lasso              293213.692049  33334.707386
ElasticNet         294075.484151   4496.946761
Decision Tree      344708.754694   5576.475372
Bagging            259849.854440   4266.661459
Random Forest      249410.093806   3926.383553


In [18]:
#Gradient Boosting Regressor using repeated cross-validation 5 folds and 5 repeats
gb = GradientBoostingRegressor(random_state=random_state)
cv = RepeatedKFold(n_splits=5, n_repeats=5, random_state=random_state)
scores_gb = cross_val_score(gb, X_train_scaled, y_train, cv=cv, scoring='neg_mean_squared_error')

rmse_scores_gb = np.sqrt(-scores_gb)
rmse_mean_gb = rmse_scores_gb.mean()
rmse_std_gb = rmse_scores_gb.std()
rmse_table.loc['Gradient Boosting'] = [rmse_mean_gb, rmse_std_gb]
print(rmse_table)

                       RMSE Mean      RMSE Std
Linear Regression  296756.896751  44392.551961
Ridge              289219.047275  19222.851917
Lasso              293213.692049  33334.707386
ElasticNet         294075.484151   4496.946761
Decision Tree      344708.754694   5576.475372
Bagging            259849.854440   4266.661459
Random Forest      249410.093806   3926.383553
Gradient Boosting  252881.425307   3908.142996


### Part 1: Discussion [2 pts]

In a paragraph or well-organized set of bullet points, briefly compare and discuss:

  - Which models perform best overall?
  - Which are most stable (lowest std)?
  - Any signs of overfitting or underfitting?

> Your text here

### **Which models perform best overall?**
The RMSE (Root Mean Square Error) Mean represents the average error magnitude, so lower values indicate better performance. **Random Forest** has the lowest RMSE Mean (249,410.09), making it the best-performing model overall. Close behind is **Gradient Boosting**, with a slightly higher RMSE Mean (252,881.43). Both models seem to balance predictive accuracy quite well.

### **Which models are most stable (lowest Std)?**
The RMSE Std (Standard Deviation) reflects consistency. A lower Std suggests the model is stable across predictions. **Gradient Boosting** takes the lead here, with the lowest RMSE Std (3,908.14), followed by **Random Forest** (3,926.38). These two models are the most stable.

### **Signs of overfitting or underfitting?**
- **Linear Regression** shows high RMSE Mean (296,756.90) and Std (44,392.55), indicating potential **underfitting**, as it may be too simple to capture the complexities of the data.
- **Decision Tree** has the highest RMSE Mean (344,708.75) and a very low RMSE Std (5,576.48). These characteristics strongly suggest **overfitting**, as the model might be excessively tailored to the training data at the cost of generalization.
- **Bagging**, **Random Forest**, and **Gradient Boosting** exhibit relatively low RMSE Mean and Std. They strike a good balance and likely have minimal issues with overfitting or underfitting.

### Summary:
For overall performance and stability, **Random Forest** and **Gradient Boosting** are your top choices.


### Part 2: Feature Engineering [3 pts]

Consider **at least three new features** based on your Milestone 1, Part 5. Examples include:
- Polynomial terms
- Log or interaction terms
- Groupings or transformations of categorical features

Add these features to `X_train` and then:
- Scale using `StandardScaler` 
- Re-run all models listed above (using default settings again).
- Report updated RMSE scores (mean and std) across repeated CV in a table. 

**Note:**  Recall that this will require creating a new version of the dataset, so effectively you may be running "polynomial regression" using `LinearRegression`. 

In [19]:
# Add as many code cells as you need

### Part 2: Discussion [2 pts]

Reflect on the impact of your new features:

- Did any models show notable improvement in performance?

- Which new features seemed to help — and in which models?

- Do you have any hypotheses about why a particular feature helped (or didn’t)?

- Were there any unexpected results?



> Your text here

### Part 3: Feature Selection [3 pts]

Using the full set of features (original + engineered):
- Apply **feature selection** methods to investigate whether you can improve performance.
  - You may use forward selection, backward selection, or feature importance from tree-based models.
- For each model, identify the **best-performing subset of features**.
- Re-run each model using only those features.
- Report updated RMSE scores (mean and std) across repeated CV in a table.


### Part 3: Discussion [2 pts]

Analyze the effect of feature selection on your models:

- Did performance improve for any models after reducing the number of features?

- Which features were consistently retained across models?

- Were any of your newly engineered features selected as important?

- How did feature selection differ between linear and tree-based models?

> Your text here

### Part 4: Fine-Tuning Your Top 3 Models [6 pts]

In this final phase of Milestone 2, you’ll select and refine your **three most promising models and their corresponding data pipelines** based on everything you've done so far.

1. Choose the top 3 models based on performance and interpretability from earlier parts.
2. For each model:
   - Perform hyperparameter tuning using `sweep_parameters`, `GridSearchCV`, `RandomizedSearchCV`, or other techniques from previous homeworks. 
   - Experiment with different versions of your feature engineering and preprocessing — treat these as additional tunable components.
3. Report the mean and standard deviation of CV RMSE score for each model in a summary table.



In [20]:
# Add as many code cells as you need

### Part 4: Discussion [4 pts]

Reflect on your tuning process and final results:

- What was your tuning strategy for each model? Why did you choose those hyperparameters?
- Did you find that certain types of preprocessing or feature engineering worked better with specific models?
- Provide a ranking of your three models and explain your reasoning — not just based on RMSE, but also interpretability, training time, or generalizability.
- Conclude by considering whether this workflow has produced the results you expected. Typically, you would repeat steps 2 - 4 and also reconsider the choices you made in Milestone 1 when cleaning the dataset, until reaching the point of diminishing returns; do you think that would that have helped here?

> Your text here