## Model Selection

This notebook should include preliminary and baseline modeling.
- Try as many different models as possible.
- Don't worry about hyperparameter tuning or cross validation here.
- Ideas include:
    - linear regression
    - support vector machines
    - random forest
    - xgboost

In [69]:
import pandas as pd

df = pd.read_csv("../data/processed/property_listings_flat.csv")

In [70]:
df = df.dropna(subset=["list_price"])  # drop rows with no price

# drop other rows with too many NaNs
df = df.dropna(axis=0, thresh=5)  # at least 5 non-NA values to keep row


In [71]:
# Drop text and ID columns that won't help modeling
X = df.drop(columns=[
    "list_price", 
    "sold_price",
    "name",
    "street_address",
    "country",
    "city",
    "postal_code",
    "property_id", 
    "listing_id",
    "property_name",
    "sold_date",
    "is_new_listing",
    "is_new_construction",
    "county_name",
    "list_date",
    "state"
])

y = df["list_price"]


In [72]:
from datetime import datetime

df["year_built"] = pd.to_numeric(df["year_built"], errors="coerce")
df = df.dropna(subset=["year_built"]).copy()  

current_year = datetime.now().year
df["building_age"] = current_year - df["year_built"]


In [73]:
# handle categorical variables
df = df.drop(columns=["year_built"])

X = pd.get_dummies(X, drop_first=True)


In [74]:
print(X.columns)


Index(['baths', 'baths_1qtr', 'baths_3qtr', 'baths_full', 'baths_half', 'beds',
       'garage', 'lot_sqft', 'sqft', 'stories', 'year_built',
       'price_reduced_amount', 'latitude', 'longitude', 'sub_type_townhouse',
       'type_condo_townhome_rowhome_coop', 'type_condos',
       'type_duplex_triplex', 'type_land', 'type_mobile', 'type_multi_family',
       'type_single_family', 'type_townhomes', 'is_price_reduced_True'],
      dtype='object')


In [76]:
# import models and fit
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


In [78]:
import os

os.makedirs("../data/processed", exist_ok=True)

X_train.to_csv("../data/processed/X_train.csv", index=False)
X_test.to_csv("../data/processed/X_test.csv", index=False)
y_train.to_csv("../data/processed/y_train.csv", index=False)
y_test.to_csv("../data/processed/y_test.csv", index=False)

print("Cleaned data saved and ready for modeling.")


Cleaned data saved and ready for modeling.


Consider what metrics you want to use to evaluate success.
- If you think about mean squared error, can we actually relate to the amount of error?
- Try root mean squared error so that error is closer to the original units (dollars)
- What does RMSE do to outliers?
- Is mean absolute error a good metric for this problem?
- What about R^2? Adjusted R^2?
- Briefly describe your reasons for picking the metrics you use

In [None]:
# gather evaluation metrics and compare results

## Feature Selection - STRETCH

> **This step doesn't need to be part of your Minimum Viable Product (MVP), but its recommended you complete it if you have time!**

Even with all the preprocessing we did in Notebook 1, you probably still have a lot of features. Are they all important for prediction?

Investigate some feature selection algorithms (Lasso, RFE, Forward/Backward Selection)
- Perform feature selection to get a reduced subset of your original features
- Refit your models with this reduced dimensionality - how does performance change on your chosen metrics?
- Based on this, should you include feature selection in your final pipeline? Explain

Remember, feature selection often doesn't directly improve performance, but if performance remains the same, a simpler model is often preferrable. 



In [None]:
# perform feature selection 
# refit models
# gather evaluation metrics and compare to the previous step (full feature set)