## Model Selection

This notebook should include preliminary and baseline modeling.
- Try as many different models as possible.
- Don't worry about hyperparameter tuning or cross validation here.
- Ideas include:
    - linear regression
    - support vector machines
    - random forest
    - xgboost

In [124]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np


In [125]:
X_train = pd.read_csv("../data/processed/X_train_scaled.csv")
X_test = pd.read_csv("../data/processed/X_test_scaled.csv")
y_train = pd.read_csv("../data/processed/y_train.csv").squeeze()  # Convert to Series
y_test = pd.read_csv("../data/processed/y_test.csv").squeeze()


In [126]:
columns_to_drop = ["is_foreclosure", "is_price_reduced", "is_new_listing", "sub_type"]
X_train = X_train.drop(columns = columns_to_drop)
X_test = X_test.drop(columns = columns_to_drop)

# Create dummy variables from the "type" column
type_dummies = pd.get_dummies(X_train["type"], prefix="type", drop_first=True)

# Drop the original "type" column and concatenate the dummy variables
X_train = pd.concat([X_train.drop("type", axis=1), type_dummies], axis=1)
X_test = pd.concat([X_test.drop("type", axis=1), type_dummies], axis=1)

columns_with_nans = ["stories", "year_built", "latitude", "longitude"]

for col in columns_with_nans:
    X_train[col] = X_train[col].fillna(X_train[col].median())

for col in columns_with_nans:
    X_test[col] = X_test[col].fillna(X_test[col].median())

X_train = X_train.drop(columns=["list_date"])
X_train = X_train.dropna(subset=["city_encoded"])

X_test = X_test.drop(columns=["list_date"])
X_test = X_test.dropna(subset=["city_encoded"])

# Save cleaned version (optional)
X_train.to_csv("../data/processed/X_train_scaled_cleaned.csv", index=False)
X_test.to_csv("../data/processed/X_test_scaled_cleaned.csv", index=False)

In [127]:
y_train = y_train.loc[X_train.index]
y_test = y_test.loc[X_test.index]


In [None]:
print(X_train.dtypes[X_train.dtypes == "object"])
#X_train
#X_train.isna().sum()

Series([], dtype: object)


In [129]:
def evaluate_model(model, X_test, y_test):
    preds = model.predict(X_test)
    rmse = mean_squared_error(y_test, preds, squared=False)
    mae = mean_absolute_error(y_test, preds)
    r2 = r2_score(y_test, preds)
    return {"RMSE": rmse, "MAE": mae, "R2": r2}


In [130]:
models = {
    "Linear Regression": LinearRegression(),
    "Support Vector Regressor": SVR(),
    "Random Forest": RandomForestRegressor(random_state=41),
    "XGBoost": XGBRegressor(random_state=41)
}

results = {}

for name, model in models.items():
    model.fit(X_train, y_train)
    results[name] = evaluate_model(model, X_test, y_test)


In [132]:
results_df = pd.DataFrame(results).T  # Transpose so models are rows
print(results_df.sort_values("RMSE"))  # Sort by RMSE (best to worst)

                                   RMSE            MAE        R2
XGBoost                    71066.752811   33205.092521  0.931205
Random Forest              86408.857850   28173.138858  0.898295
Linear Regression         170553.613213  109095.155286  0.603769
Support Vector Regressor  275198.974802  186455.771427 -0.031622


Consider what metrics you want to use to evaluate success.
- If you think about mean squared error, can we actually relate to the amount of error?
- Try root mean squared error so that error is closer to the original units (dollars)
- What does RMSE do to outliers?
- Is mean absolute error a good metric for this problem?
- What about R^2? Adjusted R^2?
- Briefly describe your reasons for picking the metrics you use

In [None]:
# gather evaluation metrics and compare results

## Feature Selection - STRETCH

> **This step doesn't need to be part of your Minimum Viable Product (MVP), but its recommended you complete it if you have time!**

Even with all the preprocessing we did in Notebook 1, you probably still have a lot of features. Are they all important for prediction?

Investigate some feature selection algorithms (Lasso, RFE, Forward/Backward Selection)
- Perform feature selection to get a reduced subset of your original features
- Refit your models with this reduced dimensionality - how does performance change on your chosen metrics?
- Based on this, should you include feature selection in your final pipeline? Explain

Remember, feature selection often doesn't directly improve performance, but if performance remains the same, a simpler model is often preferrable. 



In [None]:
# perform feature selection 
# refit models
# gather evaluation metrics and compare to the previous step (full feature set)