## Model Selection

This notebook should include preliminary and baseline modeling.
- Try as many different models as possible.
- Don't worry about hyperparameter tuning or cross validation here.
- Ideas include:
    - linear regression
    - support vector machines
    - random forest
    - xgboost

In [13]:
# import models and fit
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler, PolynomialFeatures

from xgboost import XGBRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor

from sklearn.metrics import mean_squared_error, r2_score

In [14]:
# Load data
X_train_scaled = pd.read_csv('../data/processed/X_train_scaled.csv')
X_test_scaled = pd.read_csv('../data/processed/X_test_scaled.csv')
y_train = pd.read_csv('../data/processed/y_train.csv')
y_test = pd.read_csv('../data/processed/y_test.csv')

X_train_scaled.describe()

Unnamed: 0,list_date,description.year_built,description.baths_3qtr,description.sold_date,description.baths_half,description.lot_sqft,description.sqft,description.baths,description.garage,description.stories,...,other_tags,apartment,condo,condos,land,mobile,multi_family,other_types,single_family,townhomes
count,5200.0,5200.0,5200.0,5200.0,5200.0,5200.0,5200.0,5200.0,5200.0,5200.0,...,5200.0,5200.0,5200.0,5200.0,5200.0,5200.0,5200.0,5200.0,5200.0,5200.0
mean,-0.0,-0.0,0.0,-0.0,0.0,-0.0,0.0,-0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,-0.0,0.0,0.0,-0.0
std,1.0001,1.0001,1.0001,1.0001,1.0001,1.0001,1.0001,1.0001,1.0001,1.0001,...,1.0001,1.0001,1.0001,1.0001,1.0001,1.0001,1.0001,1.0001,1.0001,1.0001
min,-23.96421,-3.96725,-0.23277,-3.82099,-0.59648,-0.39023,-1.86397,-1.82047,-0.85703,-1.31506,...,-3.07616,-0.05379,-0.08915,-0.32868,-0.1958,-0.14769,-0.2875,-0.02403,-1.46485,-0.28356
25%,-0.26858,-0.6188,-0.23277,0.01275,-0.59648,-0.27814,-0.6532,-0.95366,-0.85703,-0.24473,...,0.32508,-0.05379,-0.08915,-0.32868,-0.1958,-0.14769,-0.2875,-0.02403,-1.46485,-0.28356
50%,0.21944,0.15458,-0.23277,0.38861,-0.59648,-0.21848,-0.23974,-0.08685,0.00364,-0.24473,...,0.32508,-0.05379,-0.08915,-0.32868,-0.1958,-0.14769,-0.2875,-0.02403,0.68266,-0.28356
75%,0.50848,0.7843,-0.23277,0.53895,1.26527,-0.11477,0.32706,0.77996,0.86431,0.82559,...,0.32508,-0.05379,-0.08915,-0.32868,-0.1958,-0.14769,-0.2875,-0.02403,0.68266,-0.28356
max,1.13927,1.61439,6.10437,0.68929,8.71227,15.75831,6.47928,5.11401,8.61033,9.38818,...,0.32508,18.59211,11.21736,3.04243,5.10718,6.77103,3.47825,41.62131,0.68266,3.52657


In [15]:
# Linear Regression
LR_model = LinearRegression()
LR_model.fit(X_train_scaled, y_train)

y_train_LR = LR_model.predict(X_train_scaled)
y_test_LR = LR_model.predict(X_test_scaled)

train_mse_LR = mean_squared_error(y_train, y_train_LR)
train_r2_LR = r2_score(y_train, y_train_LR)

mse_LR = mean_squared_error(y_test, y_test_LR)
r2_LR = r2_score(y_test, y_test_LR)

LR_metrics = {'MSE':mse_LR, 'R2':r2_LR}

print(f'Train MSE: \t {train_mse_LR}')
print(f'Test MSE: \t {mse_LR}')
print(f'Train R2: \t {train_r2_LR}')
print(f'Test R2: \t {r2_LR}')


Train MSE: 	 3891128394.172198
Test MSE: 	 4147270576.5501804
Train R2: 	 0.9630897414898207
Test R2: 	 0.9593224086647336


In [16]:
# Decision Tree
tree_model = DecisionTreeRegressor(random_state=42)
tree_model.fit(X_train_scaled, y_train)

y_train_tree = tree_model.predict(X_train_scaled)
y_test_tree = tree_model.predict(X_test_scaled)

train_mse_tree = mean_squared_error(y_train, y_train_tree)
train_r2_tree = r2_score(y_train, y_train_tree)

mse_tree = mean_squared_error(y_test, y_test_tree)
r2_tree = r2_score(y_test, y_test_tree)

DTree_metrics = {'MSE':mse_tree, 'R2':r2_tree}

print(f'Train MSE: \t {train_mse_tree}')
print(f'Test MSE: \t {mse_tree}')
print(f'Train R2: \t {train_r2_tree}')
print(f'Test R2: \t {r2_tree}')


Train MSE: 	 0.0
Test MSE: 	 15288097.384615384
Train R2: 	 1.0
Test R2: 	 0.999850050059135


In [17]:
# Random Forest
forest_model = RandomForestRegressor(random_state=42)
forest_model.fit(X_train_scaled, y_train)

y_train_forest = forest_model.predict(X_train_scaled)
y_test_forest = forest_model.predict(X_test_scaled)

train_mse_forest = mean_squared_error(y_train, y_train_forest)
train_r2_forest = r2_score(y_train, y_train_forest)

mse_forest = mean_squared_error(y_test, y_test_forest)
r2_forest = r2_score(y_test, y_test_forest)

Forest_metrics = {'MSE':mse_forest, 'R2':r2_forest}

print(f'Train MSE: \t {train_mse_forest}')
print(f'Test MSE: \t {mse_forest}')
print(f'Train R2: \t {train_r2_forest}')
print(f'Test R2: \t {r2_forest}')


  return fit_method(estimator, *args, **kwargs)


Train MSE: 	 12383518.188349385
Test MSE: 	 48807475.34689576
Train R2: 	 0.9998825330826189
Test R2: 	 0.9995212826123541


In [18]:
# SVR
svr_model = SVR(kernel='rbf')
svr_model.fit(X_train_scaled, y_train)

y_train_svr = svr_model.predict(X_train_scaled)
y_test_svr = svr_model.predict(X_test_scaled)

train_mse_svr = mean_squared_error(y_train, y_train_svr)
train_r2_svr = r2_score(y_train, y_train_tree)

mse_svr = mean_squared_error(y_test, y_test_svr)
r2_svr = r2_score(y_test, y_test_svr)

SVR_metrics = {'MSE':mse_svr, 'R2':r2_svr}

print(f'Train MSE: \t {train_mse_svr}')
print(f'Test MSE: \t {mse_svr}')
print(f'Train R2: \t {train_r2_svr}')
print(f'Test R2: \t {r2_svr}')

  y = column_or_1d(y, warn=True)


Train MSE: 	 110269332924.16615
Test MSE: 	 106859985503.95558
Train R2: 	 1.0
Test R2: 	 -0.04811266595443375


In [19]:
# XGBoots
xgb_model = XGBRegressor(random_state=42)
xgb_model.fit(X_train_scaled, y_train)

y_train_xgb = xgb_model.predict(X_train_scaled)
y_test_xgb = xgb_model.predict(X_test_scaled)

train_mse_xgb = mean_squared_error(y_train, y_train_xgb)
train_r2_xgb = r2_score(y_train, y_train_xgb)

mse_xgb = mean_squared_error(y_test, y_test_xgb)
r2_xgb = r2_score(y_test, y_test_xgb)

XGB_metrics = {'MSE':mse_xgb, 'R2':r2_xgb}

print(f'Train MSE: \t {train_mse_xgb}')
print(f'Test MSE: \t {mse_xgb}')
print(f'Train R2: \t {train_r2_xgb}')
print(f'Test R2: \t {r2_xgb}')

Train MSE: 	 4236153.290046704
Test MSE: 	 29186569.95928856
Train R2: 	 0.9999598169227059
Test R2: 	 0.9997137299475961


## Metrics

Consider what metrics you want to use to evaluate success.
- If you think about mean squared error, can we actually relate to the amount of error?
- Try root mean squared error so that error is closer to the original units (dollars)
- What does RMSE do to outliers?
- Is mean absolute error a good metric for this problem?
- What about R^2? Adjusted R^2?
- Briefly describe your reasons for picking the metrics you use

In [20]:
# Set list and index names
dict_list = [LR_metrics, DTree_metrics, Forest_metrics, SVR_metrics, XGB_metrics]
index_names = ['Linear Regression', 'Decision Tree', 'Random Forest', 'SVR', 'XGBoost']

pd.set_option('display.float_format', lambda x: '%.5f' % x)

metricsDF = pd.DataFrame(dict_list, index=index_names)
metricsDF['RMSE'] = metricsDF['MSE']**0.5

metricsDF[['MSE', 'RMSE', 'R2']]

Unnamed: 0,MSE,RMSE,R2
Linear Regression,4147270576.55018,64399.30571,0.95932
Decision Tree,15288097.38462,3909.99967,0.99985
Random Forest,48807475.3469,6986.2347,0.99952
SVR,106859985503.95558,326894.45621,-0.04811
XGBoost,29186569.95929,5402.45962,0.99971


## Model Selection Results
The two scoring metrics we will focus on are RMSE and R2. <br>
Using RMSE will show us an error on the same scale as our expected sale prices and will allow us to compare the different models. <br>
R2 will give us an idea of how accurate the model predictions are with the actual sold price from our test data. <br>

* We can see that the XGBoost gives use the lowest RMSE, so we should select that model.
* The next best model is the decision tree and the random forest models. Given that random forest shows slightly better metrics and builds off of many decision trees, we will use that
    * Using the default parameters for the decision tree and random forest models, we see that the predicted results have no error. We can determine that the max depth of the tree accounts for all variability from the data.
    * In our case, the model predicts the test data accurately, but there may be overfitting. It may be worth setting the parameters to limit the max depth of the tree, limiting parameters, or folding the data. 

## Feature Selection - STRETCH

> **This step doesn't need to be part of your Minimum Viable Product (MVP), but its recommended you complete it if you have time!**

Even with all the preprocessing we did in Notebook 1, you probably still have a lot of features. Are they all important for prediction?

Investigate some feature selection algorithms (Lasso, RFE, Forward/Backward Selection)
- Perform feature selection to get a reduced subset of your original features
- Refit your models with this reduced dimensionality - how does performance change on your chosen metrics?
- Based on this, should you include feature selection in your final pipeline? Explain

Remember, feature selection often doesn't directly improve performance, but if performance remains the same, a simpler model is often preferrable. 



In [21]:
# perform feature selection 
# refit models
# gather evaluation metrics and compare to the previous step (full feature set)