## Model Selection

This notebook should include preliminary and baseline modeling.
- Try as many different models as possible.
- Don't worry about hyperparameter tuning or cross validation here.
- Ideas include:
    - linear regression
    - support vector machines
    - random forest
    - xgboost

In [2]:
# import models and fit
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler, PolynomialFeatures

from xgboost import XGBRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor

from sklearn.metrics import mean_squared_error, r2_score

In [3]:
# Load data
X_train_scaled = pd.read_csv('../data/processed/X_train_scaled.csv')
X_test_scaled = pd.read_csv('../data/processed/X_test_scaled.csv')
y_train = pd.read_csv('../data/processed/y_train.csv')
y_test = pd.read_csv('../data/processed/y_test.csv')

X_train_scaled.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,36,37,38,39,40,41,42,43,44,45
count,5220.0,5220.0,5220.0,5220.0,5220.0,5220.0,5220.0,5220.0,5220.0,5220.0,...,5220.0,5220.0,5220.0,5220.0,5220.0,5220.0,5220.0,5220.0,5220.0,5220.0
mean,4.770981e-15,-1.090316e-15,3.062684e-17,-2.885729e-14,-4.764175e-17,5.444772e-18,-2.205133e-16,2.109849e-16,2.552237e-17,-3.266863e-17,...,-3.266863e-17,4.628056e-17,1.0889540000000001e-17,1.211462e-16,6.805965e-18,2.5862670000000003e-17,0.0,8.167158e-18,-2.463759e-16,-6.125368e-17
std,1.000096,1.000096,1.000096,1.000096,1.000096,1.000096,1.000096,1.000096,1.000096,1.000096,...,1.000096,1.000096,1.000096,1.000096,1.000096,1.000096,0.0,1.000096,1.000096,1.000096
min,-24.25031,-3.98416,-0.2327139,-3.85643,-0.5948834,-0.3879067,-1.860708,-1.81539,-0.8734723,-1.298806,...,-0.05368281,-0.0889752,-0.3333333,-0.196467,-0.1467188,-0.2861211,0.0,-0.06937088,-1.44021,-0.2865127
25%,-0.2741189,-0.6221657,-0.2327139,0.009497093,-0.5948834,-0.278756,-0.6563103,-0.949022,-0.8734723,-0.2383075,...,-0.05368281,-0.0889752,-0.3333333,-0.196467,-0.1467188,-0.2861211,0.0,-0.06937088,-1.44021,-0.2865127
50%,0.219013,0.1536791,-0.2327139,0.3885095,-0.5948834,-0.2162251,-0.2361085,-0.08265354,0.002685541,-0.2383075,...,-0.05368281,-0.0889752,-0.3333333,-0.196467,-0.1467188,-0.2861211,0.0,-0.06937088,0.6943432,-0.2865127
75%,0.504629,0.785849,-0.2327139,0.5401145,1.264573,-0.1122035,0.3394031,0.7837149,0.8788434,0.8221913,...,-0.05368281,-0.0889752,-0.3333333,-0.196467,-0.1467188,-0.2861211,0.0,-0.06937088,0.6943432,-0.2865127
max,1.15198,1.619164,6.110714,0.6917195,8.702396,15.77452,6.517127,5.115557,8.764264,9.306182,...,18.62794,11.23909,3.0,5.089913,6.815757,3.495024,0.0,14.41527,0.6943432,3.490246


In [4]:
# Linear Regression
LR_model = LinearRegression()
LR_model.fit(X_train_scaled, y_train)

y_train_LR = LR_model.predict(X_train_scaled)
y_test_LR = LR_model.predict(X_test_scaled)

train_mse_LR = mean_squared_error(y_train, y_train_LR)
train_r2_LR = r2_score(y_train, y_train_LR)

mse_LR = mean_squared_error(y_test, y_test_LR)
r2_LR = r2_score(y_test, y_test_LR)

LR_metrics = {'MSE':mse_LR, 'R2':r2_LR}

print(f'Train MSE: \t {train_mse_LR}')
print(f'Test MSE: \t {mse_LR}')
print(f'Train R2: \t {train_r2_LR}')
print(f'Test R2: \t {r2_LR}')


Train MSE: 	 3479895584.8480883
Test MSE: 	 4328005232.184329
Train R2: 	 0.965973443727785
Test R2: 	 0.961626506100333


In [5]:
# Decision Tree
tree_model = DecisionTreeRegressor(random_state=42)
tree_model.fit(X_train_scaled, y_train)

y_train_tree = tree_model.predict(X_train_scaled)
y_test_tree = tree_model.predict(X_test_scaled)

train_mse_tree = mean_squared_error(y_train, y_train_tree)
train_r2_tree = r2_score(y_train, y_train_tree)

mse_tree = mean_squared_error(y_test, y_test_tree)
r2_tree = r2_score(y_test, y_test_tree)

DTree_metrics = {'MSE':mse_tree, 'R2':r2_tree}

print(f'Train MSE: \t {train_mse_tree}')
print(f'Test MSE: \t {mse_tree}')
print(f'Train R2: \t {train_r2_tree}')
print(f'Test R2: \t {r2_tree}')


Train MSE: 	 0.0
Test MSE: 	 77004555.55555555
Train R2: 	 1.0
Test R2: 	 0.9993172527101206


In [7]:
# Random Forest
forest_model = RandomForestRegressor(random_state=42)
forest_model.fit(X_train_scaled, y_train)

y_train_forest = forest_model.predict(X_train_scaled)
y_test_forest = forest_model.predict(X_test_scaled)

train_mse_forest = mean_squared_error(y_train, y_train_forest)
train_r2_forest = r2_score(y_train, y_train_forest)

mse_forest = mean_squared_error(y_test, y_test_forest)
r2_forest = r2_score(y_test, y_test_forest)

Forest_metrics = {'MSE':mse_forest, 'R2':r2_forest}

print(f'Train MSE: \t {train_mse_forest}')
print(f'Test MSE: \t {mse_forest}')
print(f'Train R2: \t {train_r2_forest}')
print(f'Test R2: \t {r2_forest}')


  return fit_method(estimator, *args, **kwargs)


Train MSE: 	 7239159.801686489
Test MSE: 	 75306351.56509332
Train R2: 	 0.9999292152099539
Test R2: 	 0.9993323095358342


In [8]:
# SVR
svr_model = SVR(kernel='rbf')
svr_model.fit(X_train_scaled, y_train)

y_train_svr = svr_model.predict(X_train_scaled)
y_test_svr = svr_model.predict(X_test_scaled)

train_mse_svr = mean_squared_error(y_train, y_train_svr)
train_r2_svr = r2_score(y_train, y_train_tree)

mse_svr = mean_squared_error(y_test, y_test_svr)
r2_svr = r2_score(y_test, y_test_svr)

SVR_metrics = {'MSE':mse_svr, 'R2':r2_svr}

print(f'Train MSE: \t {train_mse_svr}')
print(f'Test MSE: \t {mse_svr}')
print(f'Train R2: \t {train_r2_svr}')
print(f'Test R2: \t {r2_svr}')

  y = column_or_1d(y, warn=True)


Train MSE: 	 106808398328.74374
Test MSE: 	 118773156180.84033
Train R2: 	 1.0
Test R2: 	 -0.05308120938878913


In [9]:
# XGBoots
xgb_model = XGBRegressor(random_state=42)
xgb_model.fit(X_train_scaled, y_train)

y_train_xgb = xgb_model.predict(X_train_scaled)
y_test_xgb = xgb_model.predict(X_test_scaled)

train_mse_xgb = mean_squared_error(y_train, y_train_xgb)
train_r2_xgb = r2_score(y_train, y_train_xgb)

mse_xgb = mean_squared_error(y_test, y_test_xgb)
r2_xgb = r2_score(y_test, y_test_xgb)

XGB_metrics = {'MSE':mse_xgb, 'R2':r2_xgb}

print(f'Train MSE: \t {train_mse_xgb}')
print(f'Test MSE: \t {mse_xgb}')
print(f'Train R2: \t {train_r2_xgb}')
print(f'Test R2: \t {r2_xgb}')

Train MSE: 	 4386703.237700574
Test MSE: 	 29075644.353282813
Train R2: 	 0.9999571066427346
Test R2: 	 0.999742205935214


## Metrics

Consider what metrics you want to use to evaluate success.
- If you think about mean squared error, can we actually relate to the amount of error?
- Try root mean squared error so that error is closer to the original units (dollars)
- What does RMSE do to outliers?
- Is mean absolute error a good metric for this problem?
- What about R^2? Adjusted R^2?
- Briefly describe your reasons for picking the metrics you use

In [20]:
# Set list and index names
dict_list = [LR_metrics, DTree_metrics, Forest_metrics, SVR_metrics, XGB_metrics]
index_names = ['Linear Regression', 'Decision Tree', 'Random Forest', 'SVR', 'XGBoost']

pd.set_option('display.float_format', lambda x: '%.5f' % x)

metricsDF = pd.DataFrame(dict_list, index=index_names)
metricsDF['RMSE'] = metricsDF['MSE']**0.5

metricsDF[['MSE', 'RMSE', 'R2']]

Unnamed: 0,MSE,RMSE,R2
Linear Regression,4328005232.18433,65787.57658,0.96163
Decision Tree,77004555.55556,8775.22396,0.99932
Random Forest,75306351.56509,8677.92323,0.99933
SVR,118773156180.84032,344634.81568,-0.05308
XGBoost,29075644.35328,5392.18363,0.99974


## Model Selection Results
The two scoring metrics we will focus on are RMSE and R2. <br>
Using RMSE will show us an error on the same scale as our expected sale prices and will allow us to compare the different models. <br>
R2 will give us an idea of how accurate the model predictions are with the actual sold price from our test data. <br>

* We can see that the XGBoost gives use the lowest RMSE, so we should select that model.
* The next best model is the decision tree and the random forest models. Given that random forest shows slightly better metrics and builds off of many decision trees, we will use that
    * Using the default parameters for the decision tree and random forest models, we see that the predicted results have no error. We can determine that the max depth of the tree accounts for all variability from the data.
    * In our case, the model predicts the test data accurately, but there may be overfitting. It may be worth setting the parameters to limit the max depth of the tree, limiting parameters, or folding the data. 

## Feature Selection - STRETCH

> **This step doesn't need to be part of your Minimum Viable Product (MVP), but its recommended you complete it if you have time!**

Even with all the preprocessing we did in Notebook 1, you probably still have a lot of features. Are they all important for prediction?

Investigate some feature selection algorithms (Lasso, RFE, Forward/Backward Selection)
- Perform feature selection to get a reduced subset of your original features
- Refit your models with this reduced dimensionality - how does performance change on your chosen metrics?
- Based on this, should you include feature selection in your final pipeline? Explain

Remember, feature selection often doesn't directly improve performance, but if performance remains the same, a simpler model is often preferrable. 



In [None]:
# perform feature selection 
# refit models
# gather evaluation metrics and compare to the previous step (full feature set)