## Model Selection

This notebook should include preliminary and baseline modeling.
- Try as many different models as possible.
- Don't worry about hyperparameter tuning or cross validation here.
- Ideas include:
    - linear regression
    - support vector machines
    - random forest
    - xgboost

In [1]:
# import models and fit
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler, PolynomialFeatures

from xgboost import XGBRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor

from sklearn.metrics import mean_squared_error, r2_score

In [2]:
# Load data
X_train_scaled = pd.read_csv('../data/processed/X_train_scaled.csv')
X_test_scaled = pd.read_csv('../data/processed/X_test_scaled.csv')
y_train = pd.read_csv('../data/processed/y_train.csv')
y_test = pd.read_csv('../data/processed/y_test.csv')

X_train_scaled.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,36,37,38,39,40,41,42,43,44,45
count,5200.0,5200.0,5200.0,5200.0,5200.0,5200.0,5200.0,5200.0,5200.0,5200.0,...,5200.0,5200.0,5200.0,5200.0,5200.0,5200.0,5200.0,5200.0,5200.0,5200.0
mean,9.155070000000001e-17,-3.411972e-15,-4.561138e-15,3.2794280000000005e-17,-5.334536e-15,4.645856e-17,-8.19857e-18,1.222953e-16,-1.667043e-16,4.2359280000000006e-17,...,8.642659e-17,4.3725710000000006e-17,9.155070000000001e-17,5.192428000000001e-17,1.2981070000000002e-17,4.3725710000000006e-17,-6.148928e-18,1.9813210000000003e-17,6.558856000000001e-17,-2.186285e-17
std,1.000096,1.000096,1.000096,1.000096,1.000096,1.000096,1.000096,1.000096,1.000096,1.000096,...,1.000096,1.000096,1.000096,1.000096,1.000096,1.000096,1.000096,1.000096,1.000096,1.000096
min,-1.619049,-23.96421,-3.967251,-0.2327678,-3.820993,-0.5964761,-0.3902324,-1.863974,-1.820466,-0.8570274,...,-3.076163,-0.05378625,-0.0891475,-0.3286841,-0.1958026,-0.147688,-0.2875007,-0.02402615,-1.464853,-0.2835617
25%,-0.9351284,-0.2685789,-0.6187975,-0.2327678,0.01275023,-0.5964761,-0.2781443,-0.6531968,-0.9536568,-0.8570274,...,0.3250803,-0.05378625,-0.0891475,-0.3286841,-0.1958026,-0.147688,-0.2875007,-0.02402615,-1.464853,-0.2835617
50%,-0.04968707,0.2194405,0.1545751,-0.2327678,0.3886074,-0.5964761,-0.2184802,-0.2397351,-0.08684762,0.003641291,...,0.3250803,-0.05378625,-0.0891475,-0.3286841,-0.1958026,-0.147688,-0.2875007,-0.02402615,0.6826622,-0.2835617
75%,0.8660643,0.508478,0.7842986,-0.2327678,0.5389502,1.265274,-0.114766,0.3270565,0.7799616,0.86431,...,0.3250803,-0.05378625,-0.0891475,-0.3286841,-0.1958026,-0.147688,-0.2875007,-0.02402615,0.6826622,-0.2835617
max,1.721196,1.139273,1.614389,6.104366,0.6892931,8.712275,15.75831,6.479283,5.114008,8.610328,...,0.3250803,18.59211,11.21736,3.042435,5.107184,6.77103,3.478252,41.62131,0.6826622,3.52657


In [3]:
# Linear Regression
LR_model = LinearRegression()
LR_model.fit(X_train_scaled, y_train)

y_train_LR = LR_model.predict(X_train_scaled)
y_test_LR = LR_model.predict(X_test_scaled)

train_mse_LR = mean_squared_error(y_train, y_train_LR)
train_r2_LR = r2_score(y_train, y_train_LR)

mse_LR = mean_squared_error(y_test, y_test_LR)
r2_LR = r2_score(y_test, y_test_LR)

LR_metrics = {'MSE':mse_LR, 'R2':r2_LR}

print(f'Train MSE: \t {train_mse_LR}')
print(f'Test MSE: \t {mse_LR}')
print(f'Train R2: \t {train_r2_LR}')
print(f'Test R2: \t {r2_LR}')


Train MSE: 	 3890321884.3568015
Test MSE: 	 4146077707.644794
Train R2: 	 0.9630973918376792
Test R2: 	 0.9593341086570428


In [4]:
# Decision Tree
tree_model = DecisionTreeRegressor(random_state=42)
tree_model.fit(X_train_scaled, y_train)

y_train_tree = tree_model.predict(X_train_scaled)
y_test_tree = tree_model.predict(X_test_scaled)

train_mse_tree = mean_squared_error(y_train, y_train_tree)
train_r2_tree = r2_score(y_train, y_train_tree)

mse_tree = mean_squared_error(y_test, y_test_tree)
r2_tree = r2_score(y_test, y_test_tree)

DTree_metrics = {'MSE':mse_tree, 'R2':r2_tree}

print(f'Train MSE: \t {train_mse_tree}')
print(f'Test MSE: \t {mse_tree}')
print(f'Train R2: \t {train_r2_tree}')
print(f'Test R2: \t {r2_tree}')


Train MSE: 	 0.0
Test MSE: 	 21375707.0
Train R2: 	 1.0
Test R2: 	 0.9997903410790786


In [5]:
# Random Forest
forest_model = RandomForestRegressor(random_state=42)
forest_model.fit(X_train_scaled, y_train)

y_train_forest = forest_model.predict(X_train_scaled)
y_test_forest = forest_model.predict(X_test_scaled)

train_mse_forest = mean_squared_error(y_train, y_train_forest)
train_r2_forest = r2_score(y_train, y_train_forest)

mse_forest = mean_squared_error(y_test, y_test_forest)
r2_forest = r2_score(y_test, y_test_forest)

Forest_metrics = {'MSE':mse_forest, 'R2':r2_forest}

print(f'Train MSE: \t {train_mse_forest}')
print(f'Test MSE: \t {mse_forest}')
print(f'Train R2: \t {train_r2_forest}')
print(f'Test R2: \t {r2_forest}')


  return fit_method(estimator, *args, **kwargs)


Train MSE: 	 12319780.788696423
Test MSE: 	 43155462.35645008
Train R2: 	 0.9998831376794504
Test R2: 	 0.9995767191387159


In [6]:
# SVR
svr_model = SVR(kernel='rbf')
svr_model.fit(X_train_scaled, y_train)

y_train_svr = svr_model.predict(X_train_scaled)
y_test_svr = svr_model.predict(X_test_scaled)

train_mse_svr = mean_squared_error(y_train, y_train_svr)
train_r2_svr = r2_score(y_train, y_train_tree)

mse_svr = mean_squared_error(y_test, y_test_svr)
r2_svr = r2_score(y_test, y_test_svr)

SVR_metrics = {'MSE':mse_svr, 'R2':r2_svr}

print(f'Train MSE: \t {train_mse_svr}')
print(f'Test MSE: \t {mse_svr}')
print(f'Train R2: \t {train_r2_svr}')
print(f'Test R2: \t {r2_svr}')

  y = column_or_1d(y, warn=True)


Train MSE: 	 110270549438.92082
Test MSE: 	 106861159273.6704
Train R2: 	 1.0
Test R2: 	 -0.04812417861653273


In [7]:
# XGBoots
xgb_model = XGBRegressor(random_state=42)
xgb_model.fit(X_train_scaled, y_train)

y_train_xgb = xgb_model.predict(X_train_scaled)
y_test_xgb = xgb_model.predict(X_test_scaled)

train_mse_xgb = mean_squared_error(y_train, y_train_xgb)
train_r2_xgb = r2_score(y_train, y_train_xgb)

mse_xgb = mean_squared_error(y_test, y_test_xgb)
r2_xgb = r2_score(y_test, y_test_xgb)

XGB_metrics = {'MSE':mse_xgb, 'R2':r2_xgb}

print(f'Train MSE: \t {train_mse_xgb}')
print(f'Test MSE: \t {mse_xgb}')
print(f'Train R2: \t {train_r2_xgb}')
print(f'Test R2: \t {r2_xgb}')

Train MSE: 	 3852866.4934079964
Test MSE: 	 40015271.04752819
Train R2: 	 0.9999634526841905
Test R2: 	 0.999607519014543


## Metrics

Consider what metrics you want to use to evaluate success.
- If you think about mean squared error, can we actually relate to the amount of error?
- Try root mean squared error so that error is closer to the original units (dollars)
- What does RMSE do to outliers?
- Is mean absolute error a good metric for this problem?
- What about R^2? Adjusted R^2?
- Briefly describe your reasons for picking the metrics you use

In [8]:
# Set list and index names
dict_list = [LR_metrics, DTree_metrics, Forest_metrics, SVR_metrics, XGB_metrics]
index_names = ['Linear Regression', 'Decision Tree', 'Random Forest', 'SVR', 'XGBoost']

pd.set_option('display.float_format', lambda x: '%.5f' % x)

metricsDF = pd.DataFrame(dict_list, index=index_names)
metricsDF['RMSE'] = metricsDF['MSE']**0.5

metricsDF[['MSE', 'RMSE', 'R2']]

Unnamed: 0,MSE,RMSE,R2
Linear Regression,4146077707.64479,64390.04354,0.95933
Decision Tree,21375707.0,4623.38696,0.99979
Random Forest,43155462.35645,6569.28172,0.99958
SVR,106861159273.6704,326896.25154,-0.04812
XGBoost,40015271.04753,6325.76249,0.99961


## Model Selection Results
The two scoring metrics we will focus on are RMSE and R2. <br>
Using RMSE will show us an error on the same scale as our expected sale prices and will allow us to compare the different models. <br>
R2 will give us an idea of how accurate the model predictions are with the actual sold price from our test data. <br>

* We can see that the XGBoost gives use the lowest RMSE, so we should select that model.
* The next best model is the decision tree and the random forest models. Given that random forest shows slightly better metrics and builds off of many decision trees, we will use that
    * Using the default parameters for the decision tree and random forest models, we see that the predicted results have no error. We can determine that the max depth of the tree accounts for all variability from the data.
    * In our case, the model predicts the test data accurately, but there may be overfitting. It may be worth setting the parameters to limit the max depth of the tree, limiting parameters, or folding the data. 

## Feature Selection - STRETCH

> **This step doesn't need to be part of your Minimum Viable Product (MVP), but its recommended you complete it if you have time!**

Even with all the preprocessing we did in Notebook 1, you probably still have a lot of features. Are they all important for prediction?

Investigate some feature selection algorithms (Lasso, RFE, Forward/Backward Selection)
- Perform feature selection to get a reduced subset of your original features
- Refit your models with this reduced dimensionality - how does performance change on your chosen metrics?
- Based on this, should you include feature selection in your final pipeline? Explain

Remember, feature selection often doesn't directly improve performance, but if performance remains the same, a simpler model is often preferrable. 



In [9]:
# perform feature selection 
# refit models
# gather evaluation metrics and compare to the previous step (full feature set)