## Model Selection

This notebook should include preliminary and baseline modeling.
- Try as many different models as possible.
- Don't worry about hyperparameter tuning or cross validation here.
- Ideas include:
    - linear regression
    - support vector machines
    - random forest
    - xgboost

In [23]:
#Import preprocessed training data
import pandas as pd

#Independant variable training data
X_train = pd.read_csv("../data/preprocessed/X_train_scaled.csv")
X_train = X_train.drop(columns=["Unnamed: 0"])
print(f"X_train shape: {X_train.shape}")

#Target training data
y_train = pd.read_csv("../data/preprocessed/y_train.csv")
y_train = y_train.drop(columns=["Unnamed: 0"])
print(f"y_train shape: {y_train.shape}")

#Independant variable test data
X_test = pd.read_csv("../data/preprocessed/X_test_scaled.csv")
X_test = X_test.drop(columns=["Unnamed: 0"])
print(f"X_test shape: {X_test.shape}")

#Target test data
y_test = pd.read_csv("../data/preprocessed/y_test.csv")
y_test = y_test.drop(columns=["Unnamed: 0"])
print(f"y_test shape: {y_test.shape}")


X_train shape: (3381, 34)
y_train shape: (3381, 1)
X_test shape: (1450, 34)
y_test shape: (1450, 1)


In [24]:
# Train our model
from sklearn.linear_model import LinearRegression

reg = LinearRegression()
reg.fit(X_train, y_train)

Consider what metrics you want to use to evaluate success.
- If you think about mean squared error, can we actually relate to the amount of error?
- Try root mean squared error so that error is closer to the original units (dollars)
- What does RMSE do to outliers?
- Is mean absolute error a good metric for this problem?
- What about R^2? Adjusted R^2?
- Briefly describe your reasons for picking the metrics you use

In [28]:
# Check performance on train and test set
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
import numpy as np

#Using R2
y_train_pred = reg.predict(X_train)
y_test_pred = reg.predict(X_test)

r2_train = r2_score(y_train, y_train_pred)
r2_test = r2_score(y_test, y_test_pred)

print(f'Train R^2:\t{r2_train}\nTest R^2:\t{r2_test}')

Train R^2:	0.7373624908699018
Test R^2:	0.7270526876086874


In [29]:
#Using Mean Average Error
MAE_train = mean_absolute_error(y_train, y_train_pred)
MAE_test = mean_absolute_error(y_test, y_test_pred)

print(f'Train MAE:\t{MAE_train}\nTest MAE:\t{MAE_test}')

Train MAE:	69045.11110391657
Test MAE:	69106.01792210324


In [30]:
#Using Root Mean Squared Error
RMSE_train = np.sqrt(mean_squared_error(y_train, y_train_pred))
RMSE_test = np.sqrt(mean_squared_error(y_test, y_test_pred))

print(f'Train RMSE:\t{RMSE_train}\nTest RMSE:\t{RMSE_test}')

Train RMSE:	94173.4975302238
Test RMSE:	95747.85514032876


In [69]:
#Demonstration of prediction
#Random row to inspect
import random

#Choose 10 rows to display
for i in range(10):
    j = random.randint(0, len(y_test)-1)

    demo_prediction = round(y_test_pred[j][0])
    demo_actual = round(y_test.iloc[j].item())
    demo_difference = demo_prediction - demo_actual
    demo_difference_percentage = round((demo_actual / demo_prediction - 1)*10000)/100

    print(f"Index: {j} \t- \tPrediction: ${demo_prediction:,} \tActual: ${demo_actual:,} \tDifference: {demo_difference:,}, {demo_difference_percentage}%")


Index: 63 	- 	Prediction: $190,147 	Actual: $178,000 	Difference: 12,147, -6.39%
Index: 977 	- 	Prediction: $182,404 	Actual: $261,250 	Difference: -78,846, 43.23%
Index: 1175 	- 	Prediction: $93,044 	Actual: $140,000 	Difference: -46,956, 50.47%
Index: 1442 	- 	Prediction: $185,346 	Actual: $150,000 	Difference: 35,346, -19.07%
Index: 1033 	- 	Prediction: $431,152 	Actual: $822,500 	Difference: -391,348, 90.77%
Index: 1393 	- 	Prediction: $199,556 	Actual: $231,650 	Difference: -32,094, 16.08%
Index: 573 	- 	Prediction: $431,901 	Actual: $335,000 	Difference: 96,901, -22.44%
Index: 347 	- 	Prediction: $422,607 	Actual: $440,000 	Difference: -17,393, 4.12%
Index: 6 	- 	Prediction: $216,436 	Actual: $210,000 	Difference: 6,436, -2.97%
Index: 478 	- 	Prediction: $277,584 	Actual: $310,000 	Difference: -32,416, 11.68%


## Feature Selection - STRETCH

> **This step doesn't need to be part of your Minimum Viable Product (MVP), but its recommended you complete it if you have time!**

Even with all the preprocessing we did in Notebook 1, you probably still have a lot of features. Are they all important for prediction?

Investigate some feature selection algorithms (Lasso, RFE, Forward/Backward Selection)
- Perform feature selection to get a reduced subset of your original features
- Refit your models with this reduced dimensionality - how does performance change on your chosen metrics?
- Based on this, should you include feature selection in your final pipeline? Explain

Remember, feature selection often doesn't directly improve performance, but if performance remains the same, a simpler model is often preferrable. 



In [None]:
# perform feature selection 
# refit models
# gather evaluation metrics and compare to the previous step (full feature set)