## Model Selection

This notebook should include preliminary and baseline modeling.
- Try as many different models as possible.
- Don't worry about hyperparameter tuning or cross validation here.
- Ideas include:
    - linear regression
    - support vector machines
    - random forest
    - xgboost

In [193]:
#Import preprocessed data
import pandas as pd

#Independant variable training data
X_train = pd.read_csv("../data/preprocessed/X_train_scaled.csv")
X_train = X_train.drop(columns=["Unnamed: 0"])
print(f"X_train shape: {X_train.shape}")

#Target training data
y_train = pd.read_csv("../data/preprocessed/y_train.csv")
y_train = y_train.drop(columns=["Unnamed: 0"])
print(f"y_train shape: {y_train.shape}")

#Independant variable test data
X_test = pd.read_csv("../data/preprocessed/X_test_scaled.csv")
X_test = X_test.drop(columns=["Unnamed: 0"])
print(f"X_test shape: {X_test.shape}")

#Target test data
y_test = pd.read_csv("../data/preprocessed/y_test.csv")
y_test = y_test.drop(columns=["Unnamed: 0"])
print(f"y_test shape: {y_test.shape}")


X_train shape: (3381, 34)
y_train shape: (3381, 1)
X_test shape: (1450, 34)
y_test shape: (1450, 1)


In [209]:
#Create function to get all error scores at once
def get_error_scores (y_train, y_train_pred, y_test, y_test_pred, error_type='All', num_results=10):
    # Check performance on train and test set
    from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
    import numpy as np

    if (error_type == 'All' or LOWER(error_type) == 'r2'):
        #Using R2
        r2_train = round(r2_score(y_train, y_train_pred),4)
        r2_test = round(r2_score(y_test, y_test_pred),4)

        print(f'R SQUARED\n\tTrain R2:\t{r2_train}\n\tTest R2:\t{r2_test}')

    if (error_type == 'All' or LOWER(error_type) == 'mae'):
        #Using Mean Average Error
        MAE_train = round(mean_absolute_error(y_train, y_train_pred),2)
        MAE_test = round(mean_absolute_error(y_test, y_test_pred),2)

        print(f'MEAN AVERAGE ERROR\n\tTrain MAE:\t{MAE_train}\n\tTest MAE:\t{MAE_test}')

    if (error_type == 'All' or LOWER(error_type) == 'rmse'):
        #Using Root Mean Squared Error
        RMSE_train = round(np.sqrt(mean_squared_error(y_train, y_train_pred)),2)
        RMSE_test = round(np.sqrt(mean_squared_error(y_test, y_test_pred)),2)

        print(f'ROOT MEAN SQUARED ERROR\n\tTrain RMSE:\t{RMSE_train}\n\tTest RMSE:\t{RMSE_test}\n')

    if (error_type == 'All'):
        display_results_sample(y_test, y_test_pred, num_results)



In [210]:
#Function to demonstrate of prediction
def display_results_sample (y_test, y_test_prediction, num_results=10):
    import random

    print(f"{num_results} Randomly selected results.")

    sum_percentage_error = 0

    #Choose 10 rows to display
    for i in range(num_results):
        j = random.randint(0, len(y_test)-1)

        demo_prediction = round(y_test_prediction[j][0])
        demo_actual = round(y_test.iloc[j].item())
        demo_difference = demo_prediction - demo_actual
        demo_difference_percentage = round((demo_actual / demo_prediction - 1)*100,2)

        sum_percentage_error += abs(demo_difference_percentage)

        print(f"Index: {j} \t- \tPrediction: ${demo_prediction:,} \tActual: ${demo_actual:,} \tDifference: {demo_difference:,}, {demo_difference_percentage}%")

    average_percentage_error = round(sum_percentage_error / num_results,2)
    print(f"\t\t\t\t\t\t\t\t\tAverage % error = {average_percentage_error}%")


Consider what metrics you want to use to evaluate success.
- If you think about mean squared error, can we actually relate to the amount of error?
- Try root mean squared error so that error is closer to the original units (dollars)
- What does RMSE do to outliers?
- Is mean absolute error a good metric for this problem?
- What about R^2? Adjusted R^2?
- Briefly describe your reasons for picking the metrics you use

In [211]:
# Train our Linear Regression model
from sklearn.linear_model import LinearRegression

reg = LinearRegression()
reg.fit(X_train, y_train)

#Get predictions
y_train_pred = reg.predict(X_train)
y_test_pred = reg.predict(X_test)

get_error_scores (y_train, y_train_pred, y_test, y_test_pred)

R SQUARED
	Train R2:	0.7374
	Test R2:	0.7271
MEAN AVERAGE ERROR
	Train MAE:	69045.11
	Test MAE:	69106.02
ROOT MEAN SQUARED ERROR
	Train RMSE:	94173.5
	Test RMSE:	95747.86

10 Randomly selected results.
Index: 1356 	- 	Prediction: $186,969 	Actual: $150,000 	Difference: 36,969, -19.77%
Index: 225 	- 	Prediction: $567,266 	Actual: $762,000 	Difference: -194,734, 34.33%
Index: 1247 	- 	Prediction: $446,066 	Actual: $280,000 	Difference: 166,066, -37.23%
Index: 1379 	- 	Prediction: $563,924 	Actual: $368,000 	Difference: 195,924, -34.74%
Index: 1110 	- 	Prediction: $51,367 	Actual: $100,000 	Difference: -48,633, 94.68%
Index: 84 	- 	Prediction: $614,979 	Actual: $651,000 	Difference: -36,021, 5.86%
Index: 40 	- 	Prediction: $164,046 	Actual: $153,000 	Difference: 11,046, -6.73%
Index: 396 	- 	Prediction: $372,689 	Actual: $270,000 	Difference: 102,689, -27.55%
Index: 594 	- 	Prediction: $299,558 	Actual: $245,000 	Difference: 54,558, -18.21%
Index: 820 	- 	Prediction: $356,691 	Actual: $29

In [None]:
from sklearn.preprocessing import PolynomialFeatures

# Create 2nd degree polynomial feature set and train model
poly2 = PolynomialFeatures(degree=2)
Xpoly_train = poly2.fit_transform(X_train)
Xpoly_test = poly2.transform(X_test)
print(f'Number of polynomial features: {Xpoly_train.shape[1]}')

# Train our model
reg.fit(Xpoly_train, y_train)
ypoly_train_pred = reg.predict(Xpoly_train)
ypoly_test_pred = reg.predict(Xpoly_test)

get_error_scores(y_train, ypoly_train_pred, y_test, ypoly_test_pred)

Number of polynomial features: 630
R SQUARED
	Train R2:	0.8911
	Test R2:	0.8373
MEAN AVERAGE ERROR
	Train MAE:	45371.38
	Test MAE:	54898.79
ROOT MEAN SQUARED ERROR
	Train RMSE:	60651.3
	Test RMSE:	73930.9

10 Randomly selected results.
Index: 638 	- 	Prediction: $531,926 	Actual: $550,000 	Difference: -18,074, 3.4%
Index: 1389 	- 	Prediction: $253,862 	Actual: $145,000 	Difference: 108,862, -42.88%
Index: 313 	- 	Prediction: $487,708 	Actual: $529,900 	Difference: -42,192, 8.65%
Index: 1382 	- 	Prediction: $221,235 	Actual: $315,000 	Difference: -93,765, 42.38%
Index: 972 	- 	Prediction: $39,779 	Actual: $74,250 	Difference: -34,471, 86.66%
Index: 283 	- 	Prediction: $486,318 	Actual: $450,000 	Difference: 36,318, -7.47%
Index: 73 	- 	Prediction: $510,943 	Actual: $785,000 	Difference: -274,057, 53.64%
Index: 834 	- 	Prediction: $313,310 	Actual: $200,000 	Difference: 113,310, -36.17%
Index: 1108 	- 	Prediction: $256,544 	Actual: $288,000 	Difference: -31,456, 12.26%
Index: 485 	- 	Pre

In [214]:
# Create polynomial feature set and train model
poly2 = PolynomialFeatures(degree=3)
Xpoly_train = poly2.fit_transform(X_train)
Xpoly_test = poly2.transform(X_test)
print(f'Number of polynomial features: {Xpoly_train.shape[1]}')

# Train our model
reg.fit(Xpoly_train, y_train)
ypoly_train_pred = reg.predict(Xpoly_train)
ypoly_test_pred = reg.predict(Xpoly_test)

# Check performance on train and test set
get_error_scores(y_train, ypoly_train_pred, y_test, ypoly_test_pred)

Number of polynomial features: 7770
R SQUARED
	Train R2:	1.0
	Test R2:	0.9693
MEAN AVERAGE ERROR
	Train MAE:	0.0
	Test MAE:	4554.24
ROOT MEAN SQUARED ERROR
	Train RMSE:	0.0
	Test RMSE:	32094.95

10 Randomly selected results.
Index: 583 	- 	Prediction: $276,000 	Actual: $276,000 	Difference: 0, 0.0%
Index: 927 	- 	Prediction: $162,000 	Actual: $162,000 	Difference: 0, 0.0%
Index: 724 	- 	Prediction: $415,000 	Actual: $415,000 	Difference: 0, 0.0%
Index: 1195 	- 	Prediction: $550,000 	Actual: $550,000 	Difference: 0, 0.0%
Index: 9 	- 	Prediction: $379,000 	Actual: $379,000 	Difference: 0, 0.0%
Index: 32 	- 	Prediction: $515,000 	Actual: $515,000 	Difference: 0, 0.0%
Index: 97 	- 	Prediction: $260,000 	Actual: $260,000 	Difference: 0, 0.0%
Index: 1345 	- 	Prediction: $559,900 	Actual: $559,900 	Difference: 0, 0.0%
Index: 1127 	- 	Prediction: $100,000 	Actual: $100,000 	Difference: 0, 0.0%
Index: 80 	- 	Prediction: $371,500 	Actual: $371,500 	Difference: 0, 0.0%
									Average % error = 

## Feature Selection - STRETCH

> **This step doesn't need to be part of your Minimum Viable Product (MVP), but its recommended you complete it if you have time!**

Even with all the preprocessing we did in Notebook 1, you probably still have a lot of features. Are they all important for prediction?

Investigate some feature selection algorithms (Lasso, RFE, Forward/Backward Selection)
- Perform feature selection to get a reduced subset of your original features
- Refit your models with this reduced dimensionality - how does performance change on your chosen metrics?
- Based on this, should you include feature selection in your final pipeline? Explain

Remember, feature selection often doesn't directly improve performance, but if performance remains the same, a simpler model is often preferrable. 



In [None]:
# perform feature selection 
# refit models
# gather evaluation metrics and compare to the previous step (full feature set)