## Model Selection

This notebook should include preliminary and baseline modeling.
- Try as many different models as possible.
- Don't worry about hyperparameter tuning or cross validation here.
- Ideas include:
    - linear regression
    - support vector machines
    - random forest
    - xgboost

In [115]:
# import models and fit
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor

from sklearn.metrics import mean_squared_error, r2_score, roc_auc_score

import datetime

In [116]:
# load data, clean data until EDA complete
housingData = pd.read_csv('../data/housingData.csv')

In [117]:
housingData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6597 entries, 0 to 6596
Data columns (total 48 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   list_date                        6597 non-null   int64  
 1   description.year_built           6597 non-null   float64
 2   description.baths_3qtr           6597 non-null   float64
 3   description.sold_date            6597 non-null   int64  
 4   description.sold_price           6597 non-null   float64
 5   description.baths_full           6597 non-null   float64
 6   description.baths_half           6597 non-null   float64
 7   description.lot_sqft             6597 non-null   float64
 8   description.sqft                 6597 non-null   float64
 9   description.baths                6597 non-null   float64
 10  description.garage               6597 non-null   float64
 11  description.stories              6597 non-null   float64
 12  description.beds    

In [118]:
# Splitting training and testing data 
# Run for all columns
X_train, X_test, y_train, y_test = train_test_split(housingData.drop(columns=['description.sold_price']), housingData['description.sold_price'], test_size=0.2, random_state=42)

In [131]:
# Splitting training and testing data 
# Run for only list price, cityAverageListPrice, sqft
columns = ['cityAverageListPrice','list_price','description.sqft']
X_train, X_test, y_train, y_test = train_test_split(housingData[columns], housingData['description.sold_price'], test_size=0.2, random_state=42)


In [145]:
# Splitting training and testing data 
# Run for only list price, cityAverageListPrice, sqft
columns = ['list_price','description.sqft']
X_train, X_test, y_train, y_test = train_test_split(housingData[columns], housingData['description.sold_price'], test_size=0.2, random_state=42)


In [70]:
# scaler = StandardScaler()
# X_train_scaled = scaler.fit_transform(X_train)

In [168]:
# Linear Regression
model = LinearRegression()
model.fit(X_train, y_train)

y_pred_LR = model.predict(X_test)

mse_LR = mean_squared_error(y_test, y_pred_LR)
r2_LR = r2_score(y_test[900:1100], y_pred_LR[900:1100])

print(mse_LR, r2_LR)

1087161009689.8822 0.9167663340105557


In [186]:
diffprice = housingData['list_price'] - housingData['description.sold_price']
diffs = diffprice[abs(diffprice)/housingData['list_price'] > 0.3].index

In [187]:
pd.options.display.float_format = '{:.0f}'.format
housingData[['list_price','description.sold_price']].iloc[diffs]

Unnamed: 0,list_price,description.sold_price
5,30000,10000
28,225000,30000
44,1058179,710000
47,1058179,534900
49,1058179,430000
...,...,...
6529,19900,9000
6537,30000,48500
6582,257442,162000
6589,257442,676860


In [150]:
# Decision Tree
tree_model = DecisionTreeRegressor(random_state=42)
tree_model.fit(X_train, y_train)

y_pred_tree = tree_model.predict(X_test)

mse_tree = mean_squared_error(y_test, y_pred_tree)
r2_tree = r2_score(y_test, y_pred_tree)

print(mse_tree, r2_tree)

1096401940025.8744 0.09411833709319561


In [151]:
# Random Forest
forest_model = RandomForestRegressor(random_state=42)
forest_model.fit(X_train, y_train)
y_pred_forest = forest_model.predict(X_test)

mse_forest = mean_squared_error(y_test, y_pred_forest)
r2_forest = r2_score(y_test, y_pred_forest)

print(mse_forest, r2_forest)

1096796499681.2084 0.09379233953368626


In [152]:
# Gradient Booster
gb_model = GradientBoostingRegressor(random_state=42)
gb_model.fit(X_train, y_train)

y_pred_gb = gb_model.predict(X_test)

mse_gb = mean_squared_error(y_test, y_pred_gb)
r2_gb = r2_score(y_test, y_pred_gb)

print(mse_gb, r2_gb)

1094508003931.4076 0.09568316648291231


In [153]:
# SVR
svr_model = SVR(kernel='rbf')
svr_model.fit(X_train, y_train)

y_pred_svr = svr_model.predict(X_test)

mse_svr = mean_squared_error(y_test, y_pred_svr)
r2_svr = r2_score(y_test, y_pred_svr)

print(mse_svr, r2_svr)

1224417437122.868 -0.011652080811508814


In [154]:
# ALL SO LOW EW
# All Overfitted. 
# Implement methods to reduce overfitting

Consider what metrics you want to use to evaluate success.
- If you think about mean squared error, can we actually relate to the amount of error?
- Try root mean squared error so that error is closer to the original units (dollars)
- What does RMSE do to outliers?
- Is mean absolute error a good metric for this problem?
- What about R^2? Adjusted R^2?
- Briefly describe your reasons for picking the metrics you use

In [56]:
# gather evaluation metrics and compare results

In [112]:
mse = mean_squared_error(y_test,pred_price)
r2 = r2_score(y_test,pred_price)

In [114]:
print(mse,r2)

1089716073640.0072 0.09964240955102


## Feature Selection - STRETCH

> **This step doesn't need to be part of your Minimum Viable Product (MVP), but its recommended you complete it if you have time!**

Even with all the preprocessing we did in Notebook 1, you probably still have a lot of features. Are they all important for prediction?

Investigate some feature selection algorithms (Lasso, RFE, Forward/Backward Selection)
- Perform feature selection to get a reduced subset of your original features
- Refit your models with this reduced dimensionality - how does performance change on your chosen metrics?
- Based on this, should you include feature selection in your final pipeline? Explain

Remember, feature selection often doesn't directly improve performance, but if performance remains the same, a simpler model is often preferrable. 



In [None]:
# perform feature selection 
# refit models
# gather evaluation metrics and compare to the previous step (full feature set)