## Model Selection

This notebook should include preliminary and baseline modeling.
- Try as many different models as possible.
- Don't worry about hyperparameter tuning or cross validation here.
- Ideas include:
    - linear regression
    - support vector machines
    - random forest
    - xgboost

In [70]:
# import models and fit
import pandas as pd
import numpy as np


from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import mean_squared_error, r2_score


In [22]:
# load data, clean data until EDA complete
housingData = pd.read_csv('../data/housingData.csv')
numerical_cols = housingData.select_dtypes(include=['number']).columns.tolist()
housingData = housingData[numerical_cols]
housingData = housingData.dropna(subset=['description.year_built'])

In [23]:
housingData.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6012 entries, 0 to 6616
Data columns (total 18 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   description.year_built           6012 non-null   float64
 1   description.baths_3qtr           6012 non-null   float64
 2   description.sold_price           6012 non-null   float64
 3   description.baths_full           6012 non-null   float64
 4   description.baths_half           6012 non-null   float64
 5   description.lot_sqft             6012 non-null   float64
 6   description.sqft                 6012 non-null   float64
 7   description.baths                6012 non-null   float64
 8   description.garage               6012 non-null   float64
 9   description.stories              6012 non-null   float64
 10  description.beds                 6012 non-null   float64
 11  list_price                       6012 non-null   float64
 12  property_id              

In [5]:
housingData.head()

Unnamed: 0,description.year_built,description.sold_price,description.baths_full,description.lot_sqft,description.sqft,description.baths,description.garage,description.stories,description.beds,list_price,property_id,listing_id,price_reduced_amount,location.address.postal_code,location.address.coordinate.lon,location.address.coordinate.lat
0,1998.0,129900.0,2.0,11761.0,1478.0,2.0,2.0,1.0,3.0,129900.0,8846541000.0,622475900.0,0.0,36117.0,-86.178412,32.389075
1,1945.0,88500.0,2.0,6534.0,1389.0,2.0,1.0,2.0,4.0,88000.0,7727981000.0,2961523000.0,3000.0,36107.0,-86.273286,32.382748
2,1969.0,145000.0,2.0,17424.0,2058.0,2.0,0.0,1.0,3.0,149000.0,7320925000.0,619793200.0,0.0,36109.0,-86.221454,32.380023
3,1955.0,65000.0,2.0,9712.0,1432.0,2.0,0.0,1.0,3.0,68000.0,7231605000.0,2957379000.0,9000.0,36107.0,-86.284387,32.386844
4,1984.0,169000.0,2.0,10890.0,1804.0,2.0,0.0,1.0,3.0,169999.0,7700691000.0,2960976000.0,5000.0,36106.0,-86.232662,32.351898


In [79]:
# Splitting training and testing data 
X_train, X_test, y_train, y_test = train_test_split(housingData.drop(columns=['description.sold_price']), housingData['description.sold_price'], test_size=0.2, random_state=42)

In [62]:
# Splitting training and testing data 
X_train, X_test, y_train, y_test = train_test_split(housingData[['list_price','description.sqft']], housingData['description.sold_price'], test_size=0.2, random_state=42)

In [80]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

In [64]:
linR = LinearRegression()
linR.fit(X_train_scaled,y_train)

In [65]:
X_test_scaled = scaler.transform(X_test)
pred_price = linR.predict(X_test_scaled)

In [75]:
import statsmodels.api as sm
X = sm.add_constant(X_train)
lin_reg = sm.OLS(y_train,X)
model = lin_reg.fit()
print_model = model.summary()
print(print_model)

                              OLS Regression Results                              
Dep. Variable:     description.sold_price   R-squared:                       0.376
Model:                                OLS   Adj. R-squared:                  0.374
Method:                     Least Squares   F-statistic:                     180.3
Date:                    Sat, 22 Jun 2024   Prob (F-statistic):               0.00
Time:                            15:25:41   Log-Likelihood:                -69177.
No. Observations:                    4809   AIC:                         1.384e+05
Df Residuals:                        4792   BIC:                         1.385e+05
Df Model:                              16                                         
Covariance Type:                nonrobust                                         
                                      coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------

In [76]:
X.drop('description.baths_half',inplace=True, axis=1)
lin_reg = sm.OLS(y_train,X)
model = lin_reg.fit()
print_model = model.summary()
print(print_model)

                              OLS Regression Results                              
Dep. Variable:     description.sold_price   R-squared:                       0.376
Model:                                OLS   Adj. R-squared:                  0.374
Method:                     Least Squares   F-statistic:                     192.3
Date:                    Sat, 22 Jun 2024   Prob (F-statistic):               0.00
Time:                            15:26:01   Log-Likelihood:                -69177.
No. Observations:                    4809   AIC:                         1.384e+05
Df Residuals:                        4793   BIC:                         1.385e+05
Df Model:                              15                                         
Covariance Type:                nonrobust                                         
                                      coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------

In [77]:
X.drop('description.year_built',inplace=True, axis=1)
lin_reg = sm.OLS(y_train,X)
model = lin_reg.fit()
print_model = model.summary()
print(print_model)

                              OLS Regression Results                              
Dep. Variable:     description.sold_price   R-squared:                       0.376
Model:                                OLS   Adj. R-squared:                  0.374
Method:                     Least Squares   F-statistic:                     206.1
Date:                    Sat, 22 Jun 2024   Prob (F-statistic):               0.00
Time:                            15:26:24   Log-Likelihood:                -69177.
No. Observations:                    4809   AIC:                         1.384e+05
Df Residuals:                        4794   BIC:                         1.385e+05
Df Model:                              14                                         
Covariance Type:                nonrobust                                         
                                      coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------

In [78]:
X.drop('description.lot_sqft',inplace=True, axis=1)
lin_reg = sm.OLS(y_train,X)
model = lin_reg.fit()
print_model = model.summary()
print(print_model)

                              OLS Regression Results                              
Dep. Variable:     description.sold_price   R-squared:                       0.376
Model:                                OLS   Adj. R-squared:                  0.374
Method:                     Least Squares   F-statistic:                     221.9
Date:                    Sat, 22 Jun 2024   Prob (F-statistic):               0.00
Time:                            15:26:40   Log-Likelihood:                -69178.
No. Observations:                    4809   AIC:                         1.384e+05
Df Residuals:                        4795   BIC:                         1.385e+05
Df Model:                              13                                         
Covariance Type:                nonrobust                                         
                                      coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------

In [None]:
X.drop('description.lot_sqft',inplace=True, axis=1)
lin_reg = sm.OLS(y_train,X)
model = lin_reg.fit()
print_model = model.summary()
print(print_model)

Consider what metrics you want to use to evaluate success.
- If you think about mean squared error, can we actually relate to the amount of error?
- Try root mean squared error so that error is closer to the original units (dollars)
- What does RMSE do to outliers?
- Is mean absolute error a good metric for this problem?
- What about R^2? Adjusted R^2?
- Briefly describe your reasons for picking the metrics you use

In [56]:
# gather evaluation metrics and compare results

In [66]:
mse = mean_squared_error(pred_price, y_test)
r2 = r2_score(pred_price,y_test)

In [67]:
print(r2)

0.936071029904634


## Feature Selection - STRETCH

> **This step doesn't need to be part of your Minimum Viable Product (MVP), but its recommended you complete it if you have time!**

Even with all the preprocessing we did in Notebook 1, you probably still have a lot of features. Are they all important for prediction?

Investigate some feature selection algorithms (Lasso, RFE, Forward/Backward Selection)
- Perform feature selection to get a reduced subset of your original features
- Refit your models with this reduced dimensionality - how does performance change on your chosen metrics?
- Based on this, should you include feature selection in your final pipeline? Explain

Remember, feature selection often doesn't directly improve performance, but if performance remains the same, a simpler model is often preferrable. 



In [None]:
# perform feature selection 
# refit models
# gather evaluation metrics and compare to the previous step (full feature set)