## Creating a Final Model

Import libraries and change settings:

In [23]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV
from sklearn.preprocessing import PolynomialFeatures, StandardScaler, MinMaxScaler, PowerTransformer
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict
from sklearn.metrics import r2_score, mean_squared_error
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)

# 
Import our cleaned training data:

In [24]:
train_cleaned = pd.read_csv('../data/train_clean.csv')

# 
Convert to numeric as those will be the only variables we need in our model:

In [25]:
train_cleaned_numeric = train_cleaned.select_dtypes(include = np.number)

In [26]:
train_cleaned_numeric.head(2)

Unnamed: 0,Id,MS SubClass,Lot Area,Overall Qual,Overall Cond,Year Built,Year Remod/Add,BsmtFin SF 1,Total Bsmt SF,Central Air,1st Flr SF,2nd Flr SF,Low Qual Fin SF,Gr Liv Area,Bsmt Full Bath,Full Bath,Bedroom AbvGr,Kitchen Qual,TotRms AbvGrd,Fireplaces,Garage Area,Wood Deck SF,Open Porch SF,SalePrice,remodeled,post_recession_sale,HasAlley,has_brick_face,has_basement,has_attached_garage,has_decent_garage,has_nice_fence,newer,pre_war,two_story,planned_development,split,floating_village,low_density_residential,gravel_street,on_hill,in_culdesac,good_hood,bad_hood,near_artery_or_feeder,has_hip_roof,nice_exterior,has_poured_concrete_foundation,poor_functionality,has_fireplace,has_paved_drive,has_porch_or_deck,log_overall_cond,log_lot_area,Neighborhood_Blueste,Neighborhood_BrDale,Neighborhood_BrkSide,Neighborhood_ClearCr,Neighborhood_CollgCr,Neighborhood_Crawfor,Neighborhood_Edwards,Neighborhood_Gilbert,Neighborhood_Greens,Neighborhood_IDOTRR,Neighborhood_MeadowV,Neighborhood_Mitchel,Neighborhood_NAmes,Neighborhood_NPkVill,Neighborhood_NWAmes,Neighborhood_NoRidge,Neighborhood_NridgHt,Neighborhood_OldTown,Neighborhood_SWISU,Neighborhood_Sawyer,Neighborhood_SawyerW,Neighborhood_Somerst,Neighborhood_StoneBr,Neighborhood_Timber,Neighborhood_Veenker
0,109.0,60.0,13517.0,6.0,8.0,1976.0,2005.0,533.0,725.0,1.0,725.0,754.0,0.0,1479.0,0.0,2.0,3.0,3.0,6.0,0.0,475.0,0.0,44.0,130500.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0,0,1,1,2.079442,9.511703,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
1,544.0,60.0,11492.0,7.0,5.0,1996.0,1997.0,637.0,913.0,1.0,913.0,1209.0,0.0,2122.0,1.0,2.0,4.0,3.0,8.0,1.0,559.0,0.0,74.0,220000.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0,1,1,1,1.609438,9.349406,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0


# 
### The Model
###### 
Create the Independent Variables (X) and the dependent variable (Y) for our model:

In [27]:
features = train_cleaned_numeric.drop(columns = ['Central Air', 'Overall Cond', 'Id', 'SalePrice', 'Gr Liv Area', 'MS SubClass', 'Year Remod/Add', 
                                         'planned_development', 'has_paved_drive', 'Bsmt Full Bath', 
                                         'Lot Area', 'good_hood', 'bad_hood', 'two_story', 'split', 'newer', 'pre_war', 'Fireplaces', 'Low Qual Fin SF', 
                                         'has_attached_garage', 'Total Bsmt SF', 'Wood Deck SF', 'Open Porch SF', 'has_basement', 'Full Bath', 
                                         'Garage Area', 'has_nice_fence', 'TotRms AbvGrd', 'Bedroom AbvGr']).columns
X = train_cleaned_numeric[features]
y = np.log(train_cleaned_numeric['SalePrice'])

# 
Split data into training and test:

In [28]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

# 
Scale the data to put all variables on a comparable scale:

In [29]:
ss = StandardScaler()
ss.fit(X_train)
X_train = ss.transform(X_train)
X_test = ss.transform(X_test)

# 
Compare different models to decide which is best:

In [30]:
lr = LinearRegression()
lr_scores = cross_val_score(lr, X_train, y_train, cv=3)
lr_scores.mean()

0.8562193835181108

In [31]:
lasso = LassoCV(n_alphas=300, max_iter = 2200)
lasso_scores = cross_val_score(lasso, X_train, y_train, cv=3)
lasso_scores.mean()

0.8573872726904174

In [32]:
ridge = RidgeCV(alphas=np.linspace(.1, 10, 100))
ridge_scores = cross_val_score(ridge, X_train, y_train, cv=3)
ridge_scores.mean()

0.8572458457653943

# 
Lasso scores the best, so we will move forward and fit our variables to a Lasso model:


In [33]:
lasso.fit(X_train, y_train)

LassoCV(max_iter=2200, n_alphas=300)

# 
Training Score:

In [34]:
lasso.score(X_train, y_train)

0.884910211858198

# 
Testing Score:

In [35]:
lasso.score(X_test, y_test)

0.9135180771117881

# 
Cross Validation Score:

In [36]:
cross_val_score(lasso, X_train, y_train).mean()

0.8675870570772304

# 
Calculate a Baseline RMSE Score to compare to using the average sale price as our predictions for each row of data:

In [37]:
baseline_preds = [round(train_cleaned_numeric['SalePrice'].mean()) for x in y]
print('This is our baseline RMSE:')
print('')
print(mean_squared_error(np.exp(y), baseline_preds, squared = False))

This is our baseline RMSE:

79281.58901824827


# 
Calculate mean squared error for our training and test predictions and take the square root to caclulate RMSE:

In [38]:
print('Training Score:')
mean_squared_error(np.exp(y_train), np.exp(lasso.predict(X_train)), squared = False)

Training Score:


38630.58071966364

In [39]:
print('Testing Score:')
mean_squared_error(np.exp(y_test), np.exp(lasso.predict(X_test)), squared = False)

Testing Score:


20893.57523717234

# 
Get test predictions in order to calculate residuals

In [40]:
test_preds = np.exp(lasso.predict(X_test))

In [41]:
residuals = (np.exp(y_test) - test_preds)

In [42]:
print(residuals.mean())
print('')
print(residuals.max())
print('')
print(residuals.min())

2115.6616044818347

141083.35022248398

-58922.283573638386


# 
Display coefficients to determine how impactful each variable is in the model:

In [43]:
model_coefficients = pd.DataFrame(list(zip(features, lasso.coef_)), columns =['Features', 'Coefficient'])
model_coefficients

Unnamed: 0,Features,Coefficient
0,Overall Qual,0.111918
1,Year Built,0.075632
2,BsmtFin SF 1,0.037438
3,1st Flr SF,0.106711
4,2nd Flr SF,0.088726
5,Kitchen Qual,0.029392
6,remodeled,0.003252
7,post_recession_sale,-0.007374
8,HasAlley,-0.0
9,has_brick_face,0.00246


## Model Analysis

This model is significantly better than our baseline model and score. Our testing score is over 100% less than the baseline score.
###### 
Our training R2 score is less than our baseline model, but that's because we allowed for higher bias in order to lower the variance. This results in a higher test score, meaning our model performs better on new data, which is the point of the model.
###### 
Key to our model was using the logarithmic transformation on the Sale Price. This created a more normal distribution and significantly improved our model. We also did the same thing to the Lot Area and Overall Condition independent variables.
###### 
This model allows us to infer that the selected variables help predict the sale price of a home. I will elaborate more on this in the conclusions portion of this report:

[Link to Conclusions](Conclusions.ipynb)

[Back to Table of Contents](../README.md)