## Overview

__Business Problem__

As a business, it is difficult to assess the appropriate value for homes in certain neighborhoods/markets. There is a need for more precision and confidence in valuation to support decision-making. Developing a data-driven / model-based approach to valuation can help to remove human bias.

As a building developer, it is difficult to weigh the importance of attributes to potential home buyers. A model-based approach can help assess what factors drive the oveall value of a home.

__Inference Problem__

X: 79 explanatory variables describing aspects of residential homes in Ames, Iowa
y: Home price

Develop a model to predict home price based on a collection of attributes about the home and surrounding neighborhood.

__Methodology__

- Exploratory data analysis
    - Use pandas profiling minimal for overview
    - Develop calculated metrics using numerical variables
    - Apply one-hot encoding to relevant ordinal variables
- Model setup
    - Baseline set of regression models
        - Linear Regression
        - Random Forest
        - Gradient Boosting
- Model evaluation and iteration
    - Evaluate performance metrics: $R^2$, MSE, MAE, RMSE, [placeholder]
    - Assess feature importance, apply dimensionality reduction
    - Determine which models to exclude, features to exclude/include/adjust
    - Repeat model setup

## Import and Setup

In [34]:
import numpy as np
import pandas as pd
import pandas_profiling as pp

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error


In [None]:
from sklearn.preprocessing import StandardScaler
from scipy.stats import chi2_contingency 
from sklearn.model_selection import RandomizedSearchCV
from sklearn.svm import SVR
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import SGDRegressor, Ridge, Lasso
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

In [5]:
house_train = pd.read_csv('../data/train.csv')
house_test = pd.read_csv('../data/test.csv')

In [6]:
print("\033[1m" + "Dataframe Shape" + "\033[0m")
print(house_train.shape)
print("\n")

print("\033[1m" + "Column Information" + "\033[0m")
house_train.info()
print("\n")

print("\033[1m" + "Numeric Column Information" + "\033[0m")
print(house_train.describe())
print("\n")

print("\033[1m" + "Categorical Column Unique Values" + "\033[0m")
for col in house_train:
    if len(house_train[col].unique()) > 10:
        pass
    else:
        print('{}: {}'.format(col, house_train[col].unique()))

[1mDataframe Shape[0m
(1460, 81)


[1mColumn Information[0m
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   o

PavedDrive: ['Y' 'N' 'P']
PoolArea: [  0 512 648 576 555 480 519 738]
PoolQC: [nan 'Ex' 'Fa' 'Gd']
Fence: [nan 'MnPrv' 'GdWo' 'GdPrv' 'MnWw']
MiscFeature: [nan 'Shed' 'Gar2' 'Othr' 'TenC']
YrSold: [2008 2007 2006 2009 2010]
SaleType: ['WD' 'New' 'COD' 'ConLD' 'ConLI' 'CWD' 'ConLw' 'Con' 'Oth']
SaleCondition: ['Normal' 'Abnorml' 'Partial' 'AdjLand' 'Alloca' 'Family']


In [15]:
from pandas_profiling import ProfileReport
prof = ProfileReport(house_train, minimal=True)
prof.to_file(output_file='output.html')

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

In [28]:
initial_features = ['LotArea', 'BedroomAbvGr', 'FullBath']

X, y = house_train[initial_features], house_train['SalePrice']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [38]:
reg = LinearRegression().fit(X_train, y_train)
r2 = reg.score(X_train, y_train)

print("The model's R^2 score is {:.2f}".format(r2))

The model's R^2 score is 0.35


In [41]:
x1, x2, x3 = reg.coef_
intercept = reg.intercept_
print("The initial model function is: y = {:.2f}x1 + {:.2f}x2 + {}x3 + {:.2f}".format(x1, x2, x3, intercept))

The initial model function is: y = 1.50x1 + -5588.32x2 + 77447.59237738466x3 + 59429.81


In [37]:
y_pred = reg.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred, squared=True)
rmse = mean_squared_error(y_test, y_pred, squared=False)
        
print("The mean absolute error is {:.2f}".format(mae))
print("The mean squared error is {:.2f}".format(mse))
print("The root mean squared error is {:.2f}".format(rmse))

The mean absolute error is 45417.52
The mean squared error is 4831659711.85
The root mean squared error is 69510.14
