This notebook contains the work for Step 4 of the Data Science Method:

**The Data Science Method**  


1.   Problem Identification 

2.   Data Wrangling 
  * Data Collection 
   * Data Organization
  * Data Definition 
  * Data Cleaning
3.   Exploratory Data Analysis
 * Build data profile tables and plots
        - Outliers & Anomalies
 * Explore data relationships
 * Identification and creation of features</b>

4.   Pre-processing and Training Data Development
  * Create dummy or indicator features for categorical variables
  * Standardize the magnitude of numeric features
  * Split into testing and training datasets
  * Apply scaler to the testing set

<b>5.   Modeling 
  * Fit Models with Training Data Set
  * Review Model Outcomes — Iterate over additional models as needed.
  * Identify the Final Model</b>

6.   Documentation
  * Review the Results
  * Present and share your findings - storytelling
  * Finalize Code 
  * Finalize Documentation

### STEP 4: Pre-processing and Training Data Development

In [1]:
#load python packages
import os
import pandas as pd
import datetime
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
from scipy.stats import chi2_contingency
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

from sklearn import svm
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.svm import SVR
from sklearn.metrics import r2_score, explained_variance_score,mean_absolute_error
from sklearn.ensemble import RandomForestRegressor


In [2]:
# set options
pd.set_option('display.max_rows', 1500)

In [3]:
# load the data saved from step 3
df=pd.read_csv('data\step3_output.csv')
df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,SCHEDNUM_x,RECEPTION_NUM,INSTRUMENT,SALE_YEAR,SALE_MONTHDAY,RECEPTION_DATE,SALE_PRICE,GRANTOR,GRANTEE,CLASS,...,UNITS,ASMT_APPR_LAND,TOTAL_VALUE,ASDLAND,ASSESS_VALUE,ASMT_TAXABLE,ASMT_EXEMPT_AMT,NBHD_1_CN_y,LEGL_DESCRIPTION,IMPROVE_VALUE
0,14101001000,2008138043,WD,2008,703,20081008,10.0,"ATKINSON,RUSSELL",DREAM BUILDERS LLC,R,...,1,103000,530200,7365,37910,37920,0,N GREEN VALLEY,GREEN VALLEY RANCH FLG #36 B1 L1,427200
1,14101001000,2009074518,WD,2009,605,20090615,299000.0,"ATKINSON,RUSSELL","PADBURY,CHRISTOPHER R",R,...,1,103000,530200,7365,37910,37920,0,N GREEN VALLEY,GREEN VALLEY RANCH FLG #36 B1 L1,427200
2,14101001000,2015157653,WD,2015,1102,20151109,415000.0,"PADBURY,CHRISTOPHER R","MACIEL,HORACIO PEREZ",R,...,1,103000,530200,7365,37910,37920,0,N GREEN VALLEY,GREEN VALLEY RANCH FLG #36 B1 L1,427200
3,14101001000,2009002129,WD,2008,1024,20090108,10.0,DREAM BUILDERS LLC,"ATKINSON,RUSSELL",R,...,1,103000,530200,7365,37910,37920,0,N GREEN VALLEY,GREEN VALLEY RANCH FLG #36 B1 L1,427200
4,14101002000,2010094573,WD,2010,823,20100824,350000.0,"SHEARON,MARK H &","EFREM,TEWEDROS",R,...,1,90400,572600,6464,40941,40940,0,N GREEN VALLEY,GREEN VALLEY RANCH FLG #36 B1 L2,482200


## Listing of fields, arranged in groups with decision about how to handle fields

|Field|type|group|definition|notes|
|---|---|---|---|---|
|CCAGE_RM|float64|age|Remodel Year||
|CCYRBLT|float64|age|Year Built||
|RECEPTION_DATE|int64|date|Clerk & Recorder's Reception Date|drop|
|SALE_MONTHDAY|int64|date|Sale Month/Day||
|SALE_YEAR|int64|date|Sale Year||
|PIN|int64|id|Assessor's Property Identification Number|drop|
|RECEPTION_NUM|int64|id|Input Key|drop|
|SCHEDNUM_x|int64|id|Input Key|drop|
|CO_OWNER|object|ignore|Co-Owner|drop|
|MKT_CLUS|float64|loc|||
|NBHD_1_CN_x|object|loc|Neighborhood name - x|category|
|NBHD_1_CN_y|object|loc|Neighborhood name - y|drop|
|NBHD_1_x|int64|loc|Neighborhood code|ignore|
|SITE_DIR|object|loc|Site Street Direction|ignore|
|SITE_MODE|object|loc|Site Street Type|ignore|
|SITE_MORE|object|loc|Site Unit Number|ignore|
|SITE_NAME|object|loc|Site Street Name|ignore|
|SITE_NBR|int64|loc|Site Street Number|ignore|
|TAX_DIST|object|loc|Tax District|category, ignore?|
|ZONE10|object|loc|Zone|category, ignore?|
|CD|int64|other|Building Number|drop|
|CLASS|object|other|#N/A|drop|
|D_CLASS|int64|other|Property Use Class|category, ignore?|
|D_CLASS_CN_x|object|other|Property Use Class Definition - x|drop|
|D_CLASS_CN_y|object|other|Property Use Class Definition - y|drop|
|GRANTEE|object|other|??|drop|
|GRANTOR|object|other|??|drop|
|INSTRUMENT|object|other|??|drop|
|LEGL_DESCRIPTION|object|other|Description of the parcel per the deed |drop|
|PROP_CLASS|int64|other|Property Class(ASMT-PROP-CODE)|only select 1112|
|PROPERTY_CLASS|object|other|Property Class Description|drop|
|STYLE_CN|object|other|Architecture Style Code Definition|similar to STORY but does not agree|
|OWNER|object|owner|Owner|drop|
|OWNER_APT|object|owner|Street Mailing Unit Number|drop|
|OWNER_CITY|object|owner|Mailing City|drop|
|OWNER_DIR|object|owner|Street Mailing Direction|drop|
|OWNER_NUM|object|owner|Street Mailing Number|drop|
|OWNER_ST|object|owner|Street Mailing Street Name|drop|
|OWNER_STATE|object|owner|Mailing State|drop|
|OWNER_TYPE|object|owner|Street Mailing Type|drop|
|OWNER_ZIP|object|owner|Zip Code|drop|
|AREA_ABG|int64|size|Above Grade Improvement Area||
|BED_RMS|int64|size|Number of bedroom above grade||
|BSMT_AREA|int64|size|Basement Square Footage||
|FBSMT_SQFT|int64|size|Finished Basement Area||
|FULL_B|float64|size|Total number of full baths||
|GRD_AREA|int64|size|Garden Level Square Footage||
|HLF_B|float64|size|Total number of half baths||
|LAND_SQFT|float64|size|Land Area||
|OFCARD|int64|size|Number of Buildings|drop|
|STORY|int64|size|Stories||
|UNITS|int64|size|Number of Units||
|ASDLAND|int64|value|Assessed Land Value|drop|
|ASMT_APPR_LAND|int64|value|Actual Land Value||
|ASMT_EXEMPT_AMT|int64|value|Exempt Amount|drop|
|ASMT_TAXABLE|int64|value|Taxable Amount|drop|
|ASSESS_VALUE|int64|value|Assessed Total Value|drop|
|IMPROVE_VALUE|int64|value|Calculated=Tot Val - Land||
|SALE_PRICE|float64|value|Sale Price||
|TOTAL_VALUE|int64|value|Actual Total Value|drop|


In [4]:
# only select PROP_CLASS = 1112 (Single Family Residential)
df = df[df.PROP_CLASS == 1112]

In [5]:
# instead of looking at the year the house was built (CCYRBLT), transform it to years since built
df = df.assign(AGE = df.SALE_YEAR-df.CCYRBLT)

In [6]:
# instead of looking at year of remodel (CCAGE_RM), transform it to years since remodel
df=df.assign(RM_AGE = df.SALE_YEAR-df.CCAGE_RM)
# if remodel year is after sale (RM_AGE will be negative), then reset RM_AGE to AGE
df.loc[df.RM_AGE < 0, 'RM_AGE'] = df.AGE
# set remodel age to home age if no remodel year is available
df.loc[df.CCAGE_RM == 0, 'RM_AGE'] = df.AGE

In [7]:
# show results of new columns
print(df.groupby(['SALE_YEAR','CCYRBLT','CCAGE_RM','AGE','RM_AGE']).size())

SALE_YEAR  CCYRBLT  CCAGE_RM  AGE    RM_AGE
2008       1876.0   0.0       132.0  132.0       1
           1880.0   0.0       128.0  128.0       1
                    2001.0    128.0  7.0         1
           1881.0   0.0       127.0  127.0       2
                    2012.0    127.0  127.0       1
                                              ... 
2020       2018.0   0.0       2.0    2.0        31
                    2018.0    2.0    2.0        14
                    2019.0    2.0    1.0         1
           2019.0   0.0       1.0    1.0       171
                    2019.0    1.0    1.0        18
Length: 24363, dtype: int64


In [8]:
# drop columns that are not going to be used
df.drop(['RECEPTION_DATE', 'PIN', 'RECEPTION_NUM', 'SCHEDNUM_x', 'CO_OWNER', 'NBHD_1_CN_y', 'NBHD_1_x'
         , 'SITE_DIR', 'SITE_MODE', 'SITE_MORE', 'SITE_NAME', 'SITE_NBR', 'TAX_DIST', 'ZONE10', 'CD', 'CLASS'
         , 'D_CLASS', 'D_CLASS_CN_x', 'D_CLASS_CN_y', 'GRANTEE', 'GRANTOR', 'INSTRUMENT', 'LEGL_DESCRIPTION'
         , 'PROP_CLASS', 'PROPERTY_CLASS', 'STYLE_CN', 'OWNER', 'OWNER_APT', 'OWNER_CITY', 'OWNER_DIR', 'OWNER_NUM'
         , 'OWNER_ST', 'OWNER_STATE', 'OWNER_TYPE', 'OWNER_ZIP', 'OFCARD', 'ASDLAND', 'ASMT_EXEMPT_AMT'
         , 'ASMT_TAXABLE', 'ASSESS_VALUE', 'TOTAL_VALUE', 'CCYRBLT', 'CCAGE_RM'], axis=1, inplace=True)

In [9]:
# drop any rows with NaN values
df.dropna(inplace=True)

In [10]:
# change categorical values (NBHD_1_CN_x) to indicators
df_inds = pd.get_dummies(data=df, columns=['NBHD_1_CN_x'],drop_first=True)

In [11]:
df_inds.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
SALE_YEAR,141505.0,2014.118434,3.506308,2008.0,2011.0,2014.0,2017.0,2020.0
SALE_MONTHDAY,141505.0,669.957351,328.937146,101.0,412.0,629.0,926.0,1231.0
SALE_PRICE,141505.0,313035.398777,370151.400663,1.0,85700.0,265000.0,430000.0,49500000.0
MKT_CLUS,141505.0,16.446479,9.742234,1.0,8.0,16.0,24.0,54.0
LAND_SQFT,141505.0,6518.791541,2813.477368,0.0,4846.0,6250.0,7380.0,282000.0
AREA_ABG,141505.0,1570.051242,823.710111,226.0,1000.0,1305.0,1904.0,16778.0
BSMT_AREA,141505.0,670.464104,569.504749,0.0,0.0,747.0,1058.0,8712.0
FBSMT_SQFT,141505.0,456.587174,507.854015,0.0,0.0,316.0,851.0,8276.0
GRD_AREA,141505.0,33.882146,147.041963,0.0,0.0,0.0,0.0,2875.0
STORY,141505.0,1.38405,0.529795,1.0,1.0,1.0,2.0,3.0


In [17]:
# Create feature and target arrays
y = df_inds['SALE_PRICE'].ravel()
X = df_inds.drop(['SALE_PRICE','IMPROVE_VALUE','ASMT_APPR_LAND'], axis=1)

In [18]:
# now scale the X data
scaler = StandardScaler().fit(X)
X_scaled=scaler.transform(X)

In [19]:
# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size = 0.2, random_state=42)

## Code for original runs
models = [LinearRegression(),
          RandomForestRegressor(n_estimators=100, max_features='sqrt'),
          KNeighborsRegressor(n_neighbors=6),
          SVR(kernel='linear'),
          LogisticRegression()
          ]
 
TestModels = pd.DataFrame()
tmp = {}


for model in models:
    # get model name
    m = str(model)
    print(model)
    tmp['Model'] = m[:m.index('(')]
    # fit model on training dataset
    model.fit(X_train, y_train)
    # predict prices for test dataset and calculate r^2
    tmp['R2_Price'] = r2_score(y_test, model.predict(X_test))
    print(tmp)
    # write obtained data
    TestModels = TestModels.append([tmp])
 
TestModels.set_index('Model', inplace=True)
 
fig, axes = plt.subplots(ncols=1, figsize=(10, 4))
TestModels.R2_Price.plot(ax=axes, kind='bar', title='R2_Price')
plt.show()

## Results of original runs

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

{'Model': 'LinearRegression', 'R2_Price': 0.45273788242807833}

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='sqrt', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=None, verbose=0, warm_start=False)
                      
{'Model': 'RandomForestRegressor', 'R2_Price': 0.3711615155545427}

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=6, p=2,
                    weights='uniform')
                    
{'Model': 'KNeighborsRegressor', 'R2_Price': 0.2364313459928018}

SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='scale',
    kernel='linear', max_iter=-1, shrinking=True, tol=0.001, verbose=False)
    
{'Model': 'SVR', 'R2_Price': 0.3136196404697671}

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

## Fit Models with Training Data Set

### Test : Linear Regression, Random Forest, K Nearest Neighbors, SVR, and Logistic Regression

### Model 1: Linear Regression

In [23]:
lm = LinearRegression()
model1 = lm.fit(X_train,y_train)



In [33]:
y_pred =model1.predict(X_test)

# show model results without assessed values included
print('explained variance score: ',explained_variance_score(y_test, y_pred))
print('MAE: ',mean_absolute_error(y_test, y_pred))
print('R2 Score: ',r2_score(y_test, y_pred))
print('intercept: ',lm.intercept_)
pd.DataFrame(abs(lm.coef_),X.columns,columns=['Coeff']).sort_values(by=['Coeff'],ascending=False).head(10)

explained variance score:  0.43084269160439426
MAE:  161026.4060726821
R2 Score:  0.4308415412063876
intercept:  312963.7367260646


Unnamed: 0,Coeff
AREA_ABG,101630.955503
SALE_YEAR,51989.681367
NBHD_1_CN_x_COUNTRY CLUB,45979.621083
RM_AGE,41388.844287
NBHD_1_CN_x_WASHINGTON PK,37045.481762
FBSMT_SQFT,32115.04417
NBHD_1_CN_x_HILLTOP,30575.498394
LAND_SQFT,25248.868651
NBHD_1_CN_x_NORTHFIELD,24490.061086
NBHD_1_CN_x_W WASHINGTON PK,23120.546345


In [22]:
# show model results with assessed values included
print('explained variance score: ',explained_variance_score(y_test, y_pred))
print('MAE: ',mean_absolute_error(y_test, y_pred))
print('intercept: ',lm.intercept_)
pd.DataFrame(abs(lm.coef_),X.columns,columns=['Coeff']).sort_values(by=['Coeff'],ascending=False)

explained variance score:  0.43084269160439426
MAE:  161026.4060726821
intercept:  312963.7367260646


Unnamed: 0,Coeff
AREA_ABG,101630.955503
SALE_YEAR,51989.681367
NBHD_1_CN_x_COUNTRY CLUB,45979.621083
RM_AGE,41388.844287
NBHD_1_CN_x_WASHINGTON PK,37045.481762
FBSMT_SQFT,32115.04417
NBHD_1_CN_x_HILLTOP,30575.498394
LAND_SQFT,25248.868651
NBHD_1_CN_x_NORTHFIELD,24490.061086
NBHD_1_CN_x_W WASHINGTON PK,23120.546345


In [None]:

for model in models:
    # get model name
    m = str(model)
    print(model)
    tmp['Model'] = m[:m.index('(')]
    # fit model on training dataset
    model.fit(X_train, y_train)
    # predict prices for test dataset and calculate r^2
    tmp['R2_Price'] = r2_score(y_test, model.predict(X_test))
    print(tmp)
    # write obtained data
    TestModels = TestModels.append([tmp])
 
TestModels.set_index('Model', inplace=True)
 
fig, axes = plt.subplots(ncols=1, figsize=(10, 4))
TestModels.R2_Price.plot(ax=axes, kind='bar', title='R2_Price')
plt.show()

In [None]:
TestModels.set_index('Model', inplace=True)
 
fig, axes = plt.subplots(ncols=1, figsize=(10, 4))
TestModels.R2_Price.plot(ax=axes, kind='bar', title='R2_Price')
plt.show()

### Model 2: Random Forest

In [34]:
rf = RandomForestRegressor(n_estimators=100, max_features='sqrt')
model2 = rf.fit(X_train,y_train)



In [35]:
y_pred =model2.predict(X_test)

# show model results without assessed values included
print('explained variance score: ',explained_variance_score(y_test, y_pred))
print('MAE: ',mean_absolute_error(y_test, y_pred))
print('R2 score is: ',r2_score(y_test, y_pred))

explained variance score:  0.40940855938697407
MAE:  149111.62126241068
R2 score is:  0.4091512560234932


### Model 3: k nearest neighbors

In [26]:
kn = KNeighborsRegressor(n_neighbors=6)
model3 = kn.fit(X_train,y_train)



In [31]:
y_pred =model3.predict(X_test)
# show model results without assessed values included
print('explained variance score: ',explained_variance_score(y_test, y_pred))
print('MAE: ',mean_absolute_error(y_test, y_pred))
print('R2 score is: ',r2_score(y_test, y_pred))

explained variance score:  0.2665056616452648
MAE:  162448.31895221604
R2 score is:  0.2665034784018232


### Model 4: SVR

In [28]:
sv = SVR(kernel='linear')
model4 = sv.fit(X_train,y_train)

In [29]:
# show model results without assessed values included
y_pred =model4.predict(X_test)

print('explained variance score: ',explained_variance_score(y_test, y_pred))
print('MAE: ',mean_absolute_error(y_test, y_pred))


explained variance score:  0.26435270260857247
MAE:  175630.75694343183


In [30]:
print('R2 score is: ',r2_score(y_test, y_pred))

R2 score is:  0.2628601075556224
