This notebook contains the work for Step 4 of the Data Science Method:

**The Data Science Method**  


1.   Problem Identification 

2.   Data Wrangling 
  * Data Collection 
   * Data Organization
  * Data Definition 
  * Data Cleaning
3.   Exploratory Data Analysis
 * Build data profile tables and plots
        - Outliers & Anomalies
 * Explore data relationships
 * Identification and creation of features</b>

<b>4.   Pre-processing and Training Data Development
  * Create dummy or indicator features for categorical variables
  * Standardize the magnitude of numeric features
  * Split into testing and training datasets
  * Apply scaler to the testing set</b>

5.   Modeling 
  * Fit Models with Training Data Set
  * Review Model Outcomes — Iterate over additional models as needed.
  * Identify the Final Model

6.   Documentation
  * Review the Results
  * Present and share your findings - storytelling
  * Finalize Code 
  * Finalize Documentation

In [1]:
#load python packages
import os
import pandas as pd
import datetime
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
from scipy.stats import chi2_contingency
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split



In [2]:
# set options
pd.set_option('display.max_rows', 1500)

In [3]:
# load the data saved from step 3
df=pd.read_csv('data\step3_output.csv')
df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,SCHEDNUM_x,RECEPTION_NUM,INSTRUMENT,SALE_YEAR,SALE_MONTHDAY,RECEPTION_DATE,SALE_PRICE,GRANTOR,GRANTEE,CLASS,...,UNITS,ASMT_APPR_LAND,TOTAL_VALUE,ASDLAND,ASSESS_VALUE,ASMT_TAXABLE,ASMT_EXEMPT_AMT,NBHD_1_CN_y,LEGL_DESCRIPTION,IMPROVE_VALUE
0,14101001000,2008138043,WD,2008,703,20081008,10.0,"ATKINSON,RUSSELL",DREAM BUILDERS LLC,R,...,1,103000,530200,7365,37910,37920,0,N GREEN VALLEY,GREEN VALLEY RANCH FLG #36 B1 L1,427200
1,14101001000,2009074518,WD,2009,605,20090615,299000.0,"ATKINSON,RUSSELL","PADBURY,CHRISTOPHER R",R,...,1,103000,530200,7365,37910,37920,0,N GREEN VALLEY,GREEN VALLEY RANCH FLG #36 B1 L1,427200
2,14101001000,2015157653,WD,2015,1102,20151109,415000.0,"PADBURY,CHRISTOPHER R","MACIEL,HORACIO PEREZ",R,...,1,103000,530200,7365,37910,37920,0,N GREEN VALLEY,GREEN VALLEY RANCH FLG #36 B1 L1,427200
3,14101001000,2009002129,WD,2008,1024,20090108,10.0,DREAM BUILDERS LLC,"ATKINSON,RUSSELL",R,...,1,103000,530200,7365,37910,37920,0,N GREEN VALLEY,GREEN VALLEY RANCH FLG #36 B1 L1,427200
4,14101002000,2010094573,WD,2010,823,20100824,350000.0,"SHEARON,MARK H &","EFREM,TEWEDROS",R,...,1,90400,572600,6464,40941,40940,0,N GREEN VALLEY,GREEN VALLEY RANCH FLG #36 B1 L2,482200


## Listing of fields, arranged in groups with decision about how to handle fields

|Field|type|group|definition|notes|
|---|---|---|---|---|
|CCAGE_RM|float64|age|Remodel Year||
|CCYRBLT|float64|age|Year Built||
|RECEPTION_DATE|int64|date|Clerk & Recorder's Reception Date|drop|
|SALE_MONTHDAY|int64|date|Sale Month/Day||
|SALE_YEAR|int64|date|Sale Year||
|PIN|int64|id|Assessor's Property Identification Number|drop|
|RECEPTION_NUM|int64|id|Input Key|drop|
|SCHEDNUM_x|int64|id|Input Key|drop|
|CO_OWNER|object|ignore|Co-Owner|drop|
|MKT_CLUS|float64|loc|||
|NBHD_1_CN_x|object|loc|Neighborhood name - x|category|
|NBHD_1_CN_y|object|loc|Neighborhood name - y|drop|
|NBHD_1_x|int64|loc|Neighborhood code|ignore|
|SITE_DIR|object|loc|Site Street Direction|ignore|
|SITE_MODE|object|loc|Site Street Type|ignore|
|SITE_MORE|object|loc|Site Unit Number|ignore|
|SITE_NAME|object|loc|Site Street Name|ignore|
|SITE_NBR|int64|loc|Site Street Number|ignore|
|TAX_DIST|object|loc|Tax District|category, ignore?|
|ZONE10|object|loc|Zone|category, ignore?|
|CD|int64|other|Building Number|drop|
|CLASS|object|other|#N/A|drop|
|D_CLASS|int64|other|Property Use Class|category, ignore?|
|D_CLASS_CN_x|object|other|Property Use Class Definition - x|drop|
|D_CLASS_CN_y|object|other|Property Use Class Definition - y|drop|
|GRANTEE|object|other|??|drop|
|GRANTOR|object|other|??|drop|
|INSTRUMENT|object|other|??|drop|
|LEGL_DESCRIPTION|object|other|Description of the parcel per the deed |drop|
|PROP_CLASS|int64|other|Property Class(ASMT-PROP-CODE)|only select 1112|
|PROPERTY_CLASS|object|other|Property Class Description|drop|
|STYLE_CN|object|other|Architecture Style Code Definition|similar to STORY but does not agree|
|OWNER|object|owner|Owner|drop|
|OWNER_APT|object|owner|Street Mailing Unit Number|drop|
|OWNER_CITY|object|owner|Mailing City|drop|
|OWNER_DIR|object|owner|Street Mailing Direction|drop|
|OWNER_NUM|object|owner|Street Mailing Number|drop|
|OWNER_ST|object|owner|Street Mailing Street Name|drop|
|OWNER_STATE|object|owner|Mailing State|drop|
|OWNER_TYPE|object|owner|Street Mailing Type|drop|
|OWNER_ZIP|object|owner|Zip Code|drop|
|AREA_ABG|int64|size|Above Grade Improvement Area||
|BED_RMS|int64|size|Number of bedroom above grade||
|BSMT_AREA|int64|size|Basement Square Footage||
|FBSMT_SQFT|int64|size|Finished Basement Area||
|FULL_B|float64|size|Total number of full baths||
|GRD_AREA|int64|size|Garden Level Square Footage||
|HLF_B|float64|size|Total number of half baths||
|LAND_SQFT|float64|size|Land Area||
|OFCARD|int64|size|Number of Buildings|drop|
|STORY|int64|size|Stories||
|UNITS|int64|size|Number of Units||
|ASDLAND|int64|value|Assessed Land Value|drop|
|ASMT_APPR_LAND|int64|value|Actual Land Value||
|ASMT_EXEMPT_AMT|int64|value|Exempt Amount|drop|
|ASMT_TAXABLE|int64|value|Taxable Amount|drop|
|ASSESS_VALUE|int64|value|Assessed Total Value|drop|
|IMPROVE_VALUE|int64|value|Calculated=Tot Val - Land||
|SALE_PRICE|float64|value|Sale Price||
|TOTAL_VALUE|int64|value|Actual Total Value|drop|


In [4]:
# drop columns that are not going to be used
df.drop(['RECEPTION_DATE', 'PIN', 'RECEPTION_NUM', 'SCHEDNUM_x', 'CO_OWNER', 'NBHD_1_CN_y', 'NBHD_1_x', 'SITE_DIR', 'SITE_MODE', 'SITE_MORE', 'SITE_NAME', 'SITE_NBR', 'TAX_DIST', 'ZONE10', 'CD', 'CLASS', 'D_CLASS', 'D_CLASS_CN_x', 'D_CLASS_CN_y', 'GRANTEE', 'GRANTOR', 'INSTRUMENT', 'LEGL_DESCRIPTION', 'PROPERTY_CLASS', 'STYLE_CN', 'OWNER', 'OWNER_APT', 'OWNER_CITY', 'OWNER_DIR', 'OWNER_NUM', 'OWNER_ST', 'OWNER_STATE', 'OWNER_TYPE', 'OWNER_ZIP', 'OFCARD', 'ASDLAND', 'ASMT_EXEMPT_AMT', 'ASMT_TAXABLE', 'ASSESS_VALUE', 'TOTAL_VALUE'], axis=1, inplace=True)

In [5]:
# change categorical values (NBHD_1_CN_x) to indicators
df_new1 = pd.get_dummies(data=df, columns=['NBHD_1_CN_x'],drop_first=True)

In [6]:
# only select PROP_CLASS = 1112 (Single Family Residential)
df_sfr = df_new1.loc[df_new1['PROP_CLASS'] == 1112]

In [7]:
# Create feature and target arrays
y = df_sfr['SALE_PRICE']
X = df_sfr.drop('SALE_PRICE', axis=1)

In [8]:
# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)

In [9]:
# scale the feature fields 
# Import StandardScaler

# Apply a standard scaler to the data
SS_scaler = StandardScaler()

# Fit the standard scaler to the data
SS_scaler.fit(X_train)

# Transform the test data using the fitted scaler
X_test_new = SS_scaler.transform(X_test)
