# Pre-Processing and Training Data Development

Here we have 3 simple goals: <br>

1) Create dummy features for categorical data necessary <br>
2) Scale numerical variables if necessary <br>
3) Split data into training and test sets for modelling <br>

In [1]:
# Import necessary packages:

import pandas as pd
import numpy as np


In [2]:
# First let's load in our data

df = pd.read_csv(r'C:\Users\bronc\Downloads\Capstone 2\Real_Estate_Sales_2001-2018(clean).csv')
df.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,SerialNumber,ListYear,DateRecorded,Town,Address,AssessedValue,SaleAmount,SalesRatio,PropertyType,ResidentialType,Month
0,0,0,110540,2011,2012-04-03,Stamford,56 CHERRY HILL ROAD,795870,690000,0.866976,Residential,Single Family,4
1,1,1,120025,2012,2012-10-05,Greenwich,"78 BALDWIN FARMS SOUTH, GREENW",1925560,3224000,1.674318,Residential,Single Family,10
2,2,3,60173,2006,2006-12-28,Windsor,7 ALFORD DR,189630,309000,1.629489,Residential,Single Family,12
3,3,5,14539,2014,2015-05-28,Milford,170 MEADOWSIDE RD,147340,150000,1.018053,Residential,Single Family,5
4,4,9,160629,2016,2017-01-31,Bridgeport,75 EDGEMOOR RD,163380,180000,1.101726,Residential,Single Family,1


### Create dummy features

In [3]:
#Looking through the data I think it makes the most sense to create dummy features for Town and ResidentialType

dummy_town = pd.get_dummies(df['Town'])
dummy_res = pd.get_dummies(df['ResidentialType'])

In [4]:
dummy_town.head()

Unnamed: 0,Andover,Ansonia,Ashford,Avon,Barkhamsted,Beacon Falls,Berlin,Bethany,Bethel,Bethlehem,...,Willington,Wilton,Winchester,Windham,Windsor,Windsor Locks,Wolcott,Woodbridge,Woodbury,Woodstock
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [5]:
dummy_res.sample(5)

Unnamed: 0,Four Family,Multi Family,Single Family,Three Family,Two Family
517265,0,0,1,0,0
376836,0,0,1,0,0
562703,0,0,0,0,1
262464,0,0,1,0,0
394889,0,0,1,0,0


In [6]:
# With those made and verified to have worked let's put them back with our main dataframe

df = pd.concat([df, dummy_town], axis = 1)
df.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,SerialNumber,ListYear,DateRecorded,Town,Address,AssessedValue,SaleAmount,SalesRatio,...,Willington,Wilton,Winchester,Windham,Windsor,Windsor Locks,Wolcott,Woodbridge,Woodbury,Woodstock
0,0,0,110540,2011,2012-04-03,Stamford,56 CHERRY HILL ROAD,795870,690000,0.866976,...,0,0,0,0,0,0,0,0,0,0
1,1,1,120025,2012,2012-10-05,Greenwich,"78 BALDWIN FARMS SOUTH, GREENW",1925560,3224000,1.674318,...,0,0,0,0,0,0,0,0,0,0
2,2,3,60173,2006,2006-12-28,Windsor,7 ALFORD DR,189630,309000,1.629489,...,0,0,0,0,1,0,0,0,0,0
3,3,5,14539,2014,2015-05-28,Milford,170 MEADOWSIDE RD,147340,150000,1.018053,...,0,0,0,0,0,0,0,0,0,0
4,4,9,160629,2016,2017-01-31,Bridgeport,75 EDGEMOOR RD,163380,180000,1.101726,...,0,0,0,0,0,0,0,0,0,0


In [7]:
df = pd.concat([df, dummy_res], axis = 1)
df.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,SerialNumber,ListYear,DateRecorded,Town,Address,AssessedValue,SaleAmount,SalesRatio,...,Windsor Locks,Wolcott,Woodbridge,Woodbury,Woodstock,Four Family,Multi Family,Single Family,Three Family,Two Family
0,0,0,110540,2011,2012-04-03,Stamford,56 CHERRY HILL ROAD,795870,690000,0.866976,...,0,0,0,0,0,0,0,1,0,0
1,1,1,120025,2012,2012-10-05,Greenwich,"78 BALDWIN FARMS SOUTH, GREENW",1925560,3224000,1.674318,...,0,0,0,0,0,0,0,1,0,0
2,2,3,60173,2006,2006-12-28,Windsor,7 ALFORD DR,189630,309000,1.629489,...,0,0,0,0,0,0,0,1,0,0
3,3,5,14539,2014,2015-05-28,Milford,170 MEADOWSIDE RD,147340,150000,1.018053,...,0,0,0,0,0,0,0,1,0,0
4,4,9,160629,2016,2017-01-31,Bridgeport,75 EDGEMOOR RD,163380,180000,1.101726,...,0,0,0,0,0,0,0,1,0,0


I ran these in that order to verify that the columns had been successfully added

### Scale numerical variables

While we do have 3 numerical columns that are candidates to be scaled I don't think its necessary. Scaling either AssessedValue or SaleAmount doesn't make sense as they are on similar scales already and SalesRatio needs to be on a different scale to maintain its usefulness. Therefore I won't be using a scaler

### Split data into training and test sets for modelling

In [8]:
#Import the necessary package
from sklearn.model_selection import train_test_split

We'll be seeing how our data does at predicting SaleAmount as that has the most interesting real world application

In [9]:
X = df.iloc[:, 7:186]
y = df[['SaleAmount']]

In [10]:
X.drop(columns = 'SaleAmount', inplace = True)
X.head()

Unnamed: 0,AssessedValue,SalesRatio,PropertyType,ResidentialType,Month,Andover,Ansonia,Ashford,Avon,Barkhamsted,...,Windsor,Windsor Locks,Wolcott,Woodbridge,Woodbury,Woodstock,Four Family,Multi Family,Single Family,Three Family
0,795870,0.866976,Residential,Single Family,4,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
1,1925560,1.674318,Residential,Single Family,10,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
2,189630,1.629489,Residential,Single Family,12,0,0,0,0,0,...,1,0,0,0,0,0,0,0,1,0
3,147340,1.018053,Residential,Single Family,5,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,163380,1.101726,Residential,Single Family,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [11]:
y.head()

Unnamed: 0,SaleAmount
0,690000
1,3224000
2,309000
3,150000
4,180000


In [12]:
X.shape

(574715, 178)

In [13]:
y.shape

(574715, 1)

Perform train test split. I decided to test it out with a value of 30% for the test split

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 123)

Verify that the split worked

In [15]:
X_train.shape

(402300, 178)

In [16]:
X_test.shape

(172415, 178)

The split looks good! Next we'll start modelling our data

# Modeling

In [17]:
# Import necessary packages

from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
import statsmodels.api as sm

After looking at our data and some of the modeling techniques I know I decided to change up the data we'll be inputing

## Model 1

In [18]:
X = df[['AssessedValue', 'ResidentialType']]
X.head()

Unnamed: 0,AssessedValue,ResidentialType
0,795870,Single Family
1,1925560,Single Family
2,189630,Single Family
3,147340,Single Family
4,163380,Single Family


In [19]:
X = pd.get_dummies(X)
X.head()

Unnamed: 0,AssessedValue,ResidentialType_Four Family,ResidentialType_Multi Family,ResidentialType_Single Family,ResidentialType_Three Family,ResidentialType_Two Family
0,795870,0,0,1,0,0
1,1925560,0,0,1,0,0
2,189630,0,0,1,0,0
3,147340,0,0,1,0,0
4,163380,0,0,1,0,0


In [20]:
X = sm.add_constant(X)

Re-do Train Test split

In [21]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 123)

Verify that it worked again

In [22]:
X_train.shape

(402300, 7)

In [23]:
X_test.shape

(172415, 7)

Create and fit the model

In [24]:
Model1 = sm.OLS(y_train.astype(float), X_train.astype(float))
Model1_results = Model1.fit()

Get the results

In [25]:
Model1_results.summary()

0,1,2,3
Dep. Variable:,SaleAmount,R-squared:,0.737
Model:,OLS,Adj. R-squared:,0.736
Method:,Least Squares,F-statistic:,224900.0
Date:,"Wed, 17 Mar 2021",Prob (F-statistic):,0.0
Time:,20:30:19,Log-Likelihood:,-5490100.0
No. Observations:,402300,AIC:,10980000.0
Df Residuals:,402294,BIC:,10980000.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.064e+04,5446.349,1.953,0.051,-38.569,2.13e+04
AssessedValue,1.3268,0.001,1052.768,0.000,1.324,1.329
ResidentialType_Four Family,-5.306e+04,8584.206,-6.182,0.000,-6.99e+04,-3.62e+04
ResidentialType_Multi Family,-4.218e+04,2.63e+04,-1.602,0.109,-9.38e+04,9409.263
ResidentialType_Single Family,4.859e+04,5449.745,8.917,0.000,3.79e+04,5.93e+04
ResidentialType_Three Family,2.925e+04,5656.999,5.171,0.000,1.82e+04,4.03e+04
ResidentialType_Two Family,2.804e+04,5553.114,5.049,0.000,1.72e+04,3.89e+04

0,1,2,3
Omnibus:,249508.496,Durbin-Watson:,1.994
Prob(Omnibus):,0.0,Jarque-Bera (JB):,273140377.249
Skew:,-1.524,Prob(JB):,0.0
Kurtosis:,130.614,Cond. No.,2.2e+21


Finally let's use Root Mean Squared Error (RMSE) to measure how closely our model fits the data

In [26]:
y_pred = Model1_results.predict(X_test)

In [27]:
def rmse(predictions, targets):
    return np.sqrt(((predictions - targets) ** 2).mean())

In [28]:
matches = pd.DataFrame(y_test)
matches.rename(columns = {'SaleAmount':'actual'}, inplace=True)
matches["predicted"] = y_pred

rmse(matches["actual"], matches["predicted"])

207995.71176333088

This shows us that on average our predictions are off by about $208k which is bad. Let's see if we can do better by incorporating more columns

## Model 2

In [29]:
X = df.iloc[:, 7:186]
X.head()

Unnamed: 0,AssessedValue,SaleAmount,SalesRatio,PropertyType,ResidentialType,Month,Andover,Ansonia,Ashford,Avon,...,Windsor,Windsor Locks,Wolcott,Woodbridge,Woodbury,Woodstock,Four Family,Multi Family,Single Family,Three Family
0,795870,690000,0.866976,Residential,Single Family,4,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
1,1925560,3224000,1.674318,Residential,Single Family,10,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
2,189630,309000,1.629489,Residential,Single Family,12,0,0,0,0,...,1,0,0,0,0,0,0,0,1,0
3,147340,150000,1.018053,Residential,Single Family,5,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,163380,180000,1.101726,Residential,Single Family,1,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [30]:
X.drop(columns = 'SaleAmount', inplace = True)
X.drop(columns = 'PropertyType', inplace = True)
X.drop(columns = 'ResidentialType', inplace = True)
X.head()

Unnamed: 0,AssessedValue,SalesRatio,Month,Andover,Ansonia,Ashford,Avon,Barkhamsted,Beacon Falls,Berlin,...,Windsor,Windsor Locks,Wolcott,Woodbridge,Woodbury,Woodstock,Four Family,Multi Family,Single Family,Three Family
0,795870,0.866976,4,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
1,1925560,1.674318,10,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
2,189630,1.629489,12,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,1,0
3,147340,1.018053,5,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,163380,1.101726,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [31]:
X = sm.add_constant(X)

In [32]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 123)

In [33]:
X_train.shape

(402300, 177)

In [34]:
X_test.shape

(172415, 177)

In [35]:
Model2 = sm.OLS(y_train.astype(float), X_train.astype(float))
Model2_results = Model2.fit()

In [36]:
Model2_results.summary()

0,1,2,3
Dep. Variable:,SaleAmount,R-squared:,0.77
Model:,OLS,Adj. R-squared:,0.77
Method:,Least Squares,F-statistic:,7694.0
Date:,"Wed, 17 Mar 2021",Prob (F-statistic):,0.0
Time:,20:30:38,Log-Likelihood:,-5462700.0
No. Observations:,402300,AIC:,10930000.0
Df Residuals:,402124,BIC:,10930000.0
Df Model:,175,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,9.223e+04,1555.683,59.287,0.000,8.92e+04,9.53e+04
AssessedValue,1.0893,0.002,687.820,0.000,1.086,1.092
SalesRatio,6.1156,1.278,4.787,0.000,3.612,8.620
Month,117.1209,93.208,1.257,0.209,-65.563,299.805
Andover,-1.416e+04,1.01e+04,-1.397,0.162,-3.4e+04,5705.741
Ansonia,-4.439e+04,4305.059,-10.311,0.000,-5.28e+04,-3.6e+04
Ashford,-3.908e+04,8691.191,-4.497,0.000,-5.61e+04,-2.2e+04
Avon,5.387e+04,3614.418,14.904,0.000,4.68e+04,6.1e+04
Barkhamsted,-3.436e+04,9555.336,-3.595,0.000,-5.31e+04,-1.56e+04

0,1,2,3
Omnibus:,157228.964,Durbin-Watson:,1.996
Prob(Omnibus):,0.0,Jarque-Bera (JB):,156464155.948
Skew:,-0.164,Prob(JB):,0.0
Kurtosis:,99.613,Cond. No.,4.04e+19


In [37]:
y_pred = Model2_results.predict(X_test)

In [38]:
matches = pd.DataFrame(y_test)
matches.rename(columns = {'SaleAmount':'actual'}, inplace=True)
matches["predicted"] = y_pred

rmse(matches["actual"], matches["predicted"])

193853.55676488072

The results here are slightly better but not by much with us still off by an average of almost $194k. Let's try one more model. This time with a simpler approach.

## Model 3

In [39]:
X = df[['AssessedValue', 'SalesRatio']]
X.head()

Unnamed: 0,AssessedValue,SalesRatio
0,795870,0.866976
1,1925560,1.674318
2,189630,1.629489
3,147340,1.018053
4,163380,1.101726


In [40]:
X = sm.add_constant(X)

In [41]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 123)

In [42]:
X_train.shape

(402300, 3)

In [43]:
X_test.shape

(172415, 3)

In [44]:
Model3 = sm.OLS(y_train.astype(float), X_train.astype(float))
Model3_results = Model3.fit()

In [45]:
Model3_results.summary()

0,1,2,3
Dep. Variable:,SaleAmount,R-squared:,0.736
Model:,OLS,Adj. R-squared:,0.736
Method:,Least Squares,F-statistic:,561400.0
Date:,"Wed, 17 Mar 2021",Prob (F-statistic):,0.0
Time:,20:30:50,Log-Likelihood:,-5490300.0
No. Observations:,402300,AIC:,10980000.0
Df Residuals:,402297,BIC:,10980000.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.683e+04,416.158,136.553,0.000,5.6e+04,5.76e+04
AssessedValue,1.3291,0.001,1059.615,0.000,1.327,1.332
SalesRatio,6.6598,1.368,4.869,0.000,3.979,9.340

0,1,2,3
Omnibus:,251040.451,Durbin-Watson:,1.994
Prob(Omnibus):,0.0,Jarque-Bera (JB):,274327629.129
Skew:,-1.544,Prob(JB):,0.0
Kurtosis:,130.891,Cond. No.,428000.0


In [46]:
y_pred = Model3_results.predict(X_test)

In [47]:
matches = pd.DataFrame(y_test)
matches.rename(columns = {'SaleAmount':'actual'}, inplace=True)
matches["predicted"] = y_pred

rmse(matches["actual"], matches["predicted"])

208126.3560958764

This model is actually the worst of the 3 being off by $208k on average. Our best model was the one using most of the data a.k.a model 2