# Pre-Processing and Training Data Development

Here we have 3 simple goals: <br>

1) Create dummy features for categorical data necessary <br>
2) Scale numerical variables if necessary <br>
3) Split data into training and test sets for modelling <br>

In [1]:
# Import necessary packages:

import pandas as pd
import numpy as np


In [2]:
# First let's load in our data

df = pd.read_csv(r'C:\Users\bronc\Downloads\Capstone 2\Real_Estate_Sales_2001-2018(clean).csv')
df.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,SerialNumber,ListYear,DateRecorded,Town,Address,AssessedValue,SaleAmount,SalesRatio,PropertyType,ResidentialType,Month
0,0,0,110540,2011,2012-04-03,Stamford,56 CHERRY HILL ROAD,795870,690000,0.866976,Residential,Single Family,4
1,1,1,120025,2012,2012-10-05,Greenwich,"78 BALDWIN FARMS SOUTH, GREENW",1925560,3224000,1.674318,Residential,Single Family,10
2,2,3,60173,2006,2006-12-28,Windsor,7 ALFORD DR,189630,309000,1.629489,Residential,Single Family,12
3,3,5,14539,2014,2015-05-28,Milford,170 MEADOWSIDE RD,147340,150000,1.018053,Residential,Single Family,5
4,4,9,160629,2016,2017-01-31,Bridgeport,75 EDGEMOOR RD,163380,180000,1.101726,Residential,Single Family,1


### Create dummy features

In [3]:
#Looking through the data I think it makes the most sense to create dummy features for Town and ResidentialType
# However with towns since there are so many I want to aggregate those with less than 1000 observations into an other column
# To do this we must first get the value counts for Town
df['Town'].value_counts()

Bridgeport       20271
Waterbury        16554
Stamford         15131
New Haven        13044
West Hartford    12669
                 ...  
Franklin           264
Eastford           249
Hartland           240
Canaan             217
Union              143
Name: Town, Length: 169, dtype: int64

In [4]:
df['Town'] = df['Town'].replace('Franklin', 'OtherTown')
df['Town'] = df['Town'].replace('Eastford', 'OtherTown')
df['Town'] = df['Town'].replace('Hartland', 'OtherTown')
df['Town'] = df['Town'].replace('Canaan', 'OtherTown')
df['Town'] = df['Town'].replace('Union', 'OtherTown')
df['Town'] = df['Town'].replace('Bozrah', 'OtherTown')
df['Town'] = df['Town'].replace('Colebrook', 'OtherTown')
df['Town'] = df['Town'].replace('Cornwall', 'OtherTown')
df['Town'] = df['Town'].replace('Scotland', 'OtherTown')
df['Town'] = df['Town'].replace('Union', 'OtherTown')
df['Town'] = df['Town'].replace('Norfolk', 'OtherTown')
df['Town'] = df['Town'].replace('Chaplin', 'OtherTown')
df['Town'] = df['Town'].replace('Hampton', 'OtherTown')
df['Town'] = df['Town'].replace('Warren', 'OtherTown')
df['Town'] = df['Town'].replace('Bridgewater', 'OtherTown')
df['Town'] = df['Town'].replace('North Canaan', 'OtherTown')
df['Town'] = df['Town'].replace('Voluntown', 'OtherTown')
df['Town'] = df['Town'].replace('Morris', 'OtherTown')
df['Town'] = df['Town'].replace('Sprague', 'OtherTown')
df['Town'] = df['Town'].replace('Roxbury', 'OtherTown')
df['Town'] = df['Town'].replace('Kent', 'OtherTown')
df['Town'] = df['Town'].replace('Barkhamsted', 'OtherTown')
df['Town'] = df['Town'].replace('Andover', 'OtherTown')
df['Town'] = df['Town'].replace('Bethlehem', 'OtherTown')
df['Town'] = df['Town'].replace('Lyme', 'OtherTown')
df['Town'] = df['Town'].replace('Chester', 'OtherTown')
df['Town'] = df['Town'].replace('Ashford', 'OtherTown')
df['Town'] = df['Town'].replace('Pomfret', 'OtherTown')
df['Town'] = df['Town'].replace('Lisbon', 'OtherTown')
df['Town'] = df['Town'].replace('Middlefield', 'OtherTown')
df['Town'] = df['Town'].replace('Salem', 'OtherTown')
df['Town'] = df['Town'].replace('Deep River', 'OtherTown')
df['Town'] = df['Town'].replace('Willington', 'OtherTown')
df['Town'] = df['Town'].replace('Sharon', 'OtherTown')
df['Town'] = df['Town'].replace('Sterling', 'OtherTown')
df['Town'] = df['Town'].replace('Harwinton', 'OtherTown')
df['Town'] = df['Town'].replace('Preston', 'OtherTown')
df['Town'] = df['Town'].replace('Sherman', 'OtherTown')
df['Town'] = df['Town'].replace('Canterbury', 'OtherTown')
df['Town'] = df['Town'].replace('Goshen', 'OtherTown')
df['Town'] = df['Town'].replace('Beacon Falls', 'OtherTown')
df['Town'] = df['Town'].replace('Washington', 'OtherTown')
df['Town'] = df['Town'].replace('Salisbury', 'OtherTown')
df['Town'] = df['Town'].replace('Bolton', 'OtherTown')
df['Town'] = df['Town'].replace('Bethany', 'OtherTown')
df['Town'] = df['Town'].replace('North Stonington', 'OtherTown')

In [5]:
df['Town'].value_counts()

OtherTown       25864
Bridgeport      20271
Waterbury       16554
Stamford        15131
New Haven       13044
                ...  
Middlebury       1161
Marlborough      1115
New Hartford     1076
East Granby      1040
Columbia         1024
Name: Town, Length: 125, dtype: int64

In [6]:
dummy_town = pd.get_dummies(df['Town'])
dummy_res = pd.get_dummies(df['ResidentialType'])

In [7]:
dummy_town.head()

Unnamed: 0,Ansonia,Avon,Berlin,Bethel,Bloomfield,Branford,Bridgeport,Bristol,Brookfield,Brooklyn,...,Wethersfield,Wilton,Winchester,Windham,Windsor,Windsor Locks,Wolcott,Woodbridge,Woodbury,Woodstock
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [8]:
dummy_res.sample(5)

Unnamed: 0,Four Family,Multi Family,Single Family,Three Family,Two Family
405346,0,0,1,0,0
516079,0,0,1,0,0
166090,0,0,1,0,0
455166,0,0,1,0,0
423672,0,0,1,0,0


In [9]:
# With those made and verified to have worked let's put them back with our main dataframe

df = pd.concat([df, dummy_town], axis = 1)
df.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,SerialNumber,ListYear,DateRecorded,Town,Address,AssessedValue,SaleAmount,SalesRatio,...,Wethersfield,Wilton,Winchester,Windham,Windsor,Windsor Locks,Wolcott,Woodbridge,Woodbury,Woodstock
0,0,0,110540,2011,2012-04-03,Stamford,56 CHERRY HILL ROAD,795870,690000,0.866976,...,0,0,0,0,0,0,0,0,0,0
1,1,1,120025,2012,2012-10-05,Greenwich,"78 BALDWIN FARMS SOUTH, GREENW",1925560,3224000,1.674318,...,0,0,0,0,0,0,0,0,0,0
2,2,3,60173,2006,2006-12-28,Windsor,7 ALFORD DR,189630,309000,1.629489,...,0,0,0,0,1,0,0,0,0,0
3,3,5,14539,2014,2015-05-28,Milford,170 MEADOWSIDE RD,147340,150000,1.018053,...,0,0,0,0,0,0,0,0,0,0
4,4,9,160629,2016,2017-01-31,Bridgeport,75 EDGEMOOR RD,163380,180000,1.101726,...,0,0,0,0,0,0,0,0,0,0


In [10]:
df = pd.concat([df, dummy_res], axis = 1)
df.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,SerialNumber,ListYear,DateRecorded,Town,Address,AssessedValue,SaleAmount,SalesRatio,...,Windsor Locks,Wolcott,Woodbridge,Woodbury,Woodstock,Four Family,Multi Family,Single Family,Three Family,Two Family
0,0,0,110540,2011,2012-04-03,Stamford,56 CHERRY HILL ROAD,795870,690000,0.866976,...,0,0,0,0,0,0,0,1,0,0
1,1,1,120025,2012,2012-10-05,Greenwich,"78 BALDWIN FARMS SOUTH, GREENW",1925560,3224000,1.674318,...,0,0,0,0,0,0,0,1,0,0
2,2,3,60173,2006,2006-12-28,Windsor,7 ALFORD DR,189630,309000,1.629489,...,0,0,0,0,0,0,0,1,0,0
3,3,5,14539,2014,2015-05-28,Milford,170 MEADOWSIDE RD,147340,150000,1.018053,...,0,0,0,0,0,0,0,1,0,0
4,4,9,160629,2016,2017-01-31,Bridgeport,75 EDGEMOOR RD,163380,180000,1.101726,...,0,0,0,0,0,0,0,1,0,0


I ran these in that order to verify that the columns had been successfully added

### Scale numerical variables

This will be implemented in the modeling stage

### Split data into training and test sets for modelling

In [11]:
#Import the necessary package
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

We'll be seeing how our data does at predicting SaleAmount as that has the most interesting real world application

In [12]:
X = df.iloc[:, 7:144]
y = df[['SaleAmount']]

In [13]:
X.drop(columns = 'SaleAmount', inplace = True)
X.head()

Unnamed: 0,AssessedValue,SalesRatio,PropertyType,ResidentialType,Month,Ansonia,Avon,Berlin,Bethel,Bloomfield,...,Windsor Locks,Wolcott,Woodbridge,Woodbury,Woodstock,Four Family,Multi Family,Single Family,Three Family,Two Family
0,795870,0.866976,Residential,Single Family,4,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,1925560,1.674318,Residential,Single Family,10,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,189630,1.629489,Residential,Single Family,12,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,147340,1.018053,Residential,Single Family,5,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,163380,1.101726,Residential,Single Family,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [14]:
y.head()

Unnamed: 0,SaleAmount
0,690000
1,3224000
2,309000
3,150000
4,180000


In [15]:
X.shape

(574715, 135)

In [16]:
y.shape

(574715, 1)

Perform train test split. I decided to test it out with a value of 30% for the test split

In [17]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 123)

Verify that the split worked

In [18]:
X_train.shape

(402300, 135)

In [19]:
X_test.shape

(172415, 135)

The split looks good! Next we'll start modelling our data

# Modeling

In [20]:
# Import necessary packages

from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler

After looking at our data and some of the modeling techniques I know I decided to change up the data we'll be inputing

## Model 1

In [21]:
X = df[['AssessedValue', 'ResidentialType']]
X.head()

Unnamed: 0,AssessedValue,ResidentialType
0,795870,Single Family
1,1925560,Single Family
2,189630,Single Family
3,147340,Single Family
4,163380,Single Family


In [22]:
X['ResidentialType'].value_counts()

Single Family    524082
Two Family        32932
Three Family      16728
Four Family         904
Multi Family         69
Name: ResidentialType, dtype: int64

In [23]:
X['ResidentialType'] = X['ResidentialType'].replace('Two Family', 'Multi Family')
X['ResidentialType'] = X['ResidentialType'].replace('Three Family', 'Multi Family')
X['ResidentialType'] = X['ResidentialType'].replace('Four Family', 'Multi Family')
X['ResidentialType'].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Single Family    524082
Multi Family      50633
Name: ResidentialType, dtype: int64

In [24]:
X = pd.get_dummies(X)
X.head()

Unnamed: 0,AssessedValue,ResidentialType_Multi Family,ResidentialType_Single Family
0,795870,0,1
1,1925560,0,1
2,189630,0,1
3,147340,0,1
4,163380,0,1


In [25]:
scaler = StandardScaler()

In [26]:
scaler.fit(X)

StandardScaler(copy=True, with_mean=True, with_std=True)

In [27]:
X_scaled = scaler.transform(X)

In [28]:
X_scaled = sm.add_constant(X_scaled)

Re-do Train Test split

In [29]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size = 0.3, random_state = 123)

Verify that it worked again

In [30]:
X_train.shape

(402300, 4)

In [31]:
X_test.shape

(172415, 4)

Create and fit the model

In [32]:
Model1 = sm.OLS(y_train.astype(float), X_train.astype(float))
Model1_results = Model1.fit()

Get the results

In [33]:
Model1_results.summary()

0,1,2,3
Dep. Variable:,SaleAmount,R-squared:,0.736
Model:,OLS,Adj. R-squared:,0.736
Method:,Least Squares,F-statistic:,562000.0
Date:,"Wed, 24 Mar 2021",Prob (F-statistic):,0.0
Time:,18:28:59,Log-Likelihood:,-5490100.0
No. Observations:,402300,AIC:,10980000.0
Df Residuals:,402297,BIC:,10980000.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,3.356e+05,322.291,1041.397,0.000,3.35e+05,3.36e+05
x1,3.421e+05,325.024,1052.656,0.000,3.42e+05,3.43e+05
x2,-3075.9693,162.286,-18.954,0.000,-3394.044,-2757.895
x3,3075.9693,162.286,18.954,0.000,2757.895,3394.044

0,1,2,3
Omnibus:,249295.541,Durbin-Watson:,1.994
Prob(Omnibus):,0.0,Jarque-Bera (JB):,272688204.798
Skew:,-1.521,Prob(JB):,0.0
Kurtosis:,130.509,Cond. No.,2790000000000000.0


Finally let's use Root Mean Squared Error (RMSE) to measure how closely our model fits the data

In [34]:
y_pred = Model1_results.predict(X_test)

In [35]:
def rmse(predictions, targets):
    return np.sqrt(((predictions - targets) ** 2).mean())

In [36]:
matches = pd.DataFrame(y_test)
matches.rename(columns = {'SaleAmount':'actual'}, inplace=True)
matches["predicted"] = y_pred

rmse(matches["actual"], matches["predicted"])

208030.65417067677

Here the R-squared is 0.736 and the root mean squared error seems pretty high. This model is decent but I'm sure we can do better. Next let's try using more variables

## Model 2

In [37]:
X = df.iloc[:, 7:144]
X.head()

Unnamed: 0,AssessedValue,SaleAmount,SalesRatio,PropertyType,ResidentialType,Month,Ansonia,Avon,Berlin,Bethel,...,Windsor Locks,Wolcott,Woodbridge,Woodbury,Woodstock,Four Family,Multi Family,Single Family,Three Family,Two Family
0,795870,690000,0.866976,Residential,Single Family,4,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,1925560,3224000,1.674318,Residential,Single Family,10,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,189630,309000,1.629489,Residential,Single Family,12,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,147340,150000,1.018053,Residential,Single Family,5,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,163380,180000,1.101726,Residential,Single Family,1,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [38]:
X.drop(columns = 'SaleAmount', inplace = True)
X.drop(columns = 'PropertyType', inplace = True)
X.drop(columns = 'ResidentialType', inplace = True)
X.head()

Unnamed: 0,AssessedValue,SalesRatio,Month,Ansonia,Avon,Berlin,Bethel,Bloomfield,Branford,Bridgeport,...,Windsor Locks,Wolcott,Woodbridge,Woodbury,Woodstock,Four Family,Multi Family,Single Family,Three Family,Two Family
0,795870,0.866976,4,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,1925560,1.674318,10,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,189630,1.629489,12,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,147340,1.018053,5,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,163380,1.101726,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0


In [39]:
X.describe()

Unnamed: 0,AssessedValue,SalesRatio,Month,Ansonia,Avon,Berlin,Bethel,Bloomfield,Branford,Bridgeport,...,Windsor Locks,Wolcott,Woodbridge,Woodbury,Woodstock,Four Family,Multi Family,Single Family,Three Family,Two Family
count,574715.0,574715.0,574715.0,574715.0,574715.0,574715.0,574715.0,574715.0,574715.0,574715.0,...,574715.0,574715.0,574715.0,574715.0,574715.0,574715.0,574715.0,574715.0,574715.0,574715.0
mean,209766.0,2.674725,6.794531,0.004971,0.006984,0.005211,0.005129,0.006339,0.005883,0.035271,...,0.004211,0.004762,0.002939,0.002944,0.002814,0.001573,0.00012,0.911899,0.029107,0.057301
std,257898.9,232.3596,3.233131,0.070331,0.08328,0.072001,0.071437,0.079364,0.076474,0.184465,...,0.064754,0.068845,0.054131,0.054179,0.052968,0.039629,0.010957,0.283442,0.168105,0.232418
min,1.0,8.153814e-07,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,95050.0,1.266434,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
50%,142590.0,1.594454,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
75%,224070.0,2.041974,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
max,4948370.0,141666.7,12.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Since we're using month let's make it a dummy variable to allow for better modeling. I'll divide the years into quarters

In [40]:
X['Month'] = X['Month'].replace(1, 'Q1')
X['Month'] = X['Month'].replace(2, 'Q1')
X['Month'] = X['Month'].replace(3, 'Q1')
X['Month'] = X['Month'].replace(4, 'Q2')
X['Month'] = X['Month'].replace(5, 'Q2')
X['Month'] = X['Month'].replace(6, 'Q2')
X['Month'] = X['Month'].replace(7, 'Q3')
X['Month'] = X['Month'].replace(8, 'Q3')
X['Month'] = X['Month'].replace(9, 'Q3')
X['Month'] = X['Month'].replace(10, 'Q4')
X['Month'] = X['Month'].replace(11, 'Q4')
X['Month'] = X['Month'].replace(12, 'Q4')

In [41]:
X = pd.get_dummies(X)

In [42]:
X.head()

Unnamed: 0,AssessedValue,SalesRatio,Ansonia,Avon,Berlin,Bethel,Bloomfield,Branford,Bridgeport,Bristol,...,Woodstock,Four Family,Multi Family,Single Family,Three Family,Two Family,Month_Q1,Month_Q2,Month_Q3,Month_Q4
0,795870,0.866976,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,1,0,0
1,1925560,1.674318,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1
2,189630,1.629489,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1
3,147340,1.018053,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,1,0,0
4,163380,1.101726,0,0,0,0,0,0,1,0,...,0,0,0,1,0,0,1,0,0,0


In [43]:
scaler.fit(X)

StandardScaler(copy=True, with_mean=True, with_std=True)

In [44]:
X_scaled = scaler.transform(X)

In [45]:
X_scaled = sm.add_constant(X_scaled)

In [46]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size = 0.3, random_state = 123)

In [47]:
X_train.shape

(402300, 137)

In [48]:
X_test.shape

(172415, 137)

In [49]:
Model2 = sm.OLS(y_train.astype(float), X_train.astype(float))
Model2_results = Model2.fit()

In [50]:
Model2_results.summary()

0,1,2,3
Dep. Variable:,SaleAmount,R-squared:,0.769
Model:,OLS,Adj. R-squared:,0.769
Method:,Least Squares,F-statistic:,10070.0
Date:,"Wed, 24 Mar 2021",Prob (F-statistic):,0.0
Time:,18:29:28,Log-Likelihood:,-5463500.0
No. Observations:,402300,AIC:,10930000.0
Df Residuals:,402166,BIC:,10930000.0
Df Model:,133,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,3.356e+05,301.694,1112.475,0.000,3.35e+05,3.36e+05
x1,2.824e+05,406.584,694.524,0.000,2.82e+05,2.83e+05
x2,1396.3083,297.393,4.695,0.000,813.427,1979.189
x3,-3574.2509,302.115,-11.831,0.000,-4166.388,-2982.114
x4,3826.1468,299.304,12.783,0.000,3239.519,4412.775
x5,-2540.7556,299.876,-8.473,0.000,-3128.504,-1953.008
x6,-610.4970,298.524,-2.045,0.041,-1195.596,-25.399
x7,-2002.7499,300.275,-6.670,0.000,-2591.279,-1414.221
x8,-1418.1587,298.771,-4.747,0.000,-2003.741,-832.576

0,1,2,3
Omnibus:,158530.146,Durbin-Watson:,1.996
Prob(Omnibus):,0.0,Jarque-Bera (JB):,161996013.317
Skew:,-0.196,Prob(JB):,0.0
Kurtosis:,101.306,Cond. No.,3470000000000000.0


In [51]:
y_pred = Model2_results.predict(X_test)

In [52]:
matches = pd.DataFrame(y_test)
matches.rename(columns = {'SaleAmount':'actual'}, inplace=True)
matches["predicted"] = y_pred

rmse(matches["actual"], matches["predicted"])

194199.838772732

Next we will run the model eliminating columns with a p-value that is above 0.05 these columns are: 13, 14, 15, 21, 30, 58, 70, 81, 102, 126, 129 (kept values that were still very close to 0.05)

In [53]:
X.drop(X.columns[[12, 13, 14, 20, 29, 57, 69, 80, 101, 125, 128]], axis = 1, inplace = True)

In [54]:
scaler.fit(X)

StandardScaler(copy=True, with_mean=True, with_std=True)

In [55]:
X_scaled = scaler.transform(X)

In [56]:
X_scaled = sm.add_constant(X_scaled)

In [57]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size = 0.3, random_state = 123)

In [58]:
X_train.shape

(402300, 126)

In [59]:
X_test.shape

(172415, 126)

In [60]:
Model2 = sm.OLS(y_train.astype(float), X_train.astype(float))
Model2_results = Model2.fit()

In [61]:
Model2_results.summary()

0,1,2,3
Dep. Variable:,SaleAmount,R-squared:,0.769
Model:,OLS,Adj. R-squared:,0.769
Method:,Least Squares,F-statistic:,10810.0
Date:,"Wed, 24 Mar 2021",Prob (F-statistic):,0.0
Time:,18:29:46,Log-Likelihood:,-5463500.0
No. Observations:,402300,AIC:,10930000.0
Df Residuals:,402175,BIC:,10930000.0
Df Model:,124,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,3.356e+05,301.692,1112.479,0.000,3.35e+05,3.36e+05
x1,2.824e+05,406.523,694.638,0.000,2.82e+05,2.83e+05
x2,1396.2923,297.392,4.695,0.000,813.413,1979.172
x3,-3365.4872,315.650,-10.662,0.000,-3984.152,-2746.823
x4,4073.5838,318.070,12.807,0.000,3450.176,4696.991
x5,-2326.6924,313.922,-7.412,0.000,-2941.969,-1711.416
x6,-398.2621,312.483,-1.275,0.202,-1010.720,214.196
x7,-1766.7274,317.201,-5.570,0.000,-2388.432,-1145.023
x8,-1190.9161,314.684,-3.784,0.000,-1807.686,-574.146

0,1,2,3
Omnibus:,158534.675,Durbin-Watson:,1.996
Prob(Omnibus):,0.0,Jarque-Bera (JB):,161991495.945
Skew:,-0.196,Prob(JB):,0.0
Kurtosis:,101.304,Cond. No.,1380000000000000.0


I decided to continue eliminating columns with high p-values until all remaining were below 0.05

In [62]:
X.drop(X.columns[[5, 12, 19, 27, 33, 43, 50, 70, 87, 105, 117, 118, 119, 120]], axis = 1, inplace = True)

In [63]:
scaler.fit(X)

StandardScaler(copy=True, with_mean=True, with_std=True)

In [64]:
X_scaled = scaler.transform(X)

In [65]:
X_scaled = sm.add_constant(X_scaled)

In [66]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size = 0.3, random_state = 123)

In [67]:
X_train.shape

(402300, 112)

In [68]:
X_test.shape

(172415, 112)

In [69]:
Model2 = sm.OLS(y_train.astype(float), X_train.astype(float))
Model2_results = Model2.fit()

In [70]:
Model2_results.summary()

0,1,2,3
Dep. Variable:,SaleAmount,R-squared:,0.769
Model:,OLS,Adj. R-squared:,0.769
Method:,Least Squares,F-statistic:,12180.0
Date:,"Wed, 24 Mar 2021",Prob (F-statistic):,0.0
Time:,18:30:03,Log-Likelihood:,-5463500.0
No. Observations:,402300,AIC:,10930000.0
Df Residuals:,402189,BIC:,10930000.0
Df Model:,110,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,3.356e+05,301.726,1112.359,0.000,3.35e+05,3.36e+05
x1,2.824e+05,406.172,695.247,0.000,2.82e+05,2.83e+05
x2,1396.8906,297.427,4.697,0.000,813.944,1979.838
x3,-3242.3011,310.268,-10.450,0.000,-3850.418,-2634.184
x4,4323.1073,311.137,13.895,0.000,3713.288,4932.927
x5,-2118.4211,308.714,-6.862,0.000,-2723.492,-1513.350
x6,-1529.6747,310.917,-4.920,0.000,-2139.062,-920.287
x7,-979.4077,308.812,-3.172,0.002,-1584.670,-374.145
x8,-1.14e+04,349.292,-32.627,0.000,-1.21e+04,-1.07e+04

0,1,2,3
Omnibus:,158508.536,Durbin-Watson:,1.996
Prob(Omnibus):,0.0,Jarque-Bera (JB):,161792231.399
Skew:,-0.196,Prob(JB):,0.0
Kurtosis:,101.244,Cond. No.,1340000000000000.0


In [71]:
X.drop(X.columns[[36]], axis = 1, inplace = True)

In [72]:
scaler.fit(X)

StandardScaler(copy=True, with_mean=True, with_std=True)

In [73]:
X_scaled = scaler.transform(X)

In [74]:
X_scaled = sm.add_constant(X_scaled)

In [75]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size = 0.3, random_state = 123)

In [76]:
X_train.shape

(402300, 111)

In [77]:
X_test.shape

(172415, 111)

In [78]:
Model2 = sm.OLS(y_train.astype(float), X_train.astype(float))
Model2_results = Model2.fit()

In [79]:
Model2_results.summary()

0,1,2,3
Dep. Variable:,SaleAmount,R-squared:,0.769
Model:,OLS,Adj. R-squared:,0.769
Method:,Least Squares,F-statistic:,12290.0
Date:,"Wed, 24 Mar 2021",Prob (F-statistic):,0.0
Time:,18:30:20,Log-Likelihood:,-5463500.0
No. Observations:,402300,AIC:,10930000.0
Df Residuals:,402190,BIC:,10930000.0
Df Model:,109,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,3.356e+05,301.727,1112.358,0.000,3.35e+05,3.36e+05
x1,2.824e+05,406.161,695.277,0.000,2.82e+05,2.83e+05
x2,1396.9512,297.427,4.697,0.000,814.003,1979.899
x3,-3225.4652,310.051,-10.403,0.000,-3833.156,-2617.774
x4,4342.7821,310.841,13.971,0.000,3733.543,4952.021
x5,-2101.2360,308.487,-6.811,0.000,-2705.861,-1496.611
x6,-1510.6849,310.641,-4.863,0.000,-2119.531,-901.839
x7,-961.2195,308.557,-3.115,0.002,-1565.982,-356.457
x8,-1.135e+04,347.962,-32.624,0.000,-1.2e+04,-1.07e+04

0,1,2,3
Omnibus:,158511.173,Durbin-Watson:,1.996
Prob(Omnibus):,0.0,Jarque-Bera (JB):,161799207.443
Skew:,-0.196,Prob(JB):,0.0
Kurtosis:,101.246,Cond. No.,1830000000000000.0


In [80]:
matches = pd.DataFrame(y_test)
matches.rename(columns = {'SaleAmount':'actual'}, inplace=True)
matches["predicted"] = y_pred

rmse(matches["actual"], matches["predicted"])

194199.838772732

## Model 3

Next let's try a Random Forest model with less variables

In [81]:
X = df[['AssessedValue', 'SalesRatio']]
X.head()

Unnamed: 0,AssessedValue,SalesRatio
0,795870,0.866976
1,1925560,1.674318
2,189630,1.629489
3,147340,1.018053
4,163380,1.101726


In [82]:
scaler.fit(X)

StandardScaler(copy=True, with_mean=True, with_std=True)

In [83]:
X_scaled = scaler.transform(X)

In [84]:
from sklearn.ensemble import RandomForestRegressor

In [85]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size = 0.3, random_state = 123)

In [86]:
X_train.shape

(402300, 2)

In [87]:
X_test.shape

(172415, 2)

In [88]:
Model3 = RandomForestRegressor(random_state = 123)
Model3.fit(X_train, y_train)

  


RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=123, verbose=0, warm_start=False)

In [89]:
y_pred = Model3.predict(X_test)

In [90]:
r2_score(y_test, y_pred)

0.9996200059108645

In [91]:
matches = pd.DataFrame(y_test)
matches.rename(columns = {'SaleAmount':'actual'}, inplace=True)
matches["predicted"] = y_pred

rmse(matches["actual"], matches["predicted"])

7793.665618712889

The R-squared score and root mean squared error are both very good for this model. R-squared is very near 1 and
root mean squared error is by far the lowest we've seen so far

## Model 4

Finally let's do a model that focuses on regularization. I'll try this out with the same split we used for model 3

In [92]:
alpha_range = 10.**np.arange(-2,3)
alpha_range

array([1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])

In [93]:
from sklearn.linear_model import RidgeCV
Model4 = RidgeCV(alphas = alpha_range, normalize = True, scoring = 'neg_mean_squared_error')

In [94]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size = 0.3, random_state = 123)

In [95]:
Model4.fit(X_train, y_train)

RidgeCV(alphas=array([1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02]), cv=None,
        fit_intercept=True, gcv_mode=None, normalize=True,
        scoring='neg_mean_squared_error', store_cv_values=False)

In [96]:
Model4.alpha_

0.01

In [97]:
y_pred = Model4.predict(X_test)

In [98]:
from sklearn import metrics

In [99]:
r2_score(y_test, y_pred)

0.729110543378975

In [100]:
np.sqrt(metrics.mean_squared_error(y_test, y_pred))

208089.13228230696

Here our R-squared is lower than our other models and our root mean squared error is also not any better than the ones before

## Conclusion

It looks like the Random Forest regressor was by far the best model. It had a much higher R-squared almost equalling one and the root mean squared error was the lowest by far. This would be the model to implement to predict sale amount.