# Pre-Processing and Training Data Development

Here we have 3 simple goals: <br>

1) Create dummy features for categorical data necessary <br>
2) Scale numerical variables if necessary <br>
3) Split data into training and test sets for modelling <br>

In [1]:
# Import necessary packages:

import pandas as pd

In [2]:
# First let's load in our data

df = pd.read_csv(r'C:\Users\bronc\Downloads\Capstone 2\Real_Estate_Sales_2001-2018(clean).csv')
df.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,SerialNumber,ListYear,DateRecorded,Town,Address,AssessedValue,SaleAmount,SalesRatio,PropertyType,ResidentialType,Month
0,0,0,110540,2011,2012-04-03,Stamford,56 CHERRY HILL ROAD,795870,690000,0.866976,Residential,Single Family,4
1,1,1,120025,2012,2012-10-05,Greenwich,"78 BALDWIN FARMS SOUTH, GREENW",1925560,3224000,1.674318,Residential,Single Family,10
2,2,3,60173,2006,2006-12-28,Windsor,7 ALFORD DR,189630,309000,1.629489,Residential,Single Family,12
3,3,5,14539,2014,2015-05-28,Milford,170 MEADOWSIDE RD,147340,150000,1.018053,Residential,Single Family,5
4,4,9,160629,2016,2017-01-31,Bridgeport,75 EDGEMOOR RD,163380,180000,1.101726,Residential,Single Family,1


### Create dummy features

In [3]:
#Looking through the data I think it makes the most sense to create dummy features for Town and ResidentialType

dummy_town = pd.get_dummies(df['Town'])
dummy_res = pd.get_dummies(df['ResidentialType'])

In [4]:
dummy_town.head()

Unnamed: 0,Andover,Ansonia,Ashford,Avon,Barkhamsted,Beacon Falls,Berlin,Bethany,Bethel,Bethlehem,...,Willington,Wilton,Winchester,Windham,Windsor,Windsor Locks,Wolcott,Woodbridge,Woodbury,Woodstock
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [5]:
dummy_res.sample(5)

Unnamed: 0,Four Family,Multi Family,Single Family,Three Family,Two Family
105447,0,0,1,0,0
492318,0,0,1,0,0
100563,0,0,1,0,0
362320,0,0,1,0,0
459475,0,0,1,0,0


In [6]:
# With those made and verified to have worked let's put them back with our main dataframe

df = pd.concat([df, dummy_town], axis = 1)
df.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,SerialNumber,ListYear,DateRecorded,Town,Address,AssessedValue,SaleAmount,SalesRatio,...,Willington,Wilton,Winchester,Windham,Windsor,Windsor Locks,Wolcott,Woodbridge,Woodbury,Woodstock
0,0,0,110540,2011,2012-04-03,Stamford,56 CHERRY HILL ROAD,795870,690000,0.866976,...,0,0,0,0,0,0,0,0,0,0
1,1,1,120025,2012,2012-10-05,Greenwich,"78 BALDWIN FARMS SOUTH, GREENW",1925560,3224000,1.674318,...,0,0,0,0,0,0,0,0,0,0
2,2,3,60173,2006,2006-12-28,Windsor,7 ALFORD DR,189630,309000,1.629489,...,0,0,0,0,1,0,0,0,0,0
3,3,5,14539,2014,2015-05-28,Milford,170 MEADOWSIDE RD,147340,150000,1.018053,...,0,0,0,0,0,0,0,0,0,0
4,4,9,160629,2016,2017-01-31,Bridgeport,75 EDGEMOOR RD,163380,180000,1.101726,...,0,0,0,0,0,0,0,0,0,0


In [7]:
df = pd.concat([df, dummy_res], axis = 1)
df.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,SerialNumber,ListYear,DateRecorded,Town,Address,AssessedValue,SaleAmount,SalesRatio,...,Windsor Locks,Wolcott,Woodbridge,Woodbury,Woodstock,Four Family,Multi Family,Single Family,Three Family,Two Family
0,0,0,110540,2011,2012-04-03,Stamford,56 CHERRY HILL ROAD,795870,690000,0.866976,...,0,0,0,0,0,0,0,1,0,0
1,1,1,120025,2012,2012-10-05,Greenwich,"78 BALDWIN FARMS SOUTH, GREENW",1925560,3224000,1.674318,...,0,0,0,0,0,0,0,1,0,0
2,2,3,60173,2006,2006-12-28,Windsor,7 ALFORD DR,189630,309000,1.629489,...,0,0,0,0,0,0,0,1,0,0
3,3,5,14539,2014,2015-05-28,Milford,170 MEADOWSIDE RD,147340,150000,1.018053,...,0,0,0,0,0,0,0,1,0,0
4,4,9,160629,2016,2017-01-31,Bridgeport,75 EDGEMOOR RD,163380,180000,1.101726,...,0,0,0,0,0,0,0,1,0,0


I ran these in that order to verify that the columns had been successfully added

### Scale numerical variables

While we do have 3 numerical columns that are candidates to be scaled I don't think its necessary. Scaling either AssessedValue or SaleAmount doesn't make sense as they are on similar scales already and SalesRatio needs to be on a different scale to maintain its usefulness. Therefore I won't be using a scaler

### Split data into training and test sets for modelling

In [8]:
#Import the necessary package
from sklearn.model_selection import train_test_split

We'll be seeing how our data does at predicting SaleAmount as that has the most interesting real world application

In [9]:
X = df.iloc[:, 7:186]
y = df.iloc[:, 8]

In [10]:
X.drop(columns = 'SaleAmount', inplace = True)
X.head()

Unnamed: 0,AssessedValue,SalesRatio,PropertyType,ResidentialType,Month,Andover,Ansonia,Ashford,Avon,Barkhamsted,...,Windsor,Windsor Locks,Wolcott,Woodbridge,Woodbury,Woodstock,Four Family,Multi Family,Single Family,Three Family
0,795870,0.866976,Residential,Single Family,4,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
1,1925560,1.674318,Residential,Single Family,10,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
2,189630,1.629489,Residential,Single Family,12,0,0,0,0,0,...,1,0,0,0,0,0,0,0,1,0
3,147340,1.018053,Residential,Single Family,5,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,163380,1.101726,Residential,Single Family,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [11]:
y.head()

0     690000
1    3224000
2     309000
3     150000
4     180000
Name: SaleAmount, dtype: int64

In [12]:
X.shape

(574715, 178)

In [13]:
y.shape

(574715,)

Perform train test split. I decided to test it out with a value of 30% for the test split

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 123)

Verify that the split worked

In [15]:
X_train.shape

(402300, 178)

In [16]:
X_test.shape

(172415, 178)

The split looks good! Next we'll start modelling our data