# Prepping the Data
We need to
- [x] remove high null columns and rows
- [x] remove unnecessary columns
- [ ] remove outliers
- [ ] impute/remove leftover nulls
- [ ] create new features
- [ ] reorder/rename columns

Preprocessing
- [ ] scale the data
- [ ] split into train, validate, test (stratify)

In [1]:
import acquire

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = acquire.get_home_data()

In [3]:
df

Unnamed: 0,id,parcelid,airconditioningtypeid,architecturalstyletypeid,basementsqft,bathroomcnt,bedroomcnt,buildingclasstypeid,buildingqualitytypeid,calculatedbathnbr,...,structuretaxvaluedollarcnt,taxvaluedollarcnt,assessmentyear,landtaxvaluedollarcnt,taxamount,taxdelinquencyflag,taxdelinquencyyear,censustractandblock,logerror,transactiondate
0,1,10759547,,,,0.0,0.0,,,,...,,27516.0,2015.0,27516.0,,,,,0.055619,2017-01-01
1,6,10933547,,,,0.0,0.0,,,,...,404013.0,563029.0,2016.0,159016.0,6773.34,,,,-0.001011,2017-01-01
2,14,11142747,,,,0.0,0.0,,,,...,,4265.0,2015.0,4265.0,,,,,-0.008935,2017-01-02
3,15,11193347,,,,0.0,0.0,,,,...,,10.0,2016.0,10.0,,,,,0.008669,2017-01-02
4,16,11215747,,,,0.0,0.0,,,,...,,10.0,2016.0,10.0,,,,,-0.021896,2017-01-02
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70359,77609,11212539,1.0,,,3.0,4.0,,8.0,3.0,...,129566.0,162019.0,2016.0,32453.0,2860.33,,,6.037911e+13,0.020615,2017-09-20
70360,77610,11212639,1.0,,,3.0,4.0,,8.0,3.0,...,100744.0,125923.0,2016.0,25179.0,2394.26,,,6.037911e+13,0.013209,2017-09-21
70361,77611,11212962,1.0,,,2.0,3.0,,6.0,2.0,...,149241.0,198988.0,2016.0,49747.0,3331.81,,,6.037911e+13,0.037129,2017-09-21
70362,77612,11213162,1.0,,,3.0,3.0,,8.0,3.0,...,118900.0,148600.0,2016.0,29700.0,2510.53,,,6.037911e+13,0.007204,2017-09-25


# Remove Nulls in Columns/Rows

In [4]:
# sets thresh hold to 75 percent nulls
thresh_hold = df.shape[0] * .75

# remove columns with high nulls
df = df.dropna(axis=1, thresh=thresh_hold)

df.columns

Index(['id', 'parcelid', 'bathroomcnt', 'bedroomcnt', 'calculatedbathnbr',
       'calculatedfinishedsquarefeet', 'finishedsquarefeet12', 'fips',
       'fullbathcnt', 'latitude', 'longitude', 'lotsizesquarefeet',
       'propertycountylandusecode', 'propertylandusetypeid',
       'rawcensustractandblock', 'regionidcity', 'regionidcounty',
       'regionidzip', 'roomcnt', 'yearbuilt', 'structuretaxvaluedollarcnt',
       'taxvaluedollarcnt', 'assessmentyear', 'landtaxvaluedollarcnt',
       'taxamount', 'censustractandblock', 'logerror', 'transactiondate'],
      dtype='object')

In [5]:
# remove other columns we don't need, like multiple sqft
df = df.drop(columns=['finishedsquarefeet12'])

In [6]:
# sets thresh hold to 25 percent nulls
thresh_hold = df.shape[1] * .75

# remove rows with high nulls
df = df.dropna(axis=0,thresh=thresh_hold)

# Removing Outliers 

# Exploring leftover nulls to determine if dropping or imputing with mean, median, mode

In [7]:
df.isnull().sum()

id                                 0
parcelid                           0
bathroomcnt                        0
bedroomcnt                         0
calculatedbathnbr                980
calculatedfinishedsquarefeet     687
fips                               0
fullbathcnt                      980
latitude                           0
longitude                          0
lotsizesquarefeet               5805
propertycountylandusecode          0
propertylandusetypeid              0
rawcensustractandblock             0
regionidcity                    1251
regionidcounty                     0
regionidzip                      161
roomcnt                            0
yearbuilt                        703
structuretaxvaluedollarcnt       141
taxvaluedollarcnt                  5
assessmentyear                     0
landtaxvaluedollarcnt            529
taxamount                        103
censustractandblock              622
logerror                           0
transactiondate                    0
d

### Calculatedbathnbdr will be dealt with when creating features later
### Fullbathcnt

In [9]:
df.fullbathcnt.value_counts()

2.0     34734
3.0     15312
1.0     13778
4.0      2853
5.0       955
6.0       334
7.0       128
8.0        40
9.0        26
10.0        9
11.0        4
13.0        1
19.0        1
12.0        1
20.0        1
Name: fullbathcnt, dtype: int64

- Fullbathcnt has 215 nulls
- Most common value 2 has 35,694 observations
- Will impute the nulls with this

In [10]:
mode = df.fullbathcnt.mode()[0]

df['fullbathcnt'] = df.fullbathcnt.fillna(mode)

### Calculated Finished SQFT

In [13]:
df.calculatedfinishedsquarefeet.value_counts()

1200.0     159
1080.0     135
1120.0     126
960.0      125
1040.0     124
          ... 
5088.0       1
10937.0      1
6493.0       1
4976.0       1
7182.0       1
Name: calculatedfinishedsquarefeet, Length: 4777, dtype: int64

- there are 687 null values for calculated finished sqft
- the largest counts for sqft is 1200 at 159 observations
- won't impute with the mode, but the average

In [17]:
# calculate mean of column
mean = df.calculatedfinishedsquarefeet.mean()

# fill nulls in column with mean
df['calculatedfinishedsquarefeet'] = df.calculatedfinishedsquarefeet.fillna(mean)

### Full bath count

In [18]:
df.fullbathcnt.value_counts()

2.0     35714
3.0     15312
1.0     13778
4.0      2853
5.0       955
6.0       334
7.0       128
8.0        40
9.0        26
10.0        9
11.0        4
13.0        1
19.0        1
12.0        1
20.0        1
Name: fullbathcnt, dtype: int64

In [19]:
df.fullbathcnt.mean()

2.1831629480746706

- There are 980 nulls in this column
- tha value of 2 has over 35,000 0bservations
- Both the mode and mean are about 2
- Will fill nulls with 2

In [21]:
df['fullbathcnt'] = df.fullbathcnt.fillna(2)

### lotsizesquarefeet

In [22]:
df.lotsizesquarefeet.value_counts()

6000.0     1135
5000.0      447
7200.0      409
7000.0      277
6500.0      267
           ... 
12867.0       1
21799.0       1
23519.0       1
15285.0       1
47132.0       1
Name: lotsizesquarefeet, Length: 17940, dtype: int64

- There are over 5,000 nulls in this columns
- Highest observations is only 1,135 counts
- Will remove the column instead of imputing

In [25]:
df = df.drop('lotsizesquarefeet', axis=1)

### Region ID City
- has 1,251 nulls
- we already have latitude and longitude with no nulls for location
- will drop this column

In [28]:
df = df.drop('regionidcity',axis=1)

### Region ID Zip
- will replace nulls with 90000 to represent no known zip code (but not create outliers by using 0)
- can use latitude/longitude or clustering if necessary to determine actual values
- however, for only 161 nulls it is not a significant amount to worry about

In [29]:
df.regionidzip.value_counts()

97118.0    588
96987.0    560
96368.0    530
96193.0    523
97319.0    487
          ... 
96039.0      6
96329.0      5
96226.0      3
96467.0      1
97177.0      1
Name: regionidzip, Length: 386, dtype: int64

In [31]:
df['regionidzip'] = df.regionidzip.fillna(90_000)

### Year Built
703 nulls

In [33]:
df.yearbuilt.value_counts(), df.yearbuilt.mean()

(1955.0    2182
 1950.0    1856
 1954.0    1849
 1956.0    1715
 1953.0    1600
           ... 
 1891.0       1
 1889.0       1
 1886.0       1
 1880.0       1
 1862.0       1
 Name: yearbuilt, Length: 133, dtype: int64,
 1965.5359657580273)

### structuretaxvaluedollarcnt
141 nulls

In [35]:
df.structuretaxvaluedollarcnt.value_counts(), df.structuretaxvaluedollarcnt.mean()

(100000.0    34
 105000.0    32
 88000.0     29
 200000.0    29
 104023.0    29
             ..
 302595.0     1
 78443.0      1
 867330.0     1
 356352.0     1
 5.0          1
 Name: structuretaxvaluedollarcnt, Length: 48602, dtype: int64,
 176124.63272284687)

### taxvaluedollarcnt
5 nulls

### landtaxvaluedollarcnt
529 nulls

### taxamount
103 nulls

### censustractandblock
622 nulls

# Create Features

### calculate our own bath_bed
- we have a column from the database, however it has 215 nulls
- bathroom and bedroom count by themselves have no nulls
- calculate our own and drop the original

In [None]:
df['bed_plus_bath'] = df.bathroomcnt + df.bedroomcnt
df = df.drop('calculatedbathnbr',axis=1)