# Scrub the Data
The scrubbing phase of ROSEMED is very important, especially for Linear Regression models, which requires numerical data inputs which are normally distributed. This means it is important to make sure we eliminate missing or faulty data, either by filling in or dropping.

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv('../data/interim/combined_df.csv', index_col=0)

Time to check out our dataset and see what can be done to clean it up and put it in the right format

In [2]:
df.head()

Unnamed: 0,id,school_kms,school_mins,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,...,zipcode,lat,long,sqft_living15,sqft_lot15,median_household_income,median_home_value,walk_score,transit_score,bike_score
0,1000102,20.761,26.117,9/16/2014,280000.0,6,3.0,2400,9373,2.0,...,98002,47.3262,-122.214,2060,7316,43568,167100,49,,72
2,1000102,20.761,26.117,4/22/2015,300000.0,6,3.0,2400,9373,2.0,...,98002,47.3262,-122.214,2060,7316,43568,167100,49,,72
8,1200019,18.352,26.15,5/8/2014,647500.0,4,1.75,2060,26036,1.0,...,98166,47.4444,-122.351,2590,21891,63320,386900,10,,18
9,1200021,18.372,25.7,8/11/2014,400000.0,3,1.0,1460,43000,1.0,...,98166,47.4434,-122.347,2250,20023,63320,386900,9,,24
10,2800031,18.673,17.917,4/1/2015,235000.0,3,1.0,1430,7599,1.5,...,98168,47.4783,-122.265,1290,10320,49233,240000,34,,25


## Dealing with NaN's

In [3]:
df.isna().sum()

id                            0
school_kms                    0
school_mins                   0
date                          0
price                         0
bedrooms                      0
bathrooms                     0
sqft_living                   0
sqft_lot                      0
floors                        0
waterfront                 2376
view                         63
condition                     0
grade                         0
sqft_above                    0
sqft_basement                 0
yr_built                      0
yr_renovated               3842
zipcode                       0
lat                           0
long                          0
sqft_living15                 0
sqft_lot15                    0
median_household_income       0
median_home_value             0
walk_score                    0
transit_score                 0
bike_score                    0
dtype: int64

Things to change:
1. Deal with Nans:
    - Waterfront: If we assume the zeros mean there is no waterfront, we can fill with zeros.
    - View: Fill with the mode
    - Yr_renovated: Will drop this column. With 3842 missing, and 19k 0's, there is not a lot of information.
    - Scores: Fill with median

In [4]:
df['waterfront'] = df['waterfront'].fillna(0)

In [5]:
df['waterfront'].isna().sum()

0

In [6]:
df['view'] = df['view'].fillna(0)

In [7]:
df['view'].isna().sum()

0

## Dropping Columns
There are some columns we cannot use that we should drop here.
Date sold indicates the date between 2014 and 2015 that the house was sold. This is not a big enough range, and looking at the months sold did not indicate any particular insights.

Top_schools was left over from finding the distance/time from house to school.

Waterfront has a lot of missing values. When I plotted them on the map they were only a fraction of the homes that were actually located on the water.

In [8]:
to_drop = ['date', 'yr_renovated', 'waterfront']

In [9]:
df = df.drop(to_drop, axis=1)

In [10]:
df

Unnamed: 0,id,school_kms,school_mins,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,view,...,zipcode,lat,long,sqft_living15,sqft_lot15,median_household_income,median_home_value,walk_score,transit_score,bike_score
0,1000102,20.761,26.117,280000.0,6,3.00,2400,9373,2.0,0.0,...,98002,47.3262,-122.214,2060,7316,43568,167100,49,,72
2,1000102,20.761,26.117,300000.0,6,3.00,2400,9373,2.0,0.0,...,98002,47.3262,-122.214,2060,7316,43568,167100,49,,72
8,1200019,18.352,26.150,647500.0,4,1.75,2060,26036,1.0,0.0,...,98166,47.4444,-122.351,2590,21891,63320,386900,10,,18
9,1200021,18.372,25.700,400000.0,3,1.00,1460,43000,1.0,0.0,...,98166,47.4434,-122.347,2250,20023,63320,386900,9,,24
10,2800031,18.673,17.917,235000.0,3,1.00,1430,7599,1.5,0.0,...,98168,47.4783,-122.265,1290,10320,49233,240000,34,,25
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22666,9842300095,3.102,8.817,365000.0,5,2.00,1600,4168,1.5,0.0,...,98126,47.5297,-122.381,1190,4168,66408,358100,51,,56
22667,9842300485,3.038,7.783,380000.0,2,1.00,1040,7372,1.0,0.0,...,98126,47.5285,-122.378,1930,5150,66408,358100,56,,59
22668,9842300540,2.952,8.050,339000.0,3,1.00,1100,4128,1.0,0.0,...,98126,47.5296,-122.379,1510,4538,66408,358100,54,,53
22669,9895000040,1.302,3.150,399900.0,2,1.75,1410,1005,1.5,0.0,...,98027,47.5446,-122.018,1440,1188,100644,478800,72,,73


## Dealing with non-numerical data types

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21597 entries, 0 to 22670
Data columns (total 25 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       21597 non-null  int64  
 1   school_kms               21597 non-null  float64
 2   school_mins              21597 non-null  float64
 3   price                    21597 non-null  float64
 4   bedrooms                 21597 non-null  int64  
 5   bathrooms                21597 non-null  float64
 6   sqft_living              21597 non-null  int64  
 7   sqft_lot                 21597 non-null  int64  
 8   floors                   21597 non-null  float64
 9   view                     21597 non-null  float64
 10  condition                21597 non-null  int64  
 11  grade                    21597 non-null  int64  
 12  sqft_above               21597 non-null  int64  
 13  sqft_basement            21597 non-null  object 
 14  yr_built              

In [12]:
# fill in '?' with 0 and change type to float
df['sqft_basement'] = df['sqft_basement'].replace({'?': 0}).astype(float)

We should fill in some basements. We should be able to take sqft_living - sqft_above = sqft_basement

In [13]:
df['sqft_basement'] = df['sqft_living'] - df['sqft_above']

In [14]:
(df['sqft_basement'] == 0).sum()

13110

Doesnt make a huge difference

Geometry will stay as an object. We will be using that for graphing, and will drop it before any modeling.

In [15]:
df['walk_score'] = df['walk_score'].replace({' None': -1}).astype(float).fillna(-1)
df['transit_score'] = df['transit_score'].replace({' None': -1}).astype(float).fillna(-1)
df['bike_score'] = df['bike_score'].replace({' None': -1}).astype(float).fillna(-1)

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21597 entries, 0 to 22670
Data columns (total 25 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       21597 non-null  int64  
 1   school_kms               21597 non-null  float64
 2   school_mins              21597 non-null  float64
 3   price                    21597 non-null  float64
 4   bedrooms                 21597 non-null  int64  
 5   bathrooms                21597 non-null  float64
 6   sqft_living              21597 non-null  int64  
 7   sqft_lot                 21597 non-null  int64  
 8   floors                   21597 non-null  float64
 9   view                     21597 non-null  float64
 10  condition                21597 non-null  int64  
 11  grade                    21597 non-null  int64  
 12  sqft_above               21597 non-null  int64  
 13  sqft_basement            21597 non-null  int64  
 14  yr_built              

## Dealing with outliers

Getting rid of homes with prices over 1,000,000

In [17]:
df = df[(df['price'] < 700000)] # remove any houses with prices over 

In [18]:
df = df[df['bedrooms'] <= 4] # remove any houses with more than 4 bedrooms

In [19]:
df[(df['bedrooms'] == 1) & (df['price'] > 1000000)] # remove odd outlier

Unnamed: 0,id,school_kms,school_mins,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,view,...,zipcode,lat,long,sqft_living15,sqft_lot15,median_household_income,median_home_value,walk_score,transit_score,bike_score


In [20]:
df[(df['bedrooms'] == 2) & (df['price'] > 2000000)] # remove odd outlier

Unnamed: 0,id,school_kms,school_mins,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,view,...,zipcode,lat,long,sqft_living15,sqft_lot15,median_household_income,median_home_value,walk_score,transit_score,bike_score


In [21]:
df = df[(df['bathrooms'] <= 5)] # remove houses with more than 5 bathrooms

In [22]:
df = df[df['sqft_living15'].between(500, 4000)] # remove homes with more than 5000 sqft

In [23]:
df = df[(df['sqft_above'] <= 4000)] # remove homes with more than 5000 sqft

In [24]:
df = df[(df['sqft_lot'] <= 500000)] # remove homes with more than 5 million sqft lot

In [25]:
df = df[(df['floors'] <= 3)] # remove homes with more than 3 floors

In [26]:
df = df[(df['school_kms'] < 40)]

In [27]:
df = df[(df['school_mins'] < 40)]

In [28]:
df = df[(df['grade'] >= 4) & (df['grade'] <= 10)]

In [29]:
df = df[(df['median_household_income'] < 150000)]

In [30]:
df = df[(df['median_home_value'] > 200000) & (df['median_home_value'] < 700000)]

In [31]:
df = df[(df['walk_score'] < 100) & (df['walk_score'] > 0)]
df = df[(df['bike_score'] < 100) & (df['bike_score'] > 0)]

In [32]:
df.to_csv('../data/processed/housing_data.csv')