# Initial Imports and Data Cleaning

## Import Libraries and Custom Functions

The following code will import our custom functions from Mod1_Functions.py.

We also import pandas, numpy, matplotlib, seaborn, and scikit-learn.

In [1]:
from Mod1_Functions import *
import pandas as pd
import numpy as np

## Import Raw Dataset

We import the csv data into a pandas dataframe by using pandas.read_csv.

In [2]:
df_raw = pd.read_csv('kc_house_data.csv')

## Clean data using custom function

Our `clean_dataframe()` function takes in inputs of dataframe and a dictionary of adjustments.

We first set our `data_adjustments` dictionary to contain the fields we want to change as keys and a list of adjustments as dictionary values.

The parameters in the list are: \[datatype, value to replace, value to replace with, replacement array\]

    - datatype : must be a valid data type
    - value to replace: can be a single value string, integer, or np.nan
    - value to replace with: can be a single value or can be a list with strings containg other column names in dataframe (see replacement array below)
    - replacement array: contains a list of floats or integers, which are multiplied by the associated data field in the "value to replace with" list

In [3]:
data_adjustments = {'date': ['datetime64', None, None, None], 
                    'bedrooms': [None, 33, 4, None], 
                    'waterfront': [str, np.nan, 'missing', None], 
                    'view': [str, np.nan, 0, None],
                    'sqft_basement': [float, '?', ['sqft_living','sqft_above'], [1, -1]]
                   }

In [4]:
df_clean = clean_dataframe(df_raw, data_adjustments)

## Add features calculated from other columns

We create a few fields calculated based on the date of sale (day of week, day, and month).

We also create a custom binned variable `'yr_renovated_cat'` which categorizes whether the house has been renovated and whether that renovation was recent.

In [13]:
df_clean['dayofweek'] = df_clean['date'].map(lambda x: x.dayofweek)
df_clean['month'] = df_clean['date'].map(lambda x: x.month)
df_clean['day'] = df_clean['date'].map(lambda x: x.day)


#Set number of years to consider recent renovation
n_years = 15
df_clean['yr_renovated_cat'] = df_clean['yr_renovated'].apply(renovated_cat, n_years=n_years)

## Set up data fields as Categorical to avoid treating as numbers

In [14]:
categorical_columns = ['floors', 'waterfront', 'view', 'condition', 'grade', 'zipcode', 'dayofweek', 'day', 'month', 'yr_renovated_cat']

set_to_categorical(df_clean, categorical_columns)

print(df_clean.dtypes)

id                           int64
date                datetime64[ns]
price                      float64
bedrooms                     int64
bathrooms                  float64
sqft_living                  int64
sqft_lot                     int64
floors                    category
waterfront                category
view                      category
condition                 category
grade                     category
sqft_above                   int64
sqft_basement              float64
yr_built                     int64
yr_renovated               float64
zipcode                   category
lat                        float64
long                       float64
sqft_living15                int64
sqft_lot15                   int64
dayofweek                 category
month                     category
day                       category
yr_renovated_cat          category
dtype: object
