In [7]:
import pandas as pd
import numpy as np
from my_functions import *

In [8]:
houses_train = pd.read_csv('../datasets/train.csv')
houses_test = pd.read_csv('../datasets/test.csv')

# Cleaning, Engineering, and Selecting Features

In this Notebook, I'll be cleaning data, engineering features, and selecting features for my model using insights from my data exploration in the [previous EDA notebook](01_EDA.ipynb). 
**Note that the actual functions are stored in `my_functions.py`**

## Feature Engineering

### Removing Outliers
* Homes over 4000 in 1st Flr SF
* Homes over 4000 in Gr Liv Area
* Garage with Year built 2207

In [14]:
def remove_outliers(data):
    return data[(data['1st Flr SF'] < 4000) &
               (data['Gr Liv Area'] < 4000) &
               (data['Garage Yr Blt'] != 2207)]

### Imputing missing data 
Imputing with 0 or NA, depending on whether the data is categorical or continuous. According to the [data dictionary](http://jse.amstat.org/v19n3/decock/DataDocumentation.txt), most NaN values are intentional and signal that the home doesn't have a particular feature.

In [15]:
# Thanks Will Badr for this! https://towardsdatascience.com/6-different-ways-to-compensate-for-missing-values-data-imputation-with-examples-6022d9ca0779
def imp_data(data):
    has_nulls = data.isnull().mean() != 0
    null_columns = data.columns[has_nulls]
    for column in null_columns:
        try:
            data[column] + 1 # If this doesn't throw an error, it means it's an integer/float, and NaN values likely mean the value is 0
            data[column].fillna(0, inplace=True)
        except:
            data[column].fillna('NA', inplace=True)

### Creating Dummies
I chose to dummify most nominal categories. Later, I'll select features based on correlation and significance

In [18]:
def category_to_dummies(dataframe, list_of_columns):
    for column in list_of_columns:
        dummy_split = pd.get_dummies(dataframe[column], column, drop_first=True) # Creates dummy columns with the name {column}_{value_in_row} per get_dummies documentation
        for dummy_key in dummy_split: # Iterates through dummy_key in dummy_split
            dataframe[dummy_key] = dummy_split[dummy_key] # adds new columns named {dummy_key} to original dataframe

In [16]:
# choosing categories to dummify
nominal_categories = [
                      'MS Zoning',
                      'MS SubClass',
                      'Foundation',
                      'BsmtFin Type 1',
                      'BsmtFin Type 2',
                      'Exterior 1st',
                      'Exterior 2nd',
                      'Heating',
                      'Street',
                      'Neighborhood',
                      'Garage Finish',
                      'Lot Config',
                      'BsmtFin Type 1',
                      'BsmtFin Type 2',
                      'Lot Shape',
                      'Roof Matl',
                      'Roof Style',
                      'Lot Shape',
                      'Land Contour',
                      'Utilities',
                      'Land Slope',
                      'House Style',
                      'Electrical',
                      'Garage Type',
                      'Sale Type',
                      'Functional',
                      'Exter Qual',
                      'Exter Cond',
                      'Bsmt Qual',
                      'Condition 1',
                      'Condition 2',
                      'Bsmt Cond',
                      'Heating QC',
                      'Kitchen Qual',
                      'Fireplace Qu',
                      'Garage Qual',
                      'Garage Cond',
                      'Pool QC', 
                      'Full Bath',
                      'Half Bath',
                      'Bedroom AbvGr',
                      'Kitchen AbvGr',
                      'TotRms AbvGrd',
                     ]

### Creating Logs
Took the log of some columns in order to make the distribution of values more normal. This helped to predict homes with `SalePrice` outside of the interquartile range.

In [17]:
categories_to_log = ['Lot Area', '1st Flr SF', 'BsmtFin SF 1', 'Gr Liv Area']