# !!!!!!! CLEAR OUTPUT BEFORE COMMIT !!!!!!
TODO:
- Histogram of home age?

In [None]:
import pandas as pd
import numpy as np
import os
#import sklearn

import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
def one_hot_encode(df, col_name_list):
    for col_name in col_name_list:
        df = pd.concat((df, pd.get_dummies(df[col_name], prefix=col_name)), axis='columns')
    df.drop(columns=col_name_list, inplace=True)
    return df

def a_f_ohe(df, column_name, one_hot_encode_list):
    if not one_hot_encode_list:
        one_hot_encode_list = []
    df[column_name] = column_name + df[column_name].astype(str)
    if not column_name in one_hot_encode_list:
        one_hot_encode_list = one_hot_encode_list + [column_name]
    return one_hot_encode_list

In [None]:
df = pd.read_csv(os.path.join('data','train.csv'))

In [None]:
pd.options.display.max_columns = 1000
pd.options.display.max_rows = 1000
df.head()

How many records are we dealing with?

In [None]:
len(df.index)

Because of the small size of the training and testing datasets we don't need to worry about converting data types to smaller sizes to conserve space.

What data types are we dealing with?

In [None]:
df.dtypes

Let's take a look at the different features and get some simple summary statistics

In [None]:
df['1stFlrSF'].describe()

In [None]:
df['MSSubClass'].value_counts()

Which fields have null values?

In [None]:
df_nulls = df.copy().isnull().sum()
df_nulls = df_nulls.to_frame().rename(columns={0:'num_nulls'})
df_nulls['total_records'] = len(df.index)
df_nulls['pct_null'] = df_nulls['num_nulls'] / df_nulls['total_records'] * 100
df_nulls[df_nulls['num_nulls'] > 0].sort_values(by='pct_null', ascending=False)

Wow, the PoolQC (Pool Quality) is almost always empty. Is this because:
- Most of the houses have no pools?
- How many houses have pools but no quality value?

In [None]:
df.loc[df['PoolArea'] > 0, ['PoolQC','PoolArea']]

There are no pools without a quality rating.

In [None]:
df.loc[~df['PoolQC'].isnull(), ['PoolQC','PoolArea']]

This shows us that only 7 houses had a pool and no pool quality records exist where there is no pool (pool area = 0)

Pools normally represent a small amount of houses so it's important to keep this around rather than attempting to just drop it since there are so few records.

To handle these we can One Hot Encode (pd.get_dummies) for the pool quality values. Any house without a pool will have 0 for all the quality measurements to work around the null values in the pool quality feature.

If we're doing something similar to the PoolQC we should explicitly show it like we did with PoolQC

In [None]:
ohe_list = [] #['PoolQC','Alley','Fence','MiscFeature']

In [None]:
#df = one_hot_encode(df, ohe_list)

In [None]:
#df.head()

**MSSubClass:** Identifies the type of dwelling involved in the sale.

In [None]:
# Prefix the categories so they make a bit more sense when we OHE them.
# Initially it's imported as a number so we'll force it to be a string
ohe_list = a_f_ohe(df, 'MSSubClass', ohe_list)

**MSZoning:** Identifies the general zoning classification of the sale.

In [None]:
ohe_list = a_f_ohe(df, 'MSZoning', ohe_list)

**LotFrontage:** Linear feet of street connected to property

Nothing to do here. This is a simple number. Perhaps scale?

**LotFrontage:** Linear feet of street connected to property

Same as above

**Street:** Type of road access to property

In [None]:
ohe_list = a_f_ohe(df, 'Street', ohe_list)

**Alley:** Type of alley access to property

In [None]:
ohe_list = a_f_ohe(df, 'Alley', ohe_list)

**LotShape:** General shape of property

In [None]:
ohe_list = a_f_ohe(df, 'LotShape', ohe_list)

**LandContour:** Flatness of the property

In [None]:
ohe_list = a_f_ohe(df, 'LandContour', ohe_list)

**Utilities:** Type of utilities available

In [None]:
ohe_list = a_f_ohe(df, 'Utilities', ohe_list)

**LotConfig:** Lot configuration

In [None]:
ohe_list = a_f_ohe(df, 'LotConfig', ohe_list)

**LandSlope:** Slope of property

In [None]:
ohe_list = a_f_ohe(df, 'LandSlope', ohe_list)

**Neighborhood:** Physical locations within Ames city limits

In [None]:
ohe_list = a_f_ohe(df, 'Neighborhood', ohe_list)

**Condition1:** Proximity to various conditions

Are these mutually exclusive values? They appear so.

In [None]:
df['Condition1'].value_counts()

In [None]:
ohe_list = a_f_ohe(df, 'Condition1', ohe_list)

**Condition2:** Proximity to various conditions (if more than one is present)

In [None]:
df['Condition2'].value_counts()

In [None]:
ohe_list = a_f_ohe(df, 'Condition2', ohe_list)

**BldgType:** Type of dwelling

In [None]:
ohe_list = a_f_ohe(df, 'BldgType', ohe_list)

**HouseStyle:** Style of dwelling

In [None]:
ohe_list = a_f_ohe(df, 'HouseStyle', ohe_list)

**OverallQual:** Rates the overall material and finish of the house

Rating between 1 and 10 (Very Poor and Very Excellent) - Since this is numeric and has a natural relationship we'll leave alone.

**OverallCond:** Rates the overall condition of the house

Rating between 1 and 10 (Very Poor and Very Excellent) - Since this is numeric and has a natural relationship we'll leave alone.

**YearBuilt:** Original construction date

Convert this to be the ages in full years

In [None]:
from datetime import datetime
df['Age'] = datetime.now().year - df['YearBuilt']
del df['YearBuilt']

**YearRemodAdd:** Remodel date (same as construction date if no remodeling or additions)

This is the same process as the YearBuilt

In [None]:
df['RemodelAge'] = datetime.now().year - df['YearRemodAdd']
del df['YearRemodAdd']

# LEFT OFF HERE!!!!

In [None]:
ohe_list = a_f_ohe(df, 'HouseStyle', ohe_list)

In [None]:
ohe_list = a_f_ohe(df, 'HouseStyle', ohe_list)

In [None]:
ohe_list = a_f_ohe(df, 'HouseStyle', ohe_list)

In [None]:
ohe_list = a_f_ohe(df, 'HouseStyle', ohe_list)

In [None]:
ohe_list = a_f_ohe(df, 'HouseStyle', ohe_list)

In [None]:
ohe_list = a_f_ohe(df, 'HouseStyle', ohe_list)

In [None]:
ohe_list = a_f_ohe(df, 'HouseStyle', ohe_list)