# Cleaning, Engineering, and Selecting Features

In this Notebook, I'll be cleaning data, engineering features, and selecting features for my model using insights from my data exploration in the [previous EDA notebook](01_EDA.ipynb). 
**Note that all functions are stored in `my_functions.py`**

### Imports and reading in data

In [35]:
import pandas as pd
import numpy as np
from my_functions import clean_test_data_export_csv

In [36]:
houses_train = pd.read_csv('../datasets/train.csv')
houses_test = pd.read_csv('../datasets/test.csv')

## Feature Engineering

### Removing Outliers
* Homes over 4000 in 1st Flr SF
* Homes over 4000 in Gr Liv Area
* Garage with Year built 2207

In [37]:
def remove_outliers(data):
    return data[(data['Gr Liv Area'] < 4000) &
               (data['Garage Yr Blt'] != 2207)]

In [38]:
houses_train = remove_outliers(houses_train.copy())

### Imputing missing data 
Imputing with 0 or NA, depending on whether the data is categorical or continuous. According to the [data dictionary](http://jse.amstat.org/v19n3/decock/DataDocumentation.txt), most NaN values are intentional and signal that the home doesn't have a particular feature.

In [39]:
# Thanks Will Badr for this! https://towardsdatascience.com/6-different-ways-to-compensate-for-missing-values-data-imputation-with-examples-6022d9ca0779
def imp_data(data):
    has_nulls = data.isnull().mean() != 0
    null_columns = data.columns[has_nulls]
    for column in null_columns:
        try:
            data[column] + 1 # If this doesn't throw an error, it means it's an integer/float, and NaN values likely mean the value is 0
            data[column].fillna(0, inplace=True)
        except:
            data[column].fillna('NA', inplace=True)

In [40]:
imp_data(houses_train)

### Creating Dummies
I chose to dummify most nominal categories. Later, I'll select features based on correlation and significance

In [41]:
def category_to_dummies(dataframe, list_of_columns):
    for column in list_of_columns:
        dummy_split = pd.get_dummies(dataframe[column], column, drop_first=True) # Creates dummy columns with the name {column}_{value_in_row} per get_dummies documentation
        for dummy_key in dummy_split: # Iterates through dummy_key in dummy_split
            dataframe[dummy_key] = dummy_split[dummy_key] # adds new columns named {dummy_key} to original dataframe

In [42]:
# choosing categories to dummify
nominal_categories = [
                      'MS Zoning',
                      'MS SubClass',
                      'Foundation',
                      'BsmtFin Type 1',
                      'BsmtFin Type 2',
                      'Exterior 1st',
                      'Exterior 2nd',
                      'Heating',
                      'Street',
                      'Neighborhood',
                      'Garage Finish',
                      'Lot Config',
                      'BsmtFin Type 1',
                      'BsmtFin Type 2',
                      'Lot Shape',
                      'Roof Matl',
                      'Roof Style',
                      'Lot Shape',
                      'Land Contour',
                      'Utilities',
                      'Land Slope',
                      'House Style',
                      'Electrical',
                      'Garage Type',
                      'Sale Type',
                      'Functional',
                      'Exter Qual',
                      'Exter Cond',
                      'Bsmt Qual',
                      'Condition 1',
                      'Condition 2',
                      'Bsmt Cond',
                      'Heating QC',
                      'Kitchen Qual',
                      'Fireplace Qu',
                      'Garage Qual',
                      'Garage Cond',
                      'Pool QC', 
                      'Full Bath',
                      'Half Bath',
                      'Bedroom AbvGr',
                      'Kitchen AbvGr',
                      'TotRms AbvGrd',
                     ]

In [43]:
category_to_dummies(houses_train, nominal_categories)

### Engineering New Features

I wanted to create some new features based on some of my data exploration:
* Convert `Year Built` into two categories, **Pre-1983** and **1983 to present**
* Convert home years into ages (`Year Built`, `Year Remod/Add`)
* Make `has_garage` feature to determine whether a home has a garage or not

In [44]:
def create_new_features(data):
# Separating data into two groups, one pre 1982, one after 1982
    greater_than_1982 = data['Year Built'] > 1982
    data['built_1983_to_present'] = np.where(greater_than_1982, 1, 0)
# Convert years into ages
    data['age_of_home'] = 2010 - data['Year Built'] 
    data['years_since_remodel'] = data['Year Remod/Add'].apply(lambda x: 2010 - x if x != 0 else x)
    data['has_garage'] = np.where(data['Garage Yr Blt'] > 0, 1, 0)

In [45]:
create_new_features(houses_train)

### Creating Logs
Took the log of some columns in order to make the distribution of values more normal. This helped to predict homes with `SalePrice` outside of the interquartile range.

In [46]:
categories_to_log = ['Lot Area', 'BsmtFin SF 1']

In [47]:
def log_cols(data, columns):
    change_0_to_1 = lambda x: 1 if x <= 0 else x
    for column in columns:
        temp_df = data[column].apply(change_0_to_1)
        data[f"log_{column.replace(' ', '_').lower()}"] = np.log(temp_df)

In [48]:
log_cols(houses_train, categories_to_log)

### Looking at Feature Correlation

In [49]:
# Plotting the absolute value of the correlation of each feature with SalePrice
abs(houses_train.drop(columns=['Id', 'PID']).corr()).sort_values(by='SalePrice', ascending=False)[['SalePrice']].head(15)

Unnamed: 0,SalePrice
SalePrice,1.0
Overall Qual,0.803336
Gr Liv Area,0.719598
Total Bsmt SF,0.664912
Garage Area,0.655215
Garage Cars,0.648271
1st Flr SF,0.648054
Exter Qual_TA,0.600715
built_1983_to_present,0.598625
Year Built,0.572148


These are the top correlated features, note that this is the absolute values, so some of these are negatively correlated with sale price if all other variables are held constant.

In [50]:
# # Uncomment for an informative (but bulky correlation heatmap)
# plt.figure(figsize=(1, 70))
# sns.heatmap(houses_train.drop(columns=['Id', 'PID']).corr().sort_values(by='SalePrice', ascending=False)[['SalePrice']],
#             vmin=-1,
#             vmax=1,
#             cmap='RdBu')
# plt.yticks(fontsize=8);

It looks like `Overall Qual` is the strongest correlated feature we have with `SalePrice`. From the data dictionary, we can see that Overall Qual is defined as:
>The overall material and finish of the house on a scale of 1-10.

We'll be using a combination of a **correlation and p-value thresholds** to be selecting features for the model, eventually. For now, the highly correlated features you see above will be the basis for creating interaction terms. 

### Creating Interaction Terms

Since I know that `Overall Qual` is related to the material and finish of the house, many of these interaction terms will be highlighting the interaction between `Overall Qual` and features of the home that are related to the material and finish. Some other interaction terms include:
* `Overall Qual x Neighborhood`: a few neighborhoods that had relatively high absolute correlation values
* `age_of_home x age_of_garage`: the ages had strong correlations, I wanted to take into account the relationship between home age and garage age

In [51]:
def create_interaction_terms(data):
    data['Overall Qual x Gr Liv Area'] =  data['Overall Qual'] * data['Gr Liv Area']
    data['Overall Qual x Exter Qual_Gd'] = data['Overall Qual'] * data['Exter Qual_Gd']
    data['Overall Qual x Exter Qual_TA'] = data['Overall Qual'] * data['Exter Qual_TA']
    data['Overall Qual x Foundation_PConc'] = data['Overall Qual'] * data['Foundation_PConc']
    data['Overall Qual x BsmtFin Type 1_GLQ'] = data['Overall Qual'] * data['BsmtFin Type 1_GLQ']
    data['Overall Qual x Full Bath_1'] = data['Overall Qual'] * data['Full Bath_1']
    data['Overall Qual x Full Bath_2'] = data['Overall Qual'] * data['Full Bath_2']
    data['Overall Qual x Fireplace Qu_NA'] = data['Overall Qual'] * data['Fireplace Qu_NA']
    data['Overall Qual x Garage Cars'] = data['Overall Qual'] * data['Garage Cars']
    data['Overall Qual x Garage Area'] = data['Overall Qual'] * data['Garage Area']
    data['Overall Qual x Exterior 1st_VinylSd'] = data['Overall Qual'] * data['Exterior 1st_VinylSd']
    data['Overall Qual x Exterior 2nd_VinylSd'] = data['Overall Qual'] * data['Exterior 2nd_VinylSd']

In [52]:
create_interaction_terms(houses_train)

### Removing Collinear Features

Since I've made all these features, I need to remove some of the collinear features

In [53]:
def remove_features(data):
    features_to_remove = [
                          '2nd Flr SF',
                          'Gr Liv Area',
                            'Garage Area',
                            'Year Built',
                          'Garage Yr Blt',
                          'Year Remod/Add',
        'MS SubClass',
        'Garage Cars'
                         ]
    columns_rm = [col for col in data.columns if col not in features_to_remove]
    return columns_rm

In [54]:
columns_rm = remove_features(houses_train)

In [55]:
houses_train_clean = houses_train[columns_rm]

### Storing the Cleaned Data

Now it's time to actually call all the functions I've been making. I do this using--you guessed it--**another function**. I export the data after it's cleaned and the new features have been added/columns have been removed.

In [56]:
houses_train_clean.to_csv(f'../datasets/cleaned/houses_train_clean.csv', index=False)

In [57]:
# Rather than do this all twice for my test data, I've packed all my functions into a larger function that does all of the above steps to the test data
clean_test_data_export_csv(houses_test, nominal_categories, categories_to_log).to_csv(f'../datasets/cleaned/houses_test_clean.csv', index=False)

## What's next?
I've got the clean data stored and ready to model, but I want to do one final scan to make sure there's no multicollinearity