<center>
<img align="center" src="http://sydney.edu.au/images/content/about/logo-mono.jpg">
</center>
<h1 align="center" style="margin-top:10px">Statistical Learning and Data Mining</h1>
<h2 align="center" style="margin-top:20px">Week 3 Tutorial: Feature Engineering</h2>
<br>

This notebook explores useful feature engineering tools for tabular data. Keep in mind that feature engineering is highly dependent on the context and the model.  Use your analytical skills to design good features for each situation.

Each specific step in feature engineering may not make much difference for the performance of the final model. However, these small improvements can add up to a significant increase in accuracy.

<a href="#1.-Ames-Housing-Data">Ames Housing Data</a> <br>
<a href="#2.-Data-cleaning">Data cleaning</a> <br>
<a href="#3.-Type-inference">Type inference</a> <br>
<a href="#4.-Continuous-predictors">Continuous predictors</a> <br>
<a href="#5.-Nominal-predictors">Nominal predictors</a> <br>
<a href="#6.-Discrete-predictors">Discrete predictors</a> <br>
<a href="#7.-Ordinal-predictors">Ordinal predictors</a> <br>
<a href="#8.-Missing-values">Missing values</a> <br>
<a href="#9.-Interaction-effects">Interaction effects</a> <br>

This notebook relies on the following imports and settings. 

In [1]:
# Packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Plot settings
sns.set_context('notebook') # optimises figures for notebook display
sns.set_style('ticks') # set default plot style
colours = ['#4E79A7','#F28E2C','#E15759','#76B7B2','#59A14F', 
           '#EDC949','#AF7AA1','#FF9DA7','#9C755F','#BAB0AB']
sns.set_palette(colours) # set custom color scheme
%matplotlib inline
plt.rcParams['figure.figsize'] = (9, 6)

In [3]:
# Methods
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# 1. Ames Housing Data

We'll continue working with Ames Housing dataset from last week. The original source is [De Cock (2011)](http://www.tandfonline.com/doi/abs/10.1080/10691898.2011.11889627). Because our focus will be on feature engineering, it's important to use the [documentation](https://ww2.amstat.org/publications/jse/v19n3/Decock/DataDocumentation.txt) to understand the variables.

In [4]:
# data = pd.read_csv('Data/AmesHousing.csv')
data = pd.read_csv('AmesHousing.csv')
data.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,20,RL,141.0,31770,Pave,,IR1,Lvl,AllPub,Corner,...,0,,,,0,5,2010,WD,Normal,215000
1,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,Inside,...,0,,MnPrv,,0,6,2010,WD,Normal,105000
2,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,Corner,...,0,,,Gar2,12500,6,2010,WD,Normal,172000
3,20,RL,93.0,11160,Pave,,Reg,Lvl,AllPub,Corner,...,0,,,,0,4,2010,WD,Normal,244000
4,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,Inside,...,0,,MnPrv,,0,3,2010,WD,Normal,189900


Below, we repeat some of the steps from last week. We split the data into training and validation sets and apply a log transformation to the response variable, `SalePrice`. 

Like last week, our main metric to evaluate the quality of the predictions will be the RMSE on the log scale.

In [5]:
from sklearn.model_selection import train_test_split

data['LogSalePrice'] = np.log(data['SalePrice'])

index_train, index_valid = train_test_split(data.index, train_size=0.7, random_state=42)

train = data.loc[index_train, :].copy()
valid = data.loc[index_valid, :].copy()

y_train = train['LogSalePrice']
y_valid = valid['LogSalePrice']

# 2. Data cleaning

Data frequently contain errors, so we should try to detect and fix problems early on. Consider, for example, the `GarageYrBlt` variable. 

In [6]:
data.describe()

Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice,LogSalePrice
count,2930.0,2440.0,2930.0,2930.0,2930.0,2930.0,2930.0,2907.0,2929.0,2929.0,...,2930.0,2930.0,2930.0,2930.0,2930.0,2930.0,2930.0,2930.0,2930.0,2930.0
mean,57.387372,69.22459,10147.921843,6.094881,5.56314,1971.356314,1984.266553,101.896801,442.629566,49.722431,...,47.533447,23.011604,2.592491,16.002048,2.243345,50.635154,6.216041,2007.790444,180796.060068,12.020969
std,42.638025,23.365335,7880.017759,1.411026,1.111537,30.245361,20.860286,179.112611,455.590839,169.168476,...,67.4834,64.139059,25.141331,56.08737,35.597181,566.344288,2.714492,1.316613,79886.692357,0.407587
min,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,12789.0,9.456341
25%,20.0,58.0,7440.25,5.0,5.0,1954.0,1965.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,4.0,2007.0,129500.0,11.771436
50%,50.0,68.0,9436.5,6.0,5.0,1973.0,1993.0,0.0,370.0,0.0,...,27.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,160000.0,11.982929
75%,70.0,80.0,11555.25,7.0,6.0,2001.0,2004.0,164.0,734.0,0.0,...,70.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,213500.0,12.271392
max,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,1526.0,...,742.0,1012.0,508.0,576.0,800.0,17000.0,12.0,2010.0,755000.0,13.534473


In [7]:
data['GarageYrBlt'].describe().round(0)

count    2771.0
mean     1978.0
std        26.0
min      1895.0
25%      1960.0
50%      1979.0
75%      2002.0
max      2207.0
Name: GarageYrBlt, dtype: float64

The maximum value is clearly a typing error, so let's fix it. You'll find errors like this as you do exploratory data analysis. 

In [8]:
data.loc[data['GarageYrBlt'] == 2207, 'GarageYrBlt'] = 2007

In [9]:
data['GarageYrBlt'].describe().round(0)

count    2771.0
mean     1978.0
std        25.0
min      1895.0
25%      1960.0
50%      1979.0
75%      2002.0
max      2010.0
Name: GarageYrBlt, dtype: float64

The `dataprep` package has many useful functions for cleaning data. Check the [documentation](https://docs.dataprep.ai/user_guide/clean/introduction.html) to see what's available. 