<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Load-Data" data-toc-modified-id="Load-Data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Load Data</a></span></li><li><span><a href="#Extract-Features-and-Targets" data-toc-modified-id="Extract-Features-and-Targets-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Extract Features and Targets</a></span></li><li><span><a href="#Create-Validation-Set" data-toc-modified-id="Create-Validation-Set-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Create Validation Set</a></span></li><li><span><a href="#Explore-Data" data-toc-modified-id="Explore-Data-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Explore Data</a></span></li><li><span><a href="#Drop-Features-with-Missing-Data" data-toc-modified-id="Drop-Features-with-Missing-Data-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Drop Features with Missing Data</a></span></li><li><span><a href="#Imputation" data-toc-modified-id="Imputation-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Imputation</a></span></li></ul></div>

# Import Packages

In [1]:
import pandas as pd
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

# Import Data

## Load Data

In [13]:
# Define data locations
data_dir        = '../Data/house-prices-advanced-regression-techniques/'
train_file_name = 'train.csv'
test_file_name  = 'test.csv'

# Load training and testing data
train_data = pd.read_csv( data_dir + train_file_name, index_col='Id' )
test_data  = pd.read_csv( data_dir + test_file_name, index_col='Id' )

# Remove rows with missing targets
train_data.dropna( axis=0, subset=['SalePrice'], inplace=True)

## Extract Features and Targets

In [14]:
# Extract targets and features
y = train_data.SalePrice
X = train_data.copy()
X.drop( ['SalePrice'], axis=1, inplace=True )

# As instructed by course, use only numerical data
X = X.select_dtypes( exclude=['object'] )
X_test = test_data.select_dtypes( exclude=['object'] )

## Create Validation Set

In [15]:
X_train, X_val, y_train, y_val = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)

## Explore Data

To save space from extraneous output, uncomment command of interst when desired.

In [16]:
#X_train.describe()
X_train.head()
#y_train.describe()
#y_train.head()

#X_test.describe()
#X_test.head()
#list(X_test.columns) 

Unnamed: 0_level_0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
619,20,90.0,11694,9,5,2007,2007,452.0,48,0,...,774,0,108,0,0,260,0,0,7,2007
871,20,60.0,6600,5,5,1962,1962,0.0,0,0,...,308,0,0,0,0,0,0,0,8,2009
93,30,80.0,13360,5,7,1921,2006,0.0,713,0,...,432,0,0,44,0,0,0,0,8,2009
818,20,,13265,8,5,2002,2002,148.0,1218,0,...,857,150,59,0,0,0,0,0,7,2008
303,20,118.0,13704,7,5,2001,2002,150.0,0,0,...,843,468,81,0,0,0,0,0,1,2006


# Model Creation

Defined via the specs on the Kaggle course.

In [17]:
model = RandomForestRegressor(n_estimators=100, random_state=0)

# Evaluate System

A system being the combination of model and dataset.

In [18]:
def score_system( m=model, X_t=X_train, X_v=X_val, y_t=y_train, y_v=y_val ):
    m.fit( X_t, y_t )
    pred_val = m.predict( X_v )
    return mean_absolute_error( pred_val, y_v )

# Handle Missing Data

In [19]:
# Shape of training data (num_data_points, num_features)
print( X_train.shape )

# Number of missing values for each feature of training data
num_missing_val_per_feature = X_train.isnull().sum()
print( num_missing_val_per_feature[num_missing_val_per_feature > 0] )

# Features with missing values
feats_missing_vals = list(X_train.columns[X_train.isnull().any()])

(1168, 36)
LotFrontage    212
MasVnrArea       6
GarageYrBlt     58
dtype: int64


## Drop Features with Missing Data

In [20]:
reduced_X_train    = X_train.drop( feats_missing_vals, axis=1, inplace=False )
reduced_X_val      = X_val.drop( feats_missing_vals, axis=1, inplace=False )

In [21]:
score_system( X_t=reduced_X_train, X_v=reduced_X_val )

17837.82570776256

## Imputation

In [53]:
def score_imputation( replacement, X_t=X_train, X_v=X_val ):
    # Impute missing values with specified replacement
    imputed_X_train = X_t.fillna( replacement )
    imputed_X_val   = X_v.fillna( replacement )
    return score_system( X_t=imputed_X_train, X_v=imputed_X_val)

In [54]:
# Calculate mean for each feature with missing values
feat_means   = X_train[feats_missing_vals].mean( skipna=True )
feat_medians = X_train[feats_missing_vals].median( skipna=True )
feat_mins    = X_train[feats_missing_vals].min( skipna=True )

print( 'Mean imputation MAE: %.2f' % score_imputation( feat_means ) )
print( 'Median imputation MAE: %.2f' % score_imputation( feat_medians ) )
print( 'Min imputation MAE: %.2f' % score_imputation( feat_mins ) )
print( 'Scalar (0) imputation MAE: %.2f' % score_imputation( 0 ) )

Mean imputation MAE: 18062.89
Median imputation MAE: 17791.60
Min imputation MAE: 18079.88
Scalar (0) imputation MAE: 18017.67


# Mixed Dropping and Imputation

Looking at the feature descriptions gives rise to intuition about whether removing the feature or imputation of the feature makes sense.

LotFrontage - Linear feet of street connected to property  
- Likely missing if no street is connected to property such as an apartment or condo.  
- If this is the case, it makes sense to use imputation with 0's to fill for NAN

MasVnrArea  - Masonry veneer area in square feet  
- Likely missing if no masonry veneer  
- If this is the case, it makes sense to use imputation with 0's to fill for NAN

GarageYrBlt - Year garage was built  
- Likely missing if no garage  
- If this is the case, imputation does not make much sense and simply removing the feature may result in better calssification

In [55]:
# Drop year garage was built
reduced_X_train    = X_train.drop( ['GarageYrBlt'], axis=1, inplace=False )
reduced_X_val      = X_val.drop( ['GarageYrBlt'], axis=1, inplace=False )

# Perform scalar imputation with 0's
print( 'Mixed dropping and scalar (0) imputation MAE: %.2f' % score_imputation( 0, X_t=reduced_X_train, X_v=reduced_X_val ) )

Mixed dropping and scalar (0) imputation MAE: 18133.38


# Scikit Learn Built In Imputer

A less "reinventing the wheel" heavy method is to use Scikit Learn's built in simple imputer class.

I think I like the way I performed Imputation above better. There is less code involved and it seems to be simpler operations. It also has the benefit of working natively with Panda's Data Frame.

In [57]:
from sklearn.impute import SimpleImputer

In [65]:
imputed_X_train = X_train.copy()
imputed_X_val   = X_val.copy()
median_imputer  = SimpleImputer( strategy='median' )

imputed_X_train = pd.DataFrame( median_imputer.fit_transform( imputed_X_train ) )
imputed_X_val   = pd.DataFrame( median_imputer.transform( imputed_X_val ) )

imputed_X_train.columns = X_train.columns
imputed_X_val.columns = X_val.columns
imputed_X_train.head()
score_system( X_t=imputed_X_train, X_v=imputed_X_val)

17791.59899543379