<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Load-Data" data-toc-modified-id="Load-Data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Load Data</a></span></li><li><span><a href="#Extract-Features-and-Targets" data-toc-modified-id="Extract-Features-and-Targets-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Extract Features and Targets</a></span></li><li><span><a href="#Create-Validation-Set" data-toc-modified-id="Create-Validation-Set-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Create Validation Set</a></span></li><li><span><a href="#Explore-Data" data-toc-modified-id="Explore-Data-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Explore Data</a></span></li><li><span><a href="#Manage-Missing-Categorical-Data" data-toc-modified-id="Manage-Missing-Categorical-Data-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Manage Missing Categorical Data</a></span></li><li><span><a href="#Manage-Missing-Numerical-Data" data-toc-modified-id="Manage-Missing-Numerical-Data-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Manage Missing Numerical Data</a></span><ul class="toc-item"><li><span><a href="#Drop-Numerical-Features-with-Missing-Data" data-toc-modified-id="Drop-Numerical-Features-with-Missing-Data-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>Drop Numerical Features with Missing Data</a></span></li><li><span><a href="#Imputation" data-toc-modified-id="Imputation-6.2"><span class="toc-item-num">6.2&nbsp;&nbsp;</span>Imputation</a></span></li><li><span><a href="#Mixed-Dropping-and-Imputation" data-toc-modified-id="Mixed-Dropping-and-Imputation-6.3"><span class="toc-item-num">6.3&nbsp;&nbsp;</span>Mixed Dropping and Imputation</a></span></li></ul></li><li><span><a href="#Drop-Categorical-Data" data-toc-modified-id="Drop-Categorical-Data-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Drop Categorical Data</a></span></li><li><span><a href="#Label-Encode-Categorical-Data" data-toc-modified-id="Label-Encode-Categorical-Data-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Label Encode Categorical Data</a></span></li><li><span><a href="#One-Hot-Encode-Categorical-Data" data-toc-modified-id="One-Hot-Encode-Categorical-Data-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>One-Hot Encode Categorical Data</a></span></li><li><span><a href="#Join-Training-and-Validation" data-toc-modified-id="Join-Training-and-Validation-10"><span class="toc-item-num">10&nbsp;&nbsp;</span>Join Training and Validation</a></span></li><li><span><a href="#Create-Pipeline-with-Winning-Transformation-Combo" data-toc-modified-id="Create-Pipeline-with-Winning-Transformation-Combo-11"><span class="toc-item-num">11&nbsp;&nbsp;</span>Create Pipeline with Winning Transformation Combo</a></span></li><li><span><a href="#Fit-Pipeline-to-Data-and-Make-Predictions" data-toc-modified-id="Fit-Pipeline-to-Data-and-Make-Predictions-12"><span class="toc-item-num">12&nbsp;&nbsp;</span>Fit Pipeline to Data and Make Predictions</a></span></li></ul></div>

# Import Packages

In [29]:
from collections import defaultdict

import pandas as pd
import pandas_profiling as pp

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# Import Data

## Load Data

In [2]:
# Define data locations
data_dir        = '../Data/house-prices-advanced-regression-techniques/'
train_file_name = 'train.csv'
test_file_name  = 'test.csv'

# Load training and testing data
train_data = pd.read_csv( data_dir + train_file_name, index_col='Id' )
test_data  = pd.read_csv( data_dir + test_file_name, index_col='Id' )

# Rearrange columns alphabetically 
train_data = train_data.reindex(sorted(train_data.columns), axis=1)
test_data = test_data.reindex(sorted(test_data.columns), axis=1)

# Remove rows with missing targets
train_data.dropna( axis=0, subset=['SalePrice'], inplace=True)

## Extract Features and Targets

In [3]:
# Extract targets and features
y = train_data.SalePrice
X = train_data.copy()
X.drop( ['SalePrice'], axis=1, inplace=True )

X_test = test_data.copy()

## Create Validation Set

This is not needed if we are using cross-validation.

In [None]:
X_train, X_val, y_train, y_val = train_test_split( X, y, 
                                                   train_size=0.8, 
                                                   test_size=0.2, 
                                                   random_state=0)

## Explore Data

In [None]:
pp.ProfileReport( X )

# Manage Missing Data

## Manage Missing Categorical Data

In [4]:
# Get the columns for categorical features that are missing values
cat_feats_missing_vals = list(X.columns[(X.dtypes =='object') & X.isnull().any()])
print( cat_feats_missing_vals )

['Alley', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'BsmtQual', 'Electrical', 'Fence', 'FireplaceQu', 'GarageCond', 'GarageFinish', 'GarageQual', 'GarageType', 'MasVnrType', 'MiscFeature', 'PoolQC']


Based on the provided description, most of the missing values should be replaced with "NA"

The only special cases are:
 
MasVnrType: Masonry veneer type
- None: None

BsmtExposure: Refers to walkout or garden level walls
- No:   No Exposure
- NA:   No Basement

Electrical: Electrical system
- No good label, as such we will remove this featue. Additionaly, from the Profile Report we see this feature has a uniqueness of 0.4% so this feature may not tell us much about the data anyway.

In [5]:
# Possible reasons for missing values on BsmtExposure could be "no exposure" or "no basement"
# We can do a simple crossreference to see how many times values exist in other basement categories and not for BsmtExposure
(X['BsmtExposure'].isnull() & X['BsmtCond'].notnull()).sum()

1

This occurs once so to be safe we can map NaN to No.

In [6]:
# Custom Transformer that maps missing categoical labels to their appropriate label
class CustomCategoicalImputer( BaseEstimator, TransformerMixin ):
    #Class Constructor 
    def __init__( self ):
        self._label_mapping = {}
        
    def fit( self, X, y=None ):
        feats_missing_vals = list(X.columns[X.isnull().any()])
        
        for feat in feats_missing_vals:
            self._label_mapping[feat] = 'NA'
        
        self._label_mapping['MasVnrType'] = 'None'
        self._label_mapping['BsmtExposure'] = 'No'


        return self 
    
    def transform( self, X, y=None ):
        X.drop( columns=['Electrical'], inplace=True )
        X.fillna( value=self._label_mapping, inplace=True )
        
        return X

In [7]:
custom_impute_categorical_pipe_tuple = ('custom_impute_categorical',
                                        CustomCategoicalImputer())

Imputation should be done on all columns in case features are missing values in the test set and not the training set. Some features however do not have a good default value we can infer from their description. Because of this we will do "Most Frequent" imputation on all of the columns after performing the custom imputation.

In [8]:
most_frequent_impute_categorical_pipe_tuple = ('most_frequent_impute_categorical',
                                               SimpleImputer(strategy='most_frequent'))

## Manage Missing Numerical Data

### Drop Numerical Features with Missing Data

Dropping feats with missing values is as simple as not including those columns in the ColumnTranformation pipeline. 

We should perform some sort of imputation afterwards in the event a feature is missing a value in the test set and not in the training set. 

As such we can just perform imputation on the columns with no missing values in the training set. 

In [9]:
numerical_feats              = X.columns[X.dtypes != 'object']
numerical_feats_missing_vals = X.columns[(X.dtypes != 'object') & X.isna().any()]

In [12]:
drop_missing_numerical_trans_tuple = ('drop_missing_numerical', 
                                      SimpleImputer(strategy='median'), 
                                      numerical_feats.difference(numerical_feats_missing_vals))

### Imputation

Imputation can be done with the "SimpleImputer" class transforming all of the numerical columns.

The example below uses median imputation but any form can be used by defining a new tranformation tuple.

In [10]:
numerical_feats = X.columns[X.dtypes != 'object']

In [11]:
impute_missing_numerical_trans_tuple = ('impute_missing_numerical', 
                                        SimpleImputer(strategy='median'), 
                                        numerical_feats)

### Mixed Dropping and Imputation

Looking at the feature descriptions gives rise to intuition about whether removing the feature or imputation of the feature makes sense.

LotFrontage - Linear feet of street connected to property  
- Likely missing if no street is connected to property such as an apartment or condo.  
- If this is the case, it makes sense to use imputation with 0's to fill for NAN

MasVnrArea  - Masonry veneer area in square feet  
- Likely missing if no masonry veneer  
- If this is the case, it makes sense to use imputation with 0's to fill for NAN

GarageYrBlt - Year garage was built  
- Likely missing if no garage  
- If this is the case, imputation does not make much sense and simply removing the feature may result in better calssification

Mixing dropping and imputing features again is as simple as just not including the offending columns in the ColumnTransformation pipeline. 

In [13]:
numerical_feats         = X.columns[X.dtypes != 'object']
numerical_feats_to_drop = ['GarageYrBlt']

In [14]:
mixed_drop_impute_missing_numerical_trans_tuple = ('mixed_drop_impute_missing_numerical', 
                                                   SimpleImputer(strategy='median'), 
                                                   numerical_feats.difference(numerical_feats_to_drop))

# Manage Categorical Data

## Drop Categorical Data

Dropping categorical data is as simple as not including a tuple for categorical data in the ColumnTransformer pipeline. A transformation tuple is defined below in order to help automate the process of comparing all pipeline methods. 

In [15]:
categorical_feats = X.columns[X.dtypes == 'object']

In [16]:
drop_categorical_trans_tuple = ('drop_categorical', 'drop', categorical_feats)

## Label Encode Categorical Data

The standard label encoder does not have a way of encoding labels found in the test set that are not in the training set. 

For this reason we write our own that assigns the encoding 0 to any label in the test set that does not appear in the training set.

In [17]:
# Custom Transformer that maps missing categoical labels to their appropriate label
class CustomCategoicalLabelEncoder( BaseEstimator, TransformerMixin ):
    #Class Constructor 
    def __init__( self ):
        self._label_map = {}
        
    def fit( self, X, y=None ):        
        for feat in X.columns:
            unique_dict   = defaultdict(lambda: 0)
            unique_values = X[feat].unique()
            for i in range(len(unique_values)):
                unique_dict[unique_values[i]] = i+1
    
            self._label_map[feat] = unique_dict

        return self 
    
    def transform( self, X, y=None ):
        for feat in X.columns:
            X[feat] = X[feat].map( self._label_map[feat] )
        return X

In [18]:
categorical_feats = X.columns[X.dtypes == 'object']

In [19]:
custom_label_categorical_pipe_tuple  = ('custom_label_categorical',
                                        CustomCategoicalLabelEncoder())

custom_label_categorical_pipeline    = Pipeline( steps=[custom_impute_categorical_pipe_tuple,
                                                        custom_label_categorical_pipe_tuple,
                                                        most_frequent_impute_categorical_pipe_tuple] )

custom_label_categorical_trans_tuple = ('custom_label_categorical',
                                        custom_label_categorical_pipeline,
                                        categorical_feats)

## One-Hot Encode Categorical Data

We can perform One-Hot encoding on the data specifying cardinality thresholds using the sklearn OneHotEncoder and passing only columns that have a low enough cardinality.

In [20]:
card_thresh = 10
categorical_feats  = X.columns[X.dtypes == 'object']
low_card_cat_feats = [feat for feat in categorical_feats if X[feat].nunique() <= card_thresh]

In [21]:
one_hot_categorical_pipe_tuple  = ('one_hot_categorical',
                                   OneHotEncoder(handle_unknown='ignore', sparse=False))

one_hot_categorical_pipeline    = Pipeline( steps=[custom_impute_categorical_pipe_tuple,
                                                   most_frequent_impute_categorical_pipe_tuple,
                                                   one_hot_categorical_pipe_tuple] )

one_hot_categorical_trans_tuple = ('one_hot_categorical',
                                   one_hot_categorical_pipeline,
                                   low_card_cat_feats)

# Model Creation

Defined via the specs on the Kaggle course.

In [22]:
model = RandomForestRegressor(n_estimators=100, random_state=0)

# Evaluate System

A system being the combination of model and dataset.

Score system function not needed if doing cross-validation. 

In [None]:
def score_system( pipeline, X_t=X_train, X_v=X_val, y_t=y_train, y_v=y_val ):
    pipeline.fit( X_t, y_t )
    pred_val = pipeline.predict( X_v )
    return mean_absolute_error( pred_val, y_v )

In [37]:
def evaluate_transformation_combinations( cat_trans, num_trans, m=model ):
    for num_transform in num_trans:
        print( 'MAE with %s:' % (num_transform) )
        for cat_transform in cat_trans:
            preprocessor = ColumnTransformer( transformers=[num_trans[num_transform],
                                                            cat_trans[cat_transform]] )
            pipeline = Pipeline( steps=[('preprocessor', preprocessor),
                                        ('model', m)] )
            
            scores = -1 * cross_val_score( pipeline, X, y, cv=5,
                                           scoring='neg_mean_absolute_error')
            
            print( "\tAverage MAE score (across experiments) with %s: %f" % (cat_transform, scores.mean()) )

In [25]:
# Dataset slices to ensure transformations work
card_thresh = 10
categorical_feats  = X.columns[X.dtypes == 'object']
low_card_cat_feats = [feat for feat in categorical_feats if X[feat].nunique() <= card_thresh]

numerical_feats              = X.columns[X.dtypes != 'object']
numerical_feats_missing_vals = X.columns[(X.dtypes != 'object') & X.isna().any()]
numerical_feats_to_drop      = ['GarageYrBlt'] # Feats to drop during mixed dropping and imputation

In [26]:
categorical_transformations = {'drop_categorical'         : drop_categorical_trans_tuple,
                               'custom_label_categorical' : custom_label_categorical_trans_tuple,
                               'one_hot_categorical'      : one_hot_categorical_trans_tuple}

numerical_transformations   = {'drop_missing_numerical'              : drop_missing_numerical_trans_tuple,
                               'impute_missing_numerical'            : impute_missing_numerical_trans_tuple,
                               'mixed_drop_impute_missing_numerical' : mixed_drop_impute_missing_numerical_trans_tuple}

In [38]:
evaluate_transformation_combinations( categorical_transformations,
                                      numerical_transformations )

MAE with drop_missing_numerical:
	Average MAE score (across experiments) with drop_categorical: 17884.688402
	Average MAE score (across experiments) with custom_label_categorical: 17575.180740
	Average MAE score (across experiments) with one_hot_categorical: 17656.401849
MAE with impute_missing_numerical:
	Average MAE score (across experiments) with drop_categorical: 18053.033783
	Average MAE score (across experiments) with custom_label_categorical: 17662.064911
	Average MAE score (across experiments) with one_hot_categorical: 17720.302932
MAE with mixed_drop_impute_missing_numerical:
	Average MAE score (across experiments) with drop_categorical: 17989.217826
	Average MAE score (across experiments) with custom_label_categorical: 17614.623589
	Average MAE score (across experiments) with one_hot_categorical: 17665.810514


# Submit Predictions

## Join Training and Validation

This allows us to use the most data on hand to produce the best model possible for prediction.

In [39]:
# Dataset slices to ensure transformations work
card_thresh = 10
categorical_feats  = X.columns[X.dtypes == 'object']
low_card_cat_feats = [feat for feat in categorical_feats if X[feat].nunique() <= card_thresh]

numerical_feats              = X.columns[X.dtypes != 'object']
numerical_feats_missing_vals = X.columns[(X.dtypes != 'object') & X.isna().any()]

## Create Pipeline with Winning Transformation Combo

In [43]:
preprocess = ColumnTransformer( transformers=[numerical_transformations['drop_missing_numerical'],
                                              categorical_transformations['custom_label_categorical']] )
pipeline = Pipeline( steps=[('preprocess', preprocess), 
                            ('model', model)] )

In [44]:
pipeline.fit( X, y )

preds_test = pipeline.predict( X_test )

## Fit Pipeline to Data and Make Predictions

In [42]:
output = pd.DataFrame({'Id': X_test.index,
                       'SalePrice': preds_test})
output.to_csv('submission.csv', index=False)

In [53]:
b = range( 50, 450, 50 )


In [56]:
b[7]

400