## Kaggle House Price Prediction Competition

In this notebook I process the training and test datasets to remove NaNs. The notebook generates a csv file for each of the NaN processed datasets. It is provided for anyone who wishes to tackle this competition without first having to clean the data themselves.

Some processing takes place before the main body of the function, the specifics of this are explained in the relevant sections below.

Where values have needed to be imputed the training set has been used to calculated the replacement values. Other choices could have been made for replacement values, an explanation of the choices made is given below and it will be easy enough to modify this notebook to change how values are imputed.

To make this notebook as concise as possible I have removed all of the cells containing data inspection, etc that I did to determine how to process the NaNs.

### NaN removal explained
I have not set out to create an exhaustive process for removing all NaNs from any data set using these features. Rather I have looked to tackle the NaNs missing from these specific training and test sets but generalised where possible.

#### Starting Point
18 features in the training set contain NaNs, 33 features in the test set contain NaNs.

##### Lot Frontage

All dwellings are houses so should have some frontage, i.e. there are no flats which would not have any street connected to the property. I will impute this with the mean frontage for the Neighbourhood of the house.

##### Alley 

Na for this feature is defined in data description as meaning there is no alley access -> change to "None".

##### Mason Venner Type and Area

MasVrnType and MasVrnArea have multiple instances where the Area is 0 in which case the Type NaN should be set to 'None'. There is one instance in the test set where an Area is given but no type. This is for record 1150. Inspecting this record the house is made of plywood and the most common Veneer type for Plywood houses when they have one is "BrkFace".

##### Basement Variables
Test observation 660 has a NaN for TotalBsmtSF. Examining this record it does not appear that there is a basement. I have therefore set TotalBsmtSF to 0 which will allow the other basement feature processing to occur without error.

Where there is a total basement area greater than 0 but missing basement quality and/ or condition entries these have been assumes as 'TA', which is the coding for typical quality.

Where the total basement area is zero and qualitative qualitative basement feature values are missing these have been set to 'None', except for 'BasementExposure' which has been set to 'No' to fit with the convention for this feature.

If a basement has a valid FinType1 and a NaN for FinType to this has been set to 'Unf'.

If total basement SF = 0 and BsmtFinSF1 is NaN then BsmtFinSF1 and BsmtUnfSF have been set to 0.

Missing BsmtFinSF2 has been set to 0.


##### Electrical
The house missing this info (training id 1379) has air con which implies it must have electrical. I will impute this as the modal electrical type.

##### FireplaceQu
All NaNs for this feature correspond to the 'Fireplace' feature being 0 so are set to 'None'.

##### Garage Variables
Test observation 1116 appears to have a garage but is missing almost all information about it other than it is detached. I have dealt with this as a special case rather than in the main function. To impute the missing values I have grouped houses by their zone and garage type and worked out the modal value for other garage features and input these into test observation 1116.

Test observation 1132 has a 'GarageYrBlt' value of 2207. Looking at this record it seems that this should have been 2007 so this has been explicitely corrected.

Other qualitative garage feature NaNs correspond to houses without garages so have been set to 'Non

##### Pool Quality
There are some Pools which have an area but no quality value. These have been set to Gd as this is the most common quality in the training set (albeit with a small sample size). All other NaNs are 'None'.

##### Fence
NaNs are assumed as not having a fence so set to 'None'.

##### Misc Feature
NaNs set to 'None'.

##### MSZoning
MSZoning - all hosues must be zoned, impute as modal zone type.

##### Utilities
All houses should have some utility access so this has been imputed as the modal value for the feature.

##### Exterior Covering
I have assumed that houses in the same neighbourhood have similar styles and that all houses have some exterior covering. NaNs have been imputed as the modal type for their neighbourhood.

##### Kitchen Quality
Where a kitchen is present but has a NaN for quality these has been set to typical 'TA'.

##### Functionality
NaNs for this feature have been set to 'Typ' as this fits with the most common entry and likelihood based on feature description.

##### Sale Type
NaNs set to 'WD' as the most likely value.

In [1]:
import pandas as pd
pd.set_option('display.max_columns', 100)
import numpy as np
import warnings
warnings.filterwarnings('ignore')
from scipy.stats import mode

In [2]:
#Import data and take a copy for experimenting during exploration

test = pd.read_csv('test.csv')
house_test = test.copy()

train = pd.read_csv('train.csv')
house_train = train.copy()

In [3]:
display('Features containing NaNs in test set = {}'.format(house_test.isna().any().sum()))
display('Features containing NaNs in training set = {}'.format(house_train.isna().any().sum()))

na_col_train = house_train.isna().any()
train_na = house_train.loc[:,na_col_train].isna().sum()

na_col_test = house_test.isna().any()
test_na = house_test.loc[:,na_col_test].isna().sum()

pd.DataFrame([train_na,test_na],index=['train','test']).T

'Features containing NaNs in test set = 33'

'Features containing NaNs in training set = 19'

Unnamed: 0,train,test
LotFrontage,259.0,227.0
Alley,1369.0,1352.0
MasVnrType,8.0,16.0
MasVnrArea,8.0,15.0
BsmtQual,37.0,44.0
BsmtCond,37.0,45.0
BsmtExposure,38.0,44.0
BsmtFinType1,37.0,42.0
BsmtFinType2,38.0,42.0
Electrical,1.0,


In [4]:
#Process NaNs in training and test data frames

#Deal with Garage NaNs in test record 1116 of the test set
garage_cols = [col for col in test.columns if 'Garage' in col]
garage_cols.remove('GarageType')
garage_groups = house_train.groupby(['MSZoning','GarageType'])[garage_cols].agg(lambda x: mode(x)[0])

def fill_row_na(input_df,row,fill_group):
    '''function to fill in missing values for a particular dataframe row using a groupby object created outside the function'''
    df = input_df.copy() # take copy of data frame so as not to double modify
    zone = df.iloc[row,df.columns.get_loc("MSZoning")]
    gtype = df.iloc[row,df.columns.get_loc('GarageType')]
    fill_group = fill_group.loc[zone,gtype]
    for ind, item in fill_group.iteritems():
        if np.isnan(df.loc[row,ind]):
            df.loc[row,ind] = item
    return df

house_test = fill_row_na(house_test,1116,garage_groups)
house_test = fill_row_na(house_test,666,garage_groups)


#Correct GarageYrBlt = 2207 in test set
house_test.loc[1132,'GarageYrBlt'] = 2007

#Test set record 660 creates a specific problem as it records a NaN for TotalBsmtSF. Setting this to 0 will allow
#the na_processing function below to handle the other NaNs
house_test.loc[house_test['TotalBsmtSF'].isna(),'TotalBsmtSF'] = 0

#One test observation has a veneer area but no type, set this to BrkFace as it best fits the other
house_test.loc[(house_test['Neighborhood']=='Mitchel') & (house_test['MasVnrArea']>0),'MasVnrType'] = 'BrkFace'

def na_processing(input_df,training):
    '''Function for processing remaining NaNs in training and test data sets. Values are either imputed, or set to 0 or None'''
    
    df = input_df.copy() # take copy of dataframe to avoid modifying the original other than with function call
    
    #Lot Frontage
    lot_frontage_fill = training.groupby('Neighborhood').agg({'LotFrontage':'mean'})
    df = df.set_index('Neighborhood')
    df['LotFrontage'].fillna(lot_frontage_fill['LotFrontage'],inplace=True)
    df = df.reset_index()
    
    #Alley
    df['Alley'].fillna('None',inplace=True)
    
    #Masonary Veneer Area and Typr
    df['MasVnrArea'].fillna(0,inplace=True)
    df['MasVnrType'].fillna('None',inplace=True)
    
    #Basement Variables
    df.loc[(df['TotalBsmtSF']>0) & (df['BsmtQual'].isna()) ,'BsmtQual'] = 'TA'
    
    df.loc[(df['TotalBsmtSF']>0) & (df['BsmtCond'].isna()),'BsmtCond'] = 'TA'
    
    df.loc[df['TotalBsmtSF']==0,['BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2']] = \
    df.loc[df['TotalBsmtSF']==0,['BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2']].fillna('None')
    
    df.loc[(df['TotalBsmtSF']>0) & (df['BsmtExposure'].isna()),'BsmtExposure'] = 'No'
    
    df.loc[(df['BsmtFinType1'].notna()) & (df['BsmtFinType2'].isna()),'BsmtFinType2'] = 'Unf'
    
    df.loc[(df['TotalBsmtSF']==0) & (df['BsmtFinSF1'].isna()),['BsmtFinSF1','BsmtUnfSF']] = 0
    
    df['BsmtFinSF2'].fillna(0,inplace=True)
    
    df.loc[:,['BsmtFullBath','BsmtHalfBath']] = df.loc[:,['BsmtFullBath','BsmtHalfBath']].fillna(0)

    #Electrical
    df['Electrical'].fillna(training['Electrical'].mode()[0],inplace=True)
    
    #Fireplace Quality
    df['FireplaceQu'].fillna('None',inplace=True)
    
    #Garage Variables
    garage_cols = [col for col in df.columns if 'Garage' in col]
    garage_cols.remove('GarageCars')
    garage_cols.remove('GarageArea')
    df.loc[:,garage_cols] = df.loc[:,garage_cols].fillna('None')
    
    #Pool Quality
    df.loc[(df['PoolArea']>0) & (df['PoolQC'].isna()),'PoolQC'] = 'Gd'
    df['PoolQC'].fillna('None',inplace=True)
    
    #Fence
    df['Fence'].fillna('None',inplace=True)
    
    #Misc Feature
    df['MiscFeature'].fillna('None',inplace=True)
    
    #MS Zoning
    df['MSZoning'].fillna(training['MSZoning'].mode()[0],inplace=True)

    #Utilities
    df['Utilities'].fillna(training['Utilities'].mode()[0],inplace=True)
    
    #Exterior Covering
    exterior_fill = training.groupby('Neighborhood').agg({'Exterior1st': lambda x: mode(x)[0],\
                                                        'Exterior2nd': lambda x: mode(x)[0]})
    
    df = df.set_index('Neighborhood')
    df['Exterior1st'].fillna(exterior_fill['Exterior1st'],inplace=True)
    df['Exterior2nd'].fillna(exterior_fill['Exterior2nd'],inplace=True)
    df = df.reset_index()
    
    #Kitchen Quality
    df.loc[(df['KitchenAbvGr']>0) & (df['KitchenQual'].isna()),'KitchenQual'] = 'TA'
    
    #Functionality
    df['Functional'].fillna('Typ',inplace=True)
    
    #Sale Type
    df['SaleType'].fillna('WD',inplace=True)
       
    return df

test_processed = na_processing(house_test,house_train)
train_processed = na_processing(house_train,house_train)


display('Features containing NaNs in test set = {}'.format(test_processed.isna().any().sum()))
display('Features containing NaNs in training set = {}'.format(train_processed.isna().any().sum()))

'Features containing NaNs in test set = 0'

'Features containing NaNs in training set = 0'

In [5]:
test_processed.to_csv('test_processed.csv')
train_processed.to_csv('train_processed.csv')