### Kaggle House Price Prediction Competition

This dataset has 79 feature columns so a large part of the work will be in feature selection and engineering.

The evaluation metric is rmse

In [1]:
import pandas as pd
pd.set_option('display.max_columns', 100)
import numpy as np
import re
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import seaborn as sns
import mlxtend.plotting as mlxp
import warnings
warnings.filterwarnings('ignore')

from scipy.stats import kstest, levene, boxcox, mode


In [2]:
#Import data and take a copy for experimenting during exploration

test = pd.read_csv('test.csv')
test_id = test['Id'] # save id column for indexing final submission
house_test = test.copy()
house_test.drop(['Id'],inplace=True,axis=1)

train = pd.read_csv('train.csv')
train_id = train['Id']
house_train = train.copy()
house_train.drop(['Id'],inplace=True,axis=1)

In [3]:
# display(house_train.info()) #the number of features make this unwieldy =>

#split feature columns by data type to inspect further
type_dict = {str(k): list(v) for k, v in house_train.groupby(house_train.dtypes, axis=1)}

# display(house_train.loc[:,type_dict['int64']].info())
# display(house_train.loc[:,type_dict['int64']].head())

# display(house_train.loc[:,type_dict['float64']].info())
# display(house_train.loc[:,type_dict['float64']].head())

# display(house_train.loc[:,type_dict['object']].info())
# display(house_train.loc[:,type_dict['object']].head())

In [4]:
#Inspect numerical columns to get a feel for the shape of the data
# house_train.describe().T

In [5]:
# house_test.describe().T
# house_test.info()

In [21]:
display('Features containing NaNs in test set = {}'.format(house_test.isna().any().sum()))
display('Features containing NaNs in training set = {}'.format(house_train.isna().any().sum()))

na_col_train = house_train.isna().any()
train_na = house_train.loc[:,na_col_train].isna().sum()

na_col_test = house_test.isna().any()
test_na = house_test.loc[:,na_col_test].isna().sum()

pd.DataFrame([train_na,test_na],index=['train','test']).T

'Features containing NaNs in test set = 33'

'Features containing NaNs in training set = 19'

Unnamed: 0,train,test
LotFrontage,259.0,227.0
Alley,1369.0,1352.0
MasVnrType,8.0,16.0
MasVnrArea,8.0,15.0
BsmtQual,37.0,44.0
BsmtCond,37.0,45.0
BsmtExposure,38.0,44.0
BsmtFinType1,37.0,42.0
BsmtFinType2,38.0,42.0
Electrical,1.0,


In [7]:
house_test[(house_test['BsmtQual'].isna()) & (house_test['BsmtCond'].notna())]

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
757,70,C (all),60.0,5280,Pave,,Reg,Lvl,AllPub,Corner,Gtl,IDOTRR,Feedr,Norm,1Fam,2Story,4,7,1895,1950,Gable,CompShg,Wd Sdng,Wd Sdng,,0.0,TA,TA,Stone,,Fa,No,Unf,0.0,Unf,0.0,173.0,173.0,GasA,Ex,N,SBrkr,825,536,0,1361,0.0,0.0,1,0,2,1,TA,6,Typ,0,,Detchd,1895.0,Unf,1.0,185.0,Fa,TA,Y,0,123,0,0,0,0,,,,0,7,2008,WD,Normal
758,50,C (all),52.0,5150,Pave,,Reg,Lvl,AllPub,Corner,Gtl,IDOTRR,Feedr,Norm,1Fam,1.5Fin,4,7,1910,2000,Gable,CompShg,Plywood,Plywood,,0.0,TA,TA,PConc,,TA,No,Unf,0.0,Unf,0.0,356.0,356.0,GasA,TA,N,FuseA,671,378,0,1049,0.0,0.0,1,0,2,1,TA,5,Typ,0,,Detchd,1910.0,Unf,1.0,195.0,Po,Fa,N,0,0,0,0,0,0,,,,0,5,2008,WD,Normal


18 features in the training set contain NaNs, 33 features in the test set contain NaNs

Looking at these is more detail:

LotFrontage - all dwellings are houses so should have some frontage. I will impute this with the mean frontage for the OverallQual of the house

Alley - defined in data description as meaning there is no alley access - change to "None"

MasVrnType + MasVrnArea have multiple instances where the Area is 0 in which case that NaN should be set to None. There is one instance in the test set where an Area is given but no type. This is for record 1150. Inspecting this record the house is made of plywood and the most common Veneer type for Plywood houses when they have one is "BrkFace"

Electrical - The house missing this info (training id 1379) has air con which implies it must have electrical. I will impute this as the modal electrical type

MSZoning - all hosues must be zoned, impute as modal zone type

Utilities - impute modal type

Exterior1st - all houses have some kind of fininsh => impute modally based on MSSubClass

KitchenQual - impute based on MSSubClass

Functional - the descriptor says to assume typical unless information to the contrary so I will impute as typical

SaleType - impute modally

GarageCars and GarageArea - these both belong to test record 1116 which is listed as having a detached garage. If this was a training observation I may have dropped it but as it is a test record I cannot. One option is to assume that the garage being present was entered incorrectly, however the majority of houses in that zone type (RM) do have garages so I will impute using the modal values for houses in the same zone with a detached garage.

The test set also has a GarageYrBlt listed as 2207. This is clearly a mistake and needs to be corrected.

All other NaNs looks likely to be intentionally null values

For imputing values I will use the training set to calculate values even if they are to be imputed into the test set as the training is normally larger making this good practice.

In [8]:
#Deal with Garage NaNs in test record 1116 of the test set
# garage_cols = [col for col in test.columns if 'Garage' in col]
# garage_cols.remove('GarageType')
# garage_groups = house_train.groupby(['MSZoning','GarageType'])[garage_cols].agg(lambda x: mode(x)[0])
# garage_groups_1116 = garage_groups.loc['RM','Detchd']
    
# def fill_row_na(df,row,fill_group):
#     '''function to fill in missing values for a particular dataframe row using a groupby object created outside the function'''
#     for ind, item in fill_group.iteritems():
#         df.loc[row,ind] = item
#     return df

# display(house_test.loc[1116,garage_cols])
# house_test = fill_row_na(house_test,1116,garage_groups_1116)
# display(house_test.loc[1116,garage_cols])

In [9]:
#Investigating extra MasVnr NaN in test set

# display(house_test[house_test['MasVnrType'].isna()])
# display(house_train[house_train['Exterior1st']=='Plywood']['MasVnrType'].value_counts())
# house_test[(house_test['MasVnrType'].isna()) & (house_test['MasVnrArea'].notna())]

In [None]:
#Correct GarageYrBlt = 2207 in test set
# display(house_test.query('GarageYrBlt == 2207'))

# #The house was built in 2006 and has a RemodAdd date of 2007. It seems reasonable to infer that this is the garage being built
# house_test.loc[1132,'GarageYrBlt'] = 2007
# display(house_test.loc[1132,'GarageYrBlt'])

In [25]:
#Investigate missing Utilities observations
# house_train['Utilities'].value_counts()
#Almost all houses have 'AllPub' utilities => fillna with this

AllPub    1459
NoSeWa       1
Name: Utilities, dtype: int64

In [11]:
#Process NaNs in training and test data frames

# lot_frontage_fill = house_train.groupby('OverallQual').agg({'LotFrontage':'mean'})
# test = house_train.set_index('OverallQual')
# test['LotFrontage'].fillna(lot_frontage_fill['LotFrontage'],inplace=True)
# test = test.reset_index()

# house_train['Electrical'].fillna(house_train['Electrical'].mode()[0],inplace=True)

# house_test.fillna(house_train['MSZoning'].mode()[0],inplace=True)

# house_test.fillna(house_train['Utilities'].mode()[0],inplace=True)

In [12]:
# house_test.loc[house_test['GarageCars'].isna()]
# house_test.groupby('GarageType').agg({'MSZoning':'count'})

In [13]:
#Inspect numerical columns for correlation to sale price and shape of data
# correlation = house_train.corr()['SalePrice']
# kurt = house_train.kurtosis()
# skew = house_train.skew()
# cols = ['Price_Correlation','Kurtosis','Skewness']

# house_numerical = pd.concat([correlation,kurt,skew],axis=1)
# house_numerical.columns = cols
# display(house_numerical.sort_values(['Price_Correlation'],ascending=False))

In [14]:
# fig, ax = plt.subplots(1,3,figsize=(20,8))
# house_train.hist(column='SalePrice',bins=20,ax=ax[0])

# display(kstest(house_train['SalePrice'],'norm'))
# display('Sale Price Skew = {:.2f}'.format(house_train['SalePrice'].skew()))
# display('Sale Price Kurtosis = {:.2f}'.format(house_train['SalePrice'].kurtosis()))

# sales_price_log = np.log(house_train['SalePrice'])
# ax[1].hist(sales_price_log,bins=20,color='red')
# ax[1].set_title('Log_SalePrice')
# display('Log Skew = {:.2f}'.format(sales_price_log.skew()))
# display('Log Kurtosis = {:.2f}'.format(sales_price_log.kurtosis()))

# sns.boxplot(y='SalePrice',data=house_train,ax=ax[2])


The Sale Price target is normally distributed though in its base form is right-tail skewed. Taking the log of Sale Price corrects this so it may help the model to predict log Sale Price and then take the exponential to create the final predictions.

There are two clear outliers which should probably be removed from the training set before modelling.

In [15]:
# house_train['Log_SalePrice'] = np.log(house_train['SalePrice'])
# correlation_two = house_train.corr()['Log_SalePrice']
# log_house_numerical = house_numerical.join(correlation_two)
# log_house_numerical.rename(columns={'Log_SalePrice':'Log_Price_Corr'},inplace=True)
# log_house_numerical = log_house_numerical[['Price_Correlation','Log_Price_Corr','Kurtosis','Skewness']].sort_values('Log_Price_Corr',ascending=False)
# display(log_house_numerical.head(10))

Taking the log of the Sale Price improves the correlation factor of most nuerical variables, including nine of the top 10, without changing their order. This implies that using the log of the Sale Price may improve model accuracy, particularly in simpler models. I will continue to do base EDA using the Sale Price as this is the real-world value but may use its log in model buidling.

In [16]:
#Delete some extraneous variables created so far
# del[house_numerical,correlation,correlation_two]

In [17]:
#Examine Overall Quality and Condition
# fig, ax = plt.subplots(1,3,figsize=(20,8))
# sns.boxplot(x='OverallQual',y='SalePrice',data=house_train,ax=ax[0])
# sns.boxplot(x='OverallCond',y='SalePrice',data=house_train,ax=ax[1])
# sns.scatterplot(x='OverallQual',y='OverallCond',data=house_train,ax=ax[2])

# QualCon = log_house_numerical.loc[['OverallQual','OverallCond']]
# display(QualCon)

# Qual_var = house_train.groupby('OverallQual').agg({'SalePrice':'var'}).rename(columns={'SalePrice':'var_SalePrice'})
# display(Qual_var.T)

# display('Levene test of OverallQual and Sale Price = {}'.format(levene(house_train['OverallQual'],house_train['SalePrice'])))
# display('Levene test of OverallQual and Log Sale Price = {}'.format(levene(house_train['OverallQual'],house_train['Log_SalePrice'])))

In [18]:
# house_train['Box_OverallQual'] = boxcox(house_train['OverallQual'])[0]
# display(house_train[['Box_OverallQual','OverallQual','Log_SalePrice']].corr()['Log_SalePrice'])
# display('Levene test on Box-Cox transformer OverallQual = {}'.format(levene(house_train['Box_OverallQual'], \
#                             house_train['Log_SalePrice'])))