### Kaggle House Price Prediction Competition

This dataset has 79 feature columns so a large part of the work will be in feature selection and engineering.

The evaluation metric is rmse

In [1]:
import pandas as pd
pd.set_option('display.max_columns', 100)
import numpy as np
import re
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import seaborn as sns
import mlxtend.plotting as mlxp
import warnings
warnings.filterwarnings('ignore')

from scipy.stats import kstest, levene, boxcox, mode


In [2]:
#Import data and take a copy for experimenting during exploration

test = pd.read_csv('test.csv')
test_id = test['Id'] # save id column for indexing final submission
house_test = test.copy()
house_test.drop(['Id'],inplace=True,axis=1)

train = pd.read_csv('train.csv')
train_id = train['Id']
house_train = train.copy()
house_train.drop(['Id'],inplace=True,axis=1)

In [3]:
# display(house_train.info()) #the number of features make this unwieldy =>

#split feature columns by data type to inspect further
type_dict = {str(k): list(v) for k, v in house_train.groupby(house_train.dtypes, axis=1)}

# display(house_train.loc[:,type_dict['int64']].info())
# display(house_train.loc[:,type_dict['int64']].head())

# display(house_train.loc[:,type_dict['float64']].info())
# display(house_train.loc[:,type_dict['float64']].head())

# display(house_train.loc[:,type_dict['object']].info())
# display(house_train.loc[:,type_dict['object']].head())

In [4]:
#Inspect numerical columns to get a feel for the shape of the data
# house_train.describe().T

In [5]:
# house_test.describe().T
# house_test.info()

In [6]:
display('Features containing NaNs in test set = {}'.format(house_test.isna().any().sum()))
display('Features containing NaNs in training set = {}'.format(house_train.isna().any().sum()))

na_col_train = house_train.isna().any()
train_na = house_train.loc[:,na_col_train].isna().sum()

na_col_test = house_test.isna().any()
test_na = house_test.loc[:,na_col_test].isna().sum()

pd.DataFrame([train_na,test_na],index=['train','test']).T

'Features containing NaNs in test set = 33'

'Features containing NaNs in training set = 19'

Unnamed: 0,train,test
LotFrontage,259.0,227.0
Alley,1369.0,1352.0
MasVnrType,8.0,16.0
MasVnrArea,8.0,15.0
BsmtQual,37.0,44.0
BsmtCond,37.0,45.0
BsmtExposure,38.0,44.0
BsmtFinType1,37.0,42.0
BsmtFinType2,38.0,42.0
Electrical,1.0,


18 features in the training set contain NaNs, 33 features in the test set contain NaNs

Looking at these is more detail:

LotFrontage - all dwellings are houses so should have some frontage. I will impute this with the mean frontage for the Neighbourhood of the house

Alley - defined in data description as meaning there is no alley access - change to "None"

MasVrnType + MasVrnArea have multiple instances where the Area is 0 in which case that NaN should be set to None. There is one instance in the test set where an Area is given but no type. This is for record 1150. Inspecting this record the house is made of plywood and the most common Veneer type for Plywood houses when they have one is "BrkFace"

Electrical - The house missing this info (training id 1379) has air con which implies it must have electrical. I will impute this as the modal electrical type

MSZoning - all hosues must be zoned, impute as modal zone type

Utilities - impute modal type

Exterior1st - all houses have some kind of fininsh => impute modally based on MSSubClass

KitchenQual - impute based on MSSubClass

Functional - the descriptor says to assume typical unless information to the contrary so I will impute as typical

SaleType - impute modally

GarageCars and GarageArea - these both belong to test record 1116 which is listed as having a detached garage. If this was a training observation I may have dropped it but as it is a test record I cannot. One option is to assume that the garage being present was entered incorrectly, however the majority of houses in that zone type (RM) do have garages so I will impute using the modal values for houses in the same zone with a detached garage.

The test set also has a GarageYrBlt listed as 2207. This is clearly a mistake and needs to be corrected.

All other NaNs looks likely to be intentionally null values

For imputing values I will use the training set to calculate values even if they are to be imputed into the test set as the training is normally larger making this good practice.

In [7]:
#Investigate missing Utilities observations
# house_train['Utilities'].value_counts()
#Almost all houses have 'AllPub' utilities => fillna with this

In [8]:
#Investigate missing BsmtQual values
# display(house_test[(house_test['BsmtQual'].isna()) & (house_test['TotalBsmtSF']>0)])
# house_train.loc[house_train['Neighborhood']=='IDOTRR'].groupby(['MSZoning','BsmtQual'])\
# .agg({'BsmtQual':'count','MSZoning':'count'})

#There are two values in the test set where there is a non-zero TotalBsmtSF area recorded but NA BsmtQual. Both houses
#come from the same Neighbourhood and Zoning. Looking at the training set houses with these characteristics have TA BsmtQual
#so I will impute as this.
#All Bsmt qualitative values where BsmtArea = 0 can be set to None

# house_train.loc[house_train['TotalBsmtSF']==0,['BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2']] = \
# house_train.loc[house_train['TotalBsmtSF']==0,['BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2']].fillna('hello')
# display(house_train.loc[house_train['TotalBsmtSF']==0,['BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2']])

In [56]:
#Investigating extra MasVnr NaN in test set

# display(house_train.loc[house_train['MasVnrType'].isna(),['MasVnrArea','MasVnrType']])

#There is a single entry in the test data set which has a veneer area but not a veneer type. This can be filled using the most
#common type for the Neighbourhood, all other missing values can be set to NaN

ind = house_test.index[(house_test['MasVnrArea']>0) & house_test['MasVnrType'].isna()]
neighborhood = house_test.loc[ind,'Neighborhood'].values
area_veneer = house_test.loc[house_train['Neighborhood']=='Mitchel']
display(neighborhood,area_veneer)

array(['Mitchel'], dtype=object)

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
5,60,RL,75.0,10000,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Gilbert,Norm,Norm,1Fam,2Story,6,5,1993,1994,Gable,CompShg,HdBoard,HdBoard,,0.0,TA,TA,PConc,Gd,TA,No,Unf,0.0,Unf,0.0,763.0,763.0,GasA,Gd,Y,SBrkr,763,892,0,1655,0.0,0.0,2,1,3,1,TA,7,Typ,1,TA,Attchd,1993.0,Fin,2.0,440.0,TA,TA,Y,157,84,0,0,0,0,,,,0,4,2010,WD,Normal
46,60,RL,80.0,10791,Pave,,Reg,Lvl,AllPub,Inside,Gtl,NWAmes,Norm,Norm,1Fam,2Story,6,5,1993,1993,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,PConc,Gd,TA,Mn,GLQ,1137.0,Unf,0.0,143.0,1280.0,GasA,Ex,Y,SBrkr,1280,1215,0,2495,1.0,0.0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,1993.0,Unf,2.0,660.0,TA,TA,Y,224,32,0,0,0,0,,,,0,3,2010,WD,Normal
71,30,RM,56.0,4485,Pave,Grvl,Reg,Lvl,AllPub,Inside,Gtl,OldTown,Artery,Norm,1Fam,1Story,5,7,1920,1950,Gable,CompShg,Wd Sdng,Wd Sdng,,0.0,TA,TA,PConc,TA,TA,No,BLQ,579.0,Unf,0.0,357.0,936.0,GasA,TA,Y,SBrkr,936,0,0,936,1.0,0.0,1,0,2,1,TA,5,Typ,1,Gd,,,,0.0,0.0,,,P,51,0,135,0,0,0,,MnPrv,,0,5,2010,WD,Normal
81,50,RM,53.0,5830,Pave,,Reg,Lvl,AllPub,Corner,Gtl,BrkSide,Feedr,Feedr,1Fam,1.5Fin,5,6,1950,1997,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,Gd,CBlock,TA,TA,No,Rec,788.0,Unf,0.0,200.0,988.0,GasA,Ex,Y,SBrkr,1030,582,0,1612,0.0,0.0,1,1,3,1,TA,7,Typ,0,,Detchd,1950.0,Unf,1.0,363.0,TA,TA,Y,0,0,0,0,0,0,,MnPrv,,0,3,2010,WD,Normal
137,80,RL,85.0,19645,Pave,,IR1,Lvl,AllPub,FR2,Gtl,Crawfor,Norm,Norm,1Fam,SLvl,7,6,1994,2007,Gable,CompShg,VinylSd,VinylSd,BrkFace,44.0,TA,TA,PConc,Gd,TA,No,GLQ,343.0,Unf,0.0,80.0,423.0,GasA,Ex,Y,SBrkr,896,756,0,1652,1.0,0.0,2,1,3,1,Gd,6,Typ,0,,BuiltIn,1994.0,RFn,2.0,473.0,TA,TA,Y,0,0,0,0,0,0,,,,0,6,2010,WD,Normal
186,20,RL,85.0,11050,Pave,,IR1,Lvl,AllPub,Corner,Gtl,NWAmes,Norm,Norm,1Fam,1Story,7,5,1975,1975,Gable,CompShg,Plywood,Plywood,,0.0,TA,TA,CBlock,TA,TA,No,ALQ,636.0,Unf,0.0,540.0,1176.0,GasA,Fa,Y,SBrkr,1193,0,0,1193,0.0,0.0,2,0,3,1,TA,5,Typ,1,TA,Attchd,1975.0,Unf,2.0,506.0,TA,TA,Y,40,0,0,0,0,0,,,,0,8,2009,WD,Normal
201,60,RL,95.0,12350,Pave,,Reg,Lvl,AllPub,Inside,Gtl,NridgHt,Norm,Norm,1Fam,2Story,9,5,2009,2009,Gable,CompShg,VinylSd,VinylSd,,0.0,Gd,TA,PConc,Ex,TA,No,GLQ,986.0,Unf,0.0,379.0,1365.0,GasA,Ex,Y,SBrkr,1365,1325,0,2690,1.0,0.0,2,1,3,1,Ex,8,Typ,1,Gd,Attchd,2009.0,RFn,3.0,864.0,TA,TA,Y,0,197,0,0,0,0,,,,0,7,2009,New,Partial
274,20,RL,,7791,Pave,,IR1,Lvl,AllPub,Inside,Gtl,Sawyer,RRAe,Norm,1Fam,1Story,5,8,1963,1995,Gable,CompShg,Plywood,Plywood,,0.0,Gd,Gd,CBlock,TA,TA,No,ALQ,624.0,Unf,0.0,288.0,912.0,GasA,Ex,Y,SBrkr,912,0,0,912,1.0,0.0,1,0,3,1,Gd,6,Typ,0,,Attchd,1963.0,RFn,1.0,300.0,TA,TA,Y,0,0,0,0,0,0,,GdWo,,0,10,2009,WD,Normal
276,20,RL,,15676,Pave,,IR1,Low,AllPub,Inside,Gtl,Veenker,Norm,Norm,1Fam,1Story,8,8,1980,1980,Gable,CompShg,VinylSd,VinylSd,BrkFace,115.0,Gd,Gd,CBlock,Gd,Gd,Gd,ALQ,1733.0,Rec,92.0,189.0,2014.0,GasA,Gd,Y,SBrkr,2014,0,0,2014,1.0,0.0,2,0,2,1,Gd,6,Maj1,2,Gd,Attchd,1980.0,RFn,3.0,864.0,TA,TA,Y,462,0,0,255,0,0,,MnPrv,,0,4,2009,WD,Normal
295,20,RL,60.0,7436,Pave,,Reg,Lvl,AllPub,Inside,Gtl,NAmes,Norm,Norm,1Fam,1Story,4,7,1960,1960,Gable,CompShg,VinylSd,VinylSd,,0.0,TA,TA,CBlock,TA,TA,No,ALQ,734.0,Unf,0.0,160.0,894.0,GasA,Gd,Y,SBrkr,894,0,0,894,1.0,0.0,1,0,2,1,TA,5,Typ,1,Po,Detchd,1988.0,Unf,2.0,396.0,TA,TA,Y,0,0,0,360,0,0,,GdWo,,0,8,2009,WD,Normal


In [21]:
#Process NaNs in training and test data frames

#Deal with Garage NaNs in test record 1116 of the test set
garage_cols = [col for col in test.columns if 'Garage' in col]
garage_cols.remove('GarageType')
garage_groups = house_train.groupby(['MSZoning','GarageType'])[garage_cols].agg(lambda x: mode(x)[0])
garage_groups_1116 = garage_groups.loc['RM','Detchd']
    
def fill_row_na(df,row,fill_group):
    '''function to fill in missing values for a particular dataframe row using a groupby object created outside the function'''
    for ind, item in fill_group.iteritems():
        df.loc[row,ind] = item
    return df

house_test = fill_row_na(house_test,1116,garage_groups_1116)

#Correct GarageYrBlt = 2207 in test set
house_test.loc[1132,'GarageYrBlt'] = 2007

#Test set record 660 creates a specific problem as it records a NaN for TotalBsmtSF. Setting this to 0 will allow
#the na_processing function below to handle the other NaNs
house_test.loc[house_test['TotalBsmtSF'].isna(),'TotalBsmtSF'] = 0



df_tst = house_test.copy()
df_trn = house_train.copy()
df_train = house_train.copy()

na_col_trn = df_trn.isna().any()
trn_na = df_trn.loc[:,na_col_trn].isna().sum()

na_col_tst = df_tst.isna().any()
tst_na = df_tst.loc[:,na_col_tst].isna().sum()

display(pd.DataFrame([trn_na,tst_na],index=['train','test']).T)

def na_processing(df,df_train):
    lot_frontage_fill = df_train.groupby('Neighborhood').agg({'LotFrontage':'mean'})
    df = df.set_index('Neighborhood')
    df['LotFrontage'].fillna(lot_frontage_fill['LotFrontage'],inplace=True)
    df = df.reset_index()

    df['Electrical'].fillna(df_train['Electrical'].mode()[0],inplace=True)

    df['MSZoning'].fillna(df_train['MSZoning'].mode()[0],inplace=True)

    df['Utilities'].fillna(df_trn['Utilities'].mode()[0],inplace=True)

    df.loc[(df['BsmtQual'].isna()) & (df['TotalBsmtSF']>0),'BsmtQual'] = 'TA'
    
    df.loc[df['TotalBsmtSF']==0,['BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2']] = \
    df.loc[df['TotalBsmtSF']==0,['BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2']].fillna('None')

    df['Alley'].fillna('None',inplace=True)
    
    df['FireplaceQu'].fillna('None',inplace=True)
    
    garage_cols = [col for col in test.columns if 'Garage' in col]
    df.loc[:,garage_cols] = df.loc[:,garage_cols].fillna('None')
    
    df['Fence'].fillna('None',inplace=True)
    
    df['MiscFeature'].fillna('None',inplace=True)
    
    mas_veneer_fill = df_train.groupby('Neighborhood')['MasVnrType'].agg(lambda x: mode(x)[0])
    df = df.set_index('Neighborhood')
#     df['MasVnrType'].fillna(mas_veneer_fill,inplace=True)
#     df = df.reset_index()
    
#     df['MasVnrArea'].fillna(0,inplace=True)
#     df['MasVnrType'].fillna('None',inplace=True)
    
    return df

df_tst = na_processing(df_tst,df_train)
df_trn = na_processing(df_trn,df_train)


na_col_trn = df_trn.isna().any()
trn_na = df_trn.loc[:,na_col_trn].isna().sum()

na_col_tst = df_tst.isna().any()
tst_na = df_tst.loc[:,na_col_tst].isna().sum()

display(pd.DataFrame([trn_na,tst_na],index=['train','test']).T)


Unnamed: 0,train,test
LotFrontage,259.0,227.0
Alley,1369.0,1352.0
MasVnrType,8.0,16.0
MasVnrArea,8.0,15.0
BsmtQual,37.0,44.0
BsmtCond,37.0,45.0
BsmtExposure,38.0,44.0
BsmtFinType1,37.0,42.0
BsmtFinType2,38.0,42.0
Electrical,1.0,


Unnamed: 0,train,test
MasVnrType,8.0,16.0
MasVnrArea,8.0,15.0
BsmtExposure,1.0,2.0
BsmtFinType2,1.0,
PoolQC,1453.0,1456.0
Exterior1st,,1.0
Exterior2nd,,1.0
BsmtCond,,3.0
BsmtFinSF1,,1.0
BsmtFinSF2,,1.0


In [None]:
#Inspect numerical columns for correlation to sale price and shape of data
# correlation = house_train.corr()['SalePrice']
# kurt = house_train.kurtosis()
# skew = house_train.skew()
# cols = ['Price_Correlation','Kurtosis','Skewness']

# house_numerical = pd.concat([correlation,kurt,skew],axis=1)
# house_numerical.columns = cols
# display(house_numerical.sort_values(['Price_Correlation'],ascending=False))

In [None]:
# fig, ax = plt.subplots(1,3,figsize=(20,8))
# house_train.hist(column='SalePrice',bins=20,ax=ax[0])

# display(kstest(house_train['SalePrice'],'norm'))
# display('Sale Price Skew = {:.2f}'.format(house_train['SalePrice'].skew()))
# display('Sale Price Kurtosis = {:.2f}'.format(house_train['SalePrice'].kurtosis()))

# sales_price_log = np.log(house_train['SalePrice'])
# ax[1].hist(sales_price_log,bins=20,color='red')
# ax[1].set_title('Log_SalePrice')
# display('Log Skew = {:.2f}'.format(sales_price_log.skew()))
# display('Log Kurtosis = {:.2f}'.format(sales_price_log.kurtosis()))

# sns.boxplot(y='SalePrice',data=house_train,ax=ax[2])


The Sale Price target is normally distributed though in its base form is right-tail skewed. Taking the log of Sale Price corrects this so it may help the model to predict log Sale Price and then take the exponential to create the final predictions.

There are two clear outliers which should probably be removed from the training set before modelling.

In [None]:
# house_train['Log_SalePrice'] = np.log(house_train['SalePrice'])
# correlation_two = house_train.corr()['Log_SalePrice']
# log_house_numerical = house_numerical.join(correlation_two)
# log_house_numerical.rename(columns={'Log_SalePrice':'Log_Price_Corr'},inplace=True)
# log_house_numerical = log_house_numerical[['Price_Correlation','Log_Price_Corr','Kurtosis','Skewness']].sort_values('Log_Price_Corr',ascending=False)
# display(log_house_numerical.head(10))

Taking the log of the Sale Price improves the correlation factor of most nuerical variables, including nine of the top 10, without changing their order. This implies that using the log of the Sale Price may improve model accuracy, particularly in simpler models. I will continue to do base EDA using the Sale Price as this is the real-world value but may use its log in model buidling.

In [None]:
#Delete some extraneous variables created so far
# del[house_numerical,correlation,correlation_two]

In [None]:
#Examine Overall Quality and Condition
# fig, ax = plt.subplots(1,3,figsize=(20,8))
# sns.boxplot(x='OverallQual',y='SalePrice',data=house_train,ax=ax[0])
# sns.boxplot(x='OverallCond',y='SalePrice',data=house_train,ax=ax[1])
# sns.scatterplot(x='OverallQual',y='OverallCond',data=house_train,ax=ax[2])

# QualCon = log_house_numerical.loc[['OverallQual','OverallCond']]
# display(QualCon)

# Qual_var = house_train.groupby('OverallQual').agg({'SalePrice':'var'}).rename(columns={'SalePrice':'var_SalePrice'})
# display(Qual_var.T)

# display('Levene test of OverallQual and Sale Price = {}'.format(levene(house_train['OverallQual'],house_train['SalePrice'])))
# display('Levene test of OverallQual and Log Sale Price = {}'.format(levene(house_train['OverallQual'],house_train['Log_SalePrice'])))

In [None]:
# house_train['Box_OverallQual'] = boxcox(house_train['OverallQual'])[0]
# display(house_train[['Box_OverallQual','OverallQual','Log_SalePrice']].corr()['Log_SalePrice'])
# display('Levene test on Box-Cox transformer OverallQual = {}'.format(levene(house_train['Box_OverallQual'], \
#                             house_train['Log_SalePrice'])))