# Preprocess of house prices dataset
In the following notebooks we're going to preprocess the data, that is remove missing variables, transform the variables and treat outliers. We're also going to build a specialized pipeline for those transformations.

In this notebook specifically, missing values will be treated.

In [1]:
# import dataset and libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import pearsonr
%matplotlib inline
pd.plotting.register_matplotlib_converters()
plt.rc('figure', figsize=(16, 6))

In [2]:
orig_data = pd.read_csv("data/train.csv", index_col="Id")

In [3]:
# copying the dataset for analysis
house_data = orig_data.copy()
house_data.head()

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,2,2008,WD,Normal,208500
2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,,,,0,5,2007,WD,Normal,181500
3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,,,,0,9,2008,WD,Normal,223500
4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,,,,0,2,2006,WD,Abnorml,140000
5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,0,,,,0,12,2008,WD,Normal,250000


In [4]:
# checking missing values in each column
missing_val_count_by_column = house_data.isnull().sum()
print(missing_val_count_by_column[missing_val_count_by_column > 0])

LotFrontage      259
Alley           1369
MasVnrType         8
MasVnrArea         8
BsmtQual          37
BsmtCond          37
BsmtExposure      38
BsmtFinType1      37
BsmtFinType2      38
Electrical         1
FireplaceQu      690
GarageType        81
GarageYrBlt       81
GarageFinish      81
GarageQual        81
GarageCond        81
PoolQC          1453
Fence           1179
MiscFeature     1406
dtype: int64


As seen from univariate and bivariate analysis many of these variables have missing values because they don't have that feature. E.g. it's not possible to determine pool quality for houses which don't have a pool. Although we have to be careful because not all of the missing values have such simple reason.

First let's check if missing values from "Garage" variable are because of such reason. With help there comes a variable named "GarageArea" in which we can check for how many rows it has value of 0 (no garage).

In [5]:
house_data["GarageArea"][house_data["GarageArea"] == 0].count()

81

It seems to match perfectly with the number of missing values. Let's check further if the rows that have value of 0 for "GarageArea", also have missing values for the rest:

In [6]:
garage_features = ["GarageType", "GarageYrBlt", "GarageFinish", "GarageQual", "GarageCond"]
for feature in garage_features:    
    print(house_data[feature][house_data["GarageArea"] == 0].isnull().sum())

81
81
81
81
81


They do! Therefore we can replace them with a category "NoGarage" for now until we decide how to use this variable in the model

In [7]:
for feature in garage_features:
    house_data[feature] = house_data[feature].fillna("NoGarage")

We can now check that none of the garage features have missing variables:

In [8]:
missing_val_count_by_column = house_data.isnull().sum()
print(missing_val_count_by_column[missing_val_count_by_column > 0])

LotFrontage      259
Alley           1369
MasVnrType         8
MasVnrArea         8
BsmtQual          37
BsmtCond          37
BsmtExposure      38
BsmtFinType1      37
BsmtFinType2      38
Electrical         1
FireplaceQu      690
PoolQC          1453
Fence           1179
MiscFeature     1406
dtype: int64


Similarly we can replace values from FireplaceQu, PoolQC, MiscFeature using the same method

In [9]:
house_data["Fireplaces"][house_data['Fireplaces'] == 0].count()

690

In [10]:
print(house_data['FireplaceQu'][house_data["Fireplaces"] == 0].isnull().sum())

690


In [11]:
house_data["FireplaceQu"] = house_data["FireplaceQu"].fillna("NoFireplace")

In [12]:
house_data["PoolArea"][house_data['PoolArea'] == 0].count()

1453

In [13]:
print(house_data['PoolQC'][house_data["PoolArea"] == 0].isnull().sum())

1453


In [14]:
house_data["PoolQC"] = house_data["PoolQC"].fillna("NoPool")

In [15]:
house_data["MiscVal"][house_data['MiscVal'] == 0].count()

1408

In [16]:
print(house_data['MiscFeature'][house_data["MiscVal"] == 0].isnull().sum())

1406


In [17]:
# As the number of missing values with MiscVal = 0 is the same as the number of missing values overall 
# we can treat those missing values as having no miscellaneous feature
# There seems to be also two miscellaneous features with no value
house_data["MiscFeature"] = house_data["MiscFeature"].fillna("NoFeature")

As explained in data_description.txt missing values in Alley and Fence mean there is no alley access/ no fence. Therefore we can just replace those values with separate categories.

In [18]:
house_data["Fence"] = house_data["Fence"].fillna("NoFence")
house_data["Alley"] = house_data["Alley"].fillna("NoAlley")

In [19]:
missing_val_count_by_column = house_data.isnull().sum()
print(missing_val_count_by_column[missing_val_count_by_column > 0])

LotFrontage     259
MasVnrType        8
MasVnrArea        8
BsmtQual         37
BsmtCond         37
BsmtExposure     38
BsmtFinType1     37
BsmtFinType2     38
Electrical        1
dtype: int64


There is a lot of missing values in LotFrontage (over 17%) with no obvious way to replace it. Because of thoroughness in gathering data in other variables it seems unlikely that missing values here are because of random errors. Maybe it was difficult to establish exactly what is a LotFrontage for many houses.

To eliminate missing variables in this case I will impute the value with a general central value. To not introduce data leakage I will do this only after data is split into training and validation.

For MasVnrType I will assume that data was missing because of random error. It will be replaced by the mode, that is lack of Masonry.

For the Electrical I will assume the same, it will also be replaced by the mode.

In [20]:
# checking if both MasVnrType and MasVnrArea have missing values in the same rows
house_data["MasVnrArea"][house_data["MasVnrType"].isnull()].isnull().sum()

8

In [21]:
house_data["MasVnrArea"] = house_data["MasVnrArea"].fillna(0)
house_data["MasVnrType"] = house_data["MasVnrType"].fillna("None")

In [22]:
house_data["Electrical"] = house_data["Electrical"].fillna(house_data["Electrical"].mode()[0])

In [23]:
missing_val_count_by_column = house_data.isnull().sum()
print(missing_val_count_by_column[missing_val_count_by_column > 0])

LotFrontage     259
BsmtQual         37
BsmtCond         37
BsmtExposure     38
BsmtFinType1     37
BsmtFinType2     38
dtype: int64


We are left only with the basement variables, To fix BsmtFinType1 and BsmtFinType2 we can check their variables which determine their square feet. But we need to be careful as values of 0 square feet will correspond to unfinished basements as well as "No basement". To check how many correspond to something else than Unfinished we can use the following:

In [24]:
house_data["BsmtFinSF1"][house_data['BsmtFinSF1'] == 0].count() - house_data["BsmtFinType1"][house_data['BsmtFinType1'] == "Unf"][house_data['BsmtFinSF1'] == 0].count()

37

It seems that all of the BsmtFinType1 which aren't unfinished and have 0 square feet area correspond to "No Basement" category

Let's do further checks on BsmtQual, BsmtCond and BsmtFinType1:

In [25]:
house_data["BsmtQual"][house_data['BsmtFinSF1'] == 0].isnull().sum()

37

In [26]:
house_data["BsmtCond"][house_data['BsmtFinSF1'] == 0].isnull().sum()

37

In [27]:
house_data["BsmtFinType1"][house_data['BsmtFinSF1'] == 0].isnull().sum()

37

Therefore we can fill these values with a separate category "NoBsmt"

In [28]:
house_data["BsmtQual"] = house_data["BsmtQual"].fillna("NoBsmt")
house_data["BsmtCond"] = house_data["BsmtCond"].fillna("NoBsmt")
house_data["BsmtFinType1"] = house_data["BsmtFinType1"].fillna("NoBsmt")

In [29]:
missing_val_count_by_column = house_data.isnull().sum()
print(missing_val_count_by_column[missing_val_count_by_column > 0])

LotFrontage     259
BsmtExposure     38
BsmtFinType2     38
dtype: int64


In [30]:
house_data["BsmtFinSF2"][house_data['BsmtFinSF2'] == 0].count() - house_data["BsmtFinType2"][house_data['BsmtFinType2'] == "Unf"][house_data['BsmtFinSF2'] == 0].count()

37

It seems that only 37 out of 38 values can be explained by lack of basement. One of them might a true missing value.

Let's check if missing values in Exposure and FinType2 are related:

In [31]:
house_data["BsmtExposure"][house_data["BsmtFinType2"].isnull()].isnull().sum()

37

We conclude that for 37 of them every single Basement variable has a missing value. For one case there is a missing value in Exposure column, for the other in BsmtFinType2 column:

In [32]:
house_data["BsmtExposure"][house_data["BsmtFinType2"].notnull()].isnull().sum()

1

In [33]:
house_data["BsmtExposure"][house_data["BsmtFinType2"].isnull()].notnull().sum()

1

I will replace those single values by modes:

In [34]:
mode_exposure = house_data["BsmtExposure"].mode()[0]
print(mode_exposure)
mode_type2 = house_data["BsmtFinType2"].mode()[0]
print(mode_type2)

No
Unf


In [35]:
# This throws a warning but indeed it fixes a value in the dataframe. 
# This can be avoided by knowing the exact position of the row, which can be obtained normally.
# This also can be avoided by using df.loc which I will try to do in next cell.
house_data["BsmtExposure"][house_data["BsmtFinType2"].notnull()] = house_data["BsmtExposure"][house_data["BsmtFinType2"].notnull()].fillna(mode_exposure)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  house_data["BsmtExposure"][house_data["BsmtFinType2"].notnull()] = house_data["BsmtExposure"][house_data["BsmtFinType2"].notnull()].fillna(mode_exposure)


In [36]:
# This does not throw any errors and is also faster than chained indexing
mask = house_data["BsmtExposure"].notnull()
house_data.loc[mask, "BsmtFinType2"] = house_data["BsmtFinType2"][mask].fillna(mode_type2)

In [37]:
house_data["BsmtExposure"][house_data["BsmtFinType2"].isnull()].notnull().sum()

0

Now that we eliminated those actual missing values we can replace the values in BsmtExposure and BsmtFinType2 by category "NoBsmt"

In [38]:
house_data["BsmtExposure"] = house_data["BsmtExposure"].fillna("NoBsmt")
house_data["BsmtFinType2"] = house_data["BsmtFinType2"].fillna("NoBsmt")

In [39]:
missing_val_count_by_column = house_data.isnull().sum()
print(missing_val_count_by_column[missing_val_count_by_column > 0])

LotFrontage    259
dtype: int64


So the only missing variable is the LotFrontage, which will be dealt with when we divide the data into train and validation.

Finally we save the data into csv file.

In [40]:
house_data.to_csv("data/train_preprocessed.csv")