In [27]:
# Load Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

##### Here I am just loading the downloaded .csv files, but it may be better to load them directly from Kaggle.  
Thoughts?

In [28]:
# Load data sets
dfTrain = pd.read_csv("train.csv")
dfTest = pd.read_csv("test.csv")

print("Shape of training data set:", dfTrain.shape)
print("Shape of testing data set: ", dfTest.shape)

Shape of training data set: (1460, 81)
Shape of testing data set:  (1459, 80)


##### It might be a good idea to combine training and testing data sets for EDA purposes, in order to capture all of the odd cases.

In [29]:
# Combine data sets
dfBoth = pd.concat([dfTrain, dfTest], keys=['train', 'test'], names=['dataSet', 'index'])

print("Shape of combined data set:", dfBoth.shape)

Shape of combined data set: (2919, 81)


##### I think it's a good idea to first deal with *Null* values.

In [30]:
# Get a sorted list of the numbers of Null values
missingVals = dfBoth.isnull().sum()
missingVals = missingVals[missingVals > 0]
missingVals.sort_values()

Electrical         1
GarageArea         1
GarageCars         1
Exterior1st        1
Exterior2nd        1
KitchenQual        1
SaleType           1
TotalBsmtSF        1
BsmtFinSF1         1
BsmtUnfSF          1
BsmtFinSF2         1
Utilities          2
Functional         2
BsmtHalfBath       2
BsmtFullBath       2
MSZoning           4
MasVnrArea        23
MasVnrType        24
BsmtFinType1      79
BsmtFinType2      80
BsmtQual          81
BsmtExposure      82
BsmtCond          82
GarageType       157
GarageCond       159
GarageQual       159
GarageFinish     159
GarageYrBlt      159
LotFrontage      486
FireplaceQu     1420
SalePrice       1459
Fence           2348
Alley           2721
MiscFeature     2814
PoolQC          2909
dtype: int64

##### I was tempted to just remove any rows that have Null values in any of the columns with 5 or fewer Null values (Electrical through MSZoning), but I think we can make sense of some of them. For instance, the NaNs in `GarageArea` and `GarageCars` were probably just no garage and can reasonably be replaced with a zero.

In [31]:
# Replace null values with zero
dfBoth.GarageArea.fillna(0, inplace=True)
dfBoth.GarageCars.fillna(0, inplace=True)

In [32]:
# Get the counts for each garage size (in cars)
dfBoth.GarageCars.value_counts()

2.0    1594
1.0     776
3.0     374
0.0     158
4.0      16
5.0       1
Name: GarageCars, dtype: int64

##### Note the 157 zeroes (158 now) indicating no garage. The other Garage-type variables all have 157-159 Null values. We can set up a rule to change those NaNs to `None` for rows where `GarageCars == 0`. 

In [33]:
# Conditionally replace null values with 'None'
garageVars = ["GarageType", "GarageCond", "GarageQual", 
              "GarageFinish", "GarageYrBlt"]
for i in garageVars:
    dfBoth.loc[dfBoth['GarageCars'] == 0.0, i] = 'None'

##### Similarly, Null values in Basement-related variables likely indicate no basement and can also be replaced with a zero.

In [34]:
# Replace null values with zero
bsmtVars = ["TotalBsmtSF", "BsmtFinSF1", "BsmtFinSF2", 
            "BsmtUnfSF", "BsmtHalfBath", "BsmtFullBath"]
for i in bsmtVars:
    dfBoth[i].fillna(0, inplace=True)

In [35]:
# Conditionally replace null values with 'None'
garageVars = ["BsmtFinType1", "BsmtFinType2", "BsmtQual", 
              "BsmtCond", "BsmtExposure"]
for i in garageVars:
    dfBoth.loc[dfBoth['TotalBsmtSF'] == 0.0, i] = 'None'

In [36]:
# Get a sorted list of the numbers of Null values
missingVals = dfBoth.isnull().sum()
missingVals = missingVals[missingVals > 0]
missingVals.sort_values()

Electrical         1
GarageCond         1
GarageQual         1
Exterior1st        1
Exterior2nd        1
GarageFinish       1
GarageYrBlt        1
BsmtFinType2       1
SaleType           1
KitchenQual        1
Utilities          2
BsmtQual           2
Functional         2
BsmtCond           3
BsmtExposure       3
MSZoning           4
MasVnrArea        23
MasVnrType        24
LotFrontage      486
FireplaceQu     1420
SalePrice       1459
Fence           2348
Alley           2721
MiscFeature     2814
PoolQC          2909
dtype: int64

##### There are still Garage-related variables coming up null. I checked and they're all the same entry (test set, index 666). I suspect there is no garage and the number of cars for this entry (1.0) was mis-coded. We could delete this entry or set it up as no garage.

##### I haven't looked at the remaining Basement variables yet, but I suspect something similar to the Garage variables.