# 2 - Data Cleaning

In [1]:
import pandas as pd
train_df = pd.read_csv('../data/housing_train.csv')
test_df = pd.read_csv('../data/housing_test.csv')

There is an apparent relation between `MSSubClass` and `HouseStyle`. This is how they should match up:

1Story	One story
- 20	1-STORY 1946 & NEWER ALL STYLES
- 30	1-STORY 1945 & OLDER
- 40	1-STORY W/FINISHED ATTIC ALL AGES
- 120	1-STORY PUD (Planned Unit Development) - 1946 & NEWER

1.5Fin	One and one-half story: 2nd level finished
- 50	1-1/2 STORY FINISHED ALL AGES
- 150	1-1/2 STORY PUD - ALL AGES

1.5Unf	One and one-half story: 2nd level unfinished
- 45	1-1/2 STORY - UNFINISHED ALL AGES

2Story	Two story
- 60	2-STORY 1946 & NEWER
- 70	2-STORY 1945 & OLDER
- 160	2-STORY PUD - 1946 & NEWER


2.5Fin	Two and one-half story: 2nd level finished<br>
2.5Unf	Two and one-half story: 2nd level unfinished
- 75	2-1/2 STORY ALL AGES

SFoyer	Split Foyer
- 85	SPLIT FOYER

SLvl	Split Level
- 80	SPLIT OR MULTI-LEVEL

Misc (No obvious floor number)
- 90	DUPLEX - ALL STYLES AND AGES
- 180	PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
- 190	2 FAMILY CONVERSION - ALL STYLES AND AGES

### Check for mismatches:<p>
Should be one story:

In [2]:
print('Train: \nCode 20\n',train_df[train_df['MSSubClass']==20]['HouseStyle'].value_counts(),'\n')
print('Code 30\n',train_df[train_df['MSSubClass']==30]['HouseStyle'].value_counts(),'\n')
print('Code 40\n',train_df[train_df['MSSubClass']==40]['HouseStyle'].value_counts(),'\n')
print('Code 120\n',train_df[train_df['MSSubClass']==120]['HouseStyle'].value_counts())

Train: 
Code 20
 HouseStyle
1Story    534
2Story      1
SLvl        1
Name: count, dtype: int64 

Code 30
 HouseStyle
1Story    67
1.5Fin     1
1.5Unf     1
Name: count, dtype: int64 

Code 40
 HouseStyle
1Story    4
Name: count, dtype: int64 

Code 120
 HouseStyle
1Story    86
SFoyer     1
Name: count, dtype: int64


Codes 20, 30, 120 have mismatches.

In [3]:
print('Test: \nCode 20\n',test_df[test_df['MSSubClass']==20]['HouseStyle'].value_counts(),'\n')
print('Code 30\n',test_df[test_df['MSSubClass']==30]['HouseStyle'].value_counts(),'\n')
print('Code 40\n',test_df[test_df['MSSubClass']==40]['HouseStyle'].value_counts(),'\n')
print('Code 120\n',test_df[test_df['MSSubClass']==120]['HouseStyle'].value_counts())

Test: 
Code 20
 HouseStyle
1Story    543
Name: count, dtype: int64 

Code 30
 HouseStyle
1Story    69
1.5Fin     1
Name: count, dtype: int64 

Code 40
 HouseStyle
1Story    1
1.5Fin    1
Name: count, dtype: int64 

Code 120
 HouseStyle
1Story    94
SFoyer     1
Name: count, dtype: int64


Codes 30, 40, 120 have mismatches.

Should be 1.5 stories:

In [4]:
print('Train: \nCode 45\n',train_df[train_df['MSSubClass']==45]['HouseStyle'].value_counts(),'\n')
print('Code 50\n',train_df[train_df['MSSubClass']==50]['HouseStyle'].value_counts(),'\n')
print('Code 150\n',train_df[train_df['MSSubClass']==150]['HouseStyle'].value_counts())

Train: 
Code 45
 HouseStyle
1.5Unf    12
Name: count, dtype: int64 

Code 50
 HouseStyle
1.5Fin    141
2Story      3
Name: count, dtype: int64 

Code 150
 Series([], Name: count, dtype: int64)


Code 50 has mismatches, and code 150 is never used.

In [5]:
print('Test: \nCode 45\n',test_df[test_df['MSSubClass']==45]['HouseStyle'].value_counts(),'\n')
print('Code 50\n',test_df[test_df['MSSubClass']==50]['HouseStyle'].value_counts(),'\n')
print('Code 150\n',test_df[test_df['MSSubClass']==150]['HouseStyle'].value_counts())

Test: 
Code 45
 HouseStyle
1.5Unf    4
1.5Fin    2
Name: count, dtype: int64 

Code 50
 HouseStyle
1.5Fin    141
1.5Unf      1
2Story      1
Name: count, dtype: int64 

Code 150
 HouseStyle
1.5Fin    1
Name: count, dtype: int64


Code 50 has a mismatch again, as does 45.

Should be 2 stories:

In [7]:
print('Train: \nCode 60\n',train_df[train_df['MSSubClass']==60]['HouseStyle'].value_counts(),'\n')
print('Code 70\n',train_df[train_df['MSSubClass']==70]['HouseStyle'].value_counts(),'\n')
print('Code 160\n',train_df[train_df['MSSubClass']==160]['HouseStyle'].value_counts())

Train: 
Code 60
 HouseStyle
2Story    298
SLvl        1
Name: count, dtype: int64 

Code 70
 HouseStyle
2Story    59
2.5Fin     1
Name: count, dtype: int64 

Code 160
 HouseStyle
2Story    63
Name: count, dtype: int64


Code 70 has a mismatch. Code 60 could, but as a split level is just a special style of multistory house, I'm going to leave it. There's no way to know if it's a two level split or more.

In [6]:
print('Test: \nCode 60\n',test_df[test_df['MSSubClass']==60]['HouseStyle'].value_counts(),'\n')
print('Code 70\n',test_df[test_df['MSSubClass']==70]['HouseStyle'].value_counts(),'\n')
print('Code 160\n',test_df[test_df['MSSubClass']==160]['HouseStyle'].value_counts())

Test: 
Code 60
 HouseStyle
2Story    274
1.5Fin      1
2.5Unf      1
Name: count, dtype: int64 

Code 70
 HouseStyle
2Story    64
2.5Unf     3
1.5Fin     1
Name: count, dtype: int64 

Code 160
 HouseStyle
2Story    64
SLvl       1
Name: count, dtype: int64


Codes 60 and 70 have mismatches, and the split level on 160 I am going to leave. 

Should be 2.5 stories:

In [8]:
print('Train: \nCode 75\n',train_df[train_df['MSSubClass']==75]['HouseStyle'].value_counts(),'\n')
print('Test: \nCode 75\n',test_df[test_df['MSSubClass']==75]['HouseStyle'].value_counts())

Train: 
Code 75
 HouseStyle
2.5Unf    9
2.5Fin    6
2Story    1
Name: count, dtype: int64 

Test: 
Code 75
 HouseStyle
2.5Unf    6
2Story    1
Name: count, dtype: int64


Both train and test have mismatches on code 75.

Should be split foyer:

In [9]:
print('Code 85\n',train_df[train_df['MSSubClass']==85]['HouseStyle'].value_counts(),'\n')
print('Code 85\n',test_df[test_df['MSSubClass']==85]['HouseStyle'].value_counts())

Code 85
 HouseStyle
SFoyer    20
Name: count, dtype: int64 

Code 85
 HouseStyle
SFoyer    28
Name: count, dtype: int64


No mismatches.

Should be split level:

In [10]:
print('Code 80\n',train_df[train_df['MSSubClass']==80]['HouseStyle'].value_counts(),'\n')
print('Code 80\n',test_df[test_df['MSSubClass']==80]['HouseStyle'].value_counts())

Code 80
 HouseStyle
SLvl    58
Name: count, dtype: int64 

Code 80
 HouseStyle
SLvl    60
Name: count, dtype: int64


No mismatches.

### Fixing mismatches <p>
In the train set

In [12]:
# setting 1 story
# Code 20
train_df.loc[((train_df[ # select
    (train_df['MSSubClass']==20)& # code 20 AND either
    ((train_df['HouseStyle']=='2Story')| # 2 Story OR
     (train_df['HouseStyle']=='SLvl'))   # Split Level
    ].index).tolist()), 'HouseStyle'] = '1Story'

# Code 30
train_df.loc[((train_df[ # select
    (train_df['MSSubClass']==30)& # code 30 AND either
    ((train_df['HouseStyle']=='1.5Fin')| # 1.5 Finished OR
     (train_df['HouseStyle']=='1.5Unf')) # 1.5 Unfinished
].index).tolist()), 'HouseStyle'] = '1Story'

# Code 120
train_df.loc[((train_df[ # select
    (train_df['MSSubClass']==120)& # code 120 AND
    (train_df['HouseStyle']=='SFoyer') # split foyer
].index).tolist()), 'HouseStyle'] = '1Story'

In [13]:
print('Check: \nCode 20\n',train_df[train_df['MSSubClass']==20]['HouseStyle'].value_counts(),'\n')
print('Code 30\n',train_df[train_df['MSSubClass']==30]['HouseStyle'].value_counts(),'\n')
print('Code 120\n',train_df[train_df['MSSubClass']==120]['HouseStyle'].value_counts())

Check: 
Code 20
 HouseStyle
1Story    536
Name: count, dtype: int64 

Code 30
 HouseStyle
1Story    69
Name: count, dtype: int64 

Code 120
 HouseStyle
1Story    87
Name: count, dtype: int64


In [14]:
# setting 1.5Fin story, Code 50
train_df.loc[((train_df[ # select
    (train_df['MSSubClass']==50)& # code 50 AND
    (train_df['HouseStyle']=='2Story') # 2 story
].index).tolist()), 'HouseStyle'] = '1.5Fin'

In [15]:
print('Check: \nCode 50\n',train_df[train_df['MSSubClass']==50]['HouseStyle'].value_counts())

Check: 
Code 50
 HouseStyle
1.5Fin    144
Name: count, dtype: int64


In [16]:
# setting 2 stories
# Code 70
train_df.loc[((train_df[ # select
    (train_df['MSSubClass']==70)& # code 70 AND
    (train_df['HouseStyle']=='2.5Fin') # 2.5 finished
].index).tolist()), 'HouseStyle'] = '2Story'

In [17]:
print('Check: \nCode 70\n',train_df[train_df['MSSubClass']==70]['HouseStyle'].value_counts())

Check: 
Code 70
 HouseStyle
2Story    60
Name: count, dtype: int64


For the 2.5 story code, 75, there is no way to tell if they are meant to be finished or unfinished. As they only have one code, and there are not very many of them, even when the test data is included, I am going to code them all as simply `2.5Story`.

In [18]:
# setting 2.5 stories, code 75
train_df.loc[((train_df[train_df['MSSubClass']==75].index).tolist()), 'HouseStyle'] = '2.5Story'

In [19]:
print('Check: \nCode 75\n',train_df[train_df['MSSubClass']==75]['HouseStyle'].value_counts())

Check: 
Code 75
 HouseStyle
2.5Story    16
Name: count, dtype: int64


In the test set

In [20]:
# setting 1 story
# Code 30
test_df.loc[((test_df[ # select
    (test_df['MSSubClass']==30)& # code 30 AND
    (test_df['HouseStyle']=='1.5Fin') # 1.5 finished
].index).tolist()), 'HouseStyle'] = '1Story'

# Code 40
test_df.loc[((test_df[ # select
    (test_df['MSSubClass']==40)& # code 40 AND
    (test_df['HouseStyle']=='1.5Fin') # 1.5 finished
].index).tolist()), 'HouseStyle'] = '1Story'

# Code 120
test_df.loc[((test_df[ # select
    (test_df['MSSubClass']==120)& # code 120 AND
    (test_df['HouseStyle']=='SFoyer') # split foyer
].index).tolist()), 'HouseStyle'] = '1Story'

In [21]:
print('Check: \nCode 30\n',test_df[test_df['MSSubClass']==30]['HouseStyle'].value_counts(),'\n')
print('Code 40\n',test_df[test_df['MSSubClass']==40]['HouseStyle'].value_counts(),'\n')
print('Code 120\n',test_df[test_df['MSSubClass']==120]['HouseStyle'].value_counts())

Check: 
Code 30
 HouseStyle
1Story    70
Name: count, dtype: int64 

Code 40
 HouseStyle
1Story    2
Name: count, dtype: int64 

Code 120
 HouseStyle
1Story    95
Name: count, dtype: int64


In [22]:
# setting 1.5Unf, Code 45
test_df.loc[((test_df[ # select
    (test_df['MSSubClass']==45)& # code 45 AND
    (test_df['HouseStyle']=='1.5Fin') # 1.5 finished
].index).tolist()), 'HouseStyle'] = '1.5Unf'

# setting 1.5Fin story, Code 50
test_df.loc[((test_df[ # select
    (test_df['MSSubClass']==50)& # code 50 AND either
    ((test_df['HouseStyle']=='1.5Unf')| # 1.5 Unfinished OR
    (test_df['HouseStyle']=='2Story')) # 2 story
].index).tolist()), 'HouseStyle'] = '1.5Fin'

In [23]:
print('Test \nCode 45\n',test_df[test_df['MSSubClass']==45]['HouseStyle'].value_counts(),'\n')
print('Code 50\n',test_df[test_df['MSSubClass']==50]['HouseStyle'].value_counts())

Test 
Code 45
 HouseStyle
1.5Unf    6
Name: count, dtype: int64 

Code 50
 HouseStyle
1.5Fin    143
Name: count, dtype: int64


In [24]:
# setting 2 stories
# Code 60
test_df.loc[((test_df[ # select
    (test_df['MSSubClass']==60)& # code 60 AND either
    ((test_df['HouseStyle']=='1.5Fin')| # 1.5 finished OR
    (test_df['HouseStyle']=='2.5Unf')) # 2.5 unfinished
].index).tolist()), 'HouseStyle'] = '2Story'

# Code 70
test_df.loc[((test_df[ # select
    (test_df['MSSubClass']==70)& # code 70 AND either
    ((test_df['HouseStyle']=='1.5Fin')| # 1.5 finished OR
    (test_df['HouseStyle']=='2.5Unf')) # 2.5 unfinished
].index).tolist()), 'HouseStyle'] = '2Story'

In [25]:
print('Test: \nCode 60\n',test_df[test_df['MSSubClass']==60]['HouseStyle'].value_counts(),'\n')
print('Code 70\n',test_df[test_df['MSSubClass']==70]['HouseStyle'].value_counts())

Test: 
Code 60
 HouseStyle
2Story    276
Name: count, dtype: int64 

Code 70
 HouseStyle
2Story    68
Name: count, dtype: int64


In [26]:
# setting 2.5 stories, code 75
test_df.loc[((test_df[test_df['MSSubClass']==75].index).tolist()), 'HouseStyle'] = '2.5Story'

In [27]:
print('Check: \nCode 75\n',test_df[test_df['MSSubClass']==75]['HouseStyle'].value_counts())

Check: 
Code 75
 HouseStyle
2.5Story    7
Name: count, dtype: int64


### Filling NAs

The NA values for `Alley` and `Fence` can be assumed to be 'None', as that is listed in the documentation but is not present in the data.

In [28]:
train_df['Alley'] = train_df['Alley'].fillna('None')
test_df['Alley'] = test_df['Alley'].fillna('None')

In [29]:
train_df['Fence'] = train_df['Fence'].fillna('None')
test_df['Fence'] = test_df['Fence'].fillna('None')

In [30]:
train_df['LotFrontage'].describe()

count    1201.000000
mean       70.049958
std        24.284752
min        21.000000
25%        59.000000
50%        69.000000
75%        80.000000
max       313.000000
Name: LotFrontage, dtype: float64

It can be assumed that any entries not filled in are properties that do not have lot frontage on a street, given that there are no 0 values. 

In [31]:
train_df['LotFrontage'] = train_df['LotFrontage'].fillna(0)
test_df['LotFrontage'] = test_df['LotFrontage'].fillna(0)

I am assuming that if no masonry type was filled in, there isn't one, so I am giving those 'None', in accordance with the documentation.

In [32]:
train_df['MasVnrType'] = train_df['MasVnrType'].fillna('None')
test_df['MasVnrType'] = test_df['MasVnrType'].fillna('None')

In [33]:
train_df[train_df['MasVnrArea'].isna()][['MasVnrType','MasVnrArea']]

Unnamed: 0,MasVnrType,MasVnrArea
234,,
529,,
650,,
936,,
973,,
977,,
1243,,
1278,,


As there is no vaneer, I'm filling the area NAs with 0. 

In [34]:
train_df['MasVnrArea'] = train_df['MasVnrArea'].fillna(0)
test_df['MasVnrArea'] = test_df['MasVnrArea'].fillna(0)

In [35]:
print(f'Missing garage info \nType: {len(train_df[train_df['GarageType'].isna()])}')
print(f'Year built: {len(train_df[train_df['GarageYrBlt'].isna()])} \nFinish: {len(train_df[train_df['GarageFinish'].isna()])}')
print(f'Quality: {len(train_df[train_df['GarageQual'].isna()])} \nCondition: {len(train_df[train_df['GarageCond'].isna()])}')
print(f'Missing all five: {len(train_df[(train_df['GarageType'].isna())&
                            (train_df['GarageYrBlt'].isna())&
                            (train_df['GarageFinish'].isna())&
                            (train_df['GarageQual'].isna())&
                            (train_df['GarageCond'].isna())])}')

Missing garage info 
Type: 81
Year built: 81 
Finish: 81
Quality: 81 
Condition: 81
Missing all five: 81


In [36]:
print(f'Missing GarageArea: {len(train_df[train_df['GarageArea'].isna()])}')
print(f'Missing GarageCars: {len(train_df[train_df['GarageCars'].isna()])}')
print(f'Has 0 GarageArea + other missing values: {len(train_df[(train_df['GarageArea']==0)&
                                                           (train_df['GarageType'].isna())])}')
print(f'Has 0 GarageCars + other missing values: {len(train_df[(train_df['GarageCars']==0)&
                                                           (train_df['GarageType'].isna())])}')

Missing GarageArea: 0
Missing GarageCars: 0
Has 0 GarageArea + other missing values: 81
Has 0 GarageCars + other missing values: 81


The missing Garage values seem to be the result of not having a garage, and can be filled with the appropriate null values (None, NA).

In [37]:
train_df['GarageType'] = train_df['GarageType'].fillna('None')
test_df['GarageType'] = test_df['GarageType'].fillna('None')

train_df['GarageYrBlt'] = train_df['GarageYrBlt'].fillna('None')
test_df['GarageYrBlt'] = test_df['GarageYrBlt'].fillna('None')

train_df['GarageFinish'] = train_df['GarageFinish'].fillna('NA')
test_df['GarageFinish'] = test_df['GarageFinish'].fillna('NA')

train_df['GarageQual'] = train_df['GarageQual'].fillna('NA')
test_df['GarageQual'] = test_df['GarageQual'].fillna('NA')

train_df['GarageCond'] = train_df['GarageCond'].fillna('NA')
test_df['GarageCond'] = test_df['GarageCond'].fillna('NA')

In [38]:
print(f'Missing basement finish 1: {len(train_df[train_df['BsmtFinType1'].isna()])}')
print(f'Missing basement finish 2: {len(train_df[train_df['BsmtFinType2'].isna()])}')
print(f'Missing basement exposure: {len(train_df[train_df['BsmtExposure'].isna()])}')
print(f'Missing basement quality: {len(train_df[train_df['BsmtQual'].isna()])}')
print(f'Missing basement condition: {len(train_df[train_df['BsmtCond'].isna()])}')

Missing basement finish 1: 37
Missing basement finish 2: 38
Missing basement exposure: 38
Missing basement quality: 37
Missing basement condition: 37


In [39]:
print(f'No basement area, missing all values: {len(train_df[(train_df['TotalBsmtSF']==0)&
    (train_df['BsmtFinType1'].isna())&
    (train_df['BsmtFinType2'].isna())&
    (train_df['BsmtExposure'].isna())&
    (train_df['BsmtQual'].isna())&
    (train_df['BsmtCond'].isna())])}')

No basement area, missing all values: 37


Most of the missing values are in cases where there is no basement, as shown by the 0 for total square footage. However:

In [41]:
display(train_df[(train_df['TotalBsmtSF']!=0)&
    (train_df['BsmtFinType2'].isna())][['TotalBsmtSF', 'BsmtFinType1', 'BsmtFinSF1',
'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'BsmtExposure', 'BsmtQual', 'BsmtCond']])
display(train_df[(train_df['TotalBsmtSF']!=0)&
    (train_df['BsmtExposure'].isna())][['TotalBsmtSF', 'BsmtFinType1',
'BsmtFinType2', 'BsmtExposure', 'BsmtQual', 'BsmtCond']])

Unnamed: 0,TotalBsmtSF,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF.1,BsmtExposure,BsmtQual,BsmtCond
332,3206,GLQ,1124,,479,1603,3206,No,Gd,TA


Unnamed: 0,TotalBsmtSF,BsmtFinType1,BsmtFinType2,BsmtExposure,BsmtQual,BsmtCond
948,936,Unf,Unf,,Gd,TA


In [40]:
display(train_df['BsmtFinType2'].value_counts())
display(train_df['BsmtExposure'].value_counts())

BsmtFinType2
Unf    1256
Rec      54
LwQ      46
BLQ      33
ALQ      19
GLQ      14
Name: count, dtype: int64

BsmtExposure
No    953
Av    221
Gd    134
Mn    114
Name: count, dtype: int64

In the case of row 332, the fact that the values for finish 1 and 2 are different implies that the finishes are different, but there's no way to know what *kind* of finish the second is. I am going to assign it 'Rec', as it is the most common value aside from unfinished, which I know it isn't in this case from looking at the square footage values. <p>
For row 948, I assume the missing value means there is no exposure, so I will be giving that the 'No' value from the documentation. 

In [42]:
train_df.loc[332, 'BsmtFinType2'] = 'Rec'
train_df.loc[948, 'BsmtExposure'] = 'No'

Now the rest can be filled.

In [43]:
train_df['BsmtFinType1'] = train_df['BsmtFinType1'].fillna('None')
train_df['BsmtFinType2'] = train_df['BsmtFinType2'].fillna('None')
train_df['BsmtExposure'] = train_df['BsmtExposure'].fillna('NA')
train_df['BsmtQual'] = train_df['BsmtQual'].fillna('None')
train_df['BsmtCond'] = train_df['BsmtCond'].fillna('None')

There are extra issues with the test version of these columns.

In [44]:
print(f'Missing test basement values:')
print(f'Quality: {len(test_df[test_df['BsmtQual'].isna()])}')
print(f'Condition: {len(test_df[test_df['BsmtCond'].isna()])}')
print(f'Exposure: {len(test_df[test_df['BsmtExposure'].isna()])}')
print(f'Finish 1: {len(test_df[test_df['BsmtFinType1'].isna()])}')
print(f'Finish 1 area: {len(test_df[test_df['BsmtFinSF1'].isna()])}')
print(f'Finish 2: {len(test_df[test_df['BsmtFinType2'].isna()])}')
print(f'Finish 2 area: {len(test_df[test_df['BsmtFinSF2'].isna()])}')
print(f'Unfinished area: {len(test_df[test_df['BsmtUnfSF'].isna()])}')
print(f'Total area: {len(test_df[test_df['TotalBsmtSF'].isna()])}')
print(f'Full bath: {len(test_df[test_df['BsmtFullBath'].isna()])}')
print(f'Half bath:{len(test_df[test_df['BsmtHalfBath'].isna()])}')

Missing test basement values:
Quality: 44
Condition: 45
Exposure: 44
Finish 1: 42
Finish 1 area: 1
Finish 2: 42
Finish 2 area: 1
Unfinished area: 1
Total area: 1
Full bath: 2
Half bath:2


In [45]:
test_df[test_df['BsmtFinSF1'].isna()][['HouseStyle','BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 
                    'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath']]

Unnamed: 0,HouseStyle,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,BsmtFullBath,BsmtHalfBath
660,1Story,,,,,,,,,,,


Given this row is a single story house, it does not have a basement, so all of these values would be the appropriate 0/None value. All of the single entries can be filled in based on this row. 

In [46]:
test_df['BsmtFinSF1'] = test_df['BsmtFinSF1'].fillna(0)
test_df['BsmtFinSF2'] = test_df['BsmtFinSF2'].fillna(0)
test_df['BsmtUnfSF'] = test_df['BsmtUnfSF'].fillna(0)
test_df['TotalBsmtSF'] = test_df['TotalBsmtSF'].fillna(0)

In [47]:
test_df[test_df['BsmtFullBath'].isna()][['HouseStyle', 'BsmtFullBath', 'BsmtHalfBath']]

Unnamed: 0,HouseStyle,BsmtFullBath,BsmtHalfBath
660,1Story,,
728,1Story,,


The same is true of the bathroom values; there is no basement so there can't be bathrooms.

In [48]:
test_df['BsmtFullBath'] = test_df['BsmtFullBath'].fillna(0)
test_df['BsmtHalfBath'] = test_df['BsmtHalfBath'].fillna(0)

In [49]:
test_df[
    (test_df['TotalBsmtSF']!=0)&(test_df['BsmtQual'].isna())
][['TotalBsmtSF', 'BsmtQual', 'BsmtCond', 'BsmtExposure']]

Unnamed: 0,TotalBsmtSF,BsmtQual,BsmtCond,BsmtExposure
757,173.0,,Fa,No
758,356.0,,TA,No


As there is no way to know the quality (which corresponds to the height of the basement), I am going to assign these 'TA' for 'typical'. 

In [50]:
test_df.loc[((test_df[ # select
    (test_df['TotalBsmtSF']!=0)& # total area is not 0 AND
    (test_df['BsmtQual'].isna()) # quality is NA
].index).tolist()), 'BsmtQual'] = 'TA'

In [51]:
print(f'Missing values vs basement area \nQuality: {test_df[test_df['BsmtQual'].isna()][['TotalBsmtSF']].value_counts()}')
print(f'\nCondition: {test_df[test_df['BsmtCond'].isna()][['TotalBsmtSF']].value_counts()}')
print(f'\nExposure: {test_df[test_df['BsmtExposure'].isna()][['TotalBsmtSF']].value_counts()}')
print(f'\nFinish type 1: {test_df[test_df['BsmtFinType1'].isna()][['TotalBsmtSF']].value_counts()}')
print(f'\nFinish type 2: {test_df[test_df['BsmtFinType2'].isna()][['TotalBsmtSF']].value_counts()}')

Missing values vs basement area 
Quality: TotalBsmtSF
0.0            42
Name: count, dtype: int64

Condition: TotalBsmtSF
0.0            42
995.0           1
1127.0          1
1426.0          1
Name: count, dtype: int64

Exposure: TotalBsmtSF
0.0            42
725.0           1
1595.0          1
Name: count, dtype: int64

Finish type 1: TotalBsmtSF
0.0            42
Name: count, dtype: int64

Finish type 2: TotalBsmtSF
0.0            42
Name: count, dtype: int64


In [52]:
test_df['BsmtQual'] = test_df['BsmtQual'].fillna('None')
test_df['BsmtFinType1'] = test_df['BsmtFinType1'].fillna('None')
test_df['BsmtFinType2'] = test_df['BsmtFinType2'].fillna('None')

In [53]:
test_df[
    (test_df['TotalBsmtSF']>0)&
    (test_df['BsmtCond'].isna())
][['MSSubClass', 'HouseStyle', 'BsmtQual', 'BsmtCond', 'BsmtUnfSF', 'TotalBsmtSF']]

Unnamed: 0,MSSubClass,HouseStyle,BsmtQual,BsmtCond,BsmtUnfSF,TotalBsmtSF
580,20,1Story,Gd,,0.0,1426.0
725,20,1Story,TA,,94.0,1127.0
1064,80,SLvl,TA,,240.0,995.0


In [54]:
test_df['BsmtCond'].value_counts()

BsmtCond
TA    1295
Fa      59
Gd      57
Po       3
Name: count, dtype: int64

This shows that there are mismatches between the values related to the number of stories and the presence of a basement, but ignoring that for now, with no direct way to decide on the condition of the basement, I am going to use 'TA', as it is 'typical' and also the most common. 

In [55]:
test_df.loc[((test_df[ # select
    (test_df['TotalBsmtSF']!=0)& # total area is not 0 AND
    (test_df['BsmtCond'].isna()) # condition is NA
].index).tolist()), 'BsmtCond'] = 'TA'

In [56]:
test_df[test_df['BsmtCond'].isna()]['TotalBsmtSF'].value_counts()

TotalBsmtSF
0.0    42
Name: count, dtype: int64

Since all of these are rows where there is no basement, they can be filled with 'None'.

In [57]:
test_df['BsmtCond'] = test_df['BsmtCond'].fillna('None')

In [58]:
test_df[
    (test_df['TotalBsmtSF']!=0)&
    (test_df['BsmtExposure'].isna())
][['HouseStyle', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'TotalBsmtSF']]

Unnamed: 0,HouseStyle,BsmtQual,BsmtCond,BsmtExposure,TotalBsmtSF
27,1Story,Gd,TA,,1595.0
888,2Story,Gd,TA,,725.0


Again, I am assuming that missing values correspond to no exposure, given that these rows have basements.

In [59]:
test_df.loc[((test_df[ # select
    (test_df['TotalBsmtSF']!=0)& # total area is not 0 AND
    (test_df['BsmtExposure'].isna()) # exposure is NA
].index).tolist()), 'BsmtExposure'] = 'No'

In [60]:
test_df[test_df['BsmtExposure'].isna()][['TotalBsmtSF']].value_counts()

TotalBsmtSF
0.0            42
Name: count, dtype: int64

The rest of these do not have a basement, so they get filled with 'None'.

In [61]:
test_df['BsmtExposure'] = test_df['BsmtExposure'].fillna('NA')

In [62]:
print(f'Missing fireplace quality: \nTrain: {len(train_df[train_df['FireplaceQu'].isna()])}')
print(f'Test: {len(test_df[test_df['FireplaceQu'].isna()])}')


Missing fireplace quality: 
Train: 690
Test: 730


In [63]:
print(f'0 fireplaces, missing quality: \nTrain: {len(train_df[(train_df['FireplaceQu'].isna())&(train_df['Fireplaces']==0)])}')
print(f'Test: {len(test_df[(test_df['FireplaceQu'].isna())&(test_df['Fireplaces']==0)])}')

0 fireplaces, missing quality: 
Train: 690
Test: 730


Given that all NAs are associated with there being no fireplace, the NAs are being filled with 'None'.

In [64]:
train_df['FireplaceQu'] = train_df['FireplaceQu'].fillna('None')
test_df['FireplaceQu'] = test_df['FireplaceQu'].fillna('None')

In [65]:
print(f'Missing pool quality: \nTrain: {len(train_df[train_df['PoolQC'].isna()])}')
print(f'Test: {len(test_df[test_df['PoolQC'].isna()])}')
print(f'\nMissing quality and pool area 0: \nTrain: {len(train_df[(train_df['PoolArea']==0)&(train_df['PoolQC'].isna())])}')
print(f'Test: {len(test_df[(test_df['PoolArea']==0)&(test_df['PoolQC'].isna())])}')

Missing pool quality: 
Train: 1453
Test: 1456

Missing quality and pool area 0: 
Train: 1453
Test: 1453


In [66]:
test_df[(test_df['PoolArea']!=0)&(test_df['PoolQC'].isna())][['PoolArea','PoolQC']]

Unnamed: 0,PoolArea,PoolQC
960,368,
1043,444,
1139,561,


There is no way to really know what the pool quality value actually is, so I am going to assign the 'Average/Typical' value, as it is the middle value both descriptively and in terms of encoding. 

In [67]:
test_df.loc[((test_df[ # select
    (test_df['PoolArea']!=0)& # pool area is not 0 AND
    (test_df['PoolQC'].isna()) # pool quality is NA
].index).tolist()), 'PoolQC'] = 'TA'

The rest can be filled in with 'None', as they correspond to 0s in the area column. 

In [68]:
train_df['PoolQC'] = train_df['PoolQC'].fillna('None')
test_df['PoolQC'] = test_df['PoolQC'].fillna('None')

Per the documentation: "Home functionality (Assume typical unless deductions are warranted)"<br>
Therefore both empty values can be filled with `Typ`. 

In [69]:
test_df['Functional'] = test_df['Functional'].fillna('Typ')

In [None]:
print(test_df['SaleType'].isna().value_counts(),'\n')
print(test_df[test_df['SaleType'].isna()][['SaleType','SaleCondition','YearBuilt','YrSold']],'\n')
print(f'Train: {train_df[train_df['SaleCondition']=='Normal'][['SaleType','SaleCondition']].value_counts()} \n')
print(f'Test: {test_df[test_df['SaleCondition']=='Normal'][['SaleType','SaleCondition']].value_counts()}')

I could delete this row, but if I want to keep it, it is probably safe to fill in the `SaleType` with 'WD', as for a `SaleCondition` of 'Normal', the vast majority of datapoints have have value. (96.8% for train, 95.9% for test.)

In [None]:
print(train_df['Electrical'].isna().value_counts(),'\n')
print(train_df[train_df['Electrical'].isna()][['Electrical','YearBuilt']])

In [None]:
print(f'Train: {train_df['Electrical'].value_counts()},\n')
print(f'Test: {test_df['Electrical'].value_counts()} \n')
print(train_df[train_df['Electrical']=='SBrkr'][['Electrical','YearBuilt']].describe())

Similarly, the vast majority of data (91.4% of train, 91.6% of test) has the 'SBrkr' value, and the descriptive stats show there is no meaningful connection between building age and electrical status to worry about, so again it is probably safe to use that value in the interest of keeping the row in. 

Utilities: Type of utilities available
		
       AllPub	All public Utilities (E,G,W,& S)	
       NoSewr	Electricity, Gas, and Water (Septic Tank)
       NoSeWa	Electricity and Gas Only
       ELO	Electricity only

### Remove outliers<p>
Using thresholds from previous notebook.

In [None]:
train_df[train_df['LotFrontage']>250]

In [None]:
train_df[train_df['LotArea']>140000]

In [None]:
train_df[train_df['BsmtFinSF1']>4000]

In [None]:
train_df[train_df['BsmtFinSF2']>1300]

In [None]:
train_df[train_df['TotalBsmtSF']>5000]

In [None]:
train_df[train_df['1stFlrSF']>4000]

In [None]:
train_df[train_df['GrLivArea']>5000]

In [None]:
train_df[train_df['EnclosedPorch']>500]

In [None]:
train_df[train_df['MiscVal']>6000]

In [None]:
train_df[train_df['SalePrice']>700000]