# 2 - Data Cleaning

In [1]:
import pandas as pd
train_df = pd.read_csv('../data/housing_train.csv')
test_df = pd.read_csv('../data/housing_test.csv')

There is an apparent relation between `MSSubClass` and `HouseStyle`. This is how they should match up:

1Story	One story
- 20	1-STORY 1946 & NEWER ALL STYLES
- 30	1-STORY 1945 & OLDER
- 40	1-STORY W/FINISHED ATTIC ALL AGES
- 120	1-STORY PUD (Planned Unit Development) - 1946 & NEWER

1.5Fin	One and one-half story: 2nd level finished
- 50	1-1/2 STORY FINISHED ALL AGES
- 150	1-1/2 STORY PUD - ALL AGES

1.5Unf	One and one-half story: 2nd level unfinished
- 45	1-1/2 STORY - UNFINISHED ALL AGES

2Story	Two story
- 60	2-STORY 1946 & NEWER
- 70	2-STORY 1945 & OLDER
- 75	2-1/2 STORY ALL AGES
- 160	2-STORY PUD - 1946 & NEWER

SFoyer	Split Foyer
- 85	SPLIT FOYER

SLvl	Split Level
- 80	SPLIT OR MULTI-LEVEL

Misc
- 90	DUPLEX - ALL STYLES AND AGES
- 180	PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
- 190	2 FAMILY CONVERSION - ALL STYLES AND AGES

Check for mismatches:<p>
Should be one story:

In [2]:
print('Code 20\n',train_df[train_df['MSSubClass']==20]['HouseStyle'].value_counts(),'\n')
print('Code 30\n',train_df[train_df['MSSubClass']==30]['HouseStyle'].value_counts(),'\n')
print('Code 40\n',train_df[train_df['MSSubClass']==40]['HouseStyle'].value_counts(),'\n')
print('Code 120\n',train_df[train_df['MSSubClass']==120]['HouseStyle'].value_counts())

Should be 1.5 stories:

In [None]:
print('Code 45\n',train_df[train_df['MSSubClass']==45]['HouseStyle'].value_counts(),'\n')
print('Code 50\n',train_df[train_df['MSSubClass']==50]['HouseStyle'].value_counts(),'\n')
print('Code 150\n',train_df[train_df['MSSubClass']==150]['HouseStyle'].value_counts())

Should be 2 stories:

In [None]:
print('Code 60\n',train_df[train_df['MSSubClass']==60]['HouseStyle'].value_counts(),'\n')
print('Code 70\n',train_df[train_df['MSSubClass']==70]['HouseStyle'].value_counts(),'\n')
print('Code 75\n',train_df[train_df['MSSubClass']==75]['HouseStyle'].value_counts(),'\n')
print('Code 160\n',train_df[train_df['MSSubClass']==160]['HouseStyle'].value_counts())

Should be split foyer:

In [None]:
print('Code 85\n',train_df[train_df['MSSubClass']==85]['HouseStyle'].value_counts())

Should be split level:

print('Code 80\n',train_df[train_df['MSSubClass']==80]['HouseStyle'].value_counts())

No obvious floor number based on code:

In [None]:
print('Code 90\n',train_df[train_df['MSSubClass']==90]['HouseStyle'].value_counts(),'\n')
print('Code 180\n',train_df[train_df['MSSubClass']==180]['HouseStyle'].value_counts(),'\n')
print('Code 190\n',train_df[train_df['MSSubClass']==190]['HouseStyle'].value_counts())

Functional: Home functionality (Assume typical unless deductions are warranted)

       Typ	Typical Functionality
       Min1	Minor Deductions 1
       Min2	Minor Deductions 2
       Mod	Moderate Deductions
       Maj1	Major Deductions 1
       Maj2	Major Deductions 2
       Sev	Severely Damaged
       Sal	Salvage only

Fill missing values in test w/ Typ

In [None]:
print(test_df['Functional'].isna().value_counts(),'\n')
print(test_df[test_df['Functional'].isna()][['Functional']])

In [None]:
print(test_df['SaleType'].isna().value_counts(),'\n')
print(test_df[test_df['SaleType'].isna()][['SaleType','SaleCondition','YearBuilt','YrSold']],'\n')
print(f'Train: {train_df[train_df['SaleCondition']=='Normal'][['SaleType','SaleCondition']].value_counts()} \n')
print(f'Test: {test_df[test_df['SaleCondition']=='Normal'][['SaleType','SaleCondition']].value_counts()}')

I could delete this row, but if I want to keep it, it is probably safe to fill in the `SaleType` with 'WD', as for a `SaleCondition` of 'Normal', the vast majority of datapoints have have value. (96.8% for train, 95.9% for test.)

In [None]:
print(train_df['Electrical'].isna().value_counts(),'\n')
print(train_df[train_df['Electrical'].isna()][['Electrical','YearBuilt']],'\n')

In [None]:
print(f'Train: {train_df['Electrical'].value_counts()},\n')
print(f'Test: {test_df['Electrical'].value_counts()} \n')
print(train_df[train_df['Electrical']=='SBrkr'][['Electrical','YearBuilt']].describe())

Similarly, the vast majority of data (91.4% of train, 91.6% of test) has the 'SBrkr' value, and the descriptive stats show there is no meaningful connection between building age and electrical status to worry about, so again it is probably safe to use that value in the interest of keeping the row in. 

Utilities: Type of utilities available
		
       AllPub	All public Utilities (E,G,W,& S)	
       NoSewr	Electricity, Gas, and Water (Septic Tank)
       NoSeWa	Electricity and Gas Only
       ELO	Electricity only

### Remove outliers<p>
Using thresholds from previous notebook.

In [None]:
train_df[train_df['LotFrontage']>250]

In [None]:
train_df[train_df['LotArea']>140000]

In [None]:
train_df[train_df['BsmtFinSF1']>4000]

In [None]:
train_df[train_df['BsmtFinSF2']>1300]

In [None]:
train_df[train_df['TotalBsmtSF']>5000]

In [None]:
train_df[train_df['1stFlrSF']>4000]

In [None]:
train_df[train_df['GrLivArea']>5000]

In [None]:
train_df[train_df['EnclosedPorch']>500]

In [None]:
train_df[train_df['MiscVal']>6000]

In [None]:
train_df[train_df['SalePrice']>700000]