# Problem Statement Reminder
Iowa Real Estate Investors Association (IaREIA) has reached out to get a strong predictive model, ready to base their investment plans for Ames, Iowa region. 

Model Evaluation Focuses on:
- R2 value
- RMSE
- Max Error

---

In [2]:
import pandas as pd
import numpy as np

In [3]:
df_train = pd.read_csv('./datasets/train.csv')
df_test = pd.read_csv('./datasets/test.csv')

---
# Relevant Helper Funcitons

In [2]:
#importing helper functions
from helper_functions import character_df
from helper_functions import fillna_centrl_tendcy
from helper_functions import null_reminders 


---

# Data Cleaning

In [8]:
print(df_train.shape)
df_train.head()

(2051, 81)


Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,SalePrice
0,109,533352170,60,RL,,13517,Pave,,IR1,Lvl,...,0,0,,,,0,3,2010,WD,130500
1,544,531379050,60,RL,43.0,11492,Pave,,IR1,Lvl,...,0,0,,,,0,4,2009,WD,220000
2,153,535304180,20,RL,68.0,7922,Pave,,Reg,Lvl,...,0,0,,,,0,1,2010,WD,109000
3,318,916386060,60,RL,73.0,9802,Pave,,Reg,Lvl,...,0,0,,,,0,4,2010,WD,174000
4,255,906425045,50,RL,82.0,14235,Pave,,IR1,Lvl,...,0,0,,,,0,3,2010,WD,138500


After a quick inspection of the rain.csv file and my df_train it is clearly noteable that the 'NA' values were imported by pandas as a NaN value. However because of the magnitude of NA data for these respective rows I will still drop them since there is not much data to truly get a accurate respresentation of the feature as a whole. 

In [9]:
df_train.shape

(2051, 81)

In [10]:
df_charac=character_df(df_train)   
df_charac[df_charac['percent_missing']>40]

Unnamed: 0,column_name,null_count,percent_missing,categorical_unique
7,Alley,1911,93.17,"[nan, Pave, Grvl]"
58,Fireplace Qu,1000,48.76,"[nan, TA, Gd, Po, Ex, Fa]"
73,Pool QC,2042,99.56,"[nan, Fa, Gd, Ex, TA]"
74,Fence,1651,80.5,"[nan, MnPrv, GdPrv, GdWo, MnWw]"
75,Misc Feature,1986,96.83,"[nan, Shed, TenC, Gar2, Othr, Elev]"


The features listed below will be dropped because such a high percentage of these features are missing with exception to the 'PID' however for the purposes of these models I will not be utilizing the PID as a predictor. The replacement of NaNs with any method of central tendency would not yield a representative result. Additionally, after collaboration with David Coons and Hank Butler, the possibility to determine location and potentially relevance to a location of importance could be possible but will be too advanced for the purposes of this project. 
<br><br>

Features automatically to be dropped:
 ['Alley', 'Pool QC', 'Fence', 'Misc Feature', 'PID', 'Fireplace Qu'] 

In [11]:
df_train.drop(['Alley', 'Pool QC', 'Fence', 'Misc Feature', 'PID', 'Fireplace Qu'], axis =1, inplace = True)

--- 
##### Checking data types 
This subsection works through cleaning categorical variables to match the data dictionary entries

In [12]:
df_charac=character_df(df_train)   # dropped columns earlier so not a total of 81 features anymore
df_categorical=df_charac.drop(columns=['null_count','percent_missing'])
df_categorical.head()

Unnamed: 0,column_name,categorical_unique
0,Id,not cat.
1,MS SubClass,not cat.
2,MS Zoning,"[RL, RM, FV, C (all), A (agr), RH, I (all)]"
3,Lot Frontage,not cat.
4,Lot Area,not cat.


After a quick check from the categorical column above and the data dictionary I can see that there are some  Ordinal variables that are represented as categorical which will for the purposes of this analysis will be acceptable since there are only two variables if there were more a closer analysis of whether or not to treat them as categorical would be more appropriate. 
The following features were misrepresented:
- Overall Quality ('Overall Qual')
- Overall Condition ('Overall Cond')


--- 
##### Categorical Alterations
This subsection works through cleaning categorical variables to match the data dictionary entries

In [13]:
# will clean 'MS Zoning'  to match data dictionary entries
df_train['MS Zoning'].value_counts()

RL         1598
RM          316
FV          101
C (all)      19
RH           14
A (agr)       2
I (all)       1
Name: MS Zoning, dtype: int64

In [14]:
# the actual replacement implemented
df_train['MS Zoning'] = [df_train['MS Zoning'][i].replace('(all)',' ') for i in range(df_train.shape[0])]
df_train['MS Zoning'] = [df_train['MS Zoning'][i].replace('(agr)',' ') for i in range(df_train.shape[0])]

df_train['MS Zoning'].value_counts()

RL     1598
RM      316
FV      101
C        19
RH       14
A         2
I         1
Name: MS Zoning, dtype: int64

In [15]:
# None is an actual category in data dictionary so will not alter
df_train['Mas Vnr Type'].value_counts() 

None       1218
BrkFace     630
Stone       168
BrkCmn       13
Name: Mas Vnr Type, dtype: int64

In [16]:
# these are just features that are not relevant to basement quality stored to drop in filter
features_to_drop = ['Id', 'MS SubClass', 'MS Zoning', 'Lot Frontage', 'Lot Area', 'Street','Lot Shape', 'Land Contour', 'Utilities', 'Lot Config', 'Land Slope','Neighborhood', 'Condition 1',
 'Condition 2', 'Bldg Type','House Style', 'Overall Qual', 'Overall Cond', 'Year Built','Year Remod/Add', 'Roof Style', 'Roof Matl', 'Exterior 1st','Exterior 2nd', 'Mas Vnr Type', 'Mas Vnr Area', 
'Exter Qual','Exter Cond', 'Foundation',  'Full Bath','Half Bath', 'Bedroom AbvGr', 'Kitchen AbvGr', 'Kitchen Qual','TotRms AbvGrd', 'Functional', 'Fireplaces', 'Garage Type',
'Garage Yr Blt', 'Garage Finish', 'Garage Cars', 'Garage Area', 'Garage Qual', 'Garage Cond', 'Paved Drive', 'Wood Deck SF',
'Open Porch SF', 'Enclosed Porch', '3Ssn Porch', 'Screen Porch','Pool Area', 'Misc Val', 'Mo Sold', 'Yr Sold' ]

In [17]:
# taking quick glance at null details
null_reminders(dataframe=df_train, column_name='Bsmt Qual',features_to_drop=features_to_drop,value_cnt='Yes').head()


is null sum:  55
TA    887
Gd    864
Ex    184
Fa     60
Po      1
Name: Bsmt Qual, dtype: int64
df_column_ shape (55, 21)


Unnamed: 0,Bsmt Qual,Bsmt Cond,Bsmt Exposure,BsmtFin Type 1,BsmtFin SF 1,BsmtFin Type 2,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,Heating,...,Central Air,Electrical,1st Flr SF,2nd Flr SF,Low Qual Fin SF,Gr Liv Area,Bsmt Full Bath,Bsmt Half Bath,Sale Type,SalePrice
12,,,,,0.0,,0.0,0.0,0.0,GasA,...,Y,SBrkr,1288,728,0,2016,0.0,0.0,WD,131000
93,,,,,0.0,,0.0,0.0,0.0,GasA,...,Y,SBrkr,1535,0,0,1535,0.0,0.0,WD,118858
114,,,,,0.0,,0.0,0.0,0.0,GasA,...,N,SBrkr,660,0,0,660,0.0,0.0,WD,63900
146,,,,,0.0,,0.0,0.0,0.0,GasA,...,Y,SBrkr,495,1427,0,1922,0.0,0.0,ConLD,198500
183,,,,,0.0,,0.0,0.0,0.0,Wall,...,N,FuseA,733,0,0,733,0.0,0.0,WD,13100


After some consideration noting that a lot of these houses probably did not have a basement (we an 'NA' in train.csv file) I feel comfortable just renaming these nulls with 'no basement' string so can be inputed into model after one-hot coding. Will do so below and then move onto next categorical variable inspection. 

In [18]:
# filling NaN values with 'no_basement' for columns = ['Bsmt Qual','Bsmt Cond','BsmtFin Type 2','Bsmt Exposure','BsmtFin Type 1']
columns2replace = ['Bsmt Qual','Bsmt Cond','BsmtFin Type 2','Bsmt Exposure','BsmtFin Type 1']

# performing replacement and verifying
for column in columns2replace:
    print(column)
    df_train[column] = df_train[column].fillna('no_basement')
    print('is null sum: ',df_train[column].isnull().sum() )
    print('') # just for break b/w loops

Bsmt Qual
is null sum:  0

Bsmt Cond
is null sum:  0

BsmtFin Type 2
is null sum:  0

Bsmt Exposure
is null sum:  0

BsmtFin Type 1
is null sum:  0



In [19]:
# second verification
null_reminders(dataframe=df_train, column_name='Bsmt Qual',features_to_drop=features_to_drop,value_cnt='Yes')


is null sum:  0
TA             887
Gd             864
Ex             184
Fa              60
no_basement     55
Po               1
Name: Bsmt Qual, dtype: int64
df_column_ shape (0, 21)


Unnamed: 0,Bsmt Qual,Bsmt Cond,Bsmt Exposure,BsmtFin Type 1,BsmtFin SF 1,BsmtFin Type 2,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,Heating,...,Central Air,Electrical,1st Flr SF,2nd Flr SF,Low Qual Fin SF,Gr Liv Area,Bsmt Full Bath,Bsmt Half Bath,Sale Type,SalePrice


Will repeat the same process for the following columns and colum names <br>
- Garage Type	        = [Attchd, Detchd, BuiltIn, Basment, nan, 2Types...<br>
- Garage Finish	    = [RFn, Unf, Fin, nan]<br>
- Garage Qual	        =[TA, Fa, nan, Gd, Ex, Po]<br>
- Garage Cond	        =[TA, Fa, nan, Po, Gd, Ex]<br>

In [20]:
# manually selected what features to isolate
features_to_drop = ['Id', 'MS SubClass', 'MS Zoning', 'Lot Frontage', 'Lot Area', 'Street',
'Lot Shape', 'Land Contour', 'Utilities', 'Lot Config', 'Land Slope','Neighborhood', 'Condition 1', 'Condition 2', 'Bldg Type','House Style', 'Overall Qual', 'Overall Cond', 'Year Built','Year Remod/Add', 'Roof Style', 'Roof Matl', 'Exterior 1st','Exterior 2nd', 'Mas Vnr Type', 'Mas Vnr Area', 'Exter Qual','Exter Cond', 'Foundation', 'Bsmt Qual', 'Bsmt Cond', 'Bsmt Exposure','BsmtFin Type 1', 'BsmtFin SF 1', 'BsmtFin Type 2', 'BsmtFin SF 2', 'Bsmt Unf SF', 'Total Bsmt SF', 'Heating', 'Heating QC', 'Central Air','Electrical', '1st Flr SF', '2nd Flr SF', 'Low Qual Fin SF', 'Gr Liv Area', 'Bsmt Full Bath', 'Bsmt Half Bath', 'Full Bath','Half Bath', 'Bedroom AbvGr', 'Kitchen AbvGr', 'Kitchen Qual','TotRms AbvGrd', 'Functional', 'Fireplaces', 'Paved Drive', 'Wood Deck SF','Open Porch SF', 'Enclosed Porch', '3Ssn Porch', 'Screen Porch','Pool Area', 'Misc Val', 'Mo Sold',                                      ]

# quick glance at garage specific features
null_reminders(dataframe=df_train, column_name='Garage Type',features_to_drop=features_to_drop,value_cnt='Yes').head()


is null sum:  113
Attchd     1213
Detchd      536
BuiltIn     132
Basment      27
2Types       19
CarPort      11
Name: Garage Type, dtype: int64
df_column_ shape (113, 10)


Unnamed: 0,Garage Type,Garage Yr Blt,Garage Finish,Garage Cars,Garage Area,Garage Qual,Garage Cond,Yr Sold,Sale Type,SalePrice
28,,,,0.0,0.0,,,2007,WD,119600
53,,,,0.0,0.0,,,2010,WD,76000
65,,,,0.0,0.0,,,2007,New,147000
79,,,,0.0,0.0,,,2007,WD,129850
101,,,,0.0,0.0,,,2007,WD,86000


After revisting data dictionary noted that since the columns ['Garage Type', 'Garage Yr Blt', 'Garage Finish', 'Garage Qual', 'Garage Cond'] all appear to have missing values for each specific house it is likely they were originally 'NA' (No Garage) inputs on the train.csv so will replace them with the string 'no_garage' to have acurate dummy columns later on. 

In [21]:
# filling NaN values with 'no_basement' for columns = ['Bsmt Qual','Bsmt Cond','BsmtFin Type 2','Bsmt Exposure','BsmtFin Type 1']
columns2replace = ['Garage Type', 'Garage Yr Blt', 'Garage Finish', 'Garage Qual', 'Garage Cond']

# performing replacement and verifying
for column in columns2replace:
    print(column)
    df_train[column] = df_train[column].fillna('no_garage')
    print('is null sum: ',df_train[column].isnull().sum() )
    print('') # just for break b/w loops

# quick glance at garage specific features for confirmation
null_reminders(dataframe=df_train, column_name='Garage Type',features_to_drop=features_to_drop,value_cnt='Yes').head()

Garage Type
is null sum:  0

Garage Yr Blt
is null sum:  0

Garage Finish
is null sum:  0

Garage Qual
is null sum:  0

Garage Cond
is null sum:  0

is null sum:  0
Attchd       1213
Detchd        536
BuiltIn       132
no_garage     113
Basment        27
2Types         19
CarPort        11
Name: Garage Type, dtype: int64
df_column_ shape (0, 10)


Unnamed: 0,Garage Type,Garage Yr Blt,Garage Finish,Garage Cars,Garage Area,Garage Qual,Garage Cond,Yr Sold,Sale Type,SalePrice


--- 
##### Continous Feature Alterations
This subsection works through cleaning continous variables to match the data dictionary entries based on logical assumptions. This section will focus on replace larger amounts of nulls at a time within a feature not those  with 1 null value. 

In [22]:
# getting a glimpse of what is left to work with
charac_df = character_df(df_train)
charac_df[ charac_df['null_count']>1 ]

Unnamed: 0,column_name,null_count,percent_missing,categorical_unique
3,Lot Frontage,330,16.09,not cat.
24,Mas Vnr Type,22,1.07,"[BrkFace, None, nan, Stone, BrkCmn]"
25,Mas Vnr Area,22,1.07,not cat.
46,Bsmt Full Bath,2,0.1,not cat.
47,Bsmt Half Bath,2,0.1,not cat.


In [23]:
# again manually selecting what columns I do not want to look at in my  reminder function
features_to_drop = [  'MS SubClass', 'MS Zoning', 
 'Overall Cond', 'Year Built','Year Remod/Add', 'Roof Style', 'Roof Matl', 'Exterior 1st', 'Exterior 2nd', 'Mas Vnr Type', 'Mas Vnr Area', 'Exter Qual','Exter Cond', 'Foundation', 'Bsmt Qual', 'Bsmt Cond', 'Bsmt Exposure','BsmtFin Type 1', 'BsmtFin SF 1', 'BsmtFin Type 2', 'BsmtFin SF 2','Bsmt Unf SF', 'Total Bsmt SF', 'Heating', 'Heating QC', 'Central Air','Electrical', '1st Flr SF', '2nd Flr SF', 'Low Qual Fin SF','Gr Liv Area', 'Bsmt Full Bath', 'Bsmt Half Bath', 'Full Bath','Half Bath', 'Bedroom AbvGr', 'Kitchen AbvGr', 'Kitchen Qual','TotRms AbvGrd', 'Functional', 'Fireplaces', 'Garage Type', 'Garage Yr Blt', 'Garage Finish', 'Garage Cars', 'Garage Area','Garage Qual', 'Garage Cond', 'Paved Drive', 'Wood Deck SF','Open Porch SF', 'Enclosed Porch', '3Ssn Porch', 'Screen Porch','Pool Area', 'Misc Val', 'Mo Sold', 'Yr Sold', 'Sale Type']
df_lot_frnt = null_reminders(dataframe=df_train, column_name='Lot Frontage',features_to_drop=features_to_drop,value_cnt='No').head()
df_lot_frnt.head()

is null sum:  330
df_column_ shape (330, 16)


Unnamed: 0,Id,Lot Frontage,Lot Area,Street,Lot Shape,Land Contour,Utilities,Lot Config,Land Slope,Neighborhood,Condition 1,Condition 2,Bldg Type,House Style,Overall Qual,SalePrice
0,109,,13517,Pave,IR1,Lvl,AllPub,CulDSac,Gtl,Sawyer,RRAe,Norm,1Fam,2Story,6,130500
7,145,,12160,Pave,IR1,Lvl,AllPub,Inside,Gtl,NAmes,Norm,Norm,1Fam,1Story,5,142000
8,1942,,15783,Pave,Reg,Lvl,AllPub,Inside,Gtl,OldTown,Artery,Norm,1Fam,1Story,5,112500
23,12,,7980,Pave,IR1,Lvl,AllPub,Inside,Gtl,Gilbert,Norm,Norm,1Fam,1Story,6,185000
27,1534,,11700,Pave,IR1,HLS,AllPub,Inside,Mod,Crawfor,Norm,Norm,1Fam,1.5Fin,5,198000


For this particular feature since there are groups for each of these missing values the replacement of the average by group should be a decent estimate in case this feature is selected to be inputed into the model. 

In [24]:
#actual replacement of nulls
fillna_centrl_tendcy(dataframe=df_train,change_column='Lot Frontage',groupby_column='Lot Config',function='mean')

# checking output again to ensure no other nulls were left
df_lot_frnt = null_reminders(dataframe=df_train, column_name='Lot Frontage',features_to_drop=features_to_drop,value_cnt='No').head()
df_lot_frnt.head()

is null sum:  0
df_column_ shape (0, 16)


Unnamed: 0,Id,Lot Frontage,Lot Area,Street,Lot Shape,Land Contour,Utilities,Lot Config,Land Slope,Neighborhood,Condition 1,Condition 2,Bldg Type,House Style,Overall Qual,SalePrice


In [25]:
# again manually selecting what columns I do not want to look at in my reminder function
features_to_drop = [ 'Id', 'MS SubClass', 'MS Zoning', 'Lot Frontage', 'Lot Area', 'Street', 'Lot Shape', 'Land Contour', 'Utilities', 'Lot Config', 'Land Slope','Neighborhood', 'Condition 1', 'Condition 2', 'Bldg Type','House Style', 'Overall Qual', 'Overall Cond', 'Year Built', 'Year Remod/Add', 'Roof Style', 'Roof Matl', 'Exterior 1st','Exterior 2nd','Exter Qual','Exter Cond', 'Foundation', 'Bsmt Qual', 'Bsmt Cond', 'Bsmt Exposure','BsmtFin Type 1', 'BsmtFin SF 1', 'BsmtFin Type 2', 'BsmtFin SF 2', 'Bsmt Unf SF', 'Total Bsmt SF', 'Heating', 'Heating QC', 'Central Air', 'Electrical', '1st Flr SF', '2nd Flr SF', 'Low Qual Fin SF', 'Gr Liv Area', 'Bsmt Full Bath', 'Bsmt Half Bath', 'Full Bath','Half Bath', 'Bedroom AbvGr', 'Kitchen AbvGr', 'Kitchen Qual', 'TotRms AbvGrd', 'Functional', 'Fireplaces', 'Garage Type','Garage Yr Blt', 'Garage Finish', 'Garage Cars', 'Garage Area','Garage Qual', 'Garage Cond', 'Paved Drive', 'Wood Deck SF', 'Open Porch SF', 'Enclosed Porch', '3Ssn Porch', 'Screen Porch','Pool Area', 'Misc Val', 'Mo Sold', 'Yr Sold', 'Sale Type','SalePrice']


df_mas_vnr = null_reminders(dataframe=df_train, column_name='Mas Vnr Type',features_to_drop=features_to_drop,value_cnt='Yes')
df_mas_vnr.head()

is null sum:  22
None       1218
BrkFace     630
Stone       168
BrkCmn       13
Name: Mas Vnr Type, dtype: int64
df_column_ shape (22, 2)


Unnamed: 0,Mas Vnr Type,Mas Vnr Area
22,,
41,,
86,,
212,,
276,,


From the NaN in the Mas Vnr Area (Masonry veneer area in square feet) column I will assume that these houses did not have a Masonry at all. So these values will be fill with their respective None(Mas Vnr Type) and 0 square ft (Mas Vnr Area) values from the data dictionary. 

In [26]:
# performing replacement for  ['Mas Vnr Type']
columns2replace = ['Mas Vnr Type']
for column in columns2replace:
    df_train[column] = df_train[column].fillna('None')
    print('is null sum: ',df_train[column].isnull().sum() )
    print('') # just for break b/w loops

# performing replacement for  ['Mas Vnr Area']
columns2replace = ['Mas Vnr Area']
for column in columns2replace:
    df_train[column] = df_train[column].fillna(0)
    print('is null sum: ',df_train[column].isnull().sum() )
    print('') # just for break b/w loops

# quick glance at masonry specific features for confirmation
null_reminders(dataframe=df_train, column_name='Garage Type',features_to_drop=features_to_drop,value_cnt='Yes').head()

is null sum:  0

is null sum:  0

is null sum:  0
Attchd       1213
Detchd        536
BuiltIn       132
no_garage     113
Basment        27
2Types         19
CarPort        11
Name: Garage Type, dtype: int64
df_column_ shape (0, 2)


Unnamed: 0,Mas Vnr Type,Mas Vnr Area


In [27]:
# again manually selecting what columns I do not want to look at in my reminder function
features_to_drop = [ 'Id', 'MS SubClass', 'MS Zoning', 'Lot Frontage', 'Lot Area', 'Street', 'Heating'	,'Heating QC'	,'Central Air'	,'Electrical',	'1st Flr SF',	'2nd Flr SF',	'Low Qual Fin SF'	,'Gr Liv Area','Lot Shape', 'Land Contour', 'Utilities', 'Lot Config', 'Land Slope','Neighborhood', 'Condition 1', 'Condition 2', 'Bldg Type','House Style', 'Overall Qual', 'Overall Cond', 'Year Built','Year Remod/Add', 'Roof Style', 'Roof Matl', 'Exterior 1st','Exterior 2nd', 'Mas Vnr Type', 'Mas Vnr Area', 'Exter Qual','Exter Cond', 'Foundation','Bedroom AbvGr', 'Kitchen AbvGr', 'Kitchen Qual','TotRms AbvGrd', 'Functional', 'Fireplaces', 'Garage Type','Garage Yr Blt', 'Garage Finish', 'Garage Cars', 'Garage Area','Garage Qual', 'Garage Cond', 'Paved Drive', 'Wood Deck SF','Open Porch SF', 'Enclosed Porch', '3Ssn Porch', 'Screen Porch','Pool Area', 'Misc Val', 'Mo Sold', 'Yr Sold', 'Sale Type','SalePrice']


df_bsmt_baths = null_reminders(dataframe=df_train, column_name='Bsmt Full Bath',features_to_drop=features_to_drop,value_cnt='Yes')
df_bsmt_baths.head()

is null sum:  2
0.0    1200
1.0     824
2.0      23
3.0       2
Name: Bsmt Full Bath, dtype: int64
df_column_ shape (2, 13)


Unnamed: 0,Bsmt Qual,Bsmt Cond,Bsmt Exposure,BsmtFin Type 1,BsmtFin SF 1,BsmtFin Type 2,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,Bsmt Full Bath,Bsmt Half Bath,Full Bath,Half Bath
616,no_basement,no_basement,no_basement,no_basement,0.0,no_basement,0.0,0.0,0.0,,,3,1
1327,no_basement,no_basement,no_basement,no_basement,,no_basement,,,,,,1,0


Clearly the NaN values within this data frame should all be replaced with 0 since there are no basements in these house observations. Hopefully this will take care of some of the random null values that would have had to be cleaned later on. In the data dictionary they are all either categorical or discrete so replacement with 0 seems to be accurate statements across the board if there are no basements.

In [28]:
# performing replacement for  ['BsmtFin SF 1','BsmtFin SF 2' ,'Bsmt Unf SF' ,'Total Bsmt SF', 'Bsmt Full Bath','Bsmt Half Bath']
columns2replace = ['BsmtFin SF 1','BsmtFin SF 2' ,'Bsmt Unf SF' ,'Total Bsmt SF', 'Bsmt Full Bath','Bsmt Half Bath']
for column in columns2replace:
    df_train[column] = df_train[column].fillna(0)
    print('is null sum: ',df_train[column].isnull().sum() )
    print('') # just for break b/w loops

# quick glance at bsmt specific features for confirmation
null_reminders(dataframe=df_train, column_name='BsmtFin SF 1',features_to_drop=features_to_drop,value_cnt='No').head()

is null sum:  0

is null sum:  0

is null sum:  0

is null sum:  0

is null sum:  0

is null sum:  0

is null sum:  0
df_column_ shape (0, 13)


Unnamed: 0,Bsmt Qual,Bsmt Cond,Bsmt Exposure,BsmtFin Type 1,BsmtFin SF 1,BsmtFin Type 2,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,Bsmt Full Bath,Bsmt Half Bath,Full Bath,Half Bath


In [29]:
# checking one last time for anymore nulls in our df_train that arent random
charac_df = character_df(df_train)
charac_df[ charac_df['null_count']>=1 ]

Unnamed: 0,column_name,null_count,percent_missing,categorical_unique
59,Garage Cars,1,0.05,not cat.
60,Garage Area,1,0.05,not cat.


The df_train is now complete based on logical and reasonable assumptions. These may or may not affect the model later on but these are the assumptions that will be made when moving forwards through the analysis. 

--- 
##### Random Nulls leftover 
This subsection works through random nulls in dataframe

In [30]:
# getting a glimpse of what is left to work with
charac_df = character_df(df_train)
charac_df[ charac_df['null_count']>0 ]

Unnamed: 0,column_name,null_count,percent_missing,categorical_unique
59,Garage Cars,1,0.05,not cat.
60,Garage Area,1,0.05,not cat.


In [31]:
# manually selection columns to drop in reminder function
features_to_drop = ['Id', 'MS SubClass', 'MS Zoning', 'Lot Frontage', 'Lot Area', 'Street','Lot Shape', 'Land Contour', 'Utilities', 'Lot Config', 'Land Slope','Neighborhood', 'Condition 1', 'Condition 2', 'Bldg Type','House Style', 'Overall Qual', 'Overall Cond', 'Year Built','Year Remod/Add', 'Roof Style', 'Roof Matl', 'Exterior 1st', 'Exterior 2nd', 'Mas Vnr Type', 'Mas Vnr Area', 'Exter Qual','Exter Cond', 'Foundation', 'Bsmt Qual', 'Bsmt Cond', 'Bsmt Exposure','BsmtFin Type 1', 'BsmtFin SF 1', 'BsmtFin Type 2', 'BsmtFin SF 2','Bsmt Unf SF', 'Total Bsmt SF', 'Heating', 'Heating QC', 'Central Air', 'Electrical', '1st Flr SF', '2nd Flr SF', 'Low Qual Fin SF',  'Gr Liv Area', 'Bsmt Full Bath', 'Bsmt Half Bath', 'Full Bath', 'Half Bath', 'Bedroom AbvGr', 'Kitchen AbvGr', 'Kitchen Qual','Paved Drive', 'Wood Deck SF','Open Porch SF', 'Enclosed Porch', '3Ssn Porch', 'Screen Porch', 'Pool Area', 'Misc Val', 'Mo Sold', 'Yr Sold', 'Sale Type','SalePrice']

In [32]:
null_reminders(dataframe=df_train, column_name='Garage Cars',features_to_drop=features_to_drop,value_cnt='Yes').head()

is null sum:  1
2.0    1136
1.0     524
3.0     263
0.0     113
4.0      13
5.0       1
Name: Garage Cars, dtype: int64
df_column_ shape (1, 10)


Unnamed: 0,TotRms AbvGrd,Functional,Fireplaces,Garage Type,Garage Yr Blt,Garage Finish,Garage Cars,Garage Area,Garage Qual,Garage Cond
1712,6,Typ,0,Detchd,no_garage,no_garage,,,no_garage,no_garage


Clearly these were missed in the cleaning from above so since there no garages I will replace with 0s since both are discrete values. Since they are in the same row which makes the replacement easier. 

In [33]:
# performing replacement for  ['Mas Vnr Area']
columns2replace = ['Garage Area','Garage Cars']
for column in columns2replace:
    df_train[column] = df_train[column].fillna(0)
    print('is null sum: ',df_train[column].isnull().sum() )
    print('') # just for break b/w loops
null_reminders(dataframe=df_train, column_name='Garage Cars',features_to_drop=features_to_drop,value_cnt='No').head()

is null sum:  0

is null sum:  0

is null sum:  0
df_column_ shape (0, 10)


Unnamed: 0,TotRms AbvGrd,Functional,Fireplaces,Garage Type,Garage Yr Blt,Garage Finish,Garage Cars,Garage Area,Garage Qual,Garage Cond


Finally will confirm that all the null values have been taken care of to export the clean csv and then begin EDA and preparation for the modeling process. 

In [34]:
# confirming there are no more null values
charac_df = character_df(df_train)
charac_df[ charac_df['null_count']>0 ]

Unnamed: 0,column_name,null_count,percent_missing,categorical_unique


---
# Checking for Duplicates
This section shows a few ways to check for duplicates within the df_tran. Found the actual function through this medium[ article](https://towardsdatascience.com/finding-and-removing-duplicate-rows-in-pandas-dataframe-c6117668631f).

In [35]:
df_train.duplicated().sum()

0

In [36]:
df_train.loc[df_train.duplicated(), :]


Unnamed: 0,Id,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Lot Shape,Land Contour,Utilities,Lot Config,...,Open Porch SF,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Misc Val,Mo Sold,Yr Sold,Sale Type,SalePrice


Oddly enough there were no duplicates so can move onto the next step and check for outliers in the following notebook 'EDA-Outlier-Exploration.csv' for the purposes of model exploration there may be several different approaches of how to handle the outliers and each scenario will be exported as a csv to input into the model notebook in the future.  

---
### Last Minute Formatting 

In [45]:
# replacing spaces with '_' and converting columns to lowercase
df_train.columns = df_train.columns.str.strip().str.lower()
df_train.columns = df_train.columns.str.replace(' ','_')
print(df_train.columns[:2])

# will do the same for the test.csv for the future
df_test.columns = df_test.columns.str.strip().str.lower()
df_test.columns = df_test.columns.str.replace(' ','_')
print(df_test.columns[:2])


Index(['id', 'ms_subclass'], dtype='object')
Index(['id', 'pid'], dtype='object')


---
# Exporting clean dataframe (scrubbed_df_train.csv)
Exporting Clean dataframe (df_train) to datasets folder.  

In [48]:
df_train.to_csv('./datasets/scrubbed_df_train.csv',index=False)
df_test.to_csv('./datasets/test.csv',index=False)