# Ames Housing Price Regression

## Import Relevant Library

In [1]:
import pandas as pd

## Preliminary Data Exploration
### Load our Train Data

In [2]:
train = pd.read_csv("../assets/train.csv")
train.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


## Standardize column names and set `id` column as index column

In [3]:
train.columns = train.columns.str.lower()

In [4]:
train.set_index('id', inplace = True)

In [5]:
train.head()

Unnamed: 0_level_0,mssubclass,mszoning,lotfrontage,lotarea,street,alley,lotshape,landcontour,utilities,lotconfig,...,poolarea,poolqc,fence,miscfeature,miscval,mosold,yrsold,saletype,salecondition,saleprice
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,2,2008,WD,Normal,208500
2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,,,,0,5,2007,WD,Normal,181500
3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,,,,0,9,2008,WD,Normal,223500
4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,,,,0,2,2006,WD,Abnorml,140000
5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,0,,,,0,12,2008,WD,Normal,250000


### Function for checking null values:

In [6]:
def null_fields_checker(df):
    null_columns = df.isnull().sum()

    for key, value in null_columns.iteritems():
        if value >0:
            print(key, ":", value)

In [7]:
null_fields_checker(train)

lotfrontage : 259
alley : 1369
masvnrtype : 8
masvnrarea : 8
bsmtqual : 37
bsmtcond : 37
bsmtexposure : 38
bsmtfintype1 : 37
bsmtfintype2 : 38
electrical : 1
fireplacequ : 690
garagetype : 81
garageyrblt : 81
garagefinish : 81
garagequal : 81
garagecond : 81
poolqc : 1453
fence : 1179
miscfeature : 1406


In [8]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1460 entries, 1 to 1460
Data columns (total 80 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   mssubclass     1460 non-null   int64  
 1   mszoning       1460 non-null   object 
 2   lotfrontage    1201 non-null   float64
 3   lotarea        1460 non-null   int64  
 4   street         1460 non-null   object 
 5   alley          91 non-null     object 
 6   lotshape       1460 non-null   object 
 7   landcontour    1460 non-null   object 
 8   utilities      1460 non-null   object 
 9   lotconfig      1460 non-null   object 
 10  landslope      1460 non-null   object 
 11  neighborhood   1460 non-null   object 
 12  condition1     1460 non-null   object 
 13  condition2     1460 non-null   object 
 14  bldgtype       1460 non-null   object 
 15  housestyle     1460 non-null   object 
 16  overallqual    1460 non-null   int64  
 17  overallcond    1460 non-null   int64  
 18  yearbuil

Appears that `alley`, `fireplacequ`, `poolqc`, `fence`, and `miscfeature` have a huge amount of missing data. I will be removing these columns.

In [9]:
train.drop(columns = ["alley","fireplacequ", "poolqc", "fence", "miscfeature"], inplace = True)
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1460 entries, 1 to 1460
Data columns (total 75 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   mssubclass     1460 non-null   int64  
 1   mszoning       1460 non-null   object 
 2   lotfrontage    1201 non-null   float64
 3   lotarea        1460 non-null   int64  
 4   street         1460 non-null   object 
 5   lotshape       1460 non-null   object 
 6   landcontour    1460 non-null   object 
 7   utilities      1460 non-null   object 
 8   lotconfig      1460 non-null   object 
 9   landslope      1460 non-null   object 
 10  neighborhood   1460 non-null   object 
 11  condition1     1460 non-null   object 
 12  condition2     1460 non-null   object 
 13  bldgtype       1460 non-null   object 
 14  housestyle     1460 non-null   object 
 15  overallqual    1460 non-null   int64  
 16  overallcond    1460 non-null   int64  
 17  yearbuilt      1460 non-null   int64  
 18  yearremo

## Numeric Encoding of Categorical Rating Features
There are 15 features, namely:  
`lotshape`  
`landslope`  
`exterqual`  
`extercond`  
`bsmtqual`  
`bsmtcond`  
`bsmtexposure`  
`bsmtfintype1`  
`bsmtfintype2` 
`centralair`  
`heatingqc`  
`kitchenqual`  
`functional`  
`garagefinish`  
`garagequal`  
`garagecond`  
that are features of categorical ratings. I will be numerically encoding them.

### `lotshape`
 3 - Reg Regular	 
 2 - IR1 Slightly irregular  
 1 - IR2 Moderately Irregular  
 0 - IR3 Irregular  

In [10]:
train['lotshape'].replace('Reg',3, inplace = True)
train['lotshape'].replace('IR1',2, inplace = True)
train['lotshape'].replace('IR2',1, inplace = True)
train['lotshape'].replace('IR3',0, inplace = True)
train['lotshape'].head()

id
1    3
2    3
3    2
4    2
5    2
Name: lotshape, dtype: int64

### `landslope`
2 - Gtl	Gentle slope  
1 - Mod	Moderate Slope	
0 - Sev	Severe Slope  

In [11]:
train['landslope'].replace('Gtl',2, inplace = True)
train['landslope'].replace('Mod',1, inplace = True)
train['landslope'].replace('Sev',0, inplace = True)
train['landslope'].head()

id
1    2
2    2
3    2
4    2
5    2
Name: landslope, dtype: int64

### `exterqual` 
		
4 - Ex Excellent  
3 - Gd Good  
2 - TA Average/Typical  
1 - Fa Fair  
0 - Po Poor  

In [12]:
train['exterqual'].replace('Ex',4, inplace = True)
train['exterqual'].replace('Gd',3, inplace = True)
train['exterqual'].replace('TA',2, inplace = True)
train['exterqual'].replace('Fa',1, inplace = True)
train['exterqual'].replace('Po',0, inplace = True)
train['exterqual'].head()

id
1    3
2    2
3    3
4    2
5    3
Name: exterqual, dtype: int64

### `extercond` 
		
4 - Ex Excellent  
3 - Gd Good  
2 - TA Average/Typical  
1 - Fa Fair  
0 - Po Poor  

In [13]:
train['extercond'].replace('Ex',4, inplace = True)
train['extercond'].replace('Gd',3, inplace = True)
train['extercond'].replace('TA',2, inplace = True)
train['extercond'].replace('Fa',1, inplace = True)
train['extercond'].replace('Po',0, inplace = True)
train['extercond'].head()

id
1    2
2    2
3    2
4    2
5    2
Name: extercond, dtype: int64

### `bsmtqual`
5 - Ex	Excellent (100+ inches)	
4 - Gd	Good (90-99 inches)
3 - TA	Typical (80-89 inches)
2 - Fa	Fair (70-79 inches)
1 - Po	Poor (<70 inches
0 - NA	No Basement

In [14]:
train['bsmtqual'].replace('Ex',5, inplace = True)
train['bsmtqual'].replace('Gd',4, inplace = True)
train['bsmtqual'].replace('TA',3, inplace = True)
train['bsmtqual'].replace('Fa',2, inplace = True)
train['bsmtqual'].replace('Po',1, inplace = True)
train['bsmtqual'].replace('NA',0, inplace = True)
train['bsmtqual'].head()

id
1    4.0
2    4.0
3    4.0
4    3.0
5    4.0
Name: bsmtqual, dtype: float64

### `bsmtcond`
5 - Ex	Excellent
4 - Gd	Good
3 - TA	Typical - slight dampness allowed
2 - Fa	Fair - dampness or some cracking or settling
1 - Po	Poor - Severe cracking, settling, or wetness
0 - NA	No Basement

In [15]:
train['bsmtcond'].replace('Ex',5, inplace = True)
train['bsmtcond'].replace('Gd',4, inplace = True)
train['bsmtcond'].replace('TA',3, inplace = True)
train['bsmtcond'].replace('Fa',2, inplace = True)
train['bsmtcond'].replace('Po',1, inplace = True)
train['bsmtcond'].replace('NA',0, inplace = True)
train['bsmtcond'].head()

id
1    3.0
2    3.0
3    3.0
4    4.0
5    3.0
Name: bsmtcond, dtype: float64

### `bsmtexposure`
4 - Gd	Good Exposure
3 - Av	Average Exposure (split levels or foyers typically score average or above)	
2 - Mn	Mimimum Exposure
1 - No	No Exposure
0 - NA	No Basement

In [16]:
train['bsmtexposure'].replace('Gd',4, inplace = True)
train['bsmtexposure'].replace('Av',3, inplace = True)
train['bsmtexposure'].replace('Mn',2, inplace = True)
train['bsmtexposure'].replace('No',1, inplace = True)
train['bsmtexposure'].replace('NA',0, inplace = True)
train['bsmtexposure'].head()

id
1    1.0
2    4.0
3    2.0
4    1.0
5    3.0
Name: bsmtexposure, dtype: float64

### `bsmtfintype1`
6 - GLQ	Good Living Quarters
5 - ALQ	Average Living Quarters
4 - BLQ	Below Average Living Quarters	
3 - Rec	Average Rec Room
2 - LwQ	Low Quality
1 - Unf	Unfinshed
0 - NA	No Basement

In [17]:
train['bsmtfintype1'].replace('GLQ',6, inplace = True)
train['bsmtfintype1'].replace('ALQ',5, inplace = True)
train['bsmtfintype1'].replace('BLQ',4, inplace = True)
train['bsmtfintype1'].replace('Rec',3, inplace = True)
train['bsmtfintype1'].replace('LwQ',2, inplace = True)
train['bsmtfintype1'].replace('Unf',1, inplace = True)
train['bsmtfintype1'].replace('NA',0, inplace = True)
train['bsmtfintype1'].head()

id
1    6.0
2    5.0
3    6.0
4    5.0
5    6.0
Name: bsmtfintype1, dtype: float64

### `bsmtfintype2`
6 - GLQ	Good Living Quarters
5 - ALQ	Average Living Quarters
4 - BLQ	Below Average Living Quarters	
3 - Rec	Average Rec Room
2 - LwQ	Low Quality
1 - Unf	Unfinshed
0 - NA	No Basement

In [18]:
train['bsmtfintype2'].replace('GLQ',6, inplace = True)
train['bsmtfintype2'].replace('ALQ',5, inplace = True)
train['bsmtfintype2'].replace('BLQ',4, inplace = True)
train['bsmtfintype2'].replace('Rec',3, inplace = True)
train['bsmtfintype2'].replace('LwQ',2, inplace = True)
train['bsmtfintype2'].replace('Unf',1, inplace = True)
train['bsmtfintype2'].replace('NA',0, inplace = True)
train['bsmtfintype2'].head()

id
1    1.0
2    1.0
3    1.0
4    1.0
5    1.0
Name: bsmtfintype2, dtype: float64

### `heatingqc`
4 - Ex Excellent  
3 - Gd Good  
2 - TA Average/Typical  
1 - Fa Fair  
0 - Po Poor  

In [19]:
train['heatingqc'].replace('Ex',4, inplace = True)
train['heatingqc'].replace('Gd',3, inplace = True)
train['heatingqc'].replace('TA',2, inplace = True)
train['heatingqc'].replace('Fa',1, inplace = True)
train['heatingqc'].replace('Po',0, inplace = True)
train['heatingqc'].head()

id
1    4
2    4
3    4
4    3
5    4
Name: heatingqc, dtype: int64

### `centralair`
1 - Y	Yes  
0 - N	No  

In [20]:
train['centralair'].replace('Y',1, inplace = True)
train['centralair'].replace('N',0, inplace = True)
train['centralair'].head()

id
1    1
2    1
3    1
4    1
5    1
Name: centralair, dtype: int64

### `kitchenqual`
4 - Ex Excellent  
3 - Gd Good  
2 - TA Average/Typical  
1 - Fa Fair  
0 - Po Poor  

In [21]:
train['kitchenqual'].replace('Ex',4, inplace = True)
train['kitchenqual'].replace('Gd',3, inplace = True)
train['kitchenqual'].replace('TA',2, inplace = True)
train['kitchenqual'].replace('Fa',1, inplace = True)
train['kitchenqual'].replace('Po',0, inplace = True)
train['kitchenqual'].head()

id
1    3
2    2
3    3
4    3
5    3
Name: kitchenqual, dtype: int64

### `functional`
7 - Typ	Typical Functionality  
6 - Min1	Minor Deductions 1  
5 - Min2	Minor Deductions 2  
4 - Mod	Moderate Deductions  
3 - Maj1	Major Deductions 1  
2 - Maj2	Major Deductions 2  
1 - Sev	Severely Damaged  
0 - Sal	Salvage only  

In [22]:
train['functional'].replace('Typ',7, inplace = True)
train['functional'].replace('Min1',6, inplace = True)
train['functional'].replace('Min2',5, inplace = True)
train['functional'].replace('Mod',4, inplace = True)
train['functional'].replace('Maj1',3, inplace = True)
train['functional'].replace('Maj2',2, inplace = True)
train['functional'].replace('Sev',1, inplace = True)
train['functional'].replace('Sal',0, inplace = True)
train['functional'].head()

id
1    7
2    7
3    7
4    7
5    7
Name: functional, dtype: int64

### `garagefinish`
3 - Fin	Finished  
2 - RFn	Rough Finished  	
1 - Unf	Unfinished  
0 - NA	No Garage  

In [23]:
train['garagefinish'].replace('Fin',3, inplace = True)
train['garagefinish'].replace('RFn',2, inplace = True)
train['garagefinish'].replace('Unf',1, inplace = True)
train['garagefinish'].replace('NA',0, inplace = True)
train['garagefinish'].head()

id
1    2.0
2    2.0
3    2.0
4    1.0
5    2.0
Name: garagefinish, dtype: float64

### `garagequal`
5 - Ex Excellent  
4 - Gd Good  
3 - TA Average/Typical  
2 - Fa Fair  
1 - Po Poor  
0 - NA No Garage

In [24]:
train['garagequal'].replace('Ex',5, inplace = True)
train['garagequal'].replace('Gd',4, inplace = True)
train['garagequal'].replace('TA',3, inplace = True)
train['garagequal'].replace('Fa',2, inplace = True)
train['garagequal'].replace('Po',1, inplace = True)
train['garagequal'].replace('NA',0, inplace = True)
train['garagequal'].head()

id
1    3.0
2    3.0
3    3.0
4    3.0
5    3.0
Name: garagequal, dtype: float64

### `garagecond`
5 - Ex Excellent  
4 - Gd Good  
3 - TA Average/Typical  
2 - Fa Fair  
1 - Po Poor  
0 - NA No Garage  

In [25]:
train['garagecond'].replace('Ex',5, inplace = True)
train['garagecond'].replace('Gd',4, inplace = True)
train['garagecond'].replace('TA',3, inplace = True)
train['garagecond'].replace('Fa',2, inplace = True)
train['garagecond'].replace('Po',1, inplace = True)
train['garagecond'].replace('NA',0, inplace = True)
train['garagecond'].head()

id
1    3.0
2    3.0
3    3.0
4    3.0
5    3.0
Name: garagecond, dtype: float64

Now that all the listed columns have been numerically encoded, lets check for anymore string data in our dataset

In [26]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1460 entries, 1 to 1460
Data columns (total 75 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   mssubclass     1460 non-null   int64  
 1   mszoning       1460 non-null   object 
 2   lotfrontage    1201 non-null   float64
 3   lotarea        1460 non-null   int64  
 4   street         1460 non-null   object 
 5   lotshape       1460 non-null   int64  
 6   landcontour    1460 non-null   object 
 7   utilities      1460 non-null   object 
 8   lotconfig      1460 non-null   object 
 9   landslope      1460 non-null   int64  
 10  neighborhood   1460 non-null   object 
 11  condition1     1460 non-null   object 
 12  condition2     1460 non-null   object 
 13  bldgtype       1460 non-null   object 
 14  housestyle     1460 non-null   object 
 15  overallqual    1460 non-null   int64  
 16  overallcond    1460 non-null   int64  
 17  yearbuilt      1460 non-null   int64  
 18  yearremo

## Setting Dummy Variables for All Other non-numeric columns
Appears we still have a number of non-numeric columns we have to dummify. Namely these 23 columns:  
`mszoning`  
`street`  
`landcontour`  
`utilities`  
`lotconfig`  
`neighbourhood`  
`condition1`  
`condition2`  
`bldgtype`  
`housestyle`  
`roofstyle`  
`roofmatl`  
`exterior1st`  
`exterior2nd`  
`masvnrtype`  
`foundation`  
`heating`  
`centralair`  
`electrical`  
`garagetype`  
`paveddrive`  
`saletype`  
`salecondition`  

### Create a function to set Dummy Variables for our Columns

In [27]:
def dummify_column(df, column, set_list):
    dummy_df = pd.get_dummies(df[column])
    
    missing_columns = [col for col in set_list if col not in dummy_df.columns]
    print("These categories are missing from the data: " + str(missing_columns))
    
    for x in set_list:
        if x in dummy_df.columns:
            df[(x +"_" + column).lower()] = dummy_df[x]
        elif x in missing_columns:
            df[(x +"_" + column).lower()] = 0
    
    print(str(column) + " has been concatenated with input DataFrame. All missing categories are set to 0")
    df.drop(columns = str(column), inplace = True)
    return df

### `mszoning`
A	Agriculture  
C(all)	Commercial  
FV	Floating Village Residential  
I	Industrial  
RH	Residential High Density  
RL	Residential Low Density  
RP	Residential Low Density Park   
RM	Residential Medium Density  

In [28]:
mszoning_setlist = ['A',"C (all)",'FV', 'I', 'RH', 'RL', 'RP', 'RM']
train_test = dummify_column(train,"mszoning", mszoning_setlist)
train_test.head()

These categories are missing from the data: ['A', 'I', 'RP']
mszoning has been concatenated with input DataFrame. All missing categories are set to 0


Unnamed: 0_level_0,mssubclass,lotfrontage,lotarea,street,lotshape,landcontour,utilities,lotconfig,landslope,neighborhood,...,salecondition,saleprice,a_mszoning,c (all)_mszoning,fv_mszoning,i_mszoning,rh_mszoning,rl_mszoning,rp_mszoning,rm_mszoning
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,65.0,8450,Pave,3,Lvl,AllPub,Inside,2,CollgCr,...,Normal,208500,0,0,0,0,0,1,0,0
2,20,80.0,9600,Pave,3,Lvl,AllPub,FR2,2,Veenker,...,Normal,181500,0,0,0,0,0,1,0,0
3,60,68.0,11250,Pave,2,Lvl,AllPub,Inside,2,CollgCr,...,Normal,223500,0,0,0,0,0,1,0,0
4,70,60.0,9550,Pave,2,Lvl,AllPub,Corner,2,Crawfor,...,Abnorml,140000,0,0,0,0,0,1,0,0
5,60,84.0,14260,Pave,2,Lvl,AllPub,FR2,2,NoRidge,...,Normal,250000,0,0,0,0,0,1,0,0


### `street`
Grvl	Gravel	
Pave	Paved

In [29]:
street_setlist = ['Grvl', 'Pave']
train_test = dummify_column(train,"street", street_setlist)
train_test.head()

These categories are missing from the data: []
street has been concatenated with input DataFrame. All missing categories are set to 0


Unnamed: 0_level_0,mssubclass,lotfrontage,lotarea,lotshape,landcontour,utilities,lotconfig,landslope,neighborhood,condition1,...,a_mszoning,c (all)_mszoning,fv_mszoning,i_mszoning,rh_mszoning,rl_mszoning,rp_mszoning,rm_mszoning,grvl_street,pave_street
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,65.0,8450,3,Lvl,AllPub,Inside,2,CollgCr,Norm,...,0,0,0,0,0,1,0,0,0,1
2,20,80.0,9600,3,Lvl,AllPub,FR2,2,Veenker,Feedr,...,0,0,0,0,0,1,0,0,0,1
3,60,68.0,11250,2,Lvl,AllPub,Inside,2,CollgCr,Norm,...,0,0,0,0,0,1,0,0,0,1
4,70,60.0,9550,2,Lvl,AllPub,Corner,2,Crawfor,Norm,...,0,0,0,0,0,1,0,0,0,1
5,60,84.0,14260,2,Lvl,AllPub,FR2,2,NoRidge,Norm,...,0,0,0,0,0,1,0,0,0,1


### `landcontour`  
Lvl	Near Flat/Level	 
Bnk	Banked - Quick and significant rise from street grade to building  
HLS	Hillside - Significant slope from side to side  
Low	Depression  

In [30]:
landcontour_setlist = ['Lvl', 'Bnk', 'HLS', 'Low']
train_test = dummify_column(train,"landcontour", landcontour_setlist)
train_test.head()

These categories are missing from the data: []
landcontour has been concatenated with input DataFrame. All missing categories are set to 0


Unnamed: 0_level_0,mssubclass,lotfrontage,lotarea,lotshape,utilities,lotconfig,landslope,neighborhood,condition1,condition2,...,rh_mszoning,rl_mszoning,rp_mszoning,rm_mszoning,grvl_street,pave_street,lvl_landcontour,bnk_landcontour,hls_landcontour,low_landcontour
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,65.0,8450,3,AllPub,Inside,2,CollgCr,Norm,Norm,...,0,1,0,0,0,1,1,0,0,0
2,20,80.0,9600,3,AllPub,FR2,2,Veenker,Feedr,Norm,...,0,1,0,0,0,1,1,0,0,0
3,60,68.0,11250,2,AllPub,Inside,2,CollgCr,Norm,Norm,...,0,1,0,0,0,1,1,0,0,0
4,70,60.0,9550,2,AllPub,Corner,2,Crawfor,Norm,Norm,...,0,1,0,0,0,1,1,0,0,0
5,60,84.0,14260,2,AllPub,FR2,2,NoRidge,Norm,Norm,...,0,1,0,0,0,1,1,0,0,0


### `utilities`
AllPub	All public Utilities (E,G,W,& S)	  
NoSewr	Electricity, Gas, and Water (Septic Tank)  
NoSeWa	Electricity and Gas Only  
ELO	Electricity only	  

In [31]:
utilities_setlist = ['AllPub', 'NoSewr', 'NoSeWa', 'ELO']
train_test = dummify_column(train,"utilities", utilities_setlist)
train_test.head()

These categories are missing from the data: ['NoSewr', 'ELO']
utilities has been concatenated with input DataFrame. All missing categories are set to 0


Unnamed: 0_level_0,mssubclass,lotfrontage,lotarea,lotshape,lotconfig,landslope,neighborhood,condition1,condition2,bldgtype,...,grvl_street,pave_street,lvl_landcontour,bnk_landcontour,hls_landcontour,low_landcontour,allpub_utilities,nosewr_utilities,nosewa_utilities,elo_utilities
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,65.0,8450,3,Inside,2,CollgCr,Norm,Norm,1Fam,...,0,1,1,0,0,0,1,0,0,0
2,20,80.0,9600,3,FR2,2,Veenker,Feedr,Norm,1Fam,...,0,1,1,0,0,0,1,0,0,0
3,60,68.0,11250,2,Inside,2,CollgCr,Norm,Norm,1Fam,...,0,1,1,0,0,0,1,0,0,0
4,70,60.0,9550,2,Corner,2,Crawfor,Norm,Norm,1Fam,...,0,1,1,0,0,0,1,0,0,0
5,60,84.0,14260,2,FR2,2,NoRidge,Norm,Norm,1Fam,...,0,1,1,0,0,0,1,0,0,0


### `lotconfig`
Inside	Inside lot
Corner	Corner lot
CulDSac	Cul-de-sac
FR2	Frontage on 2 sides of property
FR3	Frontage on 3 sides of property

In [32]:
lotconfig_setlist = ['Inside', 'Corner', 'CulDSac', 'FR2', 'FR3']
train_test = dummify_column(train,"lotconfig", lotconfig_setlist)
train_test.head()

These categories are missing from the data: []
lotconfig has been concatenated with input DataFrame. All missing categories are set to 0


Unnamed: 0_level_0,mssubclass,lotfrontage,lotarea,lotshape,landslope,neighborhood,condition1,condition2,bldgtype,housestyle,...,low_landcontour,allpub_utilities,nosewr_utilities,nosewa_utilities,elo_utilities,inside_lotconfig,corner_lotconfig,culdsac_lotconfig,fr2_lotconfig,fr3_lotconfig
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,65.0,8450,3,2,CollgCr,Norm,Norm,1Fam,2Story,...,0,1,0,0,0,1,0,0,0,0
2,20,80.0,9600,3,2,Veenker,Feedr,Norm,1Fam,1Story,...,0,1,0,0,0,0,0,0,1,0
3,60,68.0,11250,2,2,CollgCr,Norm,Norm,1Fam,2Story,...,0,1,0,0,0,1,0,0,0,0
4,70,60.0,9550,2,2,Crawfor,Norm,Norm,1Fam,2Story,...,0,1,0,0,0,0,1,0,0,0
5,60,84.0,14260,2,2,NoRidge,Norm,Norm,1Fam,2Story,...,0,1,0,0,0,0,0,0,1,0


### `neighborhood`
Blmngtn	Bloomington Heights  
Blueste	Bluestem  
BrDale	Briardale  
BrkSide	Brookside  
ClearCr	Clear Creek  
CollgCr	College Creek  
Crawfor	Crawford  
Edwards	Edwards  
Gilbert	Gilbert  
IDOTRR	Iowa DOT and Rail Road  
MeadowV	Meadow Village  
Mitchel	Mitchell  
Names	North Ames  
NoRidge	Northridge  
NPkVill	Northpark Villa  
NridgHt	Northridge Heights  
NWAmes	Northwest Ames  
OldTown	Old Town  
SWISU	South & West of Iowa State University  
Sawyer	Sawyer  
SawyerW	Sawyer West  
Somerst	Somerset  
StoneBr	Stone Brook  
Timber	Timberland  
Veenker	Veenker  

In [33]:
# This is just to process the text for the setlist as it is quite painful to do it manually
clean_text = "'Blmngtn Blueste BrDale BrkSide ClearCr CollgCr Crawfor Edwards Gilbert IDOTRR MeadowV Mitchel Names NoRidge NPkVill NridgHt NWAmes OldTown SWISU Sawyer SawyerW Somerst StoneBr Timber Veenker'"
# clean_text.replace("  ", "\'"+ ", " +  "\'")
clean_text.replace(" ", "\', \'")

"'Blmngtn', 'Blueste', 'BrDale', 'BrkSide', 'ClearCr', 'CollgCr', 'Crawfor', 'Edwards', 'Gilbert', 'IDOTRR', 'MeadowV', 'Mitchel', 'Names', 'NoRidge', 'NPkVill', 'NridgHt', 'NWAmes', 'OldTown', 'SWISU', 'Sawyer', 'SawyerW', 'Somerst', 'StoneBr', 'Timber', 'Veenker'"

In [34]:
neighborhood_setlist = ['Blmngtn', 'Blueste', 'BrDale', 'BrkSide', 'ClearCr', 'CollgCr', 'Crawfor', 'Edwards', 'Gilbert', 'IDOTRR', 'MeadowV', 'Mitchel', 'Names', 'NoRidge', 'NPkVill', 'NridgHt', 'NWAmes', 'OldTown', 'SWISU', 'Sawyer', 'SawyerW', 'Somerst', 'StoneBr', 'Timber', 'Veenker']
train_test = dummify_column(train,"neighborhood", neighborhood_setlist)
train_test.head()

These categories are missing from the data: ['Names']
neighborhood has been concatenated with input DataFrame. All missing categories are set to 0


Unnamed: 0_level_0,mssubclass,lotfrontage,lotarea,lotshape,landslope,condition1,condition2,bldgtype,housestyle,overallqual,...,nridght_neighborhood,nwames_neighborhood,oldtown_neighborhood,swisu_neighborhood,sawyer_neighborhood,sawyerw_neighborhood,somerst_neighborhood,stonebr_neighborhood,timber_neighborhood,veenker_neighborhood
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,65.0,8450,3,2,Norm,Norm,1Fam,2Story,7,...,0,0,0,0,0,0,0,0,0,0
2,20,80.0,9600,3,2,Feedr,Norm,1Fam,1Story,6,...,0,0,0,0,0,0,0,0,0,1
3,60,68.0,11250,2,2,Norm,Norm,1Fam,2Story,7,...,0,0,0,0,0,0,0,0,0,0
4,70,60.0,9550,2,2,Norm,Norm,1Fam,2Story,7,...,0,0,0,0,0,0,0,0,0,0
5,60,84.0,14260,2,2,Norm,Norm,1Fam,2Story,8,...,0,0,0,0,0,0,0,0,0,0


### `condition1` 
Artery	Adjacent to arterial street  
Feedr	Adjacent to feeder street	  
Norm	Normal	  
RRNn	Within 200' of North-South Railroad  
RRAn	Adjacent to North-South Railroad  
PosN	Near positive off-site feature--park, greenbelt, etc.  
PosA	Adjacent to postive off-site feature  
RRNe	Within 200' of East-West Railroad  
RRAe	Adjacent to East-West Railroad  

In [35]:
condition1_setlist = ['Artery', 'Feedr', 'Norm', 'RRNn', 'RRAn','PosN','PosA','RRNe','RRAe']
train_test = dummify_column(train,"condition1", condition1_setlist)
train_test.head()

These categories are missing from the data: []
condition1 has been concatenated with input DataFrame. All missing categories are set to 0


Unnamed: 0_level_0,mssubclass,lotfrontage,lotarea,lotshape,landslope,condition2,bldgtype,housestyle,overallqual,overallcond,...,veenker_neighborhood,artery_condition1,feedr_condition1,norm_condition1,rrnn_condition1,rran_condition1,posn_condition1,posa_condition1,rrne_condition1,rrae_condition1
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,65.0,8450,3,2,Norm,1Fam,2Story,7,5,...,0,0,0,1,0,0,0,0,0,0
2,20,80.0,9600,3,2,Norm,1Fam,1Story,6,8,...,1,0,1,0,0,0,0,0,0,0
3,60,68.0,11250,2,2,Norm,1Fam,2Story,7,5,...,0,0,0,1,0,0,0,0,0,0
4,70,60.0,9550,2,2,Norm,1Fam,2Story,7,5,...,0,0,0,1,0,0,0,0,0,0
5,60,84.0,14260,2,2,Norm,1Fam,2Story,8,5,...,0,0,0,1,0,0,0,0,0,0


### `condition2` 
Artery	Adjacent to arterial street  
Feedr	Adjacent to feeder street	  
Norm	Normal	  
RRNn	Within 200' of North-South Railroad  
RRAn	Adjacent to North-South Railroad  
PosN	Near positive off-site feature--park, greenbelt, etc.  
PosA	Adjacent to postive off-site feature  
RRNe	Within 200' of East-West Railroad  
RRAe	Adjacent to East-West Railroad  

In [36]:
condition2_setlist = ['Artery', 'Feedr', 'Norm', 'RRNn', 'RRAn','PosN','PosA','RRNe','RRAe']
train_test = dummify_column(train,"condition2", condition2_setlist)
train_test.head()

These categories are missing from the data: ['RRNe']
condition2 has been concatenated with input DataFrame. All missing categories are set to 0


Unnamed: 0_level_0,mssubclass,lotfrontage,lotarea,lotshape,landslope,bldgtype,housestyle,overallqual,overallcond,yearbuilt,...,rrae_condition1,artery_condition2,feedr_condition2,norm_condition2,rrnn_condition2,rran_condition2,posn_condition2,posa_condition2,rrne_condition2,rrae_condition2
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,65.0,8450,3,2,1Fam,2Story,7,5,2003,...,0,0,0,1,0,0,0,0,0,0
2,20,80.0,9600,3,2,1Fam,1Story,6,8,1976,...,0,0,0,1,0,0,0,0,0,0
3,60,68.0,11250,2,2,1Fam,2Story,7,5,2001,...,0,0,0,1,0,0,0,0,0,0
4,70,60.0,9550,2,2,1Fam,2Story,7,5,1915,...,0,0,0,1,0,0,0,0,0,0
5,60,84.0,14260,2,2,1Fam,2Story,8,5,2000,...,0,0,0,1,0,0,0,0,0,0


### `bldgtype` 
1Fam	Single-family Detached	  
2FmCon	Two-family Conversion; originally built as one-family dwelling  
Duplx	Duplex  
TwnhsE	Townhouse End Unit  
TwnhsI	Townhouse Inside Unit  

In [37]:
bldgtype_setlist = ['1Fam', '2FmCon', 'Duplx', 'TwnhsE', 'TwnhsI']
train_test = dummify_column(train,"bldgtype", bldgtype_setlist)
train_test.head()

These categories are missing from the data: ['2FmCon', 'Duplx', 'TwnhsI']
bldgtype has been concatenated with input DataFrame. All missing categories are set to 0


Unnamed: 0_level_0,mssubclass,lotfrontage,lotarea,lotshape,landslope,housestyle,overallqual,overallcond,yearbuilt,yearremodadd,...,rran_condition2,posn_condition2,posa_condition2,rrne_condition2,rrae_condition2,1fam_bldgtype,2fmcon_bldgtype,duplx_bldgtype,twnhse_bldgtype,twnhsi_bldgtype
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,65.0,8450,3,2,2Story,7,5,2003,2003,...,0,0,0,0,0,1,0,0,0,0
2,20,80.0,9600,3,2,1Story,6,8,1976,1976,...,0,0,0,0,0,1,0,0,0,0
3,60,68.0,11250,2,2,2Story,7,5,2001,2002,...,0,0,0,0,0,1,0,0,0,0
4,70,60.0,9550,2,2,2Story,7,5,1915,1970,...,0,0,0,0,0,1,0,0,0,0
5,60,84.0,14260,2,2,2Story,8,5,2000,2000,...,0,0,0,0,0,1,0,0,0,0


### `housestyle` 
1Story	One story  
1.5Fin	One and one-half story: 2nd level finished  
1.5Unf	One and one-half story: 2nd level unfinished  
2Story	Two story  
2.5Fin	Two and one-half story: 2nd level finished  
2.5Unf	Two and one-half story: 2nd level unfinished  
SFoyer	Split Foyer  
SLvl	Split Level  

In [38]:
housestyle_setlist = ['1Story', '1.5Fin', '1.5Unf', '2Story','2.5Fin','2.5Unf','SFoyer','SLvl']
train_test = dummify_column(train,"housestyle", housestyle_setlist)
train_test.head()

These categories are missing from the data: []
housestyle has been concatenated with input DataFrame. All missing categories are set to 0


Unnamed: 0_level_0,mssubclass,lotfrontage,lotarea,lotshape,landslope,overallqual,overallcond,yearbuilt,yearremodadd,roofstyle,...,twnhse_bldgtype,twnhsi_bldgtype,1story_housestyle,1.5fin_housestyle,1.5unf_housestyle,2story_housestyle,2.5fin_housestyle,2.5unf_housestyle,sfoyer_housestyle,slvl_housestyle
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,65.0,8450,3,2,7,5,2003,2003,Gable,...,0,0,0,0,0,1,0,0,0,0
2,20,80.0,9600,3,2,6,8,1976,1976,Gable,...,0,0,1,0,0,0,0,0,0,0
3,60,68.0,11250,2,2,7,5,2001,2002,Gable,...,0,0,0,0,0,1,0,0,0,0
4,70,60.0,9550,2,2,7,5,1915,1970,Gable,...,0,0,0,0,0,1,0,0,0,0
5,60,84.0,14260,2,2,8,5,2000,2000,Gable,...,0,0,0,0,0,1,0,0,0,0


### `roofstyle` 
Flat	Flat  
Gable	Gable  
Gambrel	Gabrel (Barn)  
Hip	Hip  
Mansard	Mansard  
Shed	Shed  

In [39]:
roofstyle_setlist = ['Flat', 'Gable', 'Gambrel', 'Hip','Mansard','Shed']
train_test = dummify_column(train,"roofstyle", roofstyle_setlist)
train_test.head()

These categories are missing from the data: []
roofstyle has been concatenated with input DataFrame. All missing categories are set to 0


Unnamed: 0_level_0,mssubclass,lotfrontage,lotarea,lotshape,landslope,overallqual,overallcond,yearbuilt,yearremodadd,roofmatl,...,2.5fin_housestyle,2.5unf_housestyle,sfoyer_housestyle,slvl_housestyle,flat_roofstyle,gable_roofstyle,gambrel_roofstyle,hip_roofstyle,mansard_roofstyle,shed_roofstyle
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,65.0,8450,3,2,7,5,2003,2003,CompShg,...,0,0,0,0,0,1,0,0,0,0
2,20,80.0,9600,3,2,6,8,1976,1976,CompShg,...,0,0,0,0,0,1,0,0,0,0
3,60,68.0,11250,2,2,7,5,2001,2002,CompShg,...,0,0,0,0,0,1,0,0,0,0
4,70,60.0,9550,2,2,7,5,1915,1970,CompShg,...,0,0,0,0,0,1,0,0,0,0
5,60,84.0,14260,2,2,8,5,2000,2000,CompShg,...,0,0,0,0,0,1,0,0,0,0


### `roofmatl` 
ClyTile	Clay or Tile  
CompShg	Standard (Composite) Shingle  
Membran	Membrane  
Metal	Metal  
Roll	Roll  
Tar&Grv	Gravel & Tar  
WdShake	Wood Shakes  
WdShngl	Wood Shingles  

In [40]:
roofmatl_setlist = ['ClyTile', 'CompShg', 'Membran', 'Metal','Roll','Tar&Grv','WdShake','WdShngl']
train_test = dummify_column(train,"roofmatl", roofmatl_setlist)
train_test.head()

These categories are missing from the data: []
roofmatl has been concatenated with input DataFrame. All missing categories are set to 0


Unnamed: 0_level_0,mssubclass,lotfrontage,lotarea,lotshape,landslope,overallqual,overallcond,yearbuilt,yearremodadd,exterior1st,...,mansard_roofstyle,shed_roofstyle,clytile_roofmatl,compshg_roofmatl,membran_roofmatl,metal_roofmatl,roll_roofmatl,tar&grv_roofmatl,wdshake_roofmatl,wdshngl_roofmatl
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,65.0,8450,3,2,7,5,2003,2003,VinylSd,...,0,0,0,1,0,0,0,0,0,0
2,20,80.0,9600,3,2,6,8,1976,1976,MetalSd,...,0,0,0,1,0,0,0,0,0,0
3,60,68.0,11250,2,2,7,5,2001,2002,VinylSd,...,0,0,0,1,0,0,0,0,0,0
4,70,60.0,9550,2,2,7,5,1915,1970,Wd Sdng,...,0,0,0,1,0,0,0,0,0,0
5,60,84.0,14260,2,2,8,5,2000,2000,VinylSd,...,0,0,0,1,0,0,0,0,0,0


### `exterior1st` 
AsbShng	Asbestos Shingles  
AsphShn	Asphalt Shingles  
BrkComm	Brick Common  
BrkFace	Brick Face  
CBlock	Cinder Block  
CemntBd	Cement Board  
HdBoard	Hard Board  
ImStucc	Imitation Stucco  
MetalSd	Metal Siding  
Other	Other  
Plywood	Plywood  
PreCast	PreCast	  
Stone	Stone  
Stucco	Stucco  
VinylSd	Vinyl Siding  
Wd Sdng	Wood Siding  
WdShing	Wood Shingles  

In [41]:
exterior1st_setlist = ['AsbShng', 'AsphShn', 'BrkComm', 'BrkFace','CBlock','CemntBd','HdBoard','ImStucc','MetalSd','Other','Plywood','PreCast','Stone','Stucco','VinylSd','Wd Sdng','WdShing']
train_test = dummify_column(train,"exterior1st", exterior1st_setlist)
train_test.head()

These categories are missing from the data: ['Other', 'PreCast']
exterior1st has been concatenated with input DataFrame. All missing categories are set to 0


Unnamed: 0_level_0,mssubclass,lotfrontage,lotarea,lotshape,landslope,overallqual,overallcond,yearbuilt,yearremodadd,exterior2nd,...,imstucc_exterior1st,metalsd_exterior1st,other_exterior1st,plywood_exterior1st,precast_exterior1st,stone_exterior1st,stucco_exterior1st,vinylsd_exterior1st,wd sdng_exterior1st,wdshing_exterior1st
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,65.0,8450,3,2,7,5,2003,2003,VinylSd,...,0,0,0,0,0,0,0,1,0,0
2,20,80.0,9600,3,2,6,8,1976,1976,MetalSd,...,0,1,0,0,0,0,0,0,0,0
3,60,68.0,11250,2,2,7,5,2001,2002,VinylSd,...,0,0,0,0,0,0,0,1,0,0
4,70,60.0,9550,2,2,7,5,1915,1970,Wd Shng,...,0,0,0,0,0,0,0,0,1,0
5,60,84.0,14260,2,2,8,5,2000,2000,VinylSd,...,0,0,0,0,0,0,0,1,0,0


### `exterior2nd` 
AsbShng	Asbestos Shingles  
AsphShn	Asphalt Shingles  
BrkComm	Brick Common  
BrkFace	Brick Face  
CBlock	Cinder Block  
CemntBd	Cement Board  
HdBoard	Hard Board  
ImStucc	Imitation Stucco  
MetalSd	Metal Siding  
Other	Other  
Plywood	Plywood  
PreCast	PreCast	  
Stone	Stone  
Stucco	Stucco  
VinylSd	Vinyl Siding  
Wd Sdng	Wood Siding  
WdShing	Wood Shingles  

In [42]:
exterior2nd_setlist = ['AsbShng', 'AsphShn', 'BrkComm', 'BrkFace','CBlock','CemntBd','HdBoard','ImStucc','MetalSd','Other','Plywood','PreCast','Stone','Stucco','VinylSd','Wd Sdng','WdShing']
train_test = dummify_column(train,"exterior2nd", exterior2nd_setlist)
train_test.head()

These categories are missing from the data: ['BrkComm', 'CemntBd', 'PreCast', 'WdShing']
exterior2nd has been concatenated with input DataFrame. All missing categories are set to 0


Unnamed: 0_level_0,mssubclass,lotfrontage,lotarea,lotshape,landslope,overallqual,overallcond,yearbuilt,yearremodadd,masvnrtype,...,imstucc_exterior2nd,metalsd_exterior2nd,other_exterior2nd,plywood_exterior2nd,precast_exterior2nd,stone_exterior2nd,stucco_exterior2nd,vinylsd_exterior2nd,wd sdng_exterior2nd,wdshing_exterior2nd
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,65.0,8450,3,2,7,5,2003,2003,BrkFace,...,0,0,0,0,0,0,0,1,0,0
2,20,80.0,9600,3,2,6,8,1976,1976,,...,0,1,0,0,0,0,0,0,0,0
3,60,68.0,11250,2,2,7,5,2001,2002,BrkFace,...,0,0,0,0,0,0,0,1,0,0
4,70,60.0,9550,2,2,7,5,1915,1970,,...,0,0,0,0,0,0,0,0,0,0
5,60,84.0,14260,2,2,8,5,2000,2000,BrkFace,...,0,0,0,0,0,0,0,1,0,0


### `masvnrtype` 
BrkCmn	Brick Common  
BrkFace	Brick Face  
CBlock	Cinder Block  
None	None  
Stone	Stone  

In [43]:
masvnrtype_setlist = ['BrkCmn', 'BrkFace', 'CBlock', 'None', 'Stone']
train_test = dummify_column(train,"masvnrtype", masvnrtype_setlist)
train_test.head()

These categories are missing from the data: ['CBlock']
masvnrtype has been concatenated with input DataFrame. All missing categories are set to 0


Unnamed: 0_level_0,mssubclass,lotfrontage,lotarea,lotshape,landslope,overallqual,overallcond,yearbuilt,yearremodadd,masvnrarea,...,stone_exterior2nd,stucco_exterior2nd,vinylsd_exterior2nd,wd sdng_exterior2nd,wdshing_exterior2nd,brkcmn_masvnrtype,brkface_masvnrtype,cblock_masvnrtype,none_masvnrtype,stone_masvnrtype
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,65.0,8450,3,2,7,5,2003,2003,196.0,...,0,0,1,0,0,0,1,0,0,0
2,20,80.0,9600,3,2,6,8,1976,1976,0.0,...,0,0,0,0,0,0,0,0,1,0
3,60,68.0,11250,2,2,7,5,2001,2002,162.0,...,0,0,1,0,0,0,1,0,0,0
4,70,60.0,9550,2,2,7,5,1915,1970,0.0,...,0,0,0,0,0,0,0,0,1,0
5,60,84.0,14260,2,2,8,5,2000,2000,350.0,...,0,0,1,0,0,0,1,0,0,0


### `foundation`  
BrkTil	Brick & Tile  
CBlock	Cinder Block  
PConc	Poured Contrete	  
Slab	Slab  
Stone	Stone  
Wood	Wood  

In [44]:
foundation_setlist = ['BrkTil', 'CBlock', 'PConc', 'Slab', 'Stone', 'Wood']
train_test = dummify_column(train,"foundation", foundation_setlist)
train_test.head()

These categories are missing from the data: []
foundation has been concatenated with input DataFrame. All missing categories are set to 0


Unnamed: 0_level_0,mssubclass,lotfrontage,lotarea,lotshape,landslope,overallqual,overallcond,yearbuilt,yearremodadd,masvnrarea,...,brkface_masvnrtype,cblock_masvnrtype,none_masvnrtype,stone_masvnrtype,brktil_foundation,cblock_foundation,pconc_foundation,slab_foundation,stone_foundation,wood_foundation
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,65.0,8450,3,2,7,5,2003,2003,196.0,...,1,0,0,0,0,0,1,0,0,0
2,20,80.0,9600,3,2,6,8,1976,1976,0.0,...,0,0,1,0,0,1,0,0,0,0
3,60,68.0,11250,2,2,7,5,2001,2002,162.0,...,1,0,0,0,0,0,1,0,0,0
4,70,60.0,9550,2,2,7,5,1915,1970,0.0,...,0,0,1,0,1,0,0,0,0,0
5,60,84.0,14260,2,2,8,5,2000,2000,350.0,...,1,0,0,0,0,0,1,0,0,0


### `heating`  
Floor	Floor Furnace  
GasA	Gas forced warm air furnace  
GasW	Gas hot water or steam heat  
Grav	Gravity furnace	  
OthW	Hot water or steam heat other than gas  
Wall	Wall furnace  

In [45]:
heating_setlist = ['Floor', 'GasA', 'GasW', 'Grav', 'OthW', 'Wall']
train_test = dummify_column(train,"heating", heating_setlist)
train_test.head()

These categories are missing from the data: []
heating has been concatenated with input DataFrame. All missing categories are set to 0


Unnamed: 0_level_0,mssubclass,lotfrontage,lotarea,lotshape,landslope,overallqual,overallcond,yearbuilt,yearremodadd,masvnrarea,...,pconc_foundation,slab_foundation,stone_foundation,wood_foundation,floor_heating,gasa_heating,gasw_heating,grav_heating,othw_heating,wall_heating
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,65.0,8450,3,2,7,5,2003,2003,196.0,...,1,0,0,0,0,1,0,0,0,0
2,20,80.0,9600,3,2,6,8,1976,1976,0.0,...,0,0,0,0,0,1,0,0,0,0
3,60,68.0,11250,2,2,7,5,2001,2002,162.0,...,1,0,0,0,0,1,0,0,0,0
4,70,60.0,9550,2,2,7,5,1915,1970,0.0,...,0,0,0,0,0,1,0,0,0,0
5,60,84.0,14260,2,2,8,5,2000,2000,350.0,...,1,0,0,0,0,1,0,0,0,0


### `electrical`
SBrkr	Standard Circuit Breakers & Romex  
FuseA	Fuse Box over 60 AMP and all Romex wiring (Average)	  
FuseF	60 AMP Fuse Box and mostly Romex wiring (Fair)  
FuseP	60 AMP Fuse Box and mostly knob & tube wiring (poor)  
Mix	Mixed  

In [46]:
electrical_setlist = ['SBrkr', 'FuseA', 'FuseF', 'FuseP', 'Mix']
train_test = dummify_column(train,"electrical", electrical_setlist)
train_test.head()

These categories are missing from the data: []
electrical has been concatenated with input DataFrame. All missing categories are set to 0


Unnamed: 0_level_0,mssubclass,lotfrontage,lotarea,lotshape,landslope,overallqual,overallcond,yearbuilt,yearremodadd,masvnrarea,...,gasa_heating,gasw_heating,grav_heating,othw_heating,wall_heating,sbrkr_electrical,fusea_electrical,fusef_electrical,fusep_electrical,mix_electrical
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,65.0,8450,3,2,7,5,2003,2003,196.0,...,1,0,0,0,0,1,0,0,0,0
2,20,80.0,9600,3,2,6,8,1976,1976,0.0,...,1,0,0,0,0,1,0,0,0,0
3,60,68.0,11250,2,2,7,5,2001,2002,162.0,...,1,0,0,0,0,1,0,0,0,0
4,70,60.0,9550,2,2,7,5,1915,1970,0.0,...,1,0,0,0,0,1,0,0,0,0
5,60,84.0,14260,2,2,8,5,2000,2000,350.0,...,1,0,0,0,0,1,0,0,0,0


### `garagetype`
2Types	More than one type of garage  
Attchd	Attached to home  
Basment	Basement Garage  
BuiltIn	Built-In (Garage part of house - typically has room above garage)  
CarPort	Car Port  
Detchd	Detached from home  
NA	No Garage  

In [47]:
garagetype_setlist = ['2Types', 'Attchd', 'Basment', 'BuiltIn', 'CarPort', 'Detchd', 'NA']
train_test = dummify_column(train,"garagetype", garagetype_setlist)
train_test.head()

These categories are missing from the data: ['NA']
garagetype has been concatenated with input DataFrame. All missing categories are set to 0


Unnamed: 0_level_0,mssubclass,lotfrontage,lotarea,lotshape,landslope,overallqual,overallcond,yearbuilt,yearremodadd,masvnrarea,...,fusef_electrical,fusep_electrical,mix_electrical,2types_garagetype,attchd_garagetype,basment_garagetype,builtin_garagetype,carport_garagetype,detchd_garagetype,na_garagetype
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,65.0,8450,3,2,7,5,2003,2003,196.0,...,0,0,0,0,1,0,0,0,0,0
2,20,80.0,9600,3,2,6,8,1976,1976,0.0,...,0,0,0,0,1,0,0,0,0,0
3,60,68.0,11250,2,2,7,5,2001,2002,162.0,...,0,0,0,0,1,0,0,0,0,0
4,70,60.0,9550,2,2,7,5,1915,1970,0.0,...,0,0,0,0,0,0,0,0,1,0
5,60,84.0,14260,2,2,8,5,2000,2000,350.0,...,0,0,0,0,1,0,0,0,0,0


### `paveddrive`
Y	Paved   
P	Partial Pavement  
N	Dirt/Gravel  

In [48]:
paveddrive_setlist = ['Y', 'P', 'N']
train_test = dummify_column(train,"paveddrive", paveddrive_setlist)
train_test.head()

These categories are missing from the data: []
paveddrive has been concatenated with input DataFrame. All missing categories are set to 0


Unnamed: 0_level_0,mssubclass,lotfrontage,lotarea,lotshape,landslope,overallqual,overallcond,yearbuilt,yearremodadd,masvnrarea,...,2types_garagetype,attchd_garagetype,basment_garagetype,builtin_garagetype,carport_garagetype,detchd_garagetype,na_garagetype,y_paveddrive,p_paveddrive,n_paveddrive
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,65.0,8450,3,2,7,5,2003,2003,196.0,...,0,1,0,0,0,0,0,1,0,0
2,20,80.0,9600,3,2,6,8,1976,1976,0.0,...,0,1,0,0,0,0,0,1,0,0
3,60,68.0,11250,2,2,7,5,2001,2002,162.0,...,0,1,0,0,0,0,0,1,0,0
4,70,60.0,9550,2,2,7,5,1915,1970,0.0,...,0,0,0,0,0,1,0,1,0,0
5,60,84.0,14260,2,2,8,5,2000,2000,350.0,...,0,1,0,0,0,0,0,1,0,0


### `saletype`  
WD 	Warranty Deed - Conventional  
CWD	Warranty Deed - Cash  
VWD	Warranty Deed - VA Loan  
New	Home just constructed and sold  
COD	Court Officer Deed/Estate  
Con	Contract 15% Down payment regular terms  
ConLw	Contract Low Down payment and low interest  
ConLI	Contract Low Interest  
ConLD	Contract Low Down  
Oth	Other  

In [49]:
saletype_setlist = ['WD', 'CWD', 'VWD', 'New', 'COD', 'Con', 'ConLw', 'ConLI', 'ConLD', 'Oth']
train_test = dummify_column(train,"saletype", saletype_setlist)
train_test.head()

These categories are missing from the data: ['VWD']
saletype has been concatenated with input DataFrame. All missing categories are set to 0


Unnamed: 0_level_0,mssubclass,lotfrontage,lotarea,lotshape,landslope,overallqual,overallcond,yearbuilt,yearremodadd,masvnrarea,...,wd_saletype,cwd_saletype,vwd_saletype,new_saletype,cod_saletype,con_saletype,conlw_saletype,conli_saletype,conld_saletype,oth_saletype
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,65.0,8450,3,2,7,5,2003,2003,196.0,...,1,0,0,0,0,0,0,0,0,0
2,20,80.0,9600,3,2,6,8,1976,1976,0.0,...,1,0,0,0,0,0,0,0,0,0
3,60,68.0,11250,2,2,7,5,2001,2002,162.0,...,1,0,0,0,0,0,0,0,0,0
4,70,60.0,9550,2,2,7,5,1915,1970,0.0,...,1,0,0,0,0,0,0,0,0,0
5,60,84.0,14260,2,2,8,5,2000,2000,350.0,...,1,0,0,0,0,0,0,0,0,0


### `salecondition` 
Normal	Normal Sale  
Abnorml	Abnormal Sale -  trade, foreclosure, short sale  
AdjLand	Adjoining Land Purchase  
Alloca	Allocation - two linked properties with separate deeds, typically condo with a garage unit	  
Family	Sale between family members  
Partial	Home was not completed when last assessed (associated with New Homes)  

In [50]:
salecondition_setlist = ['Normal', 'Abnorml', 'AdjLand', 'Alloca', 'Family', 'Partial']
train_test = dummify_column(train,"salecondition", salecondition_setlist)
train_test.head()

These categories are missing from the data: []
salecondition has been concatenated with input DataFrame. All missing categories are set to 0


Unnamed: 0_level_0,mssubclass,lotfrontage,lotarea,lotshape,landslope,overallqual,overallcond,yearbuilt,yearremodadd,masvnrarea,...,conlw_saletype,conli_saletype,conld_saletype,oth_saletype,normal_salecondition,abnorml_salecondition,adjland_salecondition,alloca_salecondition,family_salecondition,partial_salecondition
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,65.0,8450,3,2,7,5,2003,2003,196.0,...,0,0,0,0,1,0,0,0,0,0
2,20,80.0,9600,3,2,6,8,1976,1976,0.0,...,0,0,0,0,1,0,0,0,0,0
3,60,68.0,11250,2,2,7,5,2001,2002,162.0,...,0,0,0,0,1,0,0,0,0,0
4,70,60.0,9550,2,2,7,5,1915,1970,0.0,...,0,0,0,0,0,1,0,0,0,0
5,60,84.0,14260,2,2,8,5,2000,2000,350.0,...,0,0,0,0,1,0,0,0,0,0


In [51]:
train_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1460 entries, 1 to 1460
Columns: 228 entries, mssubclass to partial_salecondition
dtypes: float64(11), int64(61), uint8(156)
memory usage: 1.0 MB


All named columns are now setup into dummy variables.

## Checking for Null Values again

In [52]:
null_fields_checker(train)

lotfrontage : 259
masvnrarea : 8
bsmtqual : 37
bsmtcond : 37
bsmtexposure : 38
bsmtfintype1 : 37
bsmtfintype2 : 38
garageyrblt : 81
garagefinish : 81
garagequal : 81
garagecond : 81


## A/B Testing 1: Setting all Null Values as Zeroes
I will be A/B Testing this dataset first by cleaning null values as zeroes instead. The rationale is that if these values are missing, they may have meaning as they might imply that these features are missing from the house and are thus meaningful.

In [53]:
train_zero = train_test
train_zero.fillna(0, inplace = True)
null_fields_checker(train_zero)

No results from our Null Field Checker function means there are no longer Null Values in our column.

In [54]:
train_zero.head()

Unnamed: 0_level_0,mssubclass,lotfrontage,lotarea,lotshape,landslope,overallqual,overallcond,yearbuilt,yearremodadd,masvnrarea,...,conlw_saletype,conli_saletype,conld_saletype,oth_saletype,normal_salecondition,abnorml_salecondition,adjland_salecondition,alloca_salecondition,family_salecondition,partial_salecondition
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,65.0,8450,3,2,7,5,2003,2003,196.0,...,0,0,0,0,1,0,0,0,0,0
2,20,80.0,9600,3,2,6,8,1976,1976,0.0,...,0,0,0,0,1,0,0,0,0,0
3,60,68.0,11250,2,2,7,5,2001,2002,162.0,...,0,0,0,0,1,0,0,0,0,0
4,70,60.0,9550,2,2,7,5,1915,1970,0.0,...,0,0,0,0,0,1,0,0,0,0
5,60,84.0,14260,2,2,8,5,2000,2000,350.0,...,0,0,0,0,1,0,0,0,0,0


# A/B Testing 2: Imputing Data with the Mean of the column

In [55]:
train_mean = train_test

for col in train_mean.columns:
    train_mean[col].fillna(train_mean.mean(), inplace = True)
    
null_fields_checker(train_mean)

No results from our Null Field Checker function means there are no longer Null Values in our column.

In [56]:
train_mean.head()

Unnamed: 0_level_0,mssubclass,lotfrontage,lotarea,lotshape,landslope,overallqual,overallcond,yearbuilt,yearremodadd,masvnrarea,...,conlw_saletype,conli_saletype,conld_saletype,oth_saletype,normal_salecondition,abnorml_salecondition,adjland_salecondition,alloca_salecondition,family_salecondition,partial_salecondition
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,65.0,8450,3,2,7,5,2003,2003,196.0,...,0,0,0,0,1,0,0,0,0,0
2,20,80.0,9600,3,2,6,8,1976,1976,0.0,...,0,0,0,0,1,0,0,0,0,0
3,60,68.0,11250,2,2,7,5,2001,2002,162.0,...,0,0,0,0,1,0,0,0,0,0
4,70,60.0,9550,2,2,7,5,1915,1970,0.0,...,0,0,0,0,0,1,0,0,0,0
5,60,84.0,14260,2,2,8,5,2000,2000,350.0,...,0,0,0,0,1,0,0,0,0,0


# Save Our Data into another CSV for our next notebooks to load up.

In [57]:
train_zero.to_csv("../assets/clean_zero_train.csv",index = True)
train_mean.to_csv("../assets/clean_mean_train.csv",index = True)

---