# 3 - Feature Engineering

In [1447]:
import pandas as pd
pd.set_option('display.max_columns', None)
train = pd.read_pickle('../pickles/cleaned/train_cleaned')
test = pd.read_pickle('../pickles/cleaned/test_cleaned')

### Value Mapping<p>
In order to be used in regression, all columns need to be in a numberical format. Additionally, some columns can be combined into more meaningful data points, such as room counts.<p>
First, heirarchical values, such as those ranging from 'poor' to 'excellent' can be converted into numerical ones quite easily, with 1 as the worst and counting up, using 0 where there is no data at all. <p>
Some value schema are shared across multiple columns, and can all be mapped together.

5 <- Ex	(Excellent) <br>
4 <- Gd	(Good)<br>
3 <- TA	(Average/Typical)<br>
2 <- Fa	(Fair)<br>
1 <- Po	(Poor)<br>
0 <- None	(Doesn't have)<p>

Columns this applies to:<p>
`ExterQual`: Evaluates the quality of the material on the exterior<br>
`ExterCond`: Evaluates the present condition of the material on the exterior<br>
`BsmtQual`: Evaluates the height of the basement<br>
`BsmtCond`: Evaluates the general condition of the basement<br>
`HeatingQC`: Heating quality and condition<br>
`KitchenQual`: Kitchen quality<br>
`FireplaceQu`: Fireplace quality<br>
`GarageQual`: Garage quality<br>
`GarageCond`: Garage condition<br>
`PoolQC`: Pool quality

In [1448]:
mapping1 = {'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1, 'None':0}

In [1449]:
train.loc[:, 'PoolQC'] = train['PoolQC'].map(mapping1)
train.loc[:, 'FireplaceQu'] = train['FireplaceQu'].map(mapping1)
train.loc[:, 'GarageCond'] = train['GarageCond'].map(mapping1)
train.loc[:, 'GarageQual'] = train['GarageQual'].map(mapping1)
train.loc[:, 'KitchenQual'] = train['KitchenQual'].map(mapping1)
train.loc[:, 'HeatingQC'] = train['HeatingQC'].map(mapping1)
train.loc[:, 'BsmtCond'] = train['BsmtCond'].map(mapping1)
train.loc[:, 'BsmtQual'] = train['BsmtQual'].map(mapping1)
train.loc[:, 'ExterCond'] = train['ExterCond'].map(mapping1)
train.loc[:, 'ExterQual'] = train['ExterQual'].map(mapping1)

In [1450]:
test.loc[:, 'PoolQC'] = test['PoolQC'].map(mapping1)
test.loc[:, 'FireplaceQu'] = test['FireplaceQu'].map(mapping1)
test.loc[:, 'GarageCond'] = test['GarageCond'].map(mapping1)
test.loc[:, 'GarageQual'] = test['GarageQual'].map(mapping1)
test.loc[:, 'KitchenQual'] = test['KitchenQual'].map(mapping1)
test.loc[:, 'HeatingQC'] = test['HeatingQC'].map(mapping1)
test.loc[:, 'BsmtCond'] = test['BsmtCond'].map(mapping1)
test.loc[:, 'BsmtQual'] = test['BsmtQual'].map(mapping1)
test.loc[:, 'ExterCond'] = test['ExterCond'].map(mapping1)
test.loc[:, 'ExterQual'] = test['ExterQual'].map(mapping1)

6 <- GLQ (Good Living Quarters)<br>
5 <- ALQ (Average Living Quarters)<br>
4 <- BLQ (Below Average Living Quarters)	<br>
3 <- Rec (Average Rec Room)<br>
2 <- LwQ (Low Quality)<br>
1 <- Unf (Unfinshed)<br>
0 <- None (Doesn't have)<p>

Columes this applies to:<p>
`BsmtFinType1`: Rating of basement finished area<br>
`BsmtFinType2`: Rating of basement finished area (if multiple types)

In [1451]:
mapping2 = {'GLQ':6, 'ALQ':5, 'BLQ':4, 'Rec':3, 'LwQ':2, 'Unf':1, 'None':0}

train.loc[:, 'BsmtFinType1'] = train['BsmtFinType1'].map(mapping2)
train.loc[:, 'BsmtFinType2'] = train['BsmtFinType2'].map(mapping2)

test.loc[:, 'BsmtFinType1'] = test['BsmtFinType1'].map(mapping2)
test.loc[:, 'BsmtFinType2'] = test['BsmtFinType2'].map(mapping2)

2 <- Grvl	(Gravel)<br>
1 <- Pave	(Paved)<br>
0 <- None (Only on `Alley`; no alley access)<p>

Applies to:<p>
`Street`: Type of road access to property<br>
`Alley`: Type of alley access to property

In [1452]:
mapping3 = {'Grvl':2, 'Pave':1, 'None':0}

train.loc[:, 'Street'] = train['Street'].map(mapping3)
train.loc[:, 'Alley'] = train['Alley'].map(mapping3)

test.loc[:, 'Street'] = test['Street'].map(mapping3)
test.loc[:, 'Alley'] = test['Alley'].map(mapping3)

The rest of the columns all have unique value sets and must be mapped individually.<p><br></p>

I'm going to convert the values for `LotShape` into numeric values, based on increasing irregularity. <p>
1 <- Reg	(Regular)<br>
2 <- IR1	(Slightly irregular)<br>
3 <- IR2	(Moderately Irregular)<br>
4 <- IR3	(Irregular)

In [1453]:
shapemap = {'Reg':1, 'IR1':2, 'IR2':3, 'IR3':4}

train.loc[:, 'LotShape'] = train['LotShape'].map(shapemap)

test.loc[:, 'LotShape'] = test['LotShape'].map(shapemap)

I am doing the same for the `LandSlope` column, increasing with severity.<p>
1 <- Gtl	(Gentle slope)<br>
2 <- Mod	(Moderate Slope)<br>
3 <- Sev	(Severe Slope)

In [1454]:
slopemap = {'Gtl':1, 'Mod':2, 'Sev':3}

train.loc[:, 'LandSlope'] = train['LandSlope'].map(slopemap)

test.loc[:, 'LandSlope'] = test['LandSlope'].map(slopemap)

The values of the `LandContour` column imply an ordered heirarchy, and I am going to treat them as such.<p>

4 <- Lvl	(Near Flat/Level)<br>
3 <- Bnk	(Banked)<br>
2 <- HLS	(Hillside)<br>
1 <- Low	(Depression)

In [1455]:
contmap = {'Lvl':4,'Bnk':3,'HLS':2,'Low':1}

train.loc[:, 'LandContour'] = train['LandContour'].map(contmap)

test.loc[:, 'LandContour'] = test['LandContour'].map(contmap)

The `Utilities` column:<p>

4 <- AllPub	(All public Utilities)<br>
3 <- NoSewr	(Electricity, Gas, and Water (Septic Tank))<br>
2 <- NoSeWa	(Electricity and Gas Only)<br>
1 <- ELO	(Electricity only)

In [1456]:
utilmap = {'AllPub':4, 'NoSewr':3, 'NoSeWa':2, 'ELO':1}

train.loc[:, 'Utilities'] = train['Utilities'].map(utilmap)

test.loc[:, 'Utilities'] = test['Utilities'].map(utilmap)

The values of the `LotConfig` column also impy an ordered heirarchy, with more street frontage being more desirable.<p>

1 <- Inside	(Inside lot)<br>
2 <- Corner	(Corner lot)<br>
3 <- CulDSac	(Cul-de-sac)<br>
4 <- FR2	(Frontage on 2 sides)<br>
5 <- FR3	(Frontage on 3 sides)

In [1457]:
configmap = {'Inside':1,'Corner':2,'CulDSac':3,'FR2':4,'FR3':5}

train.loc[:, 'LotConfig'] = train['LotConfig'].map(configmap)

test.loc[:, 'LotConfig'] = test['LotConfig'].map(configmap)

The `BsmtExposure` column:<p>

4 <- Gd	(Good Exposure)<br>
3 <- Av	(Average Exposure)<br>
2 <- Mn	(Mimimum Exposure)<br>
1 <- No	(No Exposure)<br>
0 <- None	(No Basement)

In [1458]:
bsmtmap = {'Gd':4, 'Av':3, 'Mn':2, 'No':1, 'None':0}

train.loc[:, 'BsmtExposure'] = train['BsmtExposure'].map(bsmtmap)

test.loc[:, 'BsmtExposure'] = test['BsmtExposure'].map(bsmtmap)

The `Functional` column:<p>

8 <- Typ	(Typical Functionality)<br>
7 <- Min1	(Minor Deductions 1)<br>
6 <- Min2	(Minor Deductions 2)<br>
5 <- Mod	(Moderate Deductions)<br>
4 <- Maj1	(Major Deductions 1)<br>
3 <- Maj2	(Major Deductions 2)<br>
2 <- Sev	(Severely Damaged)<br>
1 <- Sal	(Salvage only)


In [1459]:
functmap = {'Typ':8, 'Min1':7, 'Min2':6, 'Mod':5, 'Maj1':4, 'Maj2':3, 'Sev':2, 'Sal':1}

train.loc[:, 'Functional'] = train['Functional'].map(functmap)

test.loc[:, 'Functional'] = test['Functional'].map(functmap)

The `GarageFinish` column:<p>

3 <- Fin (Finished)<br>
2 <- RFn (Rough Finished)<br>
1 <- Unf (Unfinished)<br>
0 <- None (No Garage)

In [1460]:
garagemap = {'Fin':3, 'RFn':2, 'Unf':1, 'None':0}

train.loc[:, 'GarageFinish'] = train['GarageFinish'].map(garagemap)

test.loc[:, 'GarageFinish'] = test['GarageFinish'].map(garagemap)

The `PavedDrive` column:<p>
3 <- Y	(Paved) <br>
2 <- P	(Partial Pavement)<br>
1 <- N	(Dirt/Gravel)

In [1461]:
pavemap = {'Y':3, 'P':2, 'N':1}

train.loc[:, 'PavedDrive'] = train['PavedDrive'].map(pavemap)

test.loc[:, 'PavedDrive'] = test['PavedDrive'].map(pavemap)

The `Fence` column contains values which look like they could be contrasting pairs, but because they are all in the same column, they can't be treated that way. As such, I think the best way to treat them is to assume they are meant to be heirarchical.<p>
4 <- GdPrv	(Good Privacy)<br>
3 <- MnPrv	(Minimum Privacy)<br>
2 <- GdWo	(Good Wood)<br>
1 <- MnWw	(Minimum Wood/Wire)<br>
0 <- None	(No Fence)

In [1462]:
fencemap = {
'GdPrv':4,'MnPrv':3,'GdWo':2,'MnWw':1,'None':0
}

train.loc[:, 'Fence'] = train['Fence'].map(fencemap)

test.loc[:, 'Fence'] = test['Fence'].map(fencemap)

Additionally, the `CentralAir` column currently contains Yes/No values, and as such can be re-mapped using the standard 1/0 schema.

In [1463]:
airmap = {'Y':1, 'N':0}

train.loc[:, 'CentralAir'] = train['CentralAir'].map(airmap)

test.loc[:, 'CentralAir'] = test['CentralAir'].map(airmap)

The `GarageYrBlt` column has 'None' written in where a property has no garage. In order for this column to be useable, and because the type of garage (or lack thereof) is going to be handled later during One Hot Encoding, I am going to have to use a 0 for the 'None's. 

In [1464]:
train.loc[((train[train['GarageYrBlt']=='None'].index).tolist()), 'GarageYrBlt'] = 0

test.loc[((test[test['GarageYrBlt']=='None'].index).tolist()), 'GarageYrBlt'] = 0

### One Hot Encoding<p>
Features that aren't numerical and have no obvious heirarchy need to be handled via One Hot Encoding.

In [1465]:
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(sparse_output=False)

In the interest of not having a massive number of columns at the end, some values are going to be combined if ***a)*** they are not very numerous and ***b)*** they can logically be combined.

There are some column pairs that share the same possible values. In order to capture both columns' data for each pair, I'm expanding both of them and then combing them into a single column set, with a max of two points across them.

`Condition1` and `Condition2` share the following value set:<p>

Artery - Adjacent to arterial street<br>
Feedr - Adjacent to feeder street<br>
Norm - Normal	<br>
RRNn - Within 200' of North-South Railroad<br>
RRAn - Adjacent to North-South Railroad<br>
PosN - Near positive off-site feature--park, greenbelt, etc.<br>
PosA - Adjacent to postive off-site feature<br>
RRNe - Within 200' of East-West Railroad<br>
RRAe - Adjacent to East-West Railroad<p>

The vast majority of values are in the 'Norm' category, so I am going to be combining some of the other values. While 'Artery' and 'Feedr' could potentially be combined into a single 'Street' value, I can see the logic in differentiating when a property is on a much busier street. I will be combining the directional railroad values, but maintaining the differentiation for distance.<p>

**RRN - Within 200' of Railroad**<p>
RRNn - Within 200' of North-South Railroad<br>
RRNe - Within 200' of East-West Railroad<p>
<br>
**RRA - Adjacent to Railroad**<p>
RRAn - Adjacent to North-South Railroad<br>
RRAe - Adjacent to East-West Railroad

In [1466]:
# apply 'RRN' to Condition1 column of train
train.loc[((train[
    (train['Condition1']=='RRNn')|
    (train['Condition1']=='RRNe')
    ].index).tolist()), 'Condition1'] = 'RRN'

# apply 'RRN' to Condition2 column of train
train.loc[((train[
    (train['Condition2']=='RRNn')|
    (train['Condition2']=='RRNe')
    ].index).tolist()), 'Condition2'] = 'RRN'

# apply 'RRN' to Condition1 column of test
test.loc[((test[
    (test['Condition1']=='RRNn')|
    (test['Condition1']=='RRNe')
    ].index).tolist()), 'Condition1'] = 'RRN'

# apply 'RRN' to Condition2 column of test
test.loc[((test[
    (test['Condition2']=='RRNn')|
    (test['Condition2']=='RRNe')
    ].index).tolist()), 'Condition2'] = 'RRN'

In [1467]:
# apply 'RRA' to Condition1 column of train
train.loc[((train[
    (train['Condition1']=='RRAn')|
    (train['Condition1']=='RRAe')
    ].index).tolist()), 'Condition1'] = 'RRA'

# apply 'RRA' to Condition2 column of train
train.loc[((train[
    (train['Condition2']=='RRAn')|
    (train['Condition2']=='RRAe')
    ].index).tolist()), 'Condition2'] = 'RRA'

# apply 'RRA' to Condition1 column of test
test.loc[((test[
    (test['Condition1']=='RRAn')|
    (test['Condition1']=='RRAe')
    ].index).tolist()), 'Condition1'] = 'RRA'

# apply 'RRA' to Condition2 column of test
test.loc[((test[
    (test['Condition2']=='RRAn')|
    (test['Condition2']=='RRAe')
    ].index).tolist()), 'Condition2'] = 'RRA'

Handling the train set:

In [1468]:
# expand both columns and combine for coalation
train_c1 = pd.DataFrame(data=(enc.fit_transform(train[['Condition1']])),columns=(enc.get_feature_names_out()))
train_c2 = pd.DataFrame(data=(enc.fit_transform(train[['Condition2']])),columns=(enc.get_feature_names_out()))
train_condition = pd.concat([train_c1,train_c2],axis=1)

In [1469]:
# create columns for coalating values 
train_condition['Cond_Artery'] = 0.0
train_condition['Cond_Feedr'] = 0.0
train_condition['Cond_Norm'] = 0.0
train_condition['Cond_PosA'] = 0.0
train_condition['Cond_PosN'] = 0.0
train_condition['Cond_RRA'] = 0.0
train_condition['Cond_RRN'] = 0.0

In [1470]:
# where either 'Artery' expansion is 1, fill master column with a 1
train_condition.loc[((train_condition[
    (train_condition['Condition1_Artery']==1)| 
    (train_condition['Condition2_Artery']==1)
].index).tolist()), 'Cond_Artery'] = 1

# where either 'Feedr' expansion is 1, fill master column with a 1
train_condition.loc[((train_condition[
    (train_condition['Condition1_Feedr']==1)|
    (train_condition['Condition2_Feedr']==1)
].index).tolist()), 'Cond_Feedr'] = 1

# where either 'Norm' expansion is 1, fill master column with a 1
train_condition.loc[((train_condition[
    (train_condition['Condition1_Norm']==1)|
    (train_condition['Condition2_Norm']==1)
].index).tolist()), 'Cond_Norm'] = 1

# where either 'PosA' expansion is 1, fill master column with a 1
train_condition.loc[((train_condition[
    (train_condition['Condition1_PosA']==1)|
    (train_condition['Condition2_PosA']==1)
].index).tolist()), 'Cond_PosA'] = 1

# where either 'PosN' expansion is 1, fill master column with a 1
train_condition.loc[((train_condition[
    (train_condition['Condition1_PosN']==1)|
    (train_condition['Condition2_PosN']==1)
].index).tolist()), 'Cond_PosN'] = 1

# where either 'RRA' expansion is 1, fill master column with a 1
train_condition.loc[((train_condition[
    (train_condition['Condition1_RRA']==1)|
    (train_condition['Condition2_RRA']==1)
].index).tolist()), 'Cond_RRA'] = 1

# where either 'RRN' expansion is 1, fill master column with a 1
train_condition.loc[((train_condition[
    (train_condition['Condition1_RRN']==1)|
    (train_condition['Condition2_RRN']==1)
].index).tolist()), 'Cond_RRN'] = 1

In [1471]:
# merge the columns to keep back onto main dataframe
train = pd.concat([train.reset_index(),train_condition[['Cond_Artery','Cond_Feedr','Cond_Norm','Cond_RRN','Cond_RRA',
                                                        'Cond_PosN','Cond_PosA']]],axis=1)

And the test set:

In [1472]:
# expand both columns and combine for coalation
test_c1 = pd.DataFrame(data=(enc.fit_transform(test[['Condition1']])),columns=(enc.get_feature_names_out()))
test_c2 = pd.DataFrame(data=(enc.fit_transform(test[['Condition2']])),columns=(enc.get_feature_names_out()))
test_condition = pd.concat([test_c1,test_c2],axis=1)

In [1473]:
# rename columns that did not appear in the second set of conditions
test_condition = test_condition.rename(columns={'Condition1_RRA':'Cond_RRA','Condition1_RRN':'Cond_RRN'})

# create columns for coalating values 
test_condition['Cond_Artery'] = 0.0
test_condition['Cond_Feedr'] = 0.0
test_condition['Cond_Norm'] = 0.0
test_condition['Cond_PosA'] = 0.0
test_condition['Cond_PosN'] = 0.0

In [1474]:
# where either 'Artery' expansion is 1, fill master column with a 1
test_condition.loc[((test_condition[
    (test_condition['Condition1_Artery']==1)|
    (test_condition['Condition2_Artery']==1)
].index).tolist()), 'Cond_Artery'] = 1

# where either 'Feedr' expansion is 1, fill master column with a 1
test_condition.loc[((test_condition[
    (test_condition['Condition1_Feedr']==1)|
    (test_condition['Condition2_Feedr']==1)
].index).tolist()), 'Cond_Feedr'] = 1

# where either 'Norm' expansion is 1, fill master column with a 1
test_condition.loc[((test_condition[
    (test_condition['Condition1_Norm']==1)|
    (test_condition['Condition2_Norm']==1)
].index).tolist()), 'Cond_Norm'] = 1

# where either 'PosA' expansion is 1, fill master column with a 1
test_condition.loc[((test_condition[
    (test_condition['Condition1_PosA']==1)|
    (test_condition['Condition2_PosA']==1)
].index).tolist()), 'Cond_PosA'] = 1

# where either 'PosN' expansion is 1, fill master column with a 1
test_condition.loc[((test_condition[
    (test_condition['Condition1_PosN']==1)|
    (test_condition['Condition2_PosN']==1)
].index).tolist()), 'Cond_PosN'] = 1

In [1475]:
# merge the columns to keep back onto main dataframe
test = pd.concat([test.reset_index(),test_condition[['Cond_Artery','Cond_Feedr','Cond_Norm','Cond_RRN','Cond_RRA',
                                                        'Cond_PosN','Cond_PosA']]],axis=1)

`Exterior1st` and `Exterior2nd` share the following set of possible values:<p>

AsbShng - Asbestos Shingles<br>
AsphShn - Asphalt Shingles<br>
BrkComm - Brick Common<br>
BrkFace - Brick Face<br>
CBlock - Cinder Block<br>
CemntBd - Cement Board<br>
HdBoard - Hard Board<br>
ImStucc - Imitation Stucco<br>
MetalSd - Metal Siding<br>
Other - Other<br>
Plywood - Plywood<br>
PreCast - PreCast<br>
Stone  -Stone<br>
Stucco - Stucco<br>
VinylSd - Vinyl Siding<br>
Wd Sdng - Wood Siding<br>
WdShing - Wood Shingles<p>

There aren't any obvious places to combine values here.<p>
<br>
Handling the train set:

In [1476]:
train_ext1 = pd.DataFrame(data=(enc.fit_transform(train[['Exterior1st']])),columns=(enc.get_feature_names_out()))
train_ext2 = pd.DataFrame(data=(enc.fit_transform(train[['Exterior2nd']])),columns=(enc.get_feature_names_out()))
train_exterior = pd.concat([train_ext1,train_ext2],axis=1)

In [1477]:
# rename the column that only appears in one dataframe
train_exterior = train_exterior.rename(columns={'Exterior2nd_Other':'Ext_Other'})

# create columns for coalating values
train_exterior['Ext_AsbShng'] = 0.0
train_exterior['Ext_AsphShn'] = 0.0
train_exterior['Ext_BrkComm'] = 0.0
train_exterior['Ext_BrkFace'] = 0.0
train_exterior['Ext_CBlock'] = 0.0
train_exterior['Ext_CemntBd'] = 0.0
train_exterior['Ext_HdBoard'] = 0.0
train_exterior['Ext_ImStucc'] = 0.0
train_exterior['Ext_MetalSd'] = 0.0
train_exterior['Ext_Plywood'] = 0.0
train_exterior['Ext_Stone'] = 0.0
train_exterior['Ext_Stucco'] = 0.0
train_exterior['Ext_VinylSd'] = 0.0
train_exterior['Ext_Wd_Sdng'] = 0.0
train_exterior['Ext_WdShing'] = 0.0

In [1478]:
# where either 'AsbShng' expansion is 1, fill master column with a 1
train_exterior.loc[((train_exterior[
    (train_exterior['Exterior1st_AsbShng']==1)|
    (train_exterior['Exterior2nd_AsbShng']==1)
].index).tolist()), 'Ext_AsbShng'] = 1

# where either 'AsphShn' expansion is 1, fill master column with a 1
train_exterior.loc[((train_exterior[
    (train_exterior['Exterior1st_AsphShn']==1)|
    (train_exterior['Exterior2nd_AsphShn']==1)
].index).tolist()), 'Ext_AsphShn'] = 1

# where either 'BrkComm' expansion is 1, fill master column with a 1
train_exterior.loc[((train_exterior[
    (train_exterior['Exterior1st_BrkComm']==1)|
    (train_exterior['Exterior2nd_Brk Cmn']==1)
].index).tolist()), 'Ext_BrkComm'] = 1

# where either 'BrkFace' expansion is 1, fill master column with a 1
train_exterior.loc[((train_exterior[
    (train_exterior['Exterior1st_BrkFace']==1)|
    (train_exterior['Exterior2nd_BrkFace']==1)
].index).tolist()), 'Ext_BrkFace'] = 1

# where either 'CBlock' expansion is 1, fill master column with a 1
train_exterior.loc[((train_exterior[
    (train_exterior['Exterior1st_CBlock']==1)|
    (train_exterior['Exterior2nd_CBlock']==1)
].index).tolist()), 'Ext_CBlock'] = 1

# where either 'CemntBd' expansion is 1, fill master column with a 1
train_exterior.loc[((train_exterior[
    (train_exterior['Exterior1st_CemntBd']==1)|
    (train_exterior['Exterior2nd_CmentBd']==1)
].index).tolist()), 'Ext_CemntBd'] = 1

# where either 'HdBoard' expansion is 1, fill master column with a 1
train_exterior.loc[((train_exterior[
    (train_exterior['Exterior1st_HdBoard']==1)|
    (train_exterior['Exterior2nd_HdBoard']==1)
].index).tolist()), 'Ext_HdBoard'] = 1

# where either 'ImStucc' expansion is 1, fill master column with a 1
train_exterior.loc[((train_exterior[
    (train_exterior['Exterior1st_ImStucc']==1)|
    (train_exterior['Exterior2nd_ImStucc']==1)
].index).tolist()), 'Ext_ImStucc'] = 1

# where either 'MetalSd' expansion is 1, fill master column with a 1
train_exterior.loc[((train_exterior[
    (train_exterior['Exterior1st_MetalSd']==1)|
    (train_exterior['Exterior2nd_MetalSd']==1)
].index).tolist()), 'Ext_MetalSd'] = 1

# where either 'Plywood' expansion is 1, fill master column with a 1
train_exterior.loc[((train_exterior[
    (train_exterior['Exterior1st_Plywood']==1)|
    (train_exterior['Exterior2nd_Plywood']==1)
].index).tolist()), 'Ext_Plywood'] = 1

# where either 'Stone' expansion is 1, fill master column with a 1
train_exterior.loc[((train_exterior[
    (train_exterior['Exterior1st_Stone']==1)|
    (train_exterior['Exterior2nd_Stone']==1)
].index).tolist()), 'Ext_Stone'] = 1

# where either 'Stucco' expansion is 1, fill master column with a 1
train_exterior.loc[((train_exterior[
    (train_exterior['Exterior1st_Stucco']==1)|
    (train_exterior['Exterior2nd_Stucco']==1)
].index).tolist()), 'Ext_Stucco'] = 1

# where either 'VinylSd' expansion is 1, fill master column with a 1
train_exterior.loc[((train_exterior[
    (train_exterior['Exterior1st_VinylSd']==1)|
    (train_exterior['Exterior2nd_VinylSd']==1)
].index).tolist()), 'Ext_VinylSd'] = 1

# where either 'Wd Sdng' expansion is 1, fill master column with a 1
train_exterior.loc[((train_exterior[
    (train_exterior['Exterior1st_Wd Sdng']==1)|
    (train_exterior['Exterior2nd_Wd Shng']==1)
].index).tolist()), 'Ext_Wd_Sdng'] = 1

# where either 'WdShing' expansion is 1, fill master column with a 1
train_exterior.loc[((train_exterior[
    (train_exterior['Exterior1st_WdShing']==1)|
    (train_exterior['Exterior2nd_Wd Sdng']==1)
].index).tolist()), 'Ext_WdShing'] = 1

In [1479]:
# merge the columns to keep back onto main dataframe
train = pd.concat([train,train_exterior[['Ext_AsbShng','Ext_AsphShn','Ext_BrkComm','Ext_BrkFace','Ext_CBlock','Ext_CemntBd','Ext_HdBoard','Ext_ImStucc',
'Ext_MetalSd','Ext_Other','Ext_Plywood','Ext_Stone','Ext_Stucco','Ext_VinylSd','Ext_Wd_Sdng','Ext_WdShing']]],axis=1)

The test set:

In [1480]:
test_ext1 = pd.DataFrame(data=(enc.fit_transform(test[['Exterior1st']])),columns=(enc.get_feature_names_out()))
test_ext2 = pd.DataFrame(data=(enc.fit_transform(test[['Exterior2nd']])),columns=(enc.get_feature_names_out()))
test_exterior = pd.concat([test_ext1,test_ext2],axis=1)

In [1481]:
# rename columns that only appear in one dataframe
test_exterior = test_exterior.rename(columns={'Exterior2nd_ImStucc':'Ext_ImStucc','Exterior2nd_Stone':'Ext_Stone'})

# create columns for coalating values
test_exterior['Ext_AsbShng'] = 0.0
test_exterior['Ext_AsphShn'] = 0.0
test_exterior['Ext_BrkComm'] = 0.0
test_exterior['Ext_BrkFace'] = 0.0
test_exterior['Ext_CBlock'] = 0.0
test_exterior['Ext_CemntBd'] = 0.0
test_exterior['Ext_HdBoard'] = 0.0
test_exterior['Ext_MetalSd'] = 0.0
test_exterior['Ext_Other'] = 0.0
test_exterior['Ext_Plywood'] = 0.0
test_exterior['Ext_Stucco'] = 0.0
test_exterior['Ext_VinylSd'] = 0.0
test_exterior['Ext_Wd_Sdng'] = 0.0
test_exterior['Ext_WdShing'] = 0.0

In [1482]:
# where either 'AsbShng' expansion is 1, fill master column with a 1
test_exterior.loc[((test_exterior[
    (test_exterior['Exterior1st_AsbShng']==1)|
    (test_exterior['Exterior2nd_AsbShng']==1)
].index).tolist()), 'Ext_AsbShng'] = 1

# where either 'AsphShn' expansion is 1, fill master column with a 1
test_exterior.loc[((test_exterior[
    (test_exterior['Exterior1st_AsphShn']==1)|
    (test_exterior['Exterior2nd_AsphShn']==1)
].index).tolist()), 'Ext_AsphShn'] = 1

# where either 'BrkComm' expansion is 1, fill master column with a 1
test_exterior.loc[((test_exterior[
    (test_exterior['Exterior1st_BrkComm']==1)|
    (test_exterior['Exterior2nd_Brk Cmn']==1)
].index).tolist()), 'Ext_BrkComm'] = 1

# where either 'BrkFace' expansion is 1, fill master column with a 1
test_exterior.loc[((test_exterior[
    (test_exterior['Exterior1st_BrkFace']==1)|
    (test_exterior['Exterior2nd_BrkFace']==1)
].index).tolist()), 'Ext_BrkFace'] = 1

# where either 'CBlock' expansion is 1, fill master column with a 1
test_exterior.loc[((test_exterior[
    (test_exterior['Exterior1st_CBlock']==1)|
    (test_exterior['Exterior2nd_CBlock']==1)
].index).tolist()), 'Ext_CBlock'] = 1

# where either 'CemntBd' expansion is 1, fill master column with a 1
test_exterior.loc[((test_exterior[
    (test_exterior['Exterior1st_CemntBd']==1)|
    (test_exterior['Exterior2nd_CmentBd']==1)
].index).tolist()), 'Ext_CemntBd'] = 1

# where either 'HdBoard' expansion is 1, fill master column with a 1
test_exterior.loc[((test_exterior[
    (test_exterior['Exterior1st_HdBoard']==1)|
    (test_exterior['Exterior2nd_HdBoard']==1)
].index).tolist()), 'Ext_HdBoard'] = 1

# where either 'MetalSd' expansion is 1, fill master column with a 1
test_exterior.loc[((test_exterior[
    (test_exterior['Exterior1st_MetalSd']==1)|
    (test_exterior['Exterior2nd_MetalSd']==1)
].index).tolist()), 'Ext_MetalSd'] = 1

# where either 'Other' expansion is 1, fill master column with a 1
test_exterior.loc[((test_exterior[
    (test_exterior['Exterior1st_Other']==1)|
    (test_exterior['Exterior2nd_Other']==1)
].index).tolist()), 'Ext_Other'] = 1

# where either 'Plywood' expansion is 1, fill master column with a 1
test_exterior.loc[((test_exterior[
    (test_exterior['Exterior1st_Plywood']==1)|
    (test_exterior['Exterior2nd_Plywood']==1)
].index).tolist()), 'Ext_Plywood'] = 1

# where either 'Stucco' expansion is 1, fill master column with a 1
test_exterior.loc[((test_exterior[
    (test_exterior['Exterior1st_Stucco']==1)|
    (test_exterior['Exterior2nd_Stucco']==1)
].index).tolist()), 'Ext_Stucco'] = 1

# where either 'VinylSd' expansion is 1, fill master column with a 1
test_exterior.loc[((test_exterior[
    (test_exterior['Exterior1st_VinylSd']==1)|
    (test_exterior['Exterior2nd_VinylSd']==1)
].index).tolist()), 'Ext_VinylSd'] = 1

# where either 'Wd Sdng' expansion is 1, fill master column with a 1
test_exterior.loc[((test_exterior[
    (test_exterior['Exterior1st_Wd Sdng']==1)|
    (test_exterior['Exterior2nd_Wd Shng']==1)
].index).tolist()), 'Ext_Wd_Sdng'] = 1

# where either 'WdShing' expansion is 1, fill master column with a 1
test_exterior.loc[((test_exterior[
    (test_exterior['Exterior1st_WdShing']==1)|
    (test_exterior['Exterior2nd_Wd Sdng']==1)
].index).tolist()), 'Ext_WdShing'] = 1

In [1483]:
# merge the columns to keep back onto main dataframe
test = pd.concat([test,test_exterior[['Ext_AsbShng','Ext_AsphShn','Ext_BrkComm','Ext_BrkFace','Ext_CBlock','Ext_CemntBd','Ext_HdBoard','Ext_ImStucc',
'Ext_MetalSd','Ext_Other','Ext_Plywood','Ext_Stone','Ext_Stucco','Ext_VinylSd','Ext_Wd_Sdng','Ext_WdShing']]],axis=1)

None of the remaining columns share values, so they can't be combined into during expansion.

The `MSZoning` column:<p>

A -	Agriculture<br>
C -	Commercial<br>
FV -	Floating Village Residential<br>
I -	Industrial<br>
RH -	Residential High Density<br>
RL -	Residential Low Density<br>
RP -	Residential Low Density Park <br>
RM -	Residential Medium Density<p>

The 'Residential' values can't be combined, as the 'RL' value is already the vast majority of the entries. 

In [1484]:
train_zone = pd.DataFrame(data=(enc.fit_transform(train[['MSZoning']])),columns=(enc.get_feature_names_out()))

# simplifying name of 'C (all)' column to just 'C'
train_zone = train_zone.rename(columns={'MSZoning_C (all)':'MSZoning_C'})

# merge new columns back into main dataframe
train = pd.concat([train,train_zone],axis=1)

In [1485]:
test_zone = pd.DataFrame(data=(enc.fit_transform(test[['MSZoning']])),columns=(enc.get_feature_names_out()))

# simplifying name of 'C (all)' column to just 'C'
test_zone = test_zone.rename(columns={'MSZoning_C (all)':'MSZoning_C'})

# merge new columns back into main dataframe
test = pd.concat([test,test_zone],axis=1)

The `Neighborhood` column:<p>
Obviously, there is no way to combine neighborhoods.

In [1486]:
train_nbhd = pd.DataFrame(data=(enc.fit_transform(train[['Neighborhood']])),columns=(enc.get_feature_names_out()))

# merge new columns back into main dataframe
train = pd.concat([train,train_nbhd],axis=1)

In [1487]:
test_nbhd = pd.DataFrame(data=(enc.fit_transform(test[['Neighborhood']])),columns=(enc.get_feature_names_out()))

# merge new columns back into main dataframe
test = pd.concat([test,test_nbhd],axis=1)

The `BldgType` column:<p>

1Fam -	Single-family Detached	<br>
2FmCon -	Two-family Conversion<br>
Duplx -	Duplex<br>
TwnhsE -	Townhouse End Unit<br>
TwnhsI -	Townhouse Inside Unit<p>

Combining the two 'Townhouse' values would result in a total of more than 10% of the entries for the train set, which is more than I want to create for a single value, so they won't be combined. 

In [1488]:
train_type = pd.DataFrame(data=(enc.fit_transform(train[['BldgType']])),columns=(enc.get_feature_names_out()))

# merge new columns back into main dataframe
train = pd.concat([train,train_type],axis=1)

In [1489]:
test_type = pd.DataFrame(data=(enc.fit_transform(test[['BldgType']])),columns=(enc.get_feature_names_out()))

# merge new columns back into main dataframe
test = pd.concat([test,test_type],axis=1)

The `HouseStyle` column:<p>

1Story - One story<br>
1.5Fin -	One and one-half story: 2nd level finished<br>
1.5Unf -	One and one-half story: 2nd level unfinished<br>
2Story -	Two story<br>
2.5Story -	Two and one-half story<br>
SFoyer -	Split Foyer<br>
SLvl -	Split Level<p>

Combining '1.5Fin' and '1.5Unf' results in a larger category than I would like, but the fact that there are columns for unfinished square footage renders this distinction redundant, so I *am* going to condense them into a single value. 

In [1490]:
# set 1.5Story for train
train.loc[((train[
    (train['HouseStyle']=='1.5Fin')|
    (train['HouseStyle']=='1.5Unf')
].index).tolist()), 'HouseStyle'] = '1.5Story'

# set 1.5Story for test
test.loc[((test[
    (test['HouseStyle']=='1.5Fin')|
    (test['HouseStyle']=='1.5Unf')
].index).tolist()), 'HouseStyle'] = '1.5Story'

In [1491]:
train_style = pd.DataFrame(data=(enc.fit_transform(train[['HouseStyle']])),columns=(enc.get_feature_names_out()))

# merge new columns back into main dataframe
train = pd.concat([train,train_style],axis=1)

In [1492]:
test_style = pd.DataFrame(data=(enc.fit_transform(test[['HouseStyle']])),columns=(enc.get_feature_names_out()))

# merge new columns back into main dataframe
test = pd.concat([test,test_style],axis=1)

The `RoofStyle` column:<p>

Flat -	Flat<br>
Gable -	Gable<br>
Gambrel -	Gabrel (Barn)<br>
Hip -	Hip<br>
Mansard -	Mansard<br>
Shed -	Shed<p>
Nothing here can be combined.

In [1493]:
train_RfStyl = pd.DataFrame(data=(enc.fit_transform(train[['RoofStyle']])),columns=(enc.get_feature_names_out()))

# merge new columns back into main dataframe
train = pd.concat([train,train_RfStyl],axis=1)

In [1494]:
test_RfStyl = pd.DataFrame(data=(enc.fit_transform(test[['RoofStyle']])),columns=(enc.get_feature_names_out()))

# merge new columns back into main dataframe
test = pd.concat([test,test_RfStyl],axis=1)

The `RoofMatl` column:<p>
ClyTile -	Clay or Tile<br>
CompShg -	Standard (Composite) Shingle<br>
Membran -	Membrane<br>
Metal -	Metal<br>
Roll -	Roll<br>
Tar&Grv -	Gravel & Tar<br>
WdShake -	Wood Shakes<br>
WdShngl -	Wood Shingles<p>
Nothing to combine.

In [1495]:
train_RfMat = pd.DataFrame(data=(enc.fit_transform(train[['RoofMatl']])),columns=(enc.get_feature_names_out()))

# merge new columns back into main dataframe
train = pd.concat([train,train_RfMat],axis=1)

In [1496]:
test_RfMat = pd.DataFrame(data=(enc.fit_transform(test[['RoofMatl']])),columns=(enc.get_feature_names_out()))

# create columns for options present in train but missing in test
test_RfMat['RoofMatl_Membran'] = 0.0
test_RfMat['RoofMatl_Metal'] = 0.0
test_RfMat['RoofMatl_Roll'] = 0.0

# merge new columns back into main dataframe
test = pd.concat([test,test_RfMat],axis=1)

The `MasVnrType` column:<p>
BrkCmn -	Brick Common<br>
BrkFace -	Brick Face<br>
CBlock -	Cinder Block<br>
None -	None<br>
Stone -	Stone<p>
Nothing to combine.

In [1497]:
train_vnr = pd.DataFrame(data=(enc.fit_transform(train[['MasVnrType']])),columns=(enc.get_feature_names_out()))

# merge new columns back into main dataframe
train = pd.concat([train,train_vnr],axis=1)

In [1498]:
test_vnr = pd.DataFrame(data=(enc.fit_transform(test[['MasVnrType']])),columns=(enc.get_feature_names_out()))

# merge new columns back into main dataframe
test = pd.concat([test,test_vnr],axis=1)

The `Foundation` column:<p>
BrkTil -	Brick & Tile<br>
CBlock -	Cinder Block<br>
PConc -	Poured Contrete	<br>
Slab -	Slab<br>
Stone -	Stone<br>
Wood -	Wood<p>
Nothing to combine.

In [1499]:
train_found = pd.DataFrame(data=(enc.fit_transform(train[['Foundation']])),columns=(enc.get_feature_names_out()))

# merge new columns back into main dataframe
train = pd.concat([train,train_found],axis=1)

In [1500]:
test_found = pd.DataFrame(data=(enc.fit_transform(test[['Foundation']])),columns=(enc.get_feature_names_out()))

# merge new columns back into main dataframe
test = pd.concat([test,test_found],axis=1)

The `Heating` column:<p>
Floor -	Floor Furnace<br>
GasA -	Gas forced warm air furnace<br>
GasW -	Gas hot water or steam heat<br>
Grav -	Gravity furnace	<br>
OthW -	Hot water or steam heat other than gas<br>
Wall -	Wall furnace<p>
I will not be combining the two 'Gas' values, as 'GasA' contains the vast majority of the entries.

In [1501]:
train_heat = pd.DataFrame(data=(enc.fit_transform(train[['Heating']])),columns=(enc.get_feature_names_out()))

# merge new columns back into main dataframe
train = pd.concat([train,train_heat],axis=1)

In [1502]:
test_heat = pd.DataFrame(data=(enc.fit_transform(test[['Heating']])),columns=(enc.get_feature_names_out()))

# create columns for options present in train but missing in test
test_heat['Heating_Floor'] = 0.0
test_heat['Heating_OthW'] = 0.0

# merge new columns back into main dataframe
test = pd.concat([test,test_heat],axis=1)

The `Electrical` column:<p>
SBrkr -	Standard Circuit Breakers & Romex<br>
FuseA -	Fuse Box over 60 AMP and all Romex wiring (Average)	<br>
FuseF -	60 AMP Fuse Box and mostly Romex wiring (Fair)<br>
FuseP -	60 AMP Fuse Box and mostly knob & tube wiring (poor)<br>
Mix -	Mixed<p>
I don't know enough about wiring to know if there is a way to combine these. 

In [1503]:
train_elec = pd.DataFrame(data=(enc.fit_transform(train[['Electrical']])),columns=(enc.get_feature_names_out()))

# merge new columns back into main dataframe
train = pd.concat([train,train_elec],axis=1)

In [1504]:
test_elec = pd.DataFrame(data=(enc.fit_transform(test[['Electrical']])),columns=(enc.get_feature_names_out()))

# create column for option present in train but missing in test
test_elec['Electrical_Mix'] = 0.0

# merge new columns back into main dataframe
test = pd.concat([test,test_elec],axis=1)

The `GarageType` column:<p>
2Types -	More than one type of garage<br>
Attchd -	Attached to home<br>
Basment -	Basement Garage<br>
BuiltIn -	Built-In<br>
CarPort -	Car Port<br>
Detchd -	Detached from home<br>
None -	No Garage<p>
No combinations possible.

In [1505]:
train_garage = pd.DataFrame(data=(enc.fit_transform(train[['GarageType']])),columns=(enc.get_feature_names_out()))

# drop none column
train_garage = train_garage.drop('GarageType_None',axis=1)

# merge new columns back into main dataframe
train = pd.concat([train,train_garage],axis=1)

In [1506]:
test_garage = pd.DataFrame(data=(enc.fit_transform(test[['GarageType']])),columns=(enc.get_feature_names_out()))

# drop none column
test_garage = test_garage.drop('GarageType_None',axis=1)

# merge new columns back into main dataframe
test = pd.concat([test,test_garage],axis=1)

The `MiscFeature` column:<p>
Elev -	Elevator<br>
Gar2 -	2nd Garage (if not described in garage section)<br>
Othr -	Other<br>
Shed -	Shed (over 100 SF)<br>
TenC -	Tennis Court<br>
NA -	None<p>
Nothing to combine.

In [1507]:
train_misc = pd.DataFrame(data=(enc.fit_transform(train[['MiscFeature']])),columns=(enc.get_feature_names_out()))

# create column for option present in test but missing in train
train_misc['MiscFeature_Gar2'] = 0.0

# drop nan column
train_misc = train_misc.drop('MiscFeature_nan',axis=1)

# merge new columns back into main dataframe
train = pd.concat([train,train_misc],axis=1)

In [1508]:
test_misc = pd.DataFrame(data=(enc.fit_transform(test[['MiscFeature']])),columns=(enc.get_feature_names_out()))

# create column for option present in train but missing in test
test_misc['MiscFeature_TenC'] = 0.0

# drop nan column
test_misc = test_misc.drop('MiscFeature_nan',axis=1)

# merge new columns back into main dataframe
test = pd.concat([test,test_misc],axis=1)

The `SaleType` column:<p>
WD - 	Warranty Deed - Conventional<br>
CWD -	Warranty Deed - Cash<br>
VWD -	Warranty Deed - VA Loan<br>
New -	Home just constructed and sold<br>
COD -	Court Officer Deed/Estate<br>
Con -	Contract 15% Down payment regular terms<br>
ConLw -	Contract Low Down payment and low interest<br>
ConLI -	Contract Low Interest<br>
ConLD -	Contract Low Down<br>
Oth -	Other<p>

I'm going to combine the 'WarrantyDeed' and 'Contract' groups into single values. 'VWD' is never used, and 'CWD' accounts for less than 1% in each set. The combined amount of the 'Contract' values only account for between 1% and 2% of the values in each set. <p>

**AWD - Any Warranty Deed**<p>
WD - 	Warranty Deed - Conventional<br>
CWD -	Warranty Deed - Cash<p>
<br>
**Cont - All Contracts**<p>
Con -	Contract 15% Down payment regular terms<br>
ConLw -	Contract Low Down payment and low interest<br>
ConLI -	Contract Low Interest<br>
ConLD -	Contract Low Down<br>

In [1509]:
# apply 'AWD' to train
train.loc[((train[
    (train['SaleType']=='WD')|
    (train['SaleType']=='CWD')
    ].index).tolist()), 'SaleType'] = 'AWD'

# apply 'AWD' to test
test.loc[((test[
    (test['SaleType']=='WD')|
    (test['SaleType']=='CWD')
    ].index).tolist()), 'SaleType'] = 'AWD'

In [1510]:
# apply 'AWD' to train
train.loc[((train[
    (train['SaleType']=='Con')|
    (train['SaleType']=='ConLw')|
    (train['SaleType']=='ConLI')|
    (train['SaleType']=='ConLD')
    ].index).tolist()), 'SaleType'] = 'Cont'

# apply 'AWD' to test
test.loc[((test[
    (test['SaleType']=='Con')|
    (test['SaleType']=='ConLw')|
    (test['SaleType']=='ConLI')|
    (test['SaleType']=='ConLD')
    ].index).tolist()), 'SaleType'] = 'Cont'

In [1511]:
train_SType = pd.DataFrame(data=(enc.fit_transform(train[['SaleType']])),columns=(enc.get_feature_names_out()))

# merge new columns back into main dataframe
train = pd.concat([train,train_SType],axis=1)

In [1512]:
test_SType = pd.DataFrame(data=(enc.fit_transform(test[['SaleType']])),columns=(enc.get_feature_names_out()))

# merge new columns back into main dataframe
test = pd.concat([test,test_SType],axis=1)

The `SaleCondition` column:<p>
Normal -	Normal Sale<br>
Abnorml -	Abnormal Sale -  trade, foreclosure, short sale<br>
AdjLand -	Adjoining Land Purchase<br>
Alloca -	Allocation - two linked properties with separate deeds, typically condo with a garage unit	<br>
Family -	Sale between family members<br>
Partial -	Home was not completed when last assessed (associated with New Homes)<p>
No combinations.

In [1513]:
train_SCond = pd.DataFrame(data=(enc.fit_transform(train[['SaleCondition']])),columns=(enc.get_feature_names_out()))

# merge new columns back into main dataframe
train = pd.concat([train,train_SCond],axis=1)

In [1514]:
test_SCond = pd.DataFrame(data=(enc.fit_transform(test[['SaleCondition']])),columns=(enc.get_feature_names_out()))

# merge new columns back into main dataframe
test = pd.concat([test,test_SCond],axis=1)

### Combining and Creating Columns<p>
The total living space for each property can be calculated using the `GrLivArea` column, for the area above ground level, and the `TotalBsmtSF1` column, for the area of the basement, if one exists. 

In [1515]:
train['TtlLivSF'] = train['GrLivArea']+train['TotalBsmtSF']

test['TtlLivSF'] = test['GrLivArea']+test['TotalBsmtSF']

In [1516]:
print('Train head:')
display(train[['GrLivArea', 'TotalBsmtSF', 'TtlLivSF']].head())
print('Test head:')
display(test[['GrLivArea', 'TotalBsmtSF', 'TtlLivSF']].head())

Train head:


Unnamed: 0,GrLivArea,TotalBsmtSF,TtlLivSF
0,1710,856,2566
1,1262,1262,2524
2,1786,920,2706
3,1717,756,2473
4,2198,1145,3343


Test head:


Unnamed: 0,GrLivArea,TotalBsmtSF,TtlLivSF
0,896,882.0,1778.0
1,1329,1329.0,2658.0
2,1629,928.0,2557.0
3,1604,926.0,2530.0
4,1280,1280.0,2560.0


The total number of bathrooms can also be calculated using the four columns tracking number of bathrooms. (Halfbath columns are multiplied by 0.5 to reflect that they are 'half' when being added to the count.)

In [1517]:
train['TotalBath'] = train['BsmtFullBath']+(train['BsmtHalfBath']*0.5)+train['FullBath']+(train['HalfBath']*0.5)

test['TotalBath'] = test['BsmtFullBath']+(test['BsmtHalfBath']*0.5)+test['FullBath']+(test['HalfBath']*0.5)

In [1518]:
print('Train head:')
display(train[['BsmtFullBath', 'BsmtHalfBath', 'FullBath','HalfBath', 'TotalBath']].head())
print('Test head:')
display(test[['BsmtFullBath', 'BsmtHalfBath', 'FullBath','HalfBath', 'TotalBath']].head())

Train head:


Unnamed: 0,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,TotalBath
0,1,0,2,1,3.5
1,0,1,2,0,2.5
2,1,0,2,1,3.5
3,1,0,1,0,2.0
4,1,0,2,1,3.5


Test head:


Unnamed: 0,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,TotalBath
0,0.0,0.0,1,0,1.0
1,0.0,0.0,1,1,1.5
2,0.0,0.0,2,1,2.5
3,0.0,0.0,2,1,2.5
4,0.0,0.0,2,0,2.0


Because the `HouseStyle` references to the number of stories only reflect the ones above ground, I am going to make a column for the total number of floors, using the columns that record the total area for basement, first and second floor. There is no column for area of the third story, in the cases of a property being marked as 2.5 stories. There are no cases of properties in the 2.5 category that have no basement, so I don't need to worry about that and can encode them all has having 4 total floors. <p>
I am going to create the new column with a value of 0 so I can check that all the columns have been encoded correctly.

In [1519]:
# create new column 
train['TtlFloors'] = 0

# encode the applicable values
# only a ground floor
train.loc[((train[(train['1stFlrSF']>0)&(train['TotalBsmtSF']==0)&(train['2ndFlrSF']==0)
].index).tolist()), 'TtlFloors'] = 1

# ground floor and basement
train.loc[((train[(train['1stFlrSF']>0)&(train['TotalBsmtSF']>0)&(train['2ndFlrSF']==0)
].index).tolist()), 'TtlFloors'] = 2

# ground floor and second floor
train.loc[((train[(train['1stFlrSF']>0)&(train['TotalBsmtSF']==0)&(train['2ndFlrSF']>0)
].index).tolist()), 'TtlFloors'] = 2

# three floors
train.loc[((train[(train['1stFlrSF']>0)&(train['TotalBsmtSF']>0)&(train['2ndFlrSF']>0)
].index).tolist()), 'TtlFloors'] = 3

# the 2.5 story category
train.loc[((train[train['HouseStyle']=='2.5Story'
].index).tolist()), 'TtlFloors'] = 4

In [1520]:
# check there are no 0s left
train['TtlFloors'].value_counts()

TtlFloors
2    809
3    595
1     27
4     18
Name: count, dtype: int64

In [1521]:
# create new column 
test['TtlFloors'] = 0

# encode the applicable values
# only a ground floor
test.loc[((test[(test['1stFlrSF']>0)&(test['TotalBsmtSF']==0)&(test['2ndFlrSF']==0)
].index).tolist()), 'TtlFloors'] = 1

# ground floor and basement
test.loc[((test[(test['1stFlrSF']>0)&(test['TotalBsmtSF']>0)&(test['2ndFlrSF']==0)
].index).tolist()), 'TtlFloors'] = 2

# ground floor and second floor
test.loc[((test[(test['1stFlrSF']>0)&(test['TotalBsmtSF']==0)&(test['2ndFlrSF']>0)
].index).tolist()), 'TtlFloors'] = 2

# all three floors
test.loc[((test[(test['1stFlrSF']>0)&(test['TotalBsmtSF']>0)&(test['2ndFlrSF']>0)
].index).tolist()), 'TtlFloors'] = 3

# the 2.5 story category
test.loc[((test[test['HouseStyle']=='2.5Story'
].index).tolist()), 'TtlFloors'] = 4

In [1522]:
# check there are no 0s left
test['TtlFloors'].value_counts()

TtlFloors
2    813
3    602
1     34
4     10
Name: count, dtype: int64

Looking at the 5 columns related to porches, there are many properties that show multiple types of porch, but no cases of all 5 kinds being present.

In [1523]:
display(train[train['WoodDeckSF']>0][['WoodDeckSF','OpenPorchSF','EnclosedPorch','3SsnPorch','ScreenPorch']].head(5))
print('Looking for all 5 in Train:')
display(train[(train['WoodDeckSF']>0)&(train['OpenPorchSF']>0)&(train['EnclosedPorch']>0)&(train['3SsnPorch']>0)&(train['ScreenPorch']>0)])
print('Looking for all 5 in Test:')
display(test[(test['WoodDeckSF']>0)&(test['OpenPorchSF']>0)&(test['EnclosedPorch']>0)&(test['3SsnPorch']>0)&(test['ScreenPorch']>0)])

Unnamed: 0,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch
1,298,0,0,0,0
4,192,84,0,0,0
5,40,30,0,320,0
6,255,57,0,0,0
7,235,204,228,0,0


Looking for all 5 in Train:


Unnamed: 0,index,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice,Cond_Artery,Cond_Feedr,Cond_Norm,Cond_RRN,Cond_RRA,Cond_PosN,Cond_PosA,Ext_AsbShng,Ext_AsphShn,Ext_BrkComm,Ext_BrkFace,Ext_CBlock,Ext_CemntBd,Ext_HdBoard,Ext_ImStucc,Ext_MetalSd,Ext_Other,Ext_Plywood,Ext_Stone,Ext_Stucco,Ext_VinylSd,Ext_Wd_Sdng,Ext_WdShing,MSZoning_C,MSZoning_FV,MSZoning_RH,MSZoning_RL,MSZoning_RM,Neighborhood_Blmngtn,Neighborhood_Blueste,Neighborhood_BrDale,Neighborhood_BrkSide,Neighborhood_ClearCr,Neighborhood_CollgCr,Neighborhood_Crawfor,Neighborhood_Edwards,Neighborhood_Gilbert,Neighborhood_IDOTRR,Neighborhood_MeadowV,Neighborhood_Mitchel,Neighborhood_NAmes,Neighborhood_NPkVill,Neighborhood_NWAmes,Neighborhood_NoRidge,Neighborhood_NridgHt,Neighborhood_OldTown,Neighborhood_SWISU,Neighborhood_Sawyer,Neighborhood_SawyerW,Neighborhood_Somerst,Neighborhood_StoneBr,Neighborhood_Timber,Neighborhood_Veenker,BldgType_1Fam,BldgType_2fmCon,BldgType_Duplex,BldgType_Twnhs,BldgType_TwnhsE,HouseStyle_1.5Story,HouseStyle_1Story,HouseStyle_2.5Story,HouseStyle_2Story,HouseStyle_SFoyer,HouseStyle_SLvl,RoofStyle_Flat,RoofStyle_Gable,RoofStyle_Gambrel,RoofStyle_Hip,RoofStyle_Mansard,RoofStyle_Shed,RoofMatl_CompShg,RoofMatl_Membran,RoofMatl_Metal,RoofMatl_Roll,RoofMatl_Tar&Grv,RoofMatl_WdShake,RoofMatl_WdShngl,MasVnrType_BrkCmn,MasVnrType_BrkFace,MasVnrType_None,MasVnrType_Stone,Foundation_BrkTil,Foundation_CBlock,Foundation_PConc,Foundation_Slab,Foundation_Stone,Foundation_Wood,Heating_Floor,Heating_GasA,Heating_GasW,Heating_Grav,Heating_OthW,Heating_Wall,Electrical_FuseA,Electrical_FuseF,Electrical_FuseP,Electrical_Mix,Electrical_SBrkr,GarageType_2Types,GarageType_Attchd,GarageType_Basment,GarageType_BuiltIn,GarageType_CarPort,GarageType_Detchd,MiscFeature_Othr,MiscFeature_Shed,MiscFeature_TenC,MiscFeature_Gar2,SaleType_AWD,SaleType_COD,SaleType_Cont,SaleType_New,SaleType_Oth,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial,TtlLivSF,TotalBath,TtlFloors


Looking for all 5 in Test:


Unnamed: 0,index,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,Cond_Artery,Cond_Feedr,Cond_Norm,Cond_RRN,Cond_RRA,Cond_PosN,Cond_PosA,Ext_AsbShng,Ext_AsphShn,Ext_BrkComm,Ext_BrkFace,Ext_CBlock,Ext_CemntBd,Ext_HdBoard,Ext_ImStucc,Ext_MetalSd,Ext_Other,Ext_Plywood,Ext_Stone,Ext_Stucco,Ext_VinylSd,Ext_Wd_Sdng,Ext_WdShing,MSZoning_C,MSZoning_FV,MSZoning_RH,MSZoning_RL,MSZoning_RM,Neighborhood_Blmngtn,Neighborhood_Blueste,Neighborhood_BrDale,Neighborhood_BrkSide,Neighborhood_ClearCr,Neighborhood_CollgCr,Neighborhood_Crawfor,Neighborhood_Edwards,Neighborhood_Gilbert,Neighborhood_IDOTRR,Neighborhood_MeadowV,Neighborhood_Mitchel,Neighborhood_NAmes,Neighborhood_NPkVill,Neighborhood_NWAmes,Neighborhood_NoRidge,Neighborhood_NridgHt,Neighborhood_OldTown,Neighborhood_SWISU,Neighborhood_Sawyer,Neighborhood_SawyerW,Neighborhood_Somerst,Neighborhood_StoneBr,Neighborhood_Timber,Neighborhood_Veenker,BldgType_1Fam,BldgType_2fmCon,BldgType_Duplex,BldgType_Twnhs,BldgType_TwnhsE,HouseStyle_1.5Story,HouseStyle_1Story,HouseStyle_2.5Story,HouseStyle_2Story,HouseStyle_SFoyer,HouseStyle_SLvl,RoofStyle_Flat,RoofStyle_Gable,RoofStyle_Gambrel,RoofStyle_Hip,RoofStyle_Mansard,RoofStyle_Shed,RoofMatl_CompShg,RoofMatl_Tar&Grv,RoofMatl_WdShake,RoofMatl_WdShngl,RoofMatl_Membran,RoofMatl_Metal,RoofMatl_Roll,MasVnrType_BrkCmn,MasVnrType_BrkFace,MasVnrType_None,MasVnrType_Stone,Foundation_BrkTil,Foundation_CBlock,Foundation_PConc,Foundation_Slab,Foundation_Stone,Foundation_Wood,Heating_GasA,Heating_GasW,Heating_Grav,Heating_Wall,Heating_Floor,Heating_OthW,Electrical_FuseA,Electrical_FuseF,Electrical_FuseP,Electrical_SBrkr,Electrical_Mix,GarageType_2Types,GarageType_Attchd,GarageType_Basment,GarageType_BuiltIn,GarageType_CarPort,GarageType_Detchd,MiscFeature_Gar2,MiscFeature_Othr,MiscFeature_Shed,MiscFeature_TenC,SaleType_AWD,SaleType_COD,SaleType_Cont,SaleType_New,SaleType_Oth,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial,TtlLivSF,TotalBath,TtlFloors


I am going to create a column that collects the total size of porch space. I thought about making a column to group porches into ranked groups based on total size, and about converting the individual columns into categorical Y/N columns, but I have decided that doing that is redundant.

In [1524]:
train['PorchTtlSF'] = train['WoodDeckSF']+train['OpenPorchSF']+train['EnclosedPorch']+train['3SsnPorch']+train['ScreenPorch']

test['PorchTtlSF'] = test['WoodDeckSF']+test['OpenPorchSF']+test['EnclosedPorch']+test['3SsnPorch']+test['ScreenPorch']

Finally, I am dropping the unexpanded, text columns and the extra index column. Additionally, I am going to set the Id column as the index and drop the `MSSubClass` column, as I think it is largely redundant with the **MSZoning** and **HouseStyle** categories.

In [1525]:
droplist = [
    'index','Condition1','Condition2','Exterior1st','Exterior2nd','Foundation','Heating','HouseStyle','Electrical','RoofStyle','RoofMatl',
    'MasVnrType','MSZoning','BldgType','Neighborhood','GarageType','MiscFeature','SaleType','SaleCondition','MSSubClass']

In [1526]:
train_feat = train.drop(droplist,axis=1).set_index('Id')

test_feat = test.drop(droplist,axis=1).set_index('Id')

### Normalization 
Several of the numerical columns contain very large numbers that need to be normalized to prevent them being overemphasized in the model. I am going to use MinMaxScaler to accomplish this.

In [1527]:
from sklearn.preprocessing import MinMaxScaler

train_norm = train_feat.copy()
test_norm = test_feat.copy()

As each column has its own minimum and maximum, each column needs its own scaler. 

In [1528]:
scaler1 = MinMaxScaler()
train_norm['LotFrontage'] = scaler1.fit_transform(train_norm[['LotFrontage']])
test_norm['LotFrontage'] = scaler1.transform(test_norm[['LotFrontage']])

In [1529]:
scaler2 = MinMaxScaler()
train_norm['LotArea'] = scaler2.fit_transform(train_norm[['LotArea']])
test_norm['LotArea'] = scaler2.transform(test_norm[['LotArea']])

In [1530]:
scaler3 = MinMaxScaler()
train_norm['YearBuilt'] = scaler3.fit_transform(train_norm[['YearBuilt']])
test_norm['YearBuilt'] = scaler3.transform(test_norm[['YearBuilt']])

In [1531]:
scaler4 = MinMaxScaler()
train_norm['YearRemodAdd'] = scaler4.fit_transform(train_norm[['YearRemodAdd']])
test_norm['YearRemodAdd'] = scaler4.transform(test_norm[['YearRemodAdd']])

In [1532]:
scaler5 = MinMaxScaler()
train_norm['MasVnrArea'] = scaler5.fit_transform(train_norm[['MasVnrArea']])
test_norm['MasVnrArea'] = scaler5.transform(test_norm[['MasVnrArea']])

In [1533]:
scaler6 = MinMaxScaler()
train_norm['BsmtFinSF1'] = scaler6.fit_transform(train_norm[['BsmtFinSF1']])
test_norm['BsmtFinSF1'] = scaler6.transform(test_norm[['BsmtFinSF1']])

In [1534]:
scaler7 = MinMaxScaler()
train_norm['BsmtFinSF2'] = scaler7.fit_transform(train_norm[['BsmtFinSF2']])
test_norm['BsmtFinSF2'] = scaler7.transform(test_norm[['BsmtFinSF2']])

In [1535]:
scaler8 = MinMaxScaler()
train_norm['BsmtUnfSF'] = scaler8.fit_transform(train_norm[['BsmtUnfSF']])
test_norm['BsmtUnfSF'] = scaler8.transform(test_norm[['BsmtUnfSF']])

In [1536]:
scaler9 = MinMaxScaler()
train_norm['TotalBsmtSF'] = scaler9.fit_transform(train_norm[['TotalBsmtSF']])
test_norm['TotalBsmtSF'] = scaler9.transform(test_norm[['TotalBsmtSF']])

In [1537]:
scaler10 = MinMaxScaler()
train_norm['1stFlrSF'] = scaler10.fit_transform(train_norm[['1stFlrSF']])
test_norm['1stFlrSF'] = scaler10.transform(test_norm[['1stFlrSF']])

In [1538]:
scaler11 = MinMaxScaler()
train_norm['2ndFlrSF'] = scaler11.fit_transform(train_norm[['2ndFlrSF']])
test_norm['2ndFlrSF'] = scaler11.transform(test_norm[['2ndFlrSF']])

In [1539]:
scaler12 = MinMaxScaler()
train_norm['LowQualFinSF'] = scaler12.fit_transform(train_norm[['LowQualFinSF']])
test_norm['LowQualFinSF'] = scaler12.transform(test_norm[['LowQualFinSF']])

In [1540]:
scaler13 = MinMaxScaler()
train_norm['GrLivArea'] = scaler13.fit_transform(train_norm[['GrLivArea']])
test_norm['GrLivArea'] = scaler13.transform(test_norm[['GrLivArea']])

In [1541]:
scaler14 = MinMaxScaler()
train_norm['GarageYrBlt'] = scaler14.fit_transform(train_norm[['GarageYrBlt']])
test_norm['GarageYrBlt'] = scaler14.transform(test_norm[['GarageYrBlt']])

In [1542]:
scaler15 = MinMaxScaler()
train_norm['GarageArea'] = scaler15.fit_transform(train_norm[['GarageArea']])
test_norm['GarageArea'] = scaler15.transform(test_norm[['GarageArea']])

In [1543]:
scaler16 = MinMaxScaler()
train_norm['WoodDeckSF'] = scaler16.fit_transform(train_norm[['WoodDeckSF']])
test_norm['WoodDeckSF'] = scaler16.transform(test_norm[['WoodDeckSF']])

In [1544]:
scaler17 = MinMaxScaler()
train_norm['OpenPorchSF'] = scaler17.fit_transform(train_norm[['OpenPorchSF']])
test_norm['OpenPorchSF'] = scaler17.transform(test_norm[['OpenPorchSF']])

In [1545]:
scaler18 = MinMaxScaler()
train_norm['EnclosedPorch'] = scaler18.fit_transform(train_norm[['EnclosedPorch']])
test_norm['EnclosedPorch'] = scaler18.transform(test_norm[['EnclosedPorch']])

In [1546]:
scaler19 = MinMaxScaler()
train_norm['3SsnPorch'] = scaler19.fit_transform(train_norm[['3SsnPorch']])
test_norm['3SsnPorch'] = scaler19.transform(test_norm[['3SsnPorch']])

In [1547]:
scaler20 = MinMaxScaler()
train_norm['ScreenPorch'] = scaler20.fit_transform(train_norm[['ScreenPorch']])
test_norm['ScreenPorch'] = scaler20.transform(test_norm[['ScreenPorch']])

In [1548]:
scaler21 = MinMaxScaler()
train_norm['PoolArea'] = scaler21.fit_transform(train_norm[['PoolArea']])
test_norm['PoolArea'] = scaler21.transform(test_norm[['PoolArea']])

In [1549]:
scaler22 = MinMaxScaler()
train_norm['MiscVal'] = scaler22.fit_transform(train_norm[['MiscVal']])
test_norm['MiscVal'] = scaler22.transform(test_norm[['MiscVal']])

In [1550]:
scaler23 = MinMaxScaler()
train_norm['YrSold'] = scaler23.fit_transform(train_norm[['YrSold']])
test_norm['YrSold'] = scaler23.transform(test_norm[['YrSold']])

In [1551]:
scaler24 = MinMaxScaler()
train_norm['TtlLivSF'] = scaler24.fit_transform(train_norm[['TtlLivSF']])
test_norm['TtlLivSF'] = scaler24.transform(test_norm[['TtlLivSF']])

In [1552]:
scaler25 = MinMaxScaler()
train_norm['PorchTtlSF'] = scaler25.fit_transform(train_norm[['PorchTtlSF']])
test_norm['PorchTtlSF'] = scaler25.transform(test_norm[['PorchTtlSF']])

### Saving new pickles:<p>
I am keeping pickles of each dataframe without normalization applied for easy reference.

In [1553]:
# dataframes without normalization
train_feat.to_pickle('../pickles/features/train_features')
test_feat.to_pickle('../pickles/features/test_features')

# dataframes with normalization
train_norm.to_pickle('../pickles/features/train_normalized')
test_feat.to_pickle('../pickles/features/test_normalized')