# 3 - Feature Engineering

In [670]:
import pandas as pd
pd.set_option('display.max_columns', None)
train = pd.read_pickle('../pickles/cleaned/train_cleaned')
test = pd.read_pickle('../pickles/cleaned/test_cleaned')

### Value Mapping<p>
In order to be used in regression, all columns need to be in a numberical format. Additionally, some columns can be combined into more meaningful data points, such as room counts.<p>
First, heirarchical values, such as those ranging from 'poor' to 'excellent' can be converted into numerical ones quite easily, with 1 as the worst and counting up, using 0 where there is no data at all. <p>
Some value schema are shared across multiple columns, and can all be mapped together.

5 <- Ex	(Excellent) <br>
4 <- Gd	(Good)<br>
3 <- TA	(Average/Typical)<br>
2 <- Fa	(Fair)<br>
1 <- Po	(Poor)<br>
0 <- None	(Doesn't have)<p>

Columns this applies to:<p>
`ExterQual`: Evaluates the quality of the material on the exterior<br>
`ExterCond`: Evaluates the present condition of the material on the exterior<br>
`BsmtQual`: Evaluates the height of the basement<br>
`BsmtCond`: Evaluates the general condition of the basement<br>
`HeatingQC`: Heating quality and condition<br>
`KitchenQual`: Kitchen quality<br>
`FireplaceQu`: Fireplace quality<br>
`GarageQual`: Garage quality<br>
`GarageCond`: Garage condition<br>
`PoolQC`: Pool quality

In [671]:
mapping1 = {'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1, 'None':0}

In [672]:
train.loc[:, 'PoolQC'] = train['PoolQC'].map(mapping1)
train.loc[:, 'FireplaceQu'] = train['FireplaceQu'].map(mapping1)
train.loc[:, 'GarageCond'] = train['GarageCond'].map(mapping1)
train.loc[:, 'GarageQual'] = train['GarageQual'].map(mapping1)
train.loc[:, 'KitchenQual'] = train['KitchenQual'].map(mapping1)
train.loc[:, 'HeatingQC'] = train['HeatingQC'].map(mapping1)
train.loc[:, 'BsmtCond'] = train['BsmtCond'].map(mapping1)
train.loc[:, 'BsmtQual'] = train['BsmtQual'].map(mapping1)
train.loc[:, 'ExterCond'] = train['ExterCond'].map(mapping1)
train.loc[:, 'ExterQual'] = train['ExterQual'].map(mapping1)

In [673]:
test.loc[:, 'PoolQC'] = test['PoolQC'].map(mapping1)
test.loc[:, 'FireplaceQu'] = test['FireplaceQu'].map(mapping1)
test.loc[:, 'GarageCond'] = test['GarageCond'].map(mapping1)
test.loc[:, 'GarageQual'] = test['GarageQual'].map(mapping1)
test.loc[:, 'KitchenQual'] = test['KitchenQual'].map(mapping1)
test.loc[:, 'HeatingQC'] = test['HeatingQC'].map(mapping1)
test.loc[:, 'BsmtCond'] = test['BsmtCond'].map(mapping1)
test.loc[:, 'BsmtQual'] = test['BsmtQual'].map(mapping1)
test.loc[:, 'ExterCond'] = test['ExterCond'].map(mapping1)
test.loc[:, 'ExterQual'] = test['ExterQual'].map(mapping1)

6 <- GLQ (Good Living Quarters)<br>
5 <- ALQ (Average Living Quarters)<br>
4 <- BLQ (Below Average Living Quarters)	<br>
3 <- Rec (Average Rec Room)<br>
2 <- LwQ (Low Quality)<br>
1 <- Unf (Unfinshed)<br>
0 <- None (Doesn't have)<p>

Columes this applies to:<p>
`BsmtFinType1`: Rating of basement finished area<br>
`BsmtFinType2`: Rating of basement finished area (if multiple types)

In [674]:
mapping2 = {'GLQ':6, 'ALQ':5, 'BLQ':4, 'Rec':3, 'LwQ':2, 'Unf':1, 'None':0}

train.loc[:, 'BsmtFinType1'] = train['BsmtFinType1'].map(mapping2)
train.loc[:, 'BsmtFinType2'] = train['BsmtFinType2'].map(mapping2)

test.loc[:, 'BsmtFinType1'] = test['BsmtFinType1'].map(mapping2)
test.loc[:, 'BsmtFinType2'] = test['BsmtFinType2'].map(mapping2)

2 <- Grvl	(Gravel)<br>
1 <- Pave	(Paved)<br>
0 <- None (Only on `Alley`; no alley access)<p>

Applies to:<p>
`Street`: Type of road access to property<br>
`Alley`: Type of alley access to property

In [675]:
mapping3 = {'Grvl':2, 'Pave':1, 'None':0}

train.loc[:, 'Street'] = train['Street'].map(mapping3)
train.loc[:, 'Alley'] = train['Alley'].map(mapping3)

test.loc[:, 'Street'] = test['Street'].map(mapping3)
test.loc[:, 'Alley'] = test['Alley'].map(mapping3)

The rest of the columns all have unique value sets and must be mapped individually.<p><br></p>

I'm going to convert the values for `LotShape` into numeric values, based on increasing irregularity. <p>
1 <- Reg	(Regular)<br>
2 <- IR1	(Slightly irregular)<br>
3 <- IR2	(Moderately Irregular)<br>
4 <- IR3	(Irregular)

In [676]:
shapemap = {'Reg':1, 'IR1':2, 'IR2':3, 'IR3':4}

train.loc[:, 'LotShape'] = train['LotShape'].map(shapemap)

test.loc[:, 'LotShape'] = test['LotShape'].map(shapemap)

I am doing the same for the `LandSlope` column, increasing with severity.<p>
1 <- Gtl	(Gentle slope)<br>
2 <- Mod	(Moderate Slope)<br>
3 <- Sev	(Severe Slope)

In [677]:
slopemap = {'Gtl':1, 'Mod':2, 'Sev':3}

train.loc[:, 'LandSlope'] = train['LandSlope'].map(slopemap)

test.loc[:, 'LandSlope'] = test['LandSlope'].map(slopemap)

The values of the `LandContour` column imply an ordered heirarchy, and I am going to treat them as such.<p>

4 <- Lvl	(Near Flat/Level)<br>
3 <- Bnk	(Banked)<br>
2 <- HLS	(Hillside)<br>
1 <- Low	(Depression)

In [678]:
contmap = {'Lvl':4,'Bnk':3,'HLS':2,'Low':1}

train.loc[:, 'LandContour'] = train['LandContour'].map(contmap)

test.loc[:, 'LandContour'] = test['LandContour'].map(contmap)

The `Utilities` column:<p>

4 <- AllPub	(All public Utilities)<br>
3 <- NoSewr	(Electricity, Gas, and Water (Septic Tank))<br>
2 <- NoSeWa	(Electricity and Gas Only)<br>
1 <- ELO	(Electricity only)

In [679]:
utilmap = {'AllPub':4, 'NoSewr':3, 'NoSeWa':2, 'ELO':1}

train.loc[:, 'Utilities'] = train['Utilities'].map(utilmap)

test.loc[:, 'Utilities'] = test['Utilities'].map(utilmap)

The values of the `LotConfig` column also impy an ordered heirarchy, with more street frontage being more desirable.<p>

1 <- Inside	(Inside lot)<br>
2 <- Corner	(Corner lot)<br>
3 <- CulDSac	(Cul-de-sac)<br>
4 <- FR2	(Frontage on 2 sides)<br>
5 <- FR3	(Frontage on 3 sides)

In [680]:
configmap = {'Inside':1,'Corner':2,'CulDSac':3,'FR2':4,'FR3':5}

train.loc[:, 'LotConfig'] = train['LotConfig'].map(configmap)

test.loc[:, 'LotConfig'] = test['LotConfig'].map(configmap)

The `BsmtExposure` column:<p>

4 <- Gd	(Good Exposure)<br>
3 <- Av	(Average Exposure)<br>
2 <- Mn	(Mimimum Exposure)<br>
1 <- No	(No Exposure)<br>
0 <- None	(No Basement)

In [681]:
bsmtmap = {'Gd':4, 'Av':3, 'Mn':2, 'No':1, 'None':0}

train.loc[:, 'BsmtExposure'] = train['BsmtExposure'].map(bsmtmap)

test.loc[:, 'BsmtExposure'] = test['BsmtExposure'].map(bsmtmap)

The `Functional` column:<p>

8 <- Typ	(Typical Functionality)<br>
7 <- Min1	(Minor Deductions 1)<br>
6 <- Min2	(Minor Deductions 2)<br>
5 <- Mod	(Moderate Deductions)<br>
4 <- Maj1	(Major Deductions 1)<br>
3 <- Maj2	(Major Deductions 2)<br>
2 <- Sev	(Severely Damaged)<br>
1 <- Sal	(Salvage only)


In [682]:
functmap = {'Typ':8, 'Min1':7, 'Min2':6, 'Mod':5, 'Maj1':4, 'Maj2':3, 'Sev':2, 'Sal':1}

train.loc[:, 'Functional'] = train['Functional'].map(functmap)

test.loc[:, 'Functional'] = test['Functional'].map(functmap)

The `GarageFinish` column:<p>

3 <- Fin (Finished)<br>
2 <- RFn (Rough Finished)<br>
1 <- Unf (Unfinished)<br>
0 <- None (No Garage)

In [683]:
garagemap = {'Fin':3, 'RFn':2, 'Unf':1, 'None':0}

train.loc[:, 'GarageFinish'] = train['GarageFinish'].map(garagemap)

test.loc[:, 'GarageFinish'] = test['GarageFinish'].map(garagemap)

The `PavedDrive` column:<p>
3 <- Y	(Paved) <br>
2 <- P	(Partial Pavement)<br>
1 <- N	(Dirt/Gravel)

In [684]:
pavemap = {'Y':3, 'P':2, 'N':1}

train.loc[:, 'PavedDrive'] = train['PavedDrive'].map(pavemap)

test.loc[:, 'PavedDrive'] = test['PavedDrive'].map(pavemap)

The `Fence` column contains values which look like they could be contrasting pairs, but because they are all in the same column, they can't be treated that way. As such, I think the best way to treat them is to assume they are meant to be heirarchical.<p>
4 <- GdPrv	(Good Privacy)<br>
3 <- MnPrv	(Minimum Privacy)<br>
2 <- GdWo	(Good Wood)<br>
1 <- MnWw	(Minimum Wood/Wire)<br>
0 <- None	(No Fence)

In [685]:
fencemap = {
'GdPrv':4,'MnPrv':3,'GdWo':2,'MnWw':1,'None':0
}

train.loc[:, 'Fence'] = train['Fence'].map(fencemap)

test.loc[:, 'Fence'] = test['Fence'].map(fencemap)

Additionally, the `CentralAir` column currently contains Yes/No values, and as such can be re-mapped using the standard 1/0 schema.

In [686]:
airmap = {'Y':1, 'N':0}

train.loc[:, 'CentralAir'] = train['CentralAir'].map(airmap)

test.loc[:, 'CentralAir'] = test['CentralAir'].map(airmap)

### One Hot Encoding<p>
Features that aren't numerical and have no obvious heirarchy need to be handled via One Hot Encoding.

In [687]:
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(sparse_output=False)

There are some column pairs that share the same possible values. In order to capture both columns' data for each pair, I'm expanding both of them and then combing them into a single column set, with a max of two points across them.

`Condition1` and `Condition2` share the following value set:<p>

Artery - Adjacent to arterial street<br>
Feedr - Adjacent to feeder street<br>
Norm - Normal	<br>
RRNn - Within 200' of North-South Railroad<br>
RRAn - Adjacent to North-South Railroad<br>
PosN - Near positive off-site feature--park, greenbelt, etc.<br>
PosA - Adjacent to postive off-site feature<br>
RRNe - Within 200' of East-West Railroad<br>
RRAe - Adjacent to East-West Railroad

In [688]:
# expand both columns and combine for coalation
train_c1 = pd.DataFrame(data=(enc.fit_transform(train[['Condition1']])),columns=(enc.get_feature_names_out()))
train_c2 = pd.DataFrame(data=(enc.fit_transform(train[['Condition2']])),columns=(enc.get_feature_names_out()))
train_condition = pd.concat([train_c1,train_c2],axis=1)

In [689]:
# rename columns that did not appear in the second set of conditions
train_condition = train_condition.rename(columns={'Condition1_RRAe':'Cond_RRAe','Condition1_RRNe':'Cond_RRNe'})

# create columns for coalating values 
train_condition['Cond_Artery'] = 0.0
train_condition['Cond_Feedr'] = 0.0
train_condition['Cond_Norm'] = 0.0
train_condition['Cond_PosA'] = 0.0
train_condition['Cond_PosN'] = 0.0
train_condition['Cond_RRAn'] = 0.0
train_condition['Cond_RRNn'] = 0.0

In [690]:
# where either 'Artery' expansion is 1, fill master column with a 1
train_condition.loc[((train_condition[
    (train_condition['Condition1_Artery']==1)| 
    (train_condition['Condition2_Artery']==1)
].index).tolist()), 'Cond_Artery'] = 1

# where either 'Feedr' expansion is 1, fill master column with a 1
train_condition.loc[((train_condition[
    (train_condition['Condition1_Feedr']==1)|
    (train_condition['Condition2_Feedr']==1)
].index).tolist()), 'Cond_Feedr'] = 1

# where either 'Norm' expansion is 1, fill master column with a 1
train_condition.loc[((train_condition[
    (train_condition['Condition1_Norm']==1)|
    (train_condition['Condition2_Norm']==1)
].index).tolist()), 'Cond_Norm'] = 1

# where either 'PosA' expansion is 1, fill master column with a 1
train_condition.loc[((train_condition[
    (train_condition['Condition1_PosA']==1)|
    (train_condition['Condition2_PosA']==1)
].index).tolist()), 'Cond_PosA'] = 1

# where either 'PosN' expansion is 1, fill master column with a 1
train_condition.loc[((train_condition[
    (train_condition['Condition1_PosN']==1)|
    (train_condition['Condition2_PosN']==1)
].index).tolist()), 'Cond_PosN'] = 1

# where either 'RRAn' expansion is 1, fill master column with a 1
train_condition.loc[((train_condition[
    (train_condition['Condition1_RRAn']==1)|
    (train_condition['Condition2_RRAn']==1)
].index).tolist()), 'Cond_RRAn'] = 1

# where either 'RRNn' expansion is 1, fill master column with a 1
train_condition.loc[((train_condition[
    (train_condition['Condition1_RRNn']==1)|
    (train_condition['Condition2_RRNn']==1)
].index).tolist()), 'Cond_RRNn'] = 1

In [691]:
# merge the columns to keep back onto main dataframe
train = pd.concat([train.reset_index(),train_condition[['Cond_Artery','Cond_Feedr','Cond_Norm','Cond_RRNn','Cond_RRAn',
                                                        'Cond_PosN','Cond_PosA','Cond_RRNe','Cond_RRAe']]],axis=1)

And the test set:

In [692]:
# expand both columns and combine for coalation
test_c1 = pd.DataFrame(data=(enc.fit_transform(test[['Condition1']])),columns=(enc.get_feature_names_out()))
test_c2 = pd.DataFrame(data=(enc.fit_transform(test[['Condition2']])),columns=(enc.get_feature_names_out()))
test_condition = pd.concat([test_c1,test_c2],axis=1)

In [693]:
# rename columns that did not appear in the second set of conditions
test_condition = test_condition.rename(columns={'Condition1_RRAe':'Cond_RRAe','Condition1_RRAn':'Cond_RRAn','Condition1_RRNe':'Cond_RRNe','Condition1_RRNn':'Cond_RRNn'})

# create columns for coalating values 
test_condition['Cond_Artery'] = 0.0
test_condition['Cond_Feedr'] = 0.0
test_condition['Cond_Norm'] = 0.0
test_condition['Cond_PosA'] = 0.0
test_condition['Cond_PosN'] = 0.0

In [694]:
# where either 'Artery' expansion is 1, fill master column with a 1
test_condition.loc[((test_condition[
    (test_condition['Condition1_Artery']==1)|
    (test_condition['Condition2_Artery']==1)
].index).tolist()), 'Cond_Artery'] = 1

# where either 'Feedr' expansion is 1, fill master column with a 1
test_condition.loc[((test_condition[
    (test_condition['Condition1_Feedr']==1)|
    (test_condition['Condition2_Feedr']==1)
].index).tolist()), 'Cond_Feedr'] = 1

# where either 'Norm' expansion is 1, fill master column with a 1
test_condition.loc[((test_condition[
    (test_condition['Condition1_Norm']==1)|
    (test_condition['Condition2_Norm']==1)
].index).tolist()), 'Cond_Norm'] = 1

# where either 'PosA' expansion is 1, fill master column with a 1
test_condition.loc[((test_condition[
    (test_condition['Condition1_PosA']==1)|
    (test_condition['Condition2_PosA']==1)
].index).tolist()), 'Cond_PosA'] = 1

# where either 'PosN' expansion is 1, fill master column with a 1
test_condition.loc[((test_condition[
    (test_condition['Condition1_PosN']==1)|
    (test_condition['Condition2_PosN']==1)
].index).tolist()), 'Cond_PosN'] = 1

In [695]:
# merge the columns to keep back onto main dataframe
test = pd.concat([test.reset_index(),test_condition[['Cond_Artery','Cond_Feedr','Cond_Norm','Cond_RRNn','Cond_RRAn',
                                                        'Cond_PosN','Cond_PosA','Cond_RRNe','Cond_RRAe']]],axis=1)

`Exterior1st` and `Exterior2nd` share the following set of possible values:<p>

AsbShng - Asbestos Shingles<br>
AsphShn - Asphalt Shingles<br>
BrkComm - Brick Common<br>
BrkFace - Brick Face<br>
CBlock - Cinder Block<br>
CemntBd - Cement Board<br>
HdBoard - Hard Board<br>
ImStucc - Imitation Stucco<br>
MetalSd - Metal Siding<br>
Other - Other<br>
Plywood - Plywood<br>
PreCast - PreCast<br>
Stone  -Stone<br>
Stucco - Stucco<br>
VinylSd - Vinyl Siding<br>
Wd Sdng - Wood Siding<br>
WdShing - Wood Shingles<br>

In [696]:
train_ext1 = pd.DataFrame(data=(enc.fit_transform(train[['Exterior1st']])),columns=(enc.get_feature_names_out()))
train_ext2 = pd.DataFrame(data=(enc.fit_transform(train[['Exterior2nd']])),columns=(enc.get_feature_names_out()))
train_exterior = pd.concat([train_ext1,train_ext2],axis=1)

In [697]:
# rename the column that only appears in one dataframe
train_exterior = train_exterior.rename(columns={'Exterior2nd_Other':'Ext_Other'})

# create columns for coalating values
train_exterior['Ext_AsbShng'] = 0.0
train_exterior['Ext_AsphShn'] = 0.0
train_exterior['Ext_BrkComm'] = 0.0
train_exterior['Ext_BrkFace'] = 0.0
train_exterior['Ext_CBlock'] = 0.0
train_exterior['Ext_CemntBd'] = 0.0
train_exterior['Ext_HdBoard'] = 0.0
train_exterior['Ext_ImStucc'] = 0.0
train_exterior['Ext_MetalSd'] = 0.0
train_exterior['Ext_Plywood'] = 0.0
train_exterior['Ext_Stone'] = 0.0
train_exterior['Ext_Stucco'] = 0.0
train_exterior['Ext_VinylSd'] = 0.0
train_exterior['Ext_Wd_Sdng'] = 0.0
train_exterior['Ext_WdShing'] = 0.0

In [698]:
# where either 'AsbShng' expansion is 1, fill master column with a 1
train_exterior.loc[((train_exterior[
    (train_exterior['Exterior1st_AsbShng']==1)|
    (train_exterior['Exterior2nd_AsbShng']==1)
].index).tolist()), 'Ext_AsbShng'] = 1

# where either 'AsphShn' expansion is 1, fill master column with a 1
train_exterior.loc[((train_exterior[
    (train_exterior['Exterior1st_AsphShn']==1)|
    (train_exterior['Exterior2nd_AsphShn']==1)
].index).tolist()), 'Ext_AsphShn'] = 1

# where either 'BrkComm' expansion is 1, fill master column with a 1
train_exterior.loc[((train_exterior[
    (train_exterior['Exterior1st_BrkComm']==1)|
    (train_exterior['Exterior2nd_Brk Cmn']==1)
].index).tolist()), 'Ext_BrkComm'] = 1

# where either 'BrkFace' expansion is 1, fill master column with a 1
train_exterior.loc[((train_exterior[
    (train_exterior['Exterior1st_BrkFace']==1)|
    (train_exterior['Exterior2nd_BrkFace']==1)
].index).tolist()), 'Ext_BrkFace'] = 1

# where either 'CBlock' expansion is 1, fill master column with a 1
train_exterior.loc[((train_exterior[
    (train_exterior['Exterior1st_CBlock']==1)|
    (train_exterior['Exterior2nd_CBlock']==1)
].index).tolist()), 'Ext_CBlock'] = 1

# where either 'CemntBd' expansion is 1, fill master column with a 1
train_exterior.loc[((train_exterior[
    (train_exterior['Exterior1st_CemntBd']==1)|
    (train_exterior['Exterior2nd_CmentBd']==1)
].index).tolist()), 'Ext_CemntBd'] = 1

# where either 'HdBoard' expansion is 1, fill master column with a 1
train_exterior.loc[((train_exterior[
    (train_exterior['Exterior1st_HdBoard']==1)|
    (train_exterior['Exterior2nd_HdBoard']==1)
].index).tolist()), 'Ext_HdBoard'] = 1

# where either 'ImStucc' expansion is 1, fill master column with a 1
train_exterior.loc[((train_exterior[
    (train_exterior['Exterior1st_ImStucc']==1)|
    (train_exterior['Exterior2nd_ImStucc']==1)
].index).tolist()), 'Ext_ImStucc'] = 1

# where either 'MetalSd' expansion is 1, fill master column with a 1
train_exterior.loc[((train_exterior[
    (train_exterior['Exterior1st_MetalSd']==1)|
    (train_exterior['Exterior2nd_MetalSd']==1)
].index).tolist()), 'Ext_MetalSd'] = 1

# where either 'Plywood' expansion is 1, fill master column with a 1
train_exterior.loc[((train_exterior[
    (train_exterior['Exterior1st_Plywood']==1)|
    (train_exterior['Exterior2nd_Plywood']==1)
].index).tolist()), 'Ext_Plywood'] = 1

# where either 'Stone' expansion is 1, fill master column with a 1
train_exterior.loc[((train_exterior[
    (train_exterior['Exterior1st_Stone']==1)|
    (train_exterior['Exterior2nd_Stone']==1)
].index).tolist()), 'Ext_Stone'] = 1

# where either 'Stucco' expansion is 1, fill master column with a 1
train_exterior.loc[((train_exterior[
    (train_exterior['Exterior1st_Stucco']==1)|
    (train_exterior['Exterior2nd_Stucco']==1)
].index).tolist()), 'Ext_Stucco'] = 1

# where either 'VinylSd' expansion is 1, fill master column with a 1
train_exterior.loc[((train_exterior[
    (train_exterior['Exterior1st_VinylSd']==1)|
    (train_exterior['Exterior2nd_VinylSd']==1)
].index).tolist()), 'Ext_VinylSd'] = 1

# where either 'Wd Sdng' expansion is 1, fill master column with a 1
train_exterior.loc[((train_exterior[
    (train_exterior['Exterior1st_Wd Sdng']==1)|
    (train_exterior['Exterior2nd_Wd Shng']==1)
].index).tolist()), 'Ext_Wd_Sdng'] = 1

# where either 'WdShing' expansion is 1, fill master column with a 1
train_exterior.loc[((train_exterior[
    (train_exterior['Exterior1st_WdShing']==1)|
    (train_exterior['Exterior2nd_Wd Sdng']==1)
].index).tolist()), 'Ext_WdShing'] = 1

In [699]:
# merge the columns to keep back onto main dataframe
train = pd.concat([train,train_exterior[['Ext_AsbShng','Ext_AsphShn','Ext_BrkComm','Ext_BrkFace','Ext_CBlock','Ext_CemntBd','Ext_HdBoard','Ext_ImStucc',
'Ext_MetalSd','Ext_Other','Ext_Plywood','Ext_Stone','Ext_Stucco','Ext_VinylSd','Ext_Wd_Sdng','Ext_WdShing']]],axis=1)

The test set:

In [700]:
test_ext1 = pd.DataFrame(data=(enc.fit_transform(test[['Exterior1st']])),columns=(enc.get_feature_names_out()))
test_ext2 = pd.DataFrame(data=(enc.fit_transform(test[['Exterior2nd']])),columns=(enc.get_feature_names_out()))
test_exterior = pd.concat([test_ext1,test_ext2],axis=1)

In [701]:
# rename columns that only appear in one dataframe
test_exterior = test_exterior.rename(columns={'Exterior2nd_ImStucc':'Ext_ImStucc','Exterior2nd_Stone':'Ext_Stone'})

# create columns for coalating values
test_exterior['Ext_AsbShng'] = 0.0
test_exterior['Ext_AsphShn'] = 0.0
test_exterior['Ext_BrkComm'] = 0.0
test_exterior['Ext_BrkFace'] = 0.0
test_exterior['Ext_CBlock'] = 0.0
test_exterior['Ext_CemntBd'] = 0.0
test_exterior['Ext_HdBoard'] = 0.0
test_exterior['Ext_MetalSd'] = 0.0
test_exterior['Ext_Other'] = 0.0
test_exterior['Ext_Plywood'] = 0.0
test_exterior['Ext_Stucco'] = 0.0
test_exterior['Ext_VinylSd'] = 0.0
test_exterior['Ext_Wd_Sdng'] = 0.0
test_exterior['Ext_WdShing'] = 0.0

In [702]:
# where either 'AsbShng' expansion is 1, fill master column with a 1
test_exterior.loc[((test_exterior[
    (test_exterior['Exterior1st_AsbShng']==1)|
    (test_exterior['Exterior2nd_AsbShng']==1)
].index).tolist()), 'Ext_AsbShng'] = 1

# where either 'AsphShn' expansion is 1, fill master column with a 1
test_exterior.loc[((test_exterior[
    (test_exterior['Exterior1st_AsphShn']==1)|
    (test_exterior['Exterior2nd_AsphShn']==1)
].index).tolist()), 'Ext_AsphShn'] = 1

# where either 'BrkComm' expansion is 1, fill master column with a 1
test_exterior.loc[((test_exterior[
    (test_exterior['Exterior1st_BrkComm']==1)|
    (test_exterior['Exterior2nd_Brk Cmn']==1)
].index).tolist()), 'Ext_BrkComm'] = 1

# where either 'BrkFace' expansion is 1, fill master column with a 1
test_exterior.loc[((test_exterior[
    (test_exterior['Exterior1st_BrkFace']==1)|
    (test_exterior['Exterior2nd_BrkFace']==1)
].index).tolist()), 'Ext_BrkFace'] = 1

# where either 'CBlock' expansion is 1, fill master column with a 1
test_exterior.loc[((test_exterior[
    (test_exterior['Exterior1st_CBlock']==1)|
    (test_exterior['Exterior2nd_CBlock']==1)
].index).tolist()), 'Ext_CBlock'] = 1

# where either 'CemntBd' expansion is 1, fill master column with a 1
test_exterior.loc[((test_exterior[
    (test_exterior['Exterior1st_CemntBd']==1)|
    (test_exterior['Exterior2nd_CmentBd']==1)
].index).tolist()), 'Ext_CemntBd'] = 1

# where either 'HdBoard' expansion is 1, fill master column with a 1
test_exterior.loc[((test_exterior[
    (test_exterior['Exterior1st_HdBoard']==1)|
    (test_exterior['Exterior2nd_HdBoard']==1)
].index).tolist()), 'Ext_HdBoard'] = 1

# where either 'MetalSd' expansion is 1, fill master column with a 1
test_exterior.loc[((test_exterior[
    (test_exterior['Exterior1st_MetalSd']==1)|
    (test_exterior['Exterior2nd_MetalSd']==1)
].index).tolist()), 'Ext_MetalSd'] = 1

# where either 'Other' expansion is 1, fill master column with a 1
test_exterior.loc[((test_exterior[
    (test_exterior['Exterior1st_Other']==1)|
    (test_exterior['Exterior2nd_Other']==1)
].index).tolist()), 'Ext_Other'] = 1

# where either 'Plywood' expansion is 1, fill master column with a 1
test_exterior.loc[((test_exterior[
    (test_exterior['Exterior1st_Plywood']==1)|
    (test_exterior['Exterior2nd_Plywood']==1)
].index).tolist()), 'Ext_Plywood'] = 1

# where either 'Stucco' expansion is 1, fill master column with a 1
test_exterior.loc[((test_exterior[
    (test_exterior['Exterior1st_Stucco']==1)|
    (test_exterior['Exterior2nd_Stucco']==1)
].index).tolist()), 'Ext_Stucco'] = 1

# where either 'VinylSd' expansion is 1, fill master column with a 1
test_exterior.loc[((test_exterior[
    (test_exterior['Exterior1st_VinylSd']==1)|
    (test_exterior['Exterior2nd_VinylSd']==1)
].index).tolist()), 'Ext_VinylSd'] = 1

# where either 'Wd Sdng' expansion is 1, fill master column with a 1
test_exterior.loc[((test_exterior[
    (test_exterior['Exterior1st_Wd Sdng']==1)|
    (test_exterior['Exterior2nd_Wd Shng']==1)
].index).tolist()), 'Ext_Wd_Sdng'] = 1

# where either 'WdShing' expansion is 1, fill master column with a 1
test_exterior.loc[((test_exterior[
    (test_exterior['Exterior1st_WdShing']==1)|
    (test_exterior['Exterior2nd_Wd Sdng']==1)
].index).tolist()), 'Ext_WdShing'] = 1

In [703]:
# merge the columns to keep back onto main dataframe
test = pd.concat([test,test_exterior[['Ext_AsbShng','Ext_AsphShn','Ext_BrkComm','Ext_BrkFace','Ext_CBlock','Ext_CemntBd','Ext_HdBoard','Ext_ImStucc',
'Ext_MetalSd','Ext_Other','Ext_Plywood','Ext_Stone','Ext_Stucco','Ext_VinylSd','Ext_Wd_Sdng','Ext_WdShing']]],axis=1)

None of the remaining columns share values, so they can't be combined into during expansion.

The `MSZoning` column:

In [704]:
train_zone = pd.DataFrame(data=(enc.fit_transform(train[['MSZoning']])),columns=(enc.get_feature_names_out()))

# simplifying name of 'C (all)' column to just 'C'
train_zone = train_zone.rename(columns={'MSZoning_C (all)':'MSZoning_C'})

# merge new columns back into main dataframe
train = pd.concat([train,train_zone],axis=1)

In [705]:
test_zone = pd.DataFrame(data=(enc.fit_transform(test[['MSZoning']])),columns=(enc.get_feature_names_out()))

# simplifying name of 'C (all)' column to just 'C'
test_zone = test_zone.rename(columns={'MSZoning_C (all)':'MSZoning_C'})

# merge new columns back into main dataframe
test = pd.concat([test,test_zone],axis=1)

The `Neighborhood` column:

In [706]:
train_nbhd = pd.DataFrame(data=(enc.fit_transform(train[['Neighborhood']])),columns=(enc.get_feature_names_out()))

# merge new columns back into main dataframe
train = pd.concat([train,train_nbhd],axis=1)

In [707]:
test_nbhd = pd.DataFrame(data=(enc.fit_transform(test[['Neighborhood']])),columns=(enc.get_feature_names_out()))

# merge new columns back into main dataframe
test = pd.concat([test,test_nbhd],axis=1)

The `BldgType` column:

In [708]:
train_type = pd.DataFrame(data=(enc.fit_transform(train[['BldgType']])),columns=(enc.get_feature_names_out()))

# merge new columns back into main dataframe
train = pd.concat([train,train_type],axis=1)

In [709]:
test_type = pd.DataFrame(data=(enc.fit_transform(test[['BldgType']])),columns=(enc.get_feature_names_out()))

# merge new columns back into main dataframe
test = pd.concat([test,test_type],axis=1)

The `HouseStyle` column:

In [710]:
train_style = pd.DataFrame(data=(enc.fit_transform(train[['HouseStyle']])),columns=(enc.get_feature_names_out()))

# merge new columns back into main dataframe
train = pd.concat([train,train_style],axis=1)

In [711]:
test_style = pd.DataFrame(data=(enc.fit_transform(test[['HouseStyle']])),columns=(enc.get_feature_names_out()))

# merge new columns back into main dataframe
test = pd.concat([test,test_style],axis=1)

The `RoofStyle` column:

In [712]:
train_RfStyl = pd.DataFrame(data=(enc.fit_transform(train[['RoofStyle']])),columns=(enc.get_feature_names_out()))

# merge new columns back into main dataframe
train = pd.concat([train,train_RfStyl],axis=1)

In [713]:
test_RfStyl = pd.DataFrame(data=(enc.fit_transform(test[['RoofStyle']])),columns=(enc.get_feature_names_out()))

# merge new columns back into main dataframe
test = pd.concat([test,test_RfStyl],axis=1)

The `RoofMatl` column:

In [714]:
train_RfMat = pd.DataFrame(data=(enc.fit_transform(train[['RoofMatl']])),columns=(enc.get_feature_names_out()))

# merge new columns back into main dataframe
train = pd.concat([train,train_RfMat],axis=1)

In [715]:
test_RfMat = pd.DataFrame(data=(enc.fit_transform(test[['RoofMatl']])),columns=(enc.get_feature_names_out()))

# create columns for options present in train but missing in test
test_RfMat['RoofMatl_Membran'] = 0.0
test_RfMat['RoofMatl_Metal'] = 0.0
test_RfMat['RoofMatl_Roll'] = 0.0

# merge new columns back into main dataframe
test = pd.concat([test,test_RfMat],axis=1)

The `MasVnrType` column:

In [716]:
train_vnr = pd.DataFrame(data=(enc.fit_transform(train[['MasVnrType']])),columns=(enc.get_feature_names_out()))

# merge new columns back into main dataframe
train = pd.concat([train,train_vnr],axis=1)

In [717]:
test_vnr = pd.DataFrame(data=(enc.fit_transform(test[['MasVnrType']])),columns=(enc.get_feature_names_out()))

# merge new columns back into main dataframe
test = pd.concat([test,test_vnr],axis=1)

The `Foundation` column:

In [718]:
train_found = pd.DataFrame(data=(enc.fit_transform(train[['Foundation']])),columns=(enc.get_feature_names_out()))

# merge new columns back into main dataframe
train = pd.concat([train,train_found],axis=1)

In [719]:
test_found = pd.DataFrame(data=(enc.fit_transform(test[['Foundation']])),columns=(enc.get_feature_names_out()))

# merge new columns back into main dataframe
test = pd.concat([test,test_found],axis=1)

The `Heating` column:

In [720]:
train_heat = pd.DataFrame(data=(enc.fit_transform(train[['Heating']])),columns=(enc.get_feature_names_out()))

# merge new columns back into main dataframe
train = pd.concat([train,train_heat],axis=1)

In [721]:
test_heat = pd.DataFrame(data=(enc.fit_transform(test[['Heating']])),columns=(enc.get_feature_names_out()))

# create columns for options present in train but missing in test
test_heat['Heating_Floor'] = 0.0
test_heat['Heating_OthW'] = 0.0

# merge new columns back into main dataframe
test = pd.concat([test,test_heat],axis=1)

The `Electrical` column:

In [722]:
train_elec = pd.DataFrame(data=(enc.fit_transform(train[['Electrical']])),columns=(enc.get_feature_names_out()))

# merge new columns back into main dataframe
train = pd.concat([train,train_elec],axis=1)

In [723]:
test_elec = pd.DataFrame(data=(enc.fit_transform(test[['Electrical']])),columns=(enc.get_feature_names_out()))

# create column for option present in train but missing in test
test_elec['Electrical_Mix'] = 0.0

# merge new columns back into main dataframe
test = pd.concat([test,test_elec],axis=1)

The `GarageType` column:

In [724]:
train_garage = pd.DataFrame(data=(enc.fit_transform(train[['GarageType']])),columns=(enc.get_feature_names_out()))

# drop none column
train_garage = train_garage.drop('GarageType_None',axis=1)

# merge new columns back into main dataframe
train = pd.concat([train,train_garage],axis=1)

In [725]:
test_garage = pd.DataFrame(data=(enc.fit_transform(test[['GarageType']])),columns=(enc.get_feature_names_out()))

# drop none column
test_garage = test_garage.drop('GarageType_None',axis=1)

# merge new columns back into main dataframe
test = pd.concat([test,test_garage],axis=1)

The `MiscFeature` column:

In [726]:
train_misc = pd.DataFrame(data=(enc.fit_transform(train[['MiscFeature']])),columns=(enc.get_feature_names_out()))

# create column for option present in test but missing in train
train_misc['MiscFeature_Gar2'] = 0.0

# drop nan column
train_misc = train_misc.drop('MiscFeature_nan',axis=1)

# merge new columns back into main dataframe
train = pd.concat([train,train_misc],axis=1)

In [727]:
test_misc = pd.DataFrame(data=(enc.fit_transform(test[['MiscFeature']])),columns=(enc.get_feature_names_out()))

# create column for option present in train but missing in test
test_misc['MiscFeature_TenC'] = 0.0

# drop nan column
test_misc = test_misc.drop('MiscFeature_nan',axis=1)

# merge new columns back into main dataframe
test = pd.concat([test,test_misc],axis=1)

In [728]:
train['MiscFeature'].value_counts()

MiscFeature
Shed    47
Othr     2
TenC     1
Name: count, dtype: int64

In [729]:
test['MiscFeature'].value_counts()

MiscFeature
Shed    46
Gar2     3
Othr     2
Name: count, dtype: int64

The `SaleType` column:

In [730]:
train_SType = pd.DataFrame(data=(enc.fit_transform(train[['SaleType']])),columns=(enc.get_feature_names_out()))

# merge new columns back into main dataframe
train = pd.concat([train,train_SType],axis=1)

In [731]:
test_SType = pd.DataFrame(data=(enc.fit_transform(test[['SaleType']])),columns=(enc.get_feature_names_out()))

# merge new columns back into main dataframe
test = pd.concat([test,test_SType],axis=1)

The `SaleCondition` column:

In [732]:
train_SCond = pd.DataFrame(data=(enc.fit_transform(train[['SaleCondition']])),columns=(enc.get_feature_names_out()))

# merge new columns back into main dataframe
train = pd.concat([train,train_SCond],axis=1)

In [733]:
test_SCond = pd.DataFrame(data=(enc.fit_transform(test[['SaleCondition']])),columns=(enc.get_feature_names_out()))

# merge new columns back into main dataframe
test = pd.concat([test,test_SCond],axis=1)

In [734]:
droplist = [
    'index','Condition1','Condition2','Exterior1st','Exterior2nd',
    'Foundation','Heating','HouseStyle'
    'Electrical','RoofStyle','RoofMatl',
    'MasVnrType','MSZoning','BldgType',
    'Neighborhood','GarageType','MiscFeature','SaleType', 'SaleCondition'
]

### Combining and Creating Columns<p>
The total living space for each property can be calculated using the `GrLivArea` column, for the area above ground level, and the `TotalBsmtSF1` column, for the area of the basement, if one exists. 

In [735]:
train['TtlLivSF'] = train['GrLivArea']+train['TotalBsmtSF']
test['TtlLivSF'] = test['GrLivArea']+test['TotalBsmtSF']

In [736]:
print('Train head:')
display(train[['GrLivArea', 'TotalBsmtSF', 'TtlLivSF']].head())
print('Test head:')
display(test[['GrLivArea', 'TotalBsmtSF', 'TtlLivSF']].head())

Train head:


Unnamed: 0,GrLivArea,TotalBsmtSF,TtlLivSF
0,1710,856,2566
1,1262,1262,2524
2,1786,920,2706
3,1717,756,2473
4,2198,1145,3343


Test head:


Unnamed: 0,GrLivArea,TotalBsmtSF,TtlLivSF
0,896,882.0,1778.0
1,1329,1329.0,2658.0
2,1629,928.0,2557.0
3,1604,926.0,2530.0
4,1280,1280.0,2560.0


The total number of bathrooms can also be calculated using the four columns tracking number of bathrooms. (Halfbath columns are multiplied by 0.5 to reflect that they are 'half' when being added to the count.)

In [737]:
train['TotalBath'] = train['BsmtFullBath']+(train['BsmtHalfBath']*0.5)+train['FullBath']+(train['HalfBath']*0.5)
test['TotalBath'] = test['BsmtFullBath']+(test['BsmtHalfBath']*0.5)+test['FullBath']+(test['HalfBath']*0.5)

In [738]:
print('Train head:')
display(train[['BsmtFullBath', 'BsmtHalfBath', 'FullBath','HalfBath', 'TotalBath']].head())
print('Test head:')
display(test[['BsmtFullBath', 'BsmtHalfBath', 'FullBath','HalfBath', 'TotalBath']].head())

Train head:


Unnamed: 0,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,TotalBath
0,1,0,2,1,3.5
1,0,1,2,0,2.5
2,1,0,2,1,3.5
3,1,0,1,0,2.0
4,1,0,2,1,3.5


Test head:


Unnamed: 0,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,TotalBath
0,0.0,0.0,1,0,1.0
1,0.0,0.0,1,1,1.5
2,0.0,0.0,2,1,2.5
3,0.0,0.0,2,1,2.5
4,0.0,0.0,2,0,2.0


Because the `HouseStyle` references to the number of stories only reflect the ones above ground, I am going to make a column for the total number of floors, using the columns that record the total area for basement, first and second floor. There is no column for area of the third story, in the cases of a property being marked as 2.5 stories. There are no cases of properties in the 2.5 category that have no basement, so I don't need to worry about that and can encode them all has having 4 total floors. <p>
I am going to create the new column with a value of 0 so I can check that all the columns have been encoded correctly.

In [739]:
# create new column 
train['TtlFloors'] = 0

# encode the applicable values
# only a ground floor
train.loc[((train[(train['1stFlrSF']>0)&(train['TotalBsmtSF']==0)&(train['2ndFlrSF']==0)
].index).tolist()), 'TtlFloors'] = 1

# ground floor and basement
train.loc[((train[(train['1stFlrSF']>0)&(train['TotalBsmtSF']>0)&(train['2ndFlrSF']==0)
].index).tolist()), 'TtlFloors'] = 2

# ground floor and second floor
train.loc[((train[(train['1stFlrSF']>0)&(train['TotalBsmtSF']==0)&(train['2ndFlrSF']>0)
].index).tolist()), 'TtlFloors'] = 2

# three floors
train.loc[((train[(train['1stFlrSF']>0)&(train['TotalBsmtSF']>0)&(train['2ndFlrSF']>0)
].index).tolist()), 'TtlFloors'] = 3

# the 2.5 story category
train.loc[((train[train['HouseStyle']=='2.5Story'
].index).tolist()), 'TtlFloors'] = 4

In [740]:
# check there are no 0s left
train['TtlFloors'].value_counts()

TtlFloors
2    809
3    595
1     27
4     18
Name: count, dtype: int64

In [741]:
# create new column 
test['TtlFloors'] = 0

# encode the applicable values
# only a ground floor
test.loc[((test[(test['1stFlrSF']>0)&(test['TotalBsmtSF']==0)&(test['2ndFlrSF']==0)
].index).tolist()), 'TtlFloors'] = 1

# ground floor and basement
test.loc[((test[(test['1stFlrSF']>0)&(test['TotalBsmtSF']>0)&(test['2ndFlrSF']==0)
].index).tolist()), 'TtlFloors'] = 2

# ground floor and second floor
test.loc[((test[(test['1stFlrSF']>0)&(test['TotalBsmtSF']==0)&(test['2ndFlrSF']>0)
].index).tolist()), 'TtlFloors'] = 2

# all three floors
test.loc[((test[(test['1stFlrSF']>0)&(test['TotalBsmtSF']>0)&(test['2ndFlrSF']>0)
].index).tolist()), 'TtlFloors'] = 3

# the 2.5 story category
test.loc[((test[test['HouseStyle']=='2.5Story'
].index).tolist()), 'TtlFloors'] = 4

In [742]:
# check there are no 0s left
test['TtlFloors'].value_counts()

TtlFloors
2    813
3    602
1     34
4     10
Name: count, dtype: int64