# 3 - Feature Engineering

In [201]:
import pandas as pd
pd.set_option('display.max_columns', None)
train = pd.read_pickle('../pickles/cleaned/train_cleaned')
test = pd.read_pickle('../pickles/cleaned/test_cleaned')

### Value Mapping<p>
In order to be used in regression, all columns need to be in a numberical format. Additionally, some columns can be combined into more meaningful data points, such as room counts.<p>
First, heirarchical values, such as those ranging from 'poor' to 'excellent' can be converted into numerical ones quite easily, with 1 as the worst and counting up, using 0 where there is no data at all. 

5 <- Ex	(Excellent) <br>
4 <- Gd	(Good)<br>
3 <- TA	(Average/Typical)<br>
2 <- Fa	(Fair)<br>
1 <- Po	(Poor)<br>
0 <- None	(Doesn't have)<p>

Columns this applies to:<p>
*PoolQC*: Pool quality<br>
*FireplaceQu*: Fireplace quality<br>
*GarageCond*: Garage condition<br>
*GarageQual*: Garage quality<br>
*KitchenQual*: Kitchen quality<br>
*HeatingQC*: Heating quality and condition<br>
*BsmtCond*: Evaluates the general condition of the basement<br>
*BsmtQual*: Evaluates the height of the basement<br>
*ExterCond*: Evaluates the present condition of the material on the exterior<br>
*ExterQual*: Evaluates the quality of the material on the exterior 

In [202]:
mapping1 = {'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1, 'None':0}

In [203]:
train.loc[:, 'PoolQC'] = train['PoolQC'].map(mapping1)
train.loc[:, 'FireplaceQu'] = train['FireplaceQu'].map(mapping1)
train.loc[:, 'GarageCond'] = train['GarageCond'].map(mapping1)
train.loc[:, 'GarageQual'] = train['GarageQual'].map(mapping1)
train.loc[:, 'KitchenQual'] = train['KitchenQual'].map(mapping1)
train.loc[:, 'HeatingQC'] = train['HeatingQC'].map(mapping1)
train.loc[:, 'BsmtCond'] = train['BsmtCond'].map(mapping1)
train.loc[:, 'BsmtQual'] = train['BsmtQual'].map(mapping1)
train.loc[:, 'ExterCond'] = train['ExterCond'].map(mapping1)
train.loc[:, 'ExterQual'] = train['ExterQual'].map(mapping1)

In [204]:
test.loc[:, 'PoolQC'] = test['PoolQC'].map(mapping1)
test.loc[:, 'FireplaceQu'] = test['FireplaceQu'].map(mapping1)
test.loc[:, 'GarageCond'] = test['GarageCond'].map(mapping1)
test.loc[:, 'GarageQual'] = test['GarageQual'].map(mapping1)
test.loc[:, 'KitchenQual'] = test['KitchenQual'].map(mapping1)
test.loc[:, 'HeatingQC'] = test['HeatingQC'].map(mapping1)
test.loc[:, 'BsmtCond'] = test['BsmtCond'].map(mapping1)
test.loc[:, 'BsmtQual'] = test['BsmtQual'].map(mapping1)
test.loc[:, 'ExterCond'] = test['ExterCond'].map(mapping1)
test.loc[:, 'ExterQual'] = test['ExterQual'].map(mapping1)

6 <- GLQ (Good Living Quarters)<br>
5 <- ALQ (Average Living Quarters)<br>
4 <- BLQ (Below Average Living Quarters)	<br>
3 <- Rec (Average Rec Room)<br>
2 <- LwQ (Low Quality)<br>
1 <- Unf (Unfinshed)<br>
0 <- None (Doesn't have)<p>

Columes this applies to:<p>
*BsmtFinType1*: Rating of basement finished area<br>
*BsmtFinType2*: Rating of basement finished area (if multiple types)

In [205]:
mapping2 = {'GLQ':6, 'ALQ':5, 'BLQ':4, 'Rec':3, 'LwQ':2, 'Unf':1, 'None':0}

In [206]:
train.loc[:, 'BsmtFinType1'] = train['BsmtFinType1'].map(mapping2)
train.loc[:, 'BsmtFinType2'] = train['BsmtFinType2'].map(mapping2)

In [207]:
test.loc[:, 'BsmtFinType1'] = test['BsmtFinType1'].map(mapping2)
test.loc[:, 'BsmtFinType2'] = test['BsmtFinType2'].map(mapping2)

2 <- Grvl	(Gravel)<br>
1 <- Pave	(Paved)<br>
0 <- None (Only on `Alley`; no alley access)<p>

Applies to:<p>
*Street*: Type of road access to property<br>
*Alley*: Type of alley access to property

In [208]:
mapping3 = {'Grvl':2, 'Pave':1, 'None':0}

In [209]:
train.loc[:, 'Street'] = train['Street'].map(mapping3)
train.loc[:, 'Alley'] = train['Alley'].map(mapping3)

In [210]:
test.loc[:, 'Street'] = test['Street'].map(mapping3)
test.loc[:, 'Alley'] = test['Alley'].map(mapping3)

3 <- Y	(Paved) <br>
2 <- P	(Partial Pavement)<br>
1 <- N	(Dirt/Gravel)<p>

Only applies to:<p>
*PavedDrive*: Paved driveway

In [211]:
pavemap = {'Y':3, 'P':2, 'N':1}

In [212]:
train.loc[:, 'PavedDrive'] = train['PavedDrive'].map(pavemap)

test.loc[:, 'PavedDrive'] = test['PavedDrive'].map(pavemap)

3 <- Fin (Finished)<br>
2 <- RFn (Rough Finished)<br>
1 <- Unf (Unfinished)<br>
0 <- None (No Garage)<p>

Only applies to:<p>
*GarageFinish*: Interior finish of the garage

In [213]:
garagemap = {'Fin':3, 'RFn':2, 'Unf':1, 'None':0}

In [214]:
train.loc[:, 'GarageFinish'] = train['GarageFinish'].map(garagemap)

test.loc[:, 'GarageFinish'] = test['GarageFinish'].map(garagemap)

8 <- Typ	(Typical Functionality)<br>
7 <- Min1	(Minor Deductions 1)<br>
6 <- Min2	(Minor Deductions 2)<br>
5 <- Mod	(Moderate Deductions)<br>
4 <- Maj1	(Major Deductions 1)<br>
3 <- Maj2	(Major Deductions 2)<br>
2 <- Sev	(Severely Damaged)<br>
1 <- Sal	(Salvage only)<p>

Only applies to:<p>
*Functional*: Home functionality (Assume typical unless deductions are warranted)

In [215]:
functmap = {'Typ':8, 'Min1':7, 'Min2':6, 'Mod':5, 'Maj1':4, 'Maj2':3, 'Sev':2, 'Sal':1}

In [216]:
train.loc[:, 'Functional'] = train['Functional'].map(functmap)

test.loc[:, 'Functional'] = test['Functional'].map(functmap)

4 <- Gd	(Good Exposure)<br>
3 <- Av	(Average Exposure)<br>
2 <- Mn	(Mimimum Exposure)<br>
1 <- No	(No Exposure)<br>
0 <- None	(No Basement)<p>

Only applies to:<p>
*BsmtExposure*: Refers to walkout or garden level walls

In [217]:
bsmtmap = {'Gd':4, 'Av':3, 'Mn':2, 'No':1, 'None':0}

In [218]:
train.loc[:, 'BsmtExposure'] = train['BsmtExposure'].map(bsmtmap)

test.loc[:, 'BsmtExposure'] = test['BsmtExposure'].map(bsmtmap)

4 <- AllPub	(All public Utilities)<br>
3 <- NoSewr	(Electricity, Gas, and Water (Septic Tank))<br>
2 <- NoSeWa	(Electricity and Gas Only)<br>
1 <- ELO	(Electricity only)<p>

Only applies to:<p>
*Utilities*: Type of utilities available<br>

In [219]:
utilmap = {'AllPub':4, 'NoSewr':3, 'NoSeWa':2, 'ELO':1}

In [220]:
train.loc[:, 'Utilities'] = train['Utilities'].map(utilmap)

test.loc[:, 'Utilities'] = test['Utilities'].map(utilmap)

Additionally, the `CentralAir` column currently contains Yes/No values, and as such can be re-mapped using the standard 1/0 schema.

In [221]:
airmap = {'Y':1, 'N':0}

In [222]:
train.loc[:, 'CentralAir'] = train['CentralAir'].map(airmap)

test.loc[:, 'CentralAir'] = test['CentralAir'].map(airmap)

### Combining Columns<p>
The total living space for each property can be calculated using the `GrLivArea` column, for the area above ground level, and the `TotalBsmtSF1` column, for the area of the basement, if one exists. 

In [246]:
train['TtlLivSF'] = train['GrLivArea']+train['TotalBsmtSF']
test['TtlLivSF'] = test['GrLivArea']+test['TotalBsmtSF']

In [252]:
print('Train head:')
display(train[['GrLivArea', 'TotalBsmtSF', 'TtlLivSF']].head())
print('Test head:')
display(test[['GrLivArea', 'TotalBsmtSF', 'TtlLivSF']].head())

Train head:


Unnamed: 0,GrLivArea,TotalBsmtSF,TtlLivSF
0,1710,856,2566
1,1262,1262,2524
2,1786,920,2706
3,1717,756,2473
4,2198,1145,3343


Test head:


Unnamed: 0,GrLivArea,TotalBsmtSF,TtlLivSF
0,896,882.0,1778.0
1,1329,1329.0,2658.0
2,1629,928.0,2557.0
3,1604,926.0,2530.0
4,1280,1280.0,2560.0


The total number of bathrooms can also be calculated using the four columns tracking number of bathrooms. (Halfbath columns are multiplied by 0.5 to reflect that they are 'half' when being added to the count.)

In [253]:
train['TotalBath'] = train['BsmtFullBath']+(train['BsmtHalfBath']*0.5)+train['FullBath']+(train['HalfBath']*0.5)
test['TotalBath'] = test['BsmtFullBath']+(test['BsmtHalfBath']*0.5)+test['FullBath']+(test['HalfBath']*0.5)

In [255]:
print('Train head:')
display(train[['BsmtFullBath', 'BsmtHalfBath', 'FullBath','HalfBath', 'TotalBath']].head())
print('Test head:')
display(test[['BsmtFullBath', 'BsmtHalfBath', 'FullBath','HalfBath', 'TotalBath']].head())

Train head:


Unnamed: 0,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,TotalBath
0,1,0,2,1,3.5
1,0,1,2,0,2.5
2,1,0,2,1,3.5
3,1,0,1,0,2.0
4,1,0,2,1,3.5


Test head:


Unnamed: 0,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,TotalBath
0,0.0,0.0,1,0,1.0
1,0.0,0.0,1,1,1.5
2,0.0,0.0,2,1,2.5
3,0.0,0.0,2,1,2.5
4,0.0,0.0,2,0,2.0
